-
Notifications
You must be signed in to change notification settings - Fork 1
Fix/improve mtp performance #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: glm4-moe-mtp
Are you sure you want to change the base?
Fix/improve mtp performance #5
Conversation
…llama.cpp into glm4-mtp-graph-cache
|
Sounds good to me, I'll look over it more formally over the next 2-3 days. |
|
I was working on the server loop optimization and just noticed that ngxson has already created a PR lately: ggml-org#17808 It looks very good and hopefully can be merged soon. After that perhaps you can start rebasing this entire repo. There has been a lot of traffic since the creation of this repo so just merging the changes will already need some work. By the way, could you share your test method (cmdline, gguf, system configuration)? Since I already have a positive gain using this branch it puzzles me that it is still a regression on your system. If you could share the info I can try to reproduce and see if I can find anything suspicious. |
|
(The name of the branch bothered me, but unfortunately the PR closes if I try to rename it... well, it's going to be MT for now)
Sure. I usually don't test under best-case scenarios. For development, I use Windows, which incurs a slight performance drop. Right now, my specs are a Threadripper 5965WX and two RTX 3090s. I tried months ago with a Ryzen 7 5700X and noticed the same pattern. I was expecting a offload system to no be great, but it still bothers me that performance without MTP is higher. Here are the commands: I haven't tried using Linux for this branch or with GLM 4.5/4.6 yet, and I also didn't fully optimize for NUMA. That's why I'm eager to see how it performs for others.
That's great to hear! I made negligible improvements to the server loop—nothing that yielded significant gains. I will update the branch, take a look at the mentioned PR, and then create a new one that will allow more than one draft per loop. |
|
Thank you for the information! I will take a look. Here are my first intuitions: In a To validate this, you can either:
You mentioned NUMA which I think is not the major problem here as although generally speaking llama.cpp is not NUMA aware, the inter CCD bandwidth in a single CPU is still much higher than typical inter socket bandwidth. The major additional bottleneck with Windows to my knowledge is much higher kernel launch overhead. CUDA graphs can be used to eliminate this but it is currently disabled in split graphs. Anyway it's irrelevant here because you basically have the same overhead with or without MTP. Another important limitation in Windows is it only allows half of physical RAM to be used as pinned memory. That typically hurt prefill performance in |
|
I will need to test more carefully but I'm also using a cpu MoE setup and my tg/s goes from ~10.2 off to ~11.5 on. Haven't yet tried to optimize settings for the on case. One very loose thought is
This is a weird number; does that mean you're usually absolutely maxing out RAM/VRAM in base configuration? Loading the MTP layer will require a bit of memory so maybe you're filling up VRAM or something? |
That's good to know, I will measure it after I finish the next PR. At least right now for me the performance using the Air variant doesn't bother me. My go-to is still the GLM 4.6 in Linux, I just prefer to code using Windows. Still, it's good to see if there is something hurting my performance that doesn't allow MTP to increase higher than without. Also I'm actually curious to try a Q1 just to fit fully in GPU and see how it works.
True, and I saw that when I used GLM 4.5 for the first week; it was around 3 tokens that jumped to 5 by disabling mmap.
Yes, when the model was released I used it quite a lot and that context filled the entire VRAM. Today I don't need it anymore, but because I borrowed the old args I didn't change them, and as far as I remember I removed some layers from both GPUs to free memory. But I need to double check if it is offloading to memory again and hurting the performance. |
Hey @F1LM1, it's been my turn to be busy lately, so I haven't made much progress. However, this gave me time to rethink the future of this PR, and I decided to clean up the code for a proper review. Here are my thoughts:
I took the latest improvements we made and tested each one with small prompts. This new PR is the result. With my configuration, I'm getting about 84% of the tokens per second compared to running without MTP. I also reviewed the entire project to remove unused code and references, and finally added a command-line argument so everyone can easily try it with or without MTP.
My suggestion is as follows: