Skip to content

Conversation

@SamuelOliveirads
Copy link

Hey @F1LM1, it's been my turn to be busy lately, so I haven't made much progress. However, this gave me time to rethink the future of this PR, and I decided to clean up the code for a proper review. Here are my thoughts:

  • The core idea of the original PR was the MTP implementation. Although performance isn't currently better than having it disabled, it is functional, which I see as a solid contribution to the project.
  • This PR has been open for almost four months, and the next set of improvements could take just as long to finish.

I took the latest improvements we made and tested each one with small prompts. This new PR is the result. With my configuration, I'm getting about 84% of the tokens per second compared to running without MTP. I also reviewed the entire project to remove unused code and references, and finally added a command-line argument so everyone can easily try it with or without MTP.

My suggestion is as follows:

  1. You review this cleanup, and if it looks good, merge it into the original PR branch.
  2. Mark the original PR as ready for review. If necessary, I can tag some of the frequent reviewers to take a look.
  3. I will be available to fix any issues found in the original PR and focus on getting it merged. In parallel, I will keep working on the server loop optimization and multi-token MTP in PR Glm4 mtp optimizations #4.

@F1LM1
Copy link
Owner

F1LM1 commented Dec 7, 2025

Sounds good to me, I'll look over it more formally over the next 2-3 days.

@wishstudio
Copy link

I was working on the server loop optimization and just noticed that ngxson has already created a PR lately: ggml-org#17808

It looks very good and hopefully can be merged soon. After that perhaps you can start rebasing this entire repo. There has been a lot of traffic since the creation of this repo so just merging the changes will already need some work.

By the way, could you share your test method (cmdline, gguf, system configuration)? Since I already have a positive gain using this branch it puzzles me that it is still a regression on your system. If you could share the info I can try to reproduce and see if I can find anything suspicious.

@SamuelOliveirads SamuelOliveirads changed the title Fix/improve mt performance Fix/improve mtp performance Dec 7, 2025
@SamuelOliveirads SamuelOliveirads deleted the fix/improve-mt-performance branch December 7, 2025 22:06
@SamuelOliveirads SamuelOliveirads restored the fix/improve-mt-performance branch December 7, 2025 22:07
@SamuelOliveirads
Copy link
Author

SamuelOliveirads commented Dec 7, 2025

(The name of the branch bothered me, but unfortunately the PR closes if I try to rename it... well, it's going to be MT for now)

By the way, could you share your test method (cmdline, gguf, system configuration)? Since I already have a positive gain using this branch it puzzles me that it is still a regression on your system. If you could share the info I can try to reproduce and see if I can find anything suspicious.

Sure. I usually don't test under best-case scenarios. For development, I use Windows, which incurs a slight performance drop. Right now, my specs are a Threadripper 5965WX and two RTX 3090s. I tried months ago with a Ryzen 7 5700X and noticed the same pattern. I was expecting a offload system to no be great, but it still bothers me that performance without MTP is higher. Here are the commands:

cmake -B build -DCMAKE_BUILD_TYPE=Debug -DLLAMA_CURL=OFF -DGGML_CUDA=ON

cmake --build build --config Release -j

.\build\bin\release\llama-server.exe ^
    --model "F:\llm_models\glm-4.5-air_Q4_general\GLM-4.5-Air-IQ4_XS-00001-of-00002.gguf" ^
    --alias GLM-4.5-Air ^
    --ctx-size 36864 ^
    -ctk q8_0 -ctv q8_0 ^
    -fa --verbose ^
   --n-gpu-layers 99 ^
    -b 2048 -ub 1500 ^
   -ot "blk\.(3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18)\.ffn_.*=CUDA0" ^
    -ot "blk\.(20|21|22|23|24|25|26|27|28|29|30|31|32|33|34)\.ffn_.*=CUDA1" ^
   --override-tensor exps=CPU ^
    -mtp ^
    --threads 24 --threads-batch 36 ^
    --host 127.0.0.1 ^
    --port 8080

I haven't tried using Linux for this branch or with GLM 4.5/4.6 yet, and I also didn't fully optimize for NUMA. That's why I'm eager to see how it performs for others.
llama-bench is not the best tool for testing, at least for me, so I typically try with small prompts and sometimes with code generation or creative writing. On average, small prompts gave 12.19 T/s. Using this branch originally gave me 10.52 T/s, and now it reaches up to 11.05 T/s, which is about 90% of the original performance.

I was working on the server loop optimization and just noticed that ngxson has already created a PR lately: ggml-org#17808

It looks very good and hopefully can be merged soon. After that perhaps you can start rebasing this entire repo. There has been a lot of traffic since the creation of this repo so just merging the changes will already need some work.

That's great to hear! I made negligible improvements to the server loop—nothing that yielded significant gains. I will update the branch, take a look at the mentioned PR, and then create a new one that will allow more than one draft per loop.

@wishstudio
Copy link

wishstudio commented Dec 7, 2025

Thank you for the information! I will take a look. Here are my first intuitions:

In a --cpu-moe setup typically the CPU part is memory bound. IIRC the IQ quants are computational heavier than regular Qx_0 or Qx_K quants. So using that may cause it to shift to computational bound which incur more performance penalties.

To validate this, you can either:

  • Try a computational lighter quant, like Q4_K, or better Q8_0.
  • Log the time spent in MoE FFNs and check if it is memory bound. You can easily measure by calculating the time spent in the loop for (int cur_a = 0; cur_a < n_as; ++cur_a) in ggml_compute_forward_mul_mat_id function in ggml-cpu.c. You can then compare of the time with the theoretical time of a FFN block of your used quant (calculate the size in bytes of the corresponding weight matrix, then divide it by your actual memory bandwidth, which can be measured by intel MLC tool). You may need to also log the dimensions of ids tensors to do it correctly since there will be some shared experts in consecutive tokens. In my experiments for my gaming rig, the actual time spent in FFN blocks are quite close to theoretical time.

You mentioned NUMA which I think is not the major problem here as although generally speaking llama.cpp is not NUMA aware, the inter CCD bandwidth in a single CPU is still much higher than typical inter socket bandwidth.

The major additional bottleneck with Windows to my knowledge is much higher kernel launch overhead. CUDA graphs can be used to eliminate this but it is currently disabled in split graphs. Anyway it's irrelevant here because you basically have the same overhead with or without MTP.

Another important limitation in Windows is it only allows half of physical RAM to be used as pinned memory. That typically hurt prefill performance in --cpu-moe but not tg. But I guess it's also irrelevant to MTP performance comparison. You can try if adding --no-mmap makes any difference, without the option it will not use pinned memory for model weights.

@F1LM1
Copy link
Owner

F1LM1 commented Dec 8, 2025

I will need to test more carefully but I'm also using a cpu MoE setup and my tg/s goes from ~10.2 off to ~11.5 on. Haven't yet tried to optimize settings for the on case.

One very loose thought is

ctx-size 36864

This is a weird number; does that mean you're usually absolutely maxing out RAM/VRAM in base configuration? Loading the MTP layer will require a bit of memory so maybe you're filling up VRAM or something?

@SamuelOliveirads
Copy link
Author

  • Try a computational lighter quant, like Q4_K, or better Q8_0.
  • Log the time spent in MoE FFNs and check if it is memory bound. You can easily measure by calculating the time spent in the loop for (int cur_a = 0; cur_a < n_as; ++cur_a) in ggml_compute_forward_mul_mat_id function in ggml-cpu.c. You can then compare of the time with the theoretical time of a FFN block of your used quant (calculate the size in bytes of the corresponding weight matrix, then divide it by your actual memory bandwidth, which can be measured by intel MLC tool). You may need to also log the dimensions of ids tensors to do it correctly since there will be some shared experts in consecutive tokens. In my experiments for my gaming rig, the actual time spent in FFN blocks are quite close to theoretical time.

That's good to know, I will measure it after I finish the next PR. At least right now for me the performance using the Air variant doesn't bother me. My go-to is still the GLM 4.6 in Linux, I just prefer to code using Windows. Still, it's good to see if there is something hurting my performance that doesn't allow MTP to increase higher than without. Also I'm actually curious to try a Q1 just to fit fully in GPU and see how it works.

Another important limitation in Windows is it only allows half of physical RAM to be used as pinned memory. That typically hurt prefill performance in --cpu-moe but not tg. But I guess it's also irrelevant to MTP performance comparison. You can try if adding --no-mmap makes any difference, without the option it will not use pinned memory for model weights.

True, and I saw that when I used GLM 4.5 for the first week; it was around 3 tokens that jumped to 5 by disabling mmap.

This is a weird number; does that mean you're usually absolutely maxing out RAM/VRAM in base configuration? Loading the MTP layer will require a bit of memory so maybe you're filling up VRAM or something?

Yes, when the model was released I used it quite a lot and that context filled the entire VRAM. Today I don't need it anymore, but because I borrowed the old args I didn't change them, and as far as I remember I removed some layers from both GPUs to free memory. But I need to double check if it is offloading to memory again and hurting the performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants