Skip to content

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Dec 6, 2025

Mirrored from ggml-org/llama.cpp#17808

Fix ggml-org/llama.cpp#12968

I'm testing with llama-server --fim-qwen-7b-spec but it seems like the quality degraded significantly. Not sure if this is expected (as we no longer sample single token like before)

TODO: leave a drawing here to explain how it works

@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #463

Analysis Overview:
Comparison of version f4ec65a5-3bd7-43ad-b473-ceef01e93350 against base 6ffb5ba3-7159-4bcb-bcc4-0ff094d14c42 for the llama.cpp server speculative decoding optimization.


Summary

This PR refactors speculative decoding to batch draft tokens with the main model inference, eliminating a separate llama_decode() call per iteration. The analysis shows no measurable performance differences between versions, with all binaries reporting 0.0% power consumption change and no function-level Response Time or Throughput Time variations. The code changes are structurally sound, moving draft generation before batch construction and adding rollback logic for token management, but the optimization benefits are not captured in the static analysis environment.

@loci-dev loci-dev force-pushed the main branch 21 times, most recently from cbd1d18 to ebc7ac8 Compare December 8, 2025 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants