UPSTREAM PR #17808: server: improve speed of speculative decoding #463

loci-dev · 2025-12-06T00:48:51Z

I'm testing with llama-server --fim-qwen-7b-spec but it seems like the quality degraded significantly. Not sure if this is expected (as we no longer sample single token like before)

TODO: leave a drawing here to explain how it works

loci-agentic-ai · 2025-12-06T01:31:13Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #463

Analysis Overview:
Comparison of version f4ec65a5-3bd7-43ad-b473-ceef01e93350 against base 6ffb5ba3-7159-4bcb-bcc4-0ff094d14c42 for the llama.cpp server speculative decoding optimization.

Summary

This PR refactors speculative decoding to batch draft tokens with the main model inference, eliminating a separate llama_decode() call per iteration. The analysis shows no measurable performance differences between versions, with all binaries reporting 0.0% power consumption change and no function-level Response Time or Throughput Time variations. The code changes are structurally sound, moving draft generation before batch construction and adding rollback logic for token management, but the optimization benefits are not captured in the static analysis environment.

server: improve speed of speculative decoding

f2f08f8

loci-dev had a problem deploying to PROD__AL_DEMO December 6, 2025 00:48 — with GitHub Actions Failure

loci-dev force-pushed the main branch 21 times, most recently from cbd1d18 to ebc7ac8 Compare December 8, 2025 15:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #17808: server: improve speed of speculative decoding #463

UPSTREAM PR #17808: server: improve speed of speculative decoding #463

Uh oh!

loci-dev commented Dec 6, 2025

Uh oh!

loci-agentic-ai bot commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #17808: server: improve speed of speculative decoding #463

Are you sure you want to change the base?

UPSTREAM PR #17808: server: improve speed of speculative decoding #463

Uh oh!

Conversation

loci-dev commented Dec 6, 2025

Uh oh!

loci-agentic-ai bot commented Dec 6, 2025

Performance Analysis Summary: PR #463

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants