Skip to content

Releases: ggml-org/llama.cpp

b7332

09 Dec 14:41
4e842d5

Choose a tag to compare

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

console: allow using arrow left/right, home/end keys and history mode (#17836)

  • console: allow using arrow left/right to edit the line (with UTF-8 support)

  • console: fix arrow keys on Windows using private-use Unicode

  • console: add Home/End key support for Windows and Linux

  • console: add basic Up/Down history navigation

  • fix build

  • console: allow using arrow left/right to edit the line (with UTF-8 support)

  • console: fix arrow keys on Windows using private-use Unicode

  • console: add Home/End key support for Windows and Linux

  • console: add basic Up/Down history navigation

  • console: remove unreachable wc == 0 check after VK switch

  • console: add Ctrl+Left/Right word navigation

  • Add KEY_CTRL_ARROW_LEFT and KEY_CTRL_ARROW_RIGHT codes
  • Windows: detect CTRL modifier via dwControlKeyState
  • Linux: parse ANSI sequences with modifier (1;5D/C)
  • Implement move_word_left/right with space-skipping logic
  • Refactor escape sequence parsing to accumulate params
  • console: add Delete key support
  • Windows: VK_DELETE detection
  • Linux: ESC[3~ sequence parsing
  • Forward character deletion with UTF-8 support
  • console: implement bash-style history editing
  • Edit any history line during UP/DOWN navigation, edits persist
  • Pressing Enter appends edited version as new history entry
  • Original line stay untouched in their positions
  • clean up

  • better history impl

  • fix decode_utf8


Co-authored-by: Pascal [email protected]

macOS/iOS:

Linux:

Windows:

b7331

09 Dec 14:16
ca709e4

Choose a tag to compare

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

CANN: add support for partial RoPE and Vision mode (#17543)

  • cann: add support for partial RoPE and Vision mode

Add support for two important RoPE variants: partial rotation (rope_dims < ne0)
and Vision mode rotation.

  1. Support for partial RoPE (rope_dims < ne0):

    • Split tensor into head (first rope_dims dimensions) and tail portions
    • Apply rotation only to head portion using RotaryPositionEmbedding operator
    • Copy unrotated tail portion directly from source to destination
    • Handle both contiguous and non-contiguous tensor layouts
  2. Support for Vision mode (GGML_ROPE_TYPE_VISION):

    • Set rope_dims = ne0 for Vision mode to rotate entire tensor
    • Vision mode pairs dimension i with dimension i+n_dims (where n_dims = ne0/2)
    • No tail handling needed since entire tensor is rotated

Implementation details:

  • Use has_tail flag to determine execution path: head/tail splitting when
    rope_dims < ne0, or full tensor rotation when rope_dims == ne0
  • Support both F32 and F16 data types with intermediate F32 conversion
  • Copy non-contiguous tensors to contiguous buffers before calling
    RotaryPositionEmbedding operator for compatibility
  • Improve cache invalidation logic to include rope_dims and indep_sects
    parameters

These enhancements enable CANN backend to handle various RoPE configurations
used in modern vision-language models and models with partial rotation.

  • cann: fix review comment

macOS/iOS:

Linux:

Windows:

b7330

09 Dec 09:07
0cdce38

Choose a tag to compare

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

CUDA: fix FP16 overflow in tile FA kernel (#17875)

macOS/iOS:

Linux:

Windows:

b7329

09 Dec 07:02
e39502e

Choose a tag to compare

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

llama : add token matching support to llama-grammar (#17816)

  • llama : add token support to llama-grammar

  • fix inverse token comment

  • refactor trigger_patterns to replay tokens instead of the entire string

  • add token documentation

  • fix test-llama-grammar

  • improve test cases for tokens

macOS/iOS:

Linux:

Windows:

b7328

09 Dec 05:13
1d2a1ab

Choose a tag to compare

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

model : support Rnj-1 (#17811)

  • add support for rnj1

  • refactor gemma3 to support rnj-1

  • address review comments

macOS/iOS:

Linux:

Windows:

b7327

08 Dec 23:00
c8554b6

Choose a tag to compare

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

graph : use fill instead of scale_bias in grouped expert selection (#17867)

  • use fill instead of scale_bias in grouped expert selection

  • do not explicitly use _inplace

macOS/iOS:

Linux:

Windows:

b7325

08 Dec 17:19
951520d

Choose a tag to compare

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

server: delegate result_state creation to server_task (#17835)

  • server: delegate result_state creation to server_task

  • remove unued states

  • add more docs

macOS/iOS:

Linux:

Windows:

b7324

08 Dec 16:05
68522c6

Choose a tag to compare

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

ci : support bfloat16 SYCL release package (#17855)

  • support bfloat16 release package

  • add fallback file

macOS/iOS:

Linux:

Windows:

b7318

08 Dec 12:39
2bc9693

Choose a tag to compare

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

server : make cache_reuse configurable per request (#17858)

macOS/iOS:

Linux:

Windows:

b7317

08 Dec 11:33
5814b4d

Choose a tag to compare

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

cuda: optimize SOLVE_TRI using registers and FMAF (#17703)

  • ggml-cuda: optimize solve_tri_f32_fast and fix stride handling
  • Switch from using shared memory for the RHS/solution matrix to a register-based approach (x_low, x_high), reducing shared memory pressure and bank conflicts.
  • Implement explicit fmaf instructions for the reduction loop.
  • Update kernel arguments to pass strides in bytes rather than elements to align with standard ggml tensor arithmetic (casting to char * before addition).
  • Remove unused MAX_K_FAST definition.
  • Small cleanup

  • Remove comments in solve_tri.cu

  • Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler [email protected]

  • Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler [email protected]

  • Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler [email protected]

  • Use const for variables in solve_tri.cu

  • Replace fmaf with more readable code

  • remove last fmaf


Co-authored-by: Johannes Gäßler [email protected]

macOS/iOS:

Linux:

Windows: