Releases · ggml-org/llama.cpp

09 Dec 14:41

4e842d5

b7332 Latest

Latest

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

console: allow using arrow left/right, home/end keys and history mode (#17836)

console: allow using arrow left/right to edit the line (with UTF-8 support)
console: fix arrow keys on Windows using private-use Unicode
console: add Home/End key support for Windows and Linux
console: add basic Up/Down history navigation
fix build
console: allow using arrow left/right to edit the line (with UTF-8 support)
console: fix arrow keys on Windows using private-use Unicode
console: add Home/End key support for Windows and Linux
console: add basic Up/Down history navigation
console: remove unreachable wc == 0 check after VK switch
console: add Ctrl+Left/Right word navigation

Add KEY_CTRL_ARROW_LEFT and KEY_CTRL_ARROW_RIGHT codes
Windows: detect CTRL modifier via dwControlKeyState
Linux: parse ANSI sequences with modifier (1;5D/C)
Implement move_word_left/right with space-skipping logic
Refactor escape sequence parsing to accumulate params

console: add Delete key support

Windows: VK_DELETE detection
Linux: ESC[3~ sequence parsing
Forward character deletion with UTF-8 support

console: implement bash-style history editing

Edit any history line during UP/DOWN navigation, edits persist
Pressing Enter appends edited version as new history entry
Original line stay untouched in their positions

clean up
better history impl
fix decode_utf8

Co-authored-by: Pascal [email protected]

macOS/iOS:

Linux:

Windows:

Assets 24

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2025-12-09T14:41:37Z
cudart-llama-bin-win-cuda-13.1-x64.zip

sha256:f96935e7e385e3b2d0189239077c10fe8fd7e95690fea4afec455b1b6c7e3f18

384 MB 2025-12-09T14:41:48Z
llama-b7332-bin-macos-arm64.tar.gz

sha256:e826e121deeba926ccf2f66a33479e62933b93846a86e76d723fceab24f16a6f

13.2 MB 2025-12-09T14:41:59Z
llama-b7332-bin-macos-arm64.zip

sha256:e9c72f969ae14514ad62c5507730a3389953b89be7f9cbd4914f3707bd064517

13.2 MB 2025-12-09T14:42:00Z
llama-b7332-bin-macos-x64.tar.gz

sha256:bbf965b53bbdf1a9412fcb69df1811457494cccb0817b87bfca515951f8800fe

36.4 MB 2025-12-09T14:42:02Z
llama-b7332-bin-macos-x64.zip

sha256:acc9c9a685ea5577c009bcb5e8daeccf0189ae73832de147746fdb41c6d37a32

36.4 MB 2025-12-09T14:42:03Z
llama-b7332-bin-ubuntu-s390x.tar.gz

sha256:f20b56b068cfebea0f64e3f8958809e7378262c6dcb6b8efc678c1927c1064da

17.7 MB 2025-12-09T14:42:05Z
llama-b7332-bin-ubuntu-s390x.zip

sha256:e388490c1a8bfafe4c1082683df209b64a2db25fd8b65d24de5d7f1f0fc45b49

15.1 MB 2025-12-09T14:42:06Z
llama-b7332-bin-ubuntu-vulkan-x64.tar.gz

sha256:e215a39cc5139db5435dce12bc94957a8fcc2e7937df691d7c0b385cd3ca07ad

30.1 MB 2025-12-09T14:42:07Z
llama-b7332-bin-ubuntu-vulkan-x64.zip

sha256:42951ea3bd939dff8bca4b7d7024f284b06be1460f918609c38d5cee0014655b

30.1 MB 2025-12-09T14:42:09Z
Source code (zip)

2025-12-09T10:53:59Z
Source code (tar.gz)

2025-12-09T10:53:59Z

09 Dec 14:16

github-actions

b7331

ca709e4

b7331

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

CANN: add support for partial RoPE and Vision mode (#17543)

cann: add support for partial RoPE and Vision mode

Add support for two important RoPE variants: partial rotation (rope_dims < ne0)
and Vision mode rotation.

Support for partial RoPE (rope_dims < ne0):
- Split tensor into head (first rope_dims dimensions) and tail portions
- Apply rotation only to head portion using RotaryPositionEmbedding operator
- Copy unrotated tail portion directly from source to destination
- Handle both contiguous and non-contiguous tensor layouts
Support for Vision mode (GGML_ROPE_TYPE_VISION):
- Set rope_dims = ne0 for Vision mode to rotate entire tensor
- Vision mode pairs dimension i with dimension i+n_dims (where n_dims = ne0/2)
- No tail handling needed since entire tensor is rotated

Implementation details:

Use has_tail flag to determine execution path: head/tail splitting when
rope_dims < ne0, or full tensor rotation when rope_dims == ne0
Support both F32 and F16 data types with intermediate F32 conversion
Copy non-contiguous tensors to contiguous buffers before calling
RotaryPositionEmbedding operator for compatibility
Improve cache invalidation logic to include rope_dims and indep_sects
parameters

These enhancements enable CANN backend to handle various RoPE configurations
used in modern vision-language models and models with partial rotation.

cann: fix review comment

macOS/iOS:

Linux:

Windows:

Assets 24

09 Dec 09:07

github-actions

b7330

0cdce38

b7330

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

CUDA: fix FP16 overflow in tile FA kernel (#17875)

macOS/iOS:

Linux:

Windows:

Assets 24

09 Dec 07:02

github-actions

b7329

e39502e

b7329

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

llama : add token matching support to llama-grammar (#17816)

llama : add token support to llama-grammar
fix inverse token comment
refactor trigger_patterns to replay tokens instead of the entire string
add token documentation
fix test-llama-grammar
improve test cases for tokens

macOS/iOS:

Linux:

Windows:

Assets 24

09 Dec 05:13

github-actions

b7328

1d2a1ab

b7328

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

model : support Rnj-1 (#17811)

add support for rnj1
refactor gemma3 to support rnj-1
address review comments

macOS/iOS:

Linux:

Windows:

Assets 24

08 Dec 23:00

github-actions

b7327

c8554b6

b7327

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

graph : use fill instead of scale_bias in grouped expert selection (#17867)

use fill instead of scale_bias in grouped expert selection
do not explicitly use _inplace

macOS/iOS:

Linux:

Windows:

Assets 24

08 Dec 17:19

github-actions

b7325

951520d

b7325

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

server: delegate result_state creation to server_task (#17835)

server: delegate result_state creation to server_task
remove unued states
add more docs

macOS/iOS:

Linux:

Windows:

Assets 24

08 Dec 16:05

github-actions

b7324

68522c6

b7324

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

ci : support bfloat16 SYCL release package (#17855)

support bfloat16 release package
add fallback file

macOS/iOS:

Linux:

Windows:

Assets 24

08 Dec 12:39

github-actions

b7318

2bc9693

b7318

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

server : make cache_reuse configurable per request (#17858)

macOS/iOS:

Linux:

Windows:

Assets 24

08 Dec 11:33

github-actions

b7317

5814b4d

b7317

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

cuda: optimize SOLVE_TRI using registers and FMAF (#17703)

ggml-cuda: optimize solve_tri_f32_fast and fix stride handling

Switch from using shared memory for the RHS/solution matrix to a register-based approach (x_low, x_high), reducing shared memory pressure and bank conflicts.
Implement explicit fmaf instructions for the reduction loop.
Update kernel arguments to pass strides in bytes rather than elements to align with standard ggml tensor arithmetic (casting to char * before addition).
Remove unused MAX_K_FAST definition.

Small cleanup
Remove comments in solve_tri.cu
Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler [email protected]

Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler [email protected]

Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler [email protected]

Use const for variables in solve_tri.cu
Replace fmaf with more readable code
remove last fmaf

Co-authored-by: Johannes Gäßler [email protected]

macOS/iOS:

Linux:

Windows:

Assets 24

Releases: ggml-org/llama.cpp

b7332

Uh oh!

b7331

Uh oh!

b7330

Uh oh!

b7329

Uh oh!

b7328

Uh oh!

b7327

Uh oh!

b7325

Uh oh!

b7324

Uh oh!

b7318

Uh oh!

b7317

Uh oh!