Releases: ggml-org/llama.cpp
b7332
Warning
Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
console: allow using arrow left/right, home/end keys and history mode (#17836)
-
console: allow using arrow left/right to edit the line (with UTF-8 support)
-
console: fix arrow keys on Windows using private-use Unicode
-
console: add Home/End key support for Windows and Linux
-
console: add basic Up/Down history navigation
-
fix build
-
console: allow using arrow left/right to edit the line (with UTF-8 support)
-
console: fix arrow keys on Windows using private-use Unicode
-
console: add Home/End key support for Windows and Linux
-
console: add basic Up/Down history navigation
-
console: remove unreachable wc == 0 check after VK switch
-
console: add Ctrl+Left/Right word navigation
- Add KEY_CTRL_ARROW_LEFT and KEY_CTRL_ARROW_RIGHT codes
- Windows: detect CTRL modifier via dwControlKeyState
- Linux: parse ANSI sequences with modifier (1;5D/C)
- Implement move_word_left/right with space-skipping logic
- Refactor escape sequence parsing to accumulate params
- console: add Delete key support
- Windows: VK_DELETE detection
- Linux: ESC[3~ sequence parsing
- Forward character deletion with UTF-8 support
- console: implement bash-style history editing
- Edit any history line during UP/DOWN navigation, edits persist
- Pressing Enter appends edited version as new history entry
- Original line stay untouched in their positions
-
clean up
-
better history impl
-
fix decode_utf8
Co-authored-by: Pascal [email protected]
macOS/iOS:
Linux:
Windows:
b7331
Warning
Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
CANN: add support for partial RoPE and Vision mode (#17543)
- cann: add support for partial RoPE and Vision mode
Add support for two important RoPE variants: partial rotation (rope_dims < ne0)
and Vision mode rotation.
-
Support for partial RoPE (rope_dims < ne0):
- Split tensor into head (first rope_dims dimensions) and tail portions
- Apply rotation only to head portion using RotaryPositionEmbedding operator
- Copy unrotated tail portion directly from source to destination
- Handle both contiguous and non-contiguous tensor layouts
-
Support for Vision mode (GGML_ROPE_TYPE_VISION):
- Set rope_dims = ne0 for Vision mode to rotate entire tensor
- Vision mode pairs dimension i with dimension i+n_dims (where n_dims = ne0/2)
- No tail handling needed since entire tensor is rotated
Implementation details:
- Use has_tail flag to determine execution path: head/tail splitting when
rope_dims < ne0, or full tensor rotation when rope_dims == ne0 - Support both F32 and F16 data types with intermediate F32 conversion
- Copy non-contiguous tensors to contiguous buffers before calling
RotaryPositionEmbedding operator for compatibility - Improve cache invalidation logic to include rope_dims and indep_sects
parameters
These enhancements enable CANN backend to handle various RoPE configurations
used in modern vision-language models and models with partial rotation.
- cann: fix review comment
macOS/iOS:
Linux:
Windows:
b7330
Warning
Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
CUDA: fix FP16 overflow in tile FA kernel (#17875)
macOS/iOS:
Linux:
Windows:
b7329
Warning
Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
llama : add token matching support to llama-grammar (#17816)
-
llama : add token support to llama-grammar
-
fix inverse token comment
-
refactor trigger_patterns to replay tokens instead of the entire string
-
add token documentation
-
fix test-llama-grammar
-
improve test cases for tokens
macOS/iOS:
Linux:
Windows:
b7328
Warning
Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
model : support Rnj-1 (#17811)
-
add support for rnj1
-
refactor gemma3 to support rnj-1
-
address review comments
macOS/iOS:
Linux:
Windows:
b7327
Warning
Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
graph : use fill instead of scale_bias in grouped expert selection (#17867)
-
use fill instead of scale_bias in grouped expert selection
-
do not explicitly use _inplace
macOS/iOS:
Linux:
Windows:
b7325
Warning
Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
server: delegate result_state creation to server_task (#17835)
-
server: delegate result_state creation to server_task
-
remove unued states
-
add more docs
macOS/iOS:
Linux:
Windows:
b7324
Warning
Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
ci : support bfloat16 SYCL release package (#17855)
-
support bfloat16 release package
-
add fallback file
macOS/iOS:
Linux:
Windows:
b7318
Warning
Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
server : make cache_reuse configurable per request (#17858)
macOS/iOS:
Linux:
Windows:
b7317
Warning
Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
cuda: optimize SOLVE_TRI using registers and FMAF (#17703)
- ggml-cuda: optimize solve_tri_f32_fast and fix stride handling
- Switch from using shared memory for the RHS/solution matrix to a register-based approach (x_low, x_high), reducing shared memory pressure and bank conflicts.
- Implement explicit
fmafinstructions for the reduction loop. - Update kernel arguments to pass strides in bytes rather than elements to align with standard ggml tensor arithmetic (casting to
char *before addition). - Remove unused
MAX_K_FASTdefinition.
-
Small cleanup
-
Remove comments in solve_tri.cu
-
Update ggml/src/ggml-cuda/solve_tri.cu
Co-authored-by: Johannes Gäßler [email protected]
- Update ggml/src/ggml-cuda/solve_tri.cu
Co-authored-by: Johannes Gäßler [email protected]
- Update ggml/src/ggml-cuda/solve_tri.cu
Co-authored-by: Johannes Gäßler [email protected]
-
Use const for variables in solve_tri.cu
-
Replace fmaf with more readable code
-
remove last fmaf
Co-authored-by: Johannes Gäßler [email protected]
macOS/iOS:
Linux:
Windows: