Releases · ggml-org/llama.cpp

04 Oct 20:24

86df2c9

b6690 Latest

Latest

vulkan: use a more appropriate amount of threads when generating shad…

Assets 15

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2025-10-04T20:24:18Z
llama-b6690-bin-macos-arm64.zip

sha256:ae91eeb488ee9b1248ebfb5d737e2fab972b2c20e1ba9a4d49b3763958733f0d

10.4 MB 2025-10-04T20:24:28Z
llama-b6690-bin-macos-x64.zip

sha256:416a9be716a2695295798b24229e9ae4cf5fce76f4258f0c732352c537203913

26.8 MB 2025-10-04T20:24:29Z
llama-b6690-bin-ubuntu-vulkan-x64.zip

sha256:d467ed88950ce87f50a242683d4d49372224171346dc2608afd8d4c769988e8d

25.5 MB 2025-10-04T20:24:30Z
llama-b6690-bin-ubuntu-x64.zip

sha256:d93974ae599417e8dc4ba3fbbdcd050855f8cabe0ba60150a27b0a81ded8ba50

12.4 MB 2025-10-04T20:24:31Z
llama-b6690-bin-win-cpu-arm64.zip

sha256:25829f34cf4d8b066c82abdf48d8764d2b58b291ecf7879054fa8df2d4f8a5d4

10.5 MB 2025-10-04T20:24:32Z
llama-b6690-bin-win-cpu-x64.zip

sha256:1bc51fe594d3d344e69c3ecc63f4b02a715f71950058ea4f52fc2fb5e18ca66d

13.6 MB 2025-10-04T20:24:33Z
llama-b6690-bin-win-cuda-12.4-x64.zip

sha256:aef9ee7212ba106f92fbad267ae573d22d2a0c1d09fc0990c30707f3b3c71a2b

149 MB 2025-10-04T20:24:34Z
llama-b6690-bin-win-hip-radeon-x64.zip

sha256:09ca8552ddbed7160b59e15d68af546860d1193671db6eae5f2684beb78d6f0d

313 MB 2025-10-04T20:24:39Z
llama-b6690-bin-win-opencl-adreno-arm64.zip

sha256:7b11600dc16d71487cadb285a2914b8e4b1b4aa870683db9e8ed85219f0f39aa

11 MB 2025-10-04T20:24:46Z
Source code (zip)

2025-10-04T20:04:27Z
Source code (tar.gz)

2025-10-04T20:04:27Z

04 Oct 13:44

github-actions

b6689

f392839

b6689

rpc : check src buffer when copying tensor (#16421)

Only dst buffer is guaranteed to be an RPC buffer. Add check for the src
one.

Assets 15

04 Oct 10:50

github-actions

b6688

898acba

b6688

rpc : add support for multiple devices (#16276)

* rpc : add support for multiple devices

Allow rpc-server to expose multiple devices from a single endpoint.
Change RPC protocol to include device identifier where needed.

closes: #15210

* fixes

* use ggml_backend_reg_t

* address review comments

* fix llama-bench backend report

* address review comments, change device naming

* fix cmd order

Assets 15

04 Oct 10:04

github-actions

b6687

e29acf7

b6687

vulkan : incremental shader builds (#16341)

* vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times

* support dep-files so shaders are recompiled if their included files change

* rename shader files which are used as "headers" to use .glsl extension
* move glslc extension detection shaders to separate folders
* the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled

* vulkan : only write embedded shader .hpp/.cpp when they change

* avoid recompiling ggml-vulkan.cpp when editing shaders
* pass single --source argument instead of --input-dir & --filter to shader gen
* check for source file match earlier

* fix hang in vulkan-shaders-gen when there are compilation errors

* early out did not decrement compile_count

* clean up

* fix glslc integer dot product test

* unconditionally write the embedded shader cpp output

* replace output filepath in generated dep-files to match output in CMakeLists

---------

Co-authored-by: Jeff Bolz <[email protected]>

Assets 15

03 Oct 20:02

github-actions

b6686

128d522

b6686

chat : support Magistral thinking (#16413)

* feat: added a dedicated Magistral chat format that preserves [THINK] spans, parses reasoning before tool calls

* feat: new flow in the chat template test suite for Magistral

Assets 15

03 Oct 19:55

github-actions

b6685

f6dcda3

b6685

server : context checkpointing for hybrid and recurrent models (#16382)

* initial commit for branch 3

* generalize `swa_checkpoint` to `ctx_checkpoint`

this extends `llama-server`'s SWA checkpointing logic to include
hybrid/recurrent models such as Jamba, Granite

* oops

* disable debug prints

* keep backwards compat with `--swa-checkpoints`

Co-authored-by: Georgi Gerganov <[email protected]>

* update prompt re-processing message

* fix off-by-one error per GG

* keep `seq_rm` log per GG

Co-authored-by: Georgi Gerganov <[email protected]>

* server : fix checkpoint logic to support recurrent caches

* server : cleanup and fixes

---------

Co-authored-by: Georgi Gerganov <[email protected]>

Assets 15

03 Oct 16:55

github-actions

b6684

606a73f

b6684

metal : fix loop bound in ggml_mem_ranges (#16412)

Assets 15

03 Oct 12:59

github-actions

b6683

946f71e

b6683

llama : fix shapes for bert/mpt q/k norm (#16409)

Assets 15

03 Oct 12:13

github-actions

b6682

638d330

b6682

ggml : fix graph reallocation with multiple chunks (#16396)

reallocation is needed if a single chunk grows in size,
even if total allocation size stays the same or is lower

Assets 15

03 Oct 11:10

github-actions

b6680

2aaf0a2

b6680

vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE (#1…

Assets 15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: ggml-org/llama.cpp

b6690

Uh oh!

b6689

Uh oh!

b6688

Uh oh!

b6687

Uh oh!

b6686

Uh oh!

b6685

Uh oh!

b6684

Uh oh!

b6683

Uh oh!

b6682

Uh oh!

b6680

Uh oh!