Skip to content

Releases: ggml-org/llama.cpp

b6690

04 Oct 20:24
86df2c9
Compare
Choose a tag to compare
vulkan: use a more appropriate amount of threads when generating shad…

b6689

04 Oct 13:44
f392839
Compare
Choose a tag to compare
rpc : check src buffer when copying tensor (#16421)

Only dst buffer is guaranteed to be an RPC buffer. Add check for the src
one.

b6688

04 Oct 10:50
898acba
Compare
Choose a tag to compare
rpc : add support for multiple devices (#16276)

* rpc : add support for multiple devices

Allow rpc-server to expose multiple devices from a single endpoint.
Change RPC protocol to include device identifier where needed.

closes: #15210

* fixes

* use ggml_backend_reg_t

* address review comments

* fix llama-bench backend report

* address review comments, change device naming

* fix cmd order

b6687

04 Oct 10:04
e29acf7
Compare
Choose a tag to compare
vulkan : incremental shader builds (#16341)

* vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times

* support dep-files so shaders are recompiled if their included files change

* rename shader files which are used as "headers" to use .glsl extension
* move glslc extension detection shaders to separate folders
* the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled

* vulkan : only write embedded shader .hpp/.cpp when they change

* avoid recompiling ggml-vulkan.cpp when editing shaders
* pass single --source argument instead of --input-dir & --filter to shader gen
* check for source file match earlier

* fix hang in vulkan-shaders-gen when there are compilation errors

* early out did not decrement compile_count

* clean up

* fix glslc integer dot product test

* unconditionally write the embedded shader cpp output

* replace output filepath in generated dep-files to match output in CMakeLists

---------

Co-authored-by: Jeff Bolz <[email protected]>

b6686

03 Oct 20:02
128d522
Compare
Choose a tag to compare
chat : support Magistral thinking (#16413)

* feat: added a dedicated Magistral chat format that preserves [THINK] spans, parses reasoning before tool calls

* feat: new flow in the chat template test suite for Magistral

b6685

03 Oct 19:55
f6dcda3
Compare
Choose a tag to compare
server : context checkpointing for hybrid and recurrent models (#16382)

* initial commit for branch 3

* generalize `swa_checkpoint` to `ctx_checkpoint`

this extends `llama-server`'s SWA checkpointing logic to include
hybrid/recurrent models such as Jamba, Granite

* oops

* disable debug prints

* keep backwards compat with `--swa-checkpoints`

Co-authored-by: Georgi Gerganov <[email protected]>

* update prompt re-processing message

* fix off-by-one error per GG

* keep `seq_rm` log per GG

Co-authored-by: Georgi Gerganov <[email protected]>

* server : fix checkpoint logic to support recurrent caches

* server : cleanup and fixes

---------

Co-authored-by: Georgi Gerganov <[email protected]>

b6684

03 Oct 16:55
606a73f
Compare
Choose a tag to compare
metal : fix loop bound in ggml_mem_ranges (#16412)

b6683

03 Oct 12:59
946f71e
Compare
Choose a tag to compare
llama : fix shapes for bert/mpt q/k norm (#16409)

b6682

03 Oct 12:13
638d330
Compare
Choose a tag to compare
ggml : fix graph reallocation with multiple chunks (#16396)

reallocation is needed if a single chunk grows in size,
even if total allocation size stays the same or is lower

b6680

03 Oct 11:10
2aaf0a2
Compare
Choose a tag to compare
vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE (#1…