Releases: ggml-org/llama.cpp
Releases · ggml-org/llama.cpp
b6690
b6689
rpc : check src buffer when copying tensor (#16421) Only dst buffer is guaranteed to be an RPC buffer. Add check for the src one.
b6688
rpc : add support for multiple devices (#16276) * rpc : add support for multiple devices Allow rpc-server to expose multiple devices from a single endpoint. Change RPC protocol to include device identifier where needed. closes: #15210 * fixes * use ggml_backend_reg_t * address review comments * fix llama-bench backend report * address review comments, change device naming * fix cmd order
b6687
vulkan : incremental shader builds (#16341) * vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times * support dep-files so shaders are recompiled if their included files change * rename shader files which are used as "headers" to use .glsl extension * move glslc extension detection shaders to separate folders * the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled * vulkan : only write embedded shader .hpp/.cpp when they change * avoid recompiling ggml-vulkan.cpp when editing shaders * pass single --source argument instead of --input-dir & --filter to shader gen * check for source file match earlier * fix hang in vulkan-shaders-gen when there are compilation errors * early out did not decrement compile_count * clean up * fix glslc integer dot product test * unconditionally write the embedded shader cpp output * replace output filepath in generated dep-files to match output in CMakeLists --------- Co-authored-by: Jeff Bolz <[email protected]>
b6686
chat : support Magistral thinking (#16413) * feat: added a dedicated Magistral chat format that preserves [THINK] spans, parses reasoning before tool calls * feat: new flow in the chat template test suite for Magistral
b6685
server : context checkpointing for hybrid and recurrent models (#16382) * initial commit for branch 3 * generalize `swa_checkpoint` to `ctx_checkpoint` this extends `llama-server`'s SWA checkpointing logic to include hybrid/recurrent models such as Jamba, Granite * oops * disable debug prints * keep backwards compat with `--swa-checkpoints` Co-authored-by: Georgi Gerganov <[email protected]> * update prompt re-processing message * fix off-by-one error per GG * keep `seq_rm` log per GG Co-authored-by: Georgi Gerganov <[email protected]> * server : fix checkpoint logic to support recurrent caches * server : cleanup and fixes --------- Co-authored-by: Georgi Gerganov <[email protected]>
b6684
metal : fix loop bound in ggml_mem_ranges (#16412)
b6683
llama : fix shapes for bert/mpt q/k norm (#16409)
b6682
ggml : fix graph reallocation with multiple chunks (#16396) reallocation is needed if a single chunk grows in size, even if total allocation size stays the same or is lower
b6680
vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE (#1…