Skip to content

Releases: ggml-org/llama.cpp

b5272

03 May 23:40
3e959f0
Compare
Choose a tag to compare
imatrix: fix oob writes if src1 is not contiguous (#13286)

b5271

03 May 18:54
36667c8
Compare
Choose a tag to compare
clip : revert the change of BOI/EOI token for GLM-edge (⚠️ breaking c…

b5270

03 May 16:31
3bf785f
Compare
Choose a tag to compare
llama : Llama-3_1-Nemotron-Ultra-253B-v1 support (#12843)

b5269

02 May 19:23
1d36b36
Compare
Choose a tag to compare
llama : move end-user examples to tools directory (#13249)

* llama : move end-user examples to tools directory

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>

b5267

02 May 18:58
a75cb30
Compare
Choose a tag to compare
context : fix reorder logic (#13267)

ggml-ci

b5266

02 May 18:36
3f3769b
Compare
Choose a tag to compare
ggml : Enable MMA for BF16 in llamafile_sgemm (#13148)

This patch upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for BF16 data type.

This change results in 9x - 40x gains
in total speed S t/s (ie all tokens/total time), across various batch sizes tested using llama-batched-bench benchmark.

The patch is tested with Meta-Lllama-3-8B,
and Mistral-7B models (BF16 models generated by using llama-quantize from corresponding FP32 models) on an IBM POWER10 machine.

Signed-off-by: Shalini Salomi Bodapati <[email protected]>

b5265

02 May 17:50
2f56761
Compare
Choose a tag to compare
llama-model : support Qwen2 embedding models and pooling_mode_lasttok…

b5261

02 May 13:17
cb06a3c
Compare
Choose a tag to compare
llama : orion rope type is neox (#13261)

b5260

02 May 12:37
626083f
Compare
Choose a tag to compare
llama : plamo rope type is neox (#13260)

b5259

02 May 09:51
2af6880
Compare
Choose a tag to compare
llama-chat : reset glmedge chat template (#13253)

* reset glmedge chat template

* fix glmedge chat template