context : print graph stats for memory-less contexts #15586

ggerganov · 2025-08-26T09:12:16Z

Print stats for compute buffers and graph nodes with memory-less contexts (such as for embedding models).

Example:

llama-embedding -hf ggml-org/bge-small-en-v1.5-Q8_0-GGUF -p "test" -c 512 -b 512 -ub 512

0.00.379.367 I llama_context: constructing llama_context
0.00.379.370 I llama_context: n_seq_max     = 1
0.00.379.370 I llama_context: n_ctx         = 512
0.00.379.370 I llama_context: n_ctx_per_seq = 512
0.00.379.370 I llama_context: n_batch       = 512
0.00.379.370 I llama_context: n_ubatch      = 512
0.00.379.371 I llama_context: causal_attn   = 0
0.00.379.371 I llama_context: flash_attn    = 0
0.00.379.371 I llama_context: kv_unified    = true
0.00.379.371 I llama_context: freq_base     = 10000.0
0.00.379.372 I llama_context: freq_scale    = 1
0.00.379.372 I ggml_metal_init: allocating
0.00.379.401 I ggml_metal_init: found device: Apple M2 Ultra
0.00.379.404 I ggml_metal_init: picking default device: Apple M2 Ultra
0.00.379.997 I ggml_metal_load_library: using embedded metal library
0.00.384.503 I ggml_metal_init: GPU name:   Apple M2 Ultra
0.00.384.507 I ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
0.00.384.508 I ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
0.00.384.508 I ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
0.00.384.508 I ggml_metal_init: simdgroup reduction   = true
0.00.384.509 I ggml_metal_init: simdgroup matrix mul. = true
0.00.384.509 I ggml_metal_init: has residency sets    = true
0.00.384.509 I ggml_metal_init: has bfloat            = true
0.00.384.509 I ggml_metal_init: use bfloat            = true
0.00.384.510 I ggml_metal_init: hasUnifiedMemory      = true
0.00.384.511 I ggml_metal_init: recommendedMaxWorkingSetSize  = 154618.82 MB
0.00.405.574 I llama_context:        CPU  output buffer size =     0.12 MiB
0.00.406.905 I llama_context:      Metal compute buffer size =    16.75 MiB
0.00.406.909 I llama_context:        CPU compute buffer size =     2.51 MiB
0.00.406.909 I llama_context: graph nodes  = 431
0.00.406.910 I llama_context: graph splits = 2

ggml-ci

LostRuins · 2025-09-21T00:37:20Z

Hello, just wanted to point out an observation that after this PR, the GPU memory usage of the embedding model bge-m3-q8_0.gguf has increased by about 4gb when used with 8k context.

Does not seem to affect other embedding models, and bge seemed to have worked fine previously.

The llama-context.cpp file has been modified many times since, but this PR is where the regression first started.

It's still working fine, and by reverting

llama.cpp/src/llama-context.cpp

Line 273 in 7f76692

if (!hparams.vocab_only) {

back to if (!hparams.vocab_only && memory) {, I seem to be able to generate my embeddings just fine without the extra memory overhead. So I am guessing there are some unnecessary allocations when it comes to BGE?

ggerganov · 2025-09-22T13:01:21Z

@LostRuins This issue will be addressed after we accomplish #16148

context : print graph stats for memory-less contexts

7ccfe75

ggml-ci

ggerganov requested a review from danbev August 26, 2025 09:12

danbev approved these changes Aug 26, 2025

View reviewed changes

ggerganov merged commit 85cc1ae into master Aug 26, 2025
53 of 56 checks passed

ggerganov deleted the gg/context-print-stats-no-mem branch August 26, 2025 09:47

ggerganov mentioned this pull request Aug 26, 2025

graph : fix assert in memory-less build_attn #15590

Merged

Minh141120 pushed a commit to menloresearch/llama.cpp that referenced this pull request Aug 27, 2025

context : print graph stats for memory-less contexts (ggml-org#15586)

88e1081

ggml-ci

ggerganov mentioned this pull request Sep 21, 2025

ggml : add support for non-padded FA KV #16148

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

context : print graph stats for memory-less contexts #15586

context : print graph stats for memory-less contexts #15586

Uh oh!

ggerganov commented Aug 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

LostRuins commented Sep 21, 2025

Uh oh!

ggerganov commented Sep 22, 2025

Uh oh!

Uh oh!

context : print graph stats for memory-less contexts #15586

context : print graph stats for memory-less contexts #15586

Uh oh!

Conversation

ggerganov commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

LostRuins commented Sep 21, 2025

Uh oh!

ggerganov commented Sep 22, 2025

Uh oh!

Uh oh!

ggerganov commented Aug 26, 2025 •

edited

Loading