Skip to content

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Aug 26, 2025

Print stats for compute buffers and graph nodes with memory-less contexts (such as for embedding models).

Example:

llama-embedding -hf ggml-org/bge-small-en-v1.5-Q8_0-GGUF -p "test" -c 512 -b 512 -ub 512
0.00.379.367 I llama_context: constructing llama_context
0.00.379.370 I llama_context: n_seq_max     = 1
0.00.379.370 I llama_context: n_ctx         = 512
0.00.379.370 I llama_context: n_ctx_per_seq = 512
0.00.379.370 I llama_context: n_batch       = 512
0.00.379.370 I llama_context: n_ubatch      = 512
0.00.379.371 I llama_context: causal_attn   = 0
0.00.379.371 I llama_context: flash_attn    = 0
0.00.379.371 I llama_context: kv_unified    = true
0.00.379.371 I llama_context: freq_base     = 10000.0
0.00.379.372 I llama_context: freq_scale    = 1
0.00.379.372 I ggml_metal_init: allocating
0.00.379.401 I ggml_metal_init: found device: Apple M2 Ultra
0.00.379.404 I ggml_metal_init: picking default device: Apple M2 Ultra
0.00.379.997 I ggml_metal_load_library: using embedded metal library
0.00.384.503 I ggml_metal_init: GPU name:   Apple M2 Ultra
0.00.384.507 I ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
0.00.384.508 I ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
0.00.384.508 I ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
0.00.384.508 I ggml_metal_init: simdgroup reduction   = true
0.00.384.509 I ggml_metal_init: simdgroup matrix mul. = true
0.00.384.509 I ggml_metal_init: has residency sets    = true
0.00.384.509 I ggml_metal_init: has bfloat            = true
0.00.384.509 I ggml_metal_init: use bfloat            = true
0.00.384.510 I ggml_metal_init: hasUnifiedMemory      = true
0.00.384.511 I ggml_metal_init: recommendedMaxWorkingSetSize  = 154618.82 MB
0.00.405.574 I llama_context:        CPU  output buffer size =     0.12 MiB
0.00.406.905 I llama_context:      Metal compute buffer size =    16.75 MiB
0.00.406.909 I llama_context:        CPU compute buffer size =     2.51 MiB
0.00.406.909 I llama_context: graph nodes  = 431
0.00.406.910 I llama_context: graph splits = 2

@ggerganov ggerganov requested a review from danbev August 26, 2025 09:12
@ggerganov ggerganov merged commit 85cc1ae into master Aug 26, 2025
53 of 56 checks passed
@ggerganov ggerganov deleted the gg/context-print-stats-no-mem branch August 26, 2025 09:47
Minh141120 pushed a commit to menloresearch/llama.cpp that referenced this pull request Aug 27, 2025
@LostRuins
Copy link
Collaborator

Hello, just wanted to point out an observation that after this PR, the GPU memory usage of the embedding model bge-m3-q8_0.gguf has increased by about 4gb when used with 8k context.

Does not seem to affect other embedding models, and bge seemed to have worked fine previously.

The llama-context.cpp file has been modified many times since, but this PR is where the regression first started.

It's still working fine, and by reverting

if (!hparams.vocab_only) {
back to if (!hparams.vocab_only && memory) {, I seem to be able to generate my embeddings just fine without the extra memory overhead. So I am guessing there are some unnecessary allocations when it comes to BGE?

@ggerganov
Copy link
Member Author

@LostRuins This issue will be addressed after we accomplish #16148

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants