-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Description
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
I would like to obtain an embedding vector for larger texts, e.g. 4K, 8K or more.
Current Behavior
I get an error:
ggml_new_object: not enough space in the context's memory pool (needed 12747504, available 12747472)
Environment and Context
When I create a text file with 197 lines of "Hello World", like:
Hello World
Hello World
...
I get the embedding vector as expected.
However, when I add just one more line, I receive the error not enough space in the context's memory pool.
Yet, my RAM/VRAM is being used less than 15%!
I know there are many issues related to this error, but I haven't found any solution for embeddings.
-
Physical (or virtual) hardware you are using, e.g. for Linux:
- CPU i7-9700K @ 3.60GHz
- GPU RTX 3090 TI VRAM 24 GB
- RAM 80 GB
-
Operating System, e.g. for Linux:
- Windows 11
Failure Information (for bugs)
ggml_new_object: not enough space in the context's memory pool (needed 12747504, available 12747472)
Steps to Reproduce
- Create a text file named "text-of-2367-bytes.txt" containing over 198 lines of "Hello World".
- .\llama-master-cb1c072-bin-win-cublas-cu11.7.1-x64\embedding.exe -ngl 80 -c 2048 -m .\models\wizard-vicuna-13b-uncensored-superhot-8k.ggmlv3.q4_K_M.bin -f .\text-of-2367-bytes.txt
Failure Logs
Example run with the Windows command embedding
.\llama-master-cb1c072-bin-win-cublas-cu11.7.1-x64\embedding.exe -ngl 80 -c 2048 -m .\models\wizard-vicuna-13b-uncensored-superhot-8k.ggmlv3.q4_K_M.bin -f .\text-of-2367-bytes.txt
main: build = 1010 (cb1c072)
main: seed = 1692704725
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6
llama.cpp: loading model from .\models\wizard-vicuna-13b-uncensored-superhot-8k.ggmlv3.q4_K_M.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_head_kv = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.11 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 582.00 MB (+ 1600.00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 480 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 9493 MB
llama_new_context_with_model: kv self size = 1600.00 MB
system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
ggml_new_object: not enough space in the context's memory pool (needed 12747504, available 12747472)