Obtaining an embeddings vector for a larger text

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- [x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [x] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [x] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.

# Expected Behavior

I would like to obtain an embedding vector for larger texts, e.g. 4K, 8K or more.

# Current Behavior

I get an error:
```ggml_new_object: not enough space in the context's memory pool (needed 12747504, available 12747472)```

# Environment and Context

When I create a text file with 197 lines of "Hello World", like:

```
Hello World
Hello World
...
```
I get the embedding vector as expected.
However, when I add just one more line, I receive the error ```not enough space in the context's memory pool.``` 
Yet, my RAM/VRAM is being used less than 15%!

I know there are many issues related to this error, but I haven't found any solution for embeddings. 

* Physical (or virtual) hardware you are using, e.g. for Linux:

  * CPU i7-9700K @ 3.60GHz
  * GPU RTX 3090 TI VRAM 24 GB
  * RAM 80 GB

* Operating System, e.g. for Linux:

  * Windows 11

# Failure Information (for bugs)

```ggml_new_object: not enough space in the context's memory pool (needed 12747504, available 12747472)```

# Steps to Reproduce

1. Create a text file named "text-of-2367-bytes.txt" containing over 198 lines of "Hello World".
2. .\llama-master-cb1c072-bin-win-cublas-cu11.7.1-x64\embedding.exe -ngl 80 -c 2048 -m .\models\wizard-vicuna-13b-uncensored-superhot-8k.ggmlv3.q4_K_M.bin -f .\text-of-2367-bytes.txt 

# Failure Logs


Example run with the Windows command [embedding](https://renenyffenegger.ch/notes/development/Artificial-intelligence/language-model/LLM/LLaMA/libs/llama_cpp/index)
```
.\llama-master-cb1c072-bin-win-cublas-cu11.7.1-x64\embedding.exe -ngl 80 -c 2048 -m .\models\wizard-vicuna-13b-uncensored-superhot-8k.ggmlv3.q4_K_M.bin -f .\text-of-2367-bytes.txt 
main: build = 1010 (cb1c072)
main: seed  = 1692704725
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6
llama.cpp: loading model from .\models\wizard-vicuna-13b-uncensored-superhot-8k.ggmlv3.q4_K_M.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.11 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  =  582.00 MB (+ 1600.00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 480 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 9493 MB
llama_new_context_with_model: kv self size  = 1600.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
ggml_new_object: not enough space in the context's memory pool (needed 12747504, available 12747472)
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Obtaining an embeddings vector for a larger text #2712

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Obtaining an embeddings vector for a larger text #2712

Description

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions