ggml_new_tensor_impl: not enough space in the context's memory pool

I'm using textgen with the llama.cpp python bindings. Built fresh from source today.

Loading camel-platypus2-70b. q4-k-m 
Inference goes normally for small context. Speed is about the same as exllama_hf.

Unfortunately when using a prompt of 1848 on the 4096 model, I am presented with this error and a segfault.

System is 2x3090, 1xP40, xeon-v4 16 core with 256g of ram.

I have done tensor split of 40/40, correctly set GQA, and offloaded all layers to GPU. I have tried to use the new kernel and mlock or not. Batch size is set to the max but lowering has not made much difference.

About all I can do is turn off mmap or split across all 3 gpu but I think this might be a bug. Same inference is no problem for GPTQ.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml_new_tensor_impl: not enough space in the context's memory pool #2691

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ggml_new_tensor_impl: not enough space in the context's memory pool #2691

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions