Skip to content

ggml_new_tensor_impl: not enough space in the context's memory pool #2691

@Ph0rk0z

Description

@Ph0rk0z

I'm using textgen with the llama.cpp python bindings. Built fresh from source today.

Loading camel-platypus2-70b. q4-k-m
Inference goes normally for small context. Speed is about the same as exllama_hf.

Unfortunately when using a prompt of 1848 on the 4096 model, I am presented with this error and a segfault.

System is 2x3090, 1xP40, xeon-v4 16 core with 256g of ram.

I have done tensor split of 40/40, correctly set GQA, and offloaded all layers to GPU. I have tried to use the new kernel and mlock or not. Batch size is set to the max but lowering has not made much difference.

About all I can do is turn off mmap or split across all 3 gpu but I think this might be a bug. Same inference is no problem for GPTQ.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions