You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using textgen with the llama.cpp python bindings. Built fresh from source today.
Loading camel-platypus2-70b. q4-k-m
Inference goes normally for small context. Speed is about the same as exllama_hf.
Unfortunately when using a prompt of 1848 on the 4096 model, I am presented with this error and a segfault.
System is 2x3090, 1xP40, xeon-v4 16 core with 256g of ram.
I have done tensor split of 40/40, correctly set GQA, and offloaded all layers to GPU. I have tried to use the new kernel and mlock or not. Batch size is set to the max but lowering has not made much difference.
About all I can do is turn off mmap or split across all 3 gpu but I think this might be a bug. Same inference is no problem for GPTQ.