Replies: 2 comments 1 reply
-
That is to be expected. The computation for a single token has a part with with constant runtime (weights) and a part with a runtime proportional to context depth (attention). So as the context fills up more and more computation needs to be done for each token and the speed decreases. Not using |
Beta Was this translation helpful? Give feedback.
-
I have been unable to load the model fully without using -nkvo and -fa. I get a malloc error with a qwen model that is ctx 40960. llama-server will load the model with up to about -c 16384 but says the entire model will not be used. I am loading it across 2 16GB gpus, and i guess the KVs in cpu ram. I am new to this. Is there a way to get it to load entirely in the GPUs to solve it? Another command line switch config? Thanks |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
after running llama-server -m model.gguf -nkvo -fa
I can get the model i want to load fully but with every query tp/s
slowly diminsh.
1st query 20 tps, 2nd query 18 tps, 3rd query 16 tps and so on.
Any idea how to fix it? Is this normal?
Thanks
Beta Was this translation helpful? Give feedback.
All reactions