tps slower and slower #16262

sl0sh · 2025-09-25T20:22:39Z

sl0sh
Sep 25, 2025

Hi,
after running llama-server -m model.gguf -nkvo -fa

I can get the model i want to load fully but with every query tp/s
slowly diminsh.

1st query 20 tps, 2nd query 18 tps, 3rd query 16 tps and so on.

Any idea how to fix it? Is this normal?

Thanks

JohannesGaessler · 2025-09-26T08:04:32Z

JohannesGaessler
Sep 26, 2025
Collaborator

That is to be expected. The computation for a single token has a part with with constant runtime (weights) and a part with a runtime proportional to context depth (attention). So as the context fills up more and more computation needs to be done for each token and the speed decreases. Not using --no-kv-offload would probably help because as it is you're keeping the KV cache, which is used for attention in RAM.

0 replies

sl0sh · 2025-09-26T11:40:48Z

sl0sh
Sep 26, 2025
Author

I have been unable to load the model fully without using -nkvo and -fa. I get a malloc error with a qwen model that is ctx 40960. llama-server will load the model with up to about -c 16384 but says the entire model will not be used. I am loading it across 2 16GB gpus, and i guess the KVs in cpu ram.

I am new to this. Is there a way to get it to load entirely in the GPUs to solve it? Another command line switch config?

Thanks

1 reply

JohannesGaessler Sep 26, 2025
Collaborator

By default the entire model will be loaded into the GPUs, with a lower context size the maximum length of text that the model will work on be shorter but it will work the same below that limit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tps slower and slower #16262

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

tps slower and slower #16262

Uh oh!

sl0sh Sep 25, 2025

Replies: 2 comments · 1 reply

Uh oh!

JohannesGaessler Sep 26, 2025 Collaborator

Uh oh!

sl0sh Sep 26, 2025 Author

Uh oh!

JohannesGaessler Sep 26, 2025 Collaborator

sl0sh
Sep 25, 2025

Replies: 2 comments 1 reply

JohannesGaessler
Sep 26, 2025
Collaborator

sl0sh
Sep 26, 2025
Author

JohannesGaessler Sep 26, 2025
Collaborator