Why max 64 parallel requests? #16069

Nico3012 · 2025-09-18T08:39:03Z

Nico3012
Sep 18, 2025

Hey,

why does llama.cpp only allow 64 parallel requests?
If i set --parallel to e.g. 256, i get the error:

"llama_init_from_model: failed to initialize the context: n_seq_max must be <= 64"

What is the reason for this limit?

adhusch · 2025-09-24T14:49:23Z

adhusch
Sep 24, 2025

This was actually changed in 4f81b33#commitcomment-159972791 :)

1 reply

Nico3012 Sep 26, 2025
Author

Thank you!
For me, it makes no sense to have this upper limit. Do you know why we need it? And is it possible to change LLAMA_MAX_PARALLEL_SEQUENCES e.g., with an environment variable?

Would this bring any downside (apart from the slower generation speed and context split) to allow many more parallel sequences?

I'm using the prebuilt Docker container llama-server

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why max 64 parallel requests? #16069

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Why max 64 parallel requests? #16069

Uh oh!

Nico3012 Sep 18, 2025

Replies: 1 comment · 1 reply

Uh oh!

adhusch Sep 24, 2025

Uh oh!

Nico3012 Sep 26, 2025 Author

Nico3012
Sep 18, 2025

Replies: 1 comment 1 reply

adhusch
Sep 24, 2025

Nico3012 Sep 26, 2025
Author