Model aware routing #79

jkrauss82 · 2025-06-27T06:36:27Z

Related issue and conversation: #60

Closes #60

…k to handle only first 64k characters available in buffer

…web dashboard

mcharytoniuk · 2025-06-27T09:34:39Z

@jkrauss82 To sum up, from what I see, you are assigning models to specific agents, and then you made changes to how requests are handled to redirect them to agents with that model loaded, correct?

So, your intention is to have a mechanism for the balancer to support multiple models and direct traffic based on that?

jkrauss82 · 2025-06-27T10:20:38Z

@jkrauss82 To sum up, from what I see, you are assigning models to specific agents, and then you made changes to how requests are handled to redirect them to agents with that model loaded, correct?

Yes, this is correct. I added a new property "model" to the status update and the upstream peer to track which model they are serving so when the user requests a specific model, paddler considers this and chooses the upstream peer accordingly.

So, your intention is to have a mechanism for the balancer to support multiple models and direct traffic based on that?

Exactly. We can still allow "any routing" in case no model is specified in a request or it is the empty string. But in case user specifies a model, we only assign a slot in case the upstream peer is serving the desired model.

mcharytoniuk · 2025-06-27T10:58:59Z

@jkrauss82 Ok, that is great bc we want to have such a feature, but the issue is - we are currently working on a Supervisor feature - something that allows us to remotely manage llama-server instances (swapping models, parameters, etc.), but we are doing that with an assumption that all the agents use the same model.

So we planned it more or less like that so far:

flowchart 
    subgraph Fleet
        balancer("Balancer")
        agent_1("Agent 1")
        agent_2("Agent 2")
        llama_1("llama-server 1")
        llama_2("llama-server 2")
        supervisor_1("Supervisor 1")
        supervisor_2("Supervisor 2")
        agent_1 --> balancer
        agent_2 --> balancer
        supervisor_1 --> balancer
        supervisor_1 --> llama_1
        supervisor_2 --> balancer
        supervisor_2 --> llama_2
        llama_1 --> agent_1
        llama_2 --> agent_2
    end

Supervisors manage llama-server instances (additional components alongside agents) and connect to the balancer. Then, we planned to expose a management endpoint in the balancer's API to enable swapping models across all llama servers (as a rolling release with zero downtime).

This entire setup (multiple agents and supervisors, single balancer) we plan to call a "Fleet" (so a single fleet is a synchronized set of services, the same model, etc), which should make resource management easy.

What you want to do, we also want to support, but differently - by having something on top of fleets, let's call that a Cluster.

flowchart 
    cluster("Cluster")
    fleet_1("Fleet 1: Mistral")
    fleet_2("Fleet 2: Qwen")
    fleet_1 --> cluster
    fleet_2 --> cluster

What the new "Cluster" component can do is direct traffic to specific fleets, so we are building a hierarchical, tree-like structure to make everything modular and scalable. Therefore, we do not want to add that feature at the balancer or agent level but rather inside another component that builds on top of those.

Let me know your thoughts on that. If you want to help build the system in that way, we will gladly accept your help. :D Also, let me know your thoughts on the concept.

jkrauss82 · 2025-06-27T13:02:47Z

Thanks for this detailed explanation and the flow charts, that makes sense to me so far and I guess the concept should work like that in general. I think this is a very valuable addition to paddler's features and a lot of users will benefit from this!

I have two questions (please feel free just to point me to a discussion where these were already answered in case such a discussion exists):

Why not integrate the supervisor and agent capabilities in a single role?
Why integrate the balancer within the fleet and not use an integrated cluster/balancer role?

Remark regarding the second question: due to the limitations of the Pingora proxy it would actually be beneficial for the purpose of inspecting the request body to have a separate router component handling the routing.

…ocked

mcharytoniuk · 2025-07-02T09:29:15Z

@jkrauss82 Thanks a lot!

Purely because we want to make the supervisor an optional feature.
I don't get this one entirely. How would an integrated cluster role be different than what we have now/planned?

jkrauss82 · 2025-07-03T09:15:43Z

I don't get this one entirely. How would an integrated cluster role be different than what we have now/planned?

If I understand the planned architecture correctly, it has a cluster service running before one or more fleets. The cluster service would take care of routing incoming requests to the appropriate fleet where then a balancer assigns a slot.

If the balancer role would have knowledge of all fleets (or maybe even all llamacpp instances) in the cluster, it could integrate the routing part as well.

So looking at it this way, my question is more about the fleet concept which adds another layer to the architecture, because without fleets the cluster service would probably not be required. But my use case is for a comparably small number of llamacpp instances and requests and I can imagine that for larger setups having the fleet concept will be very welcome.

mcharytoniuk · 2025-07-03T11:45:13Z

@jkrauss82 Yes, we planned fleets and clusters around scalable deployments, but in general, I think your use case makes sense, and I've been thinking for some time about how to approach this, because I think both features are useful (fleets and some kind of routing based on parameters).

What I had an issue with was that it seemed out of place to me to add parameter-based routing to a balancer (I might be biased on that, though).

I just realized that what you are trying to do is more like a gateway than a balancer (directing traffic to specific hosts based on a specific parameter), which we can add. I have an idea on how we can implement this, so fleets stay as they are, and we can also handle your use case.

I think we need to add a feature, paddler gateway, that will handle such traffic. We can reuse agents to connect to it instead of a balancer, so we would have something like this:

flowchart 
 gateway("Gateway")
 agent_1("Agent 1")
 agent_2("Agent 2")
 llama_1("Llama-server 1: Mistral")
 llama_2("Llama-server 2: Qwen")
 agent_1 --> gateway
 agent_2 --> gateway
 llama_1 --> agent_1
 llama_2 --> agent_2

So, in general, that would mean Paddler could operate in two modes:

balancer mode (as it is now), to just direct requests to whichever llama instance has the most available slots
gateway mode (to direct traffic based on the requested model); it can also potentially integrate with a balancer
Setting up a gateway should be as simple as setting up the balancer, but it could have an entirely different routing algorithm and behavior (it can still have some elements of balancing requests between hosts with identical models).

So I would be happy to do it that way, this we will be able to continue working on fleets, and scalable deployments (with the assumption that the entire fleet will have the same model loaded in every llama-server instance), and at the same time we can have another feature (a Gateway) that can handle multiple models the way you described (maybe fallback to 3rd party APIs even?) as an additional feature.

So, if you are open to it, I would do it this way. :) That would pretty much require adding an additional command line action paddler gateway, moving your code to a gateway proxy, and reusing some code from the balancer. I will be happy to work with you on that and help along the way as much as necessary.

What do you think? :)

jkrauss82 · 2025-07-08T13:01:07Z

@mcharytoniuk thanks for the explanation and the suggestion, I think this makes sense and would cover my use case very well. It is also true that I use the balancer like a gateway, yet it also serves as a load balancer as it distributes requests over multiple llamacpp instances running the same model, basically exactly as you described it.

What would be beneficial for the gateway is a more robust way to inspect the request body. Currently this is limited by the Pingora logic which does the selection of the upstream peer before processing the request body.

I would be open for implementing that and thanks for offering your support along the way. Due to the upcoming holiday season I will not be that often behind the keyboard till the end of July, but I can probably set up a few basic things this week and continue discussing here on Github.

Jonas Krauss and others added 12 commits June 17, 2025 14:22

wip on adding model routing

0651831

revert unnecessary change

84a88f8

avoid consuming the request body for upstream

4b4e3f6

Merge branch 'distantmagic:main' into model-aware-routing

31713b5

sync with main

c7cac6c

add responses for unsupported/missing model parameter, add regex chec…

b29c875

…k to handle only first 64k characters available in buffer

sync with main

5a55493

Merge branch 'main' into model-aware-routing

746c650

fix mock_status_update.rs

b328bab

Merge branch 'main' into model-aware-routing and add model column to …

365eb83

…web dashboard

Merge branch 'main' into model-aware-routing

3560384

Merge branch 'distantmagic:main' into model-aware-routing

5d22ebf

Jonas Krauss added 3 commits June 27, 2025 15:50

fix returning model not supported on all slots for model currently bl…

c2efc63

…ocked

Merge branch 'main' into model-aware-routing

f15a1d0

make model nullable for dashboard

eb0c275

add handling of non-utf8 chunk in first 64k bytes of request body

961d4b4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Model aware routing #79

Model aware routing #79

Uh oh!

jkrauss82 commented Jun 27, 2025

Uh oh!

mcharytoniuk commented Jun 27, 2025

Uh oh!

jkrauss82 commented Jun 27, 2025

Uh oh!

mcharytoniuk commented Jun 27, 2025

Uh oh!

jkrauss82 commented Jun 27, 2025

Uh oh!

mcharytoniuk commented Jul 2, 2025

Uh oh!

jkrauss82 commented Jul 3, 2025

Uh oh!

mcharytoniuk commented Jul 3, 2025

Uh oh!

jkrauss82 commented Jul 8, 2025

Uh oh!

Uh oh!

Model aware routing #79

Are you sure you want to change the base?

Model aware routing #79

Uh oh!

Conversation

jkrauss82 commented Jun 27, 2025

Uh oh!

mcharytoniuk commented Jun 27, 2025

Uh oh!

jkrauss82 commented Jun 27, 2025

Uh oh!

mcharytoniuk commented Jun 27, 2025

Uh oh!

jkrauss82 commented Jun 27, 2025

Uh oh!

mcharytoniuk commented Jul 2, 2025

Uh oh!

jkrauss82 commented Jul 3, 2025

Uh oh!

mcharytoniuk commented Jul 3, 2025

Uh oh!

jkrauss82 commented Jul 8, 2025

Uh oh!

Uh oh!