-
Notifications
You must be signed in to change notification settings - Fork 61
Model aware routing #79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…k to handle only first 64k characters available in buffer
@jkrauss82 To sum up, from what I see, you are assigning models to specific agents, and then you made changes to how requests are handled to redirect them to agents with that model loaded, correct? So, your intention is to have a mechanism for the balancer to support multiple models and direct traffic based on that? |
Yes, this is correct. I added a new property "model" to the status update and the upstream peer to track which model they are serving so when the user requests a specific model, paddler considers this and chooses the upstream peer accordingly.
Exactly. We can still allow "any routing" in case no model is specified in a request or it is the empty string. But in case user specifies a model, we only assign a slot in case the upstream peer is serving the desired model. |
@jkrauss82 Ok, that is great bc we want to have such a feature, but the issue is - we are currently working on a Supervisor feature - something that allows us to remotely manage llama-server instances (swapping models, parameters, etc.), but we are doing that with an assumption that all the agents use the same model. So we planned it more or less like that so far: flowchart
subgraph Fleet
balancer("Balancer")
agent_1("Agent 1")
agent_2("Agent 2")
llama_1("llama-server 1")
llama_2("llama-server 2")
supervisor_1("Supervisor 1")
supervisor_2("Supervisor 2")
agent_1 --> balancer
agent_2 --> balancer
supervisor_1 --> balancer
supervisor_1 --> llama_1
supervisor_2 --> balancer
supervisor_2 --> llama_2
llama_1 --> agent_1
llama_2 --> agent_2
end
Supervisors manage llama-server instances (additional components alongside agents) and connect to the balancer. Then, we planned to expose a management endpoint in the balancer's API to enable swapping models across all llama servers (as a rolling release with zero downtime). This entire setup (multiple agents and supervisors, single balancer) we plan to call a "Fleet" (so a single fleet is a synchronized set of services, the same model, etc), which should make resource management easy. What you want to do, we also want to support, but differently - by having something on top of fleets, let's call that a Cluster. flowchart
cluster("Cluster")
fleet_1("Fleet 1: Mistral")
fleet_2("Fleet 2: Qwen")
fleet_1 --> cluster
fleet_2 --> cluster
What the new "Cluster" component can do is direct traffic to specific fleets, so we are building a hierarchical, tree-like structure to make everything modular and scalable. Therefore, we do not want to add that feature at the Let me know your thoughts on that. If you want to help build the system in that way, we will gladly accept your help. :D Also, let me know your thoughts on the concept. |
Thanks for this detailed explanation and the flow charts, that makes sense to me so far and I guess the concept should work like that in general. I think this is a very valuable addition to paddler's features and a lot of users will benefit from this! I have two questions (please feel free just to point me to a discussion where these were already answered in case such a discussion exists):
Remark regarding the second question: due to the limitations of the Pingora proxy it would actually be beneficial for the purpose of inspecting the request body to have a separate router component handling the routing. |
@jkrauss82 Thanks a lot!
|
If I understand the planned architecture correctly, it has a cluster service running before one or more fleets. The cluster service would take care of routing incoming requests to the appropriate fleet where then a balancer assigns a slot. If the balancer role would have knowledge of all fleets (or maybe even all llamacpp instances) in the cluster, it could integrate the routing part as well. So looking at it this way, my question is more about the fleet concept which adds another layer to the architecture, because without fleets the cluster service would probably not be required. But my use case is for a comparably small number of llamacpp instances and requests and I can imagine that for larger setups having the fleet concept will be very welcome. |
@jkrauss82 Yes, we planned fleets and clusters around scalable deployments, but in general, I think your use case makes sense, and I've been thinking for some time about how to approach this, because I think both features are useful (fleets and some kind of routing based on parameters). What I had an issue with was that it seemed out of place to me to add parameter-based routing to a balancer (I might be biased on that, though). I just realized that what you are trying to do is more like a gateway than a balancer (directing traffic to specific hosts based on a specific parameter), which we can add. I have an idea on how we can implement this, so fleets stay as they are, and we can also handle your use case. I think we need to add a feature, flowchart
gateway("Gateway")
agent_1("Agent 1")
agent_2("Agent 2")
llama_1("Llama-server 1: Mistral")
llama_2("Llama-server 2: Qwen")
agent_1 --> gateway
agent_2 --> gateway
llama_1 --> agent_1
llama_2 --> agent_2
So, in general, that would mean Paddler could operate in two modes:
So I would be happy to do it that way, this we will be able to continue working on fleets, and scalable deployments (with the assumption that the entire fleet will have the same model loaded in every llama-server instance), and at the same time we can have another feature (a Gateway) that can handle multiple models the way you described (maybe fallback to 3rd party APIs even?) as an additional feature. So, if you are open to it, I would do it this way. :) That would pretty much require adding an additional command line action What do you think? :) |
@mcharytoniuk thanks for the explanation and the suggestion, I think this makes sense and would cover my use case very well. It is also true that I use the balancer like a gateway, yet it also serves as a load balancer as it distributes requests over multiple llamacpp instances running the same model, basically exactly as you described it. What would be beneficial for the gateway is a more robust way to inspect the request body. Currently this is limited by the Pingora logic which does the selection of the upstream peer before processing the request body. I would be open for implementing that and thanks for offering your support along the way. Due to the upcoming holiday season I will not be that often behind the keyboard till the end of July, but I can probably set up a few basic things this week and continue discussing here on Github. |
Related issue and conversation: #60
Closes #60