Skip to content

Conversation

jkrauss82
Copy link

Related issue and conversation: #60

Closes #60

@mcharytoniuk
Copy link
Contributor

@jkrauss82 To sum up, from what I see, you are assigning models to specific agents, and then you made changes to how requests are handled to redirect them to agents with that model loaded, correct?

So, your intention is to have a mechanism for the balancer to support multiple models and direct traffic based on that?

@jkrauss82
Copy link
Author

@jkrauss82 To sum up, from what I see, you are assigning models to specific agents, and then you made changes to how requests are handled to redirect them to agents with that model loaded, correct?

Yes, this is correct. I added a new property "model" to the status update and the upstream peer to track which model they are serving so when the user requests a specific model, paddler considers this and chooses the upstream peer accordingly.

So, your intention is to have a mechanism for the balancer to support multiple models and direct traffic based on that?

Exactly. We can still allow "any routing" in case no model is specified in a request or it is the empty string. But in case user specifies a model, we only assign a slot in case the upstream peer is serving the desired model.

@mcharytoniuk
Copy link
Contributor

@jkrauss82 Ok, that is great bc we want to have such a feature, but the issue is - we are currently working on a Supervisor feature - something that allows us to remotely manage llama-server instances (swapping models, parameters, etc.), but we are doing that with an assumption that all the agents use the same model.

So we planned it more or less like that so far:

flowchart 
    subgraph Fleet
        balancer("Balancer")
        agent_1("Agent 1")
        agent_2("Agent 2")
        llama_1("llama-server 1")
        llama_2("llama-server 2")
        supervisor_1("Supervisor 1")
        supervisor_2("Supervisor 2")
        agent_1 --> balancer
        agent_2 --> balancer
        supervisor_1 --> balancer
        supervisor_1 --> llama_1
        supervisor_2 --> balancer
        supervisor_2 --> llama_2
        llama_1 --> agent_1
        llama_2 --> agent_2
    end
Loading

Supervisors manage llama-server instances (additional components alongside agents) and connect to the balancer. Then, we planned to expose a management endpoint in the balancer's API to enable swapping models across all llama servers (as a rolling release with zero downtime).

This entire setup (multiple agents and supervisors, single balancer) we plan to call a "Fleet" (so a single fleet is a synchronized set of services, the same model, etc), which should make resource management easy.


What you want to do, we also want to support, but differently - by having something on top of fleets, let's call that a Cluster.

flowchart 
    cluster("Cluster")
    fleet_1("Fleet 1: Mistral")
    fleet_2("Fleet 2: Qwen")
    fleet_1 --> cluster
    fleet_2 --> cluster
Loading

What the new "Cluster" component can do is direct traffic to specific fleets, so we are building a hierarchical, tree-like structure to make everything modular and scalable. Therefore, we do not want to add that feature at the balancer or agent level but rather inside another component that builds on top of those.

Let me know your thoughts on that. If you want to help build the system in that way, we will gladly accept your help. :D Also, let me know your thoughts on the concept.

@jkrauss82
Copy link
Author

Thanks for this detailed explanation and the flow charts, that makes sense to me so far and I guess the concept should work like that in general. I think this is a very valuable addition to paddler's features and a lot of users will benefit from this!

I have two questions (please feel free just to point me to a discussion where these were already answered in case such a discussion exists):

  1. Why not integrate the supervisor and agent capabilities in a single role?
  2. Why integrate the balancer within the fleet and not use an integrated cluster/balancer role?

Remark regarding the second question: due to the limitations of the Pingora proxy it would actually be beneficial for the purpose of inspecting the request body to have a separate router component handling the routing.

@mcharytoniuk
Copy link
Contributor

@jkrauss82 Thanks a lot!

  1. Purely because we want to make the supervisor an optional feature.
  2. I don't get this one entirely. How would an integrated cluster role be different than what we have now/planned?

@jkrauss82
Copy link
Author

I don't get this one entirely. How would an integrated cluster role be different than what we have now/planned?

If I understand the planned architecture correctly, it has a cluster service running before one or more fleets. The cluster service would take care of routing incoming requests to the appropriate fleet where then a balancer assigns a slot.

If the balancer role would have knowledge of all fleets (or maybe even all llamacpp instances) in the cluster, it could integrate the routing part as well.

So looking at it this way, my question is more about the fleet concept which adds another layer to the architecture, because without fleets the cluster service would probably not be required. But my use case is for a comparably small number of llamacpp instances and requests and I can imagine that for larger setups having the fleet concept will be very welcome.

@mcharytoniuk
Copy link
Contributor

@jkrauss82 Yes, we planned fleets and clusters around scalable deployments, but in general, I think your use case makes sense, and I've been thinking for some time about how to approach this, because I think both features are useful (fleets and some kind of routing based on parameters).

What I had an issue with was that it seemed out of place to me to add parameter-based routing to a balancer (I might be biased on that, though).

I just realized that what you are trying to do is more like a gateway than a balancer (directing traffic to specific hosts based on a specific parameter), which we can add. I have an idea on how we can implement this, so fleets stay as they are, and we can also handle your use case.

I think we need to add a feature, paddler gateway, that will handle such traffic. We can reuse agents to connect to it instead of a balancer, so we would have something like this:

flowchart 
 gateway("Gateway")
 agent_1("Agent 1")
 agent_2("Agent 2")
 llama_1("Llama-server 1: Mistral")
 llama_2("Llama-server 2: Qwen")
 agent_1 --> gateway
 agent_2 --> gateway
 llama_1 --> agent_1
 llama_2 --> agent_2
Loading

So, in general, that would mean Paddler could operate in two modes:

  • balancer mode (as it is now), to just direct requests to whichever llama instance has the most available slots
  • gateway mode (to direct traffic based on the requested model); it can also potentially integrate with a balancer
    Setting up a gateway should be as simple as setting up the balancer, but it could have an entirely different routing algorithm and behavior (it can still have some elements of balancing requests between hosts with identical models).

So I would be happy to do it that way, this we will be able to continue working on fleets, and scalable deployments (with the assumption that the entire fleet will have the same model loaded in every llama-server instance), and at the same time we can have another feature (a Gateway) that can handle multiple models the way you described (maybe fallback to 3rd party APIs even?) as an additional feature.

So, if you are open to it, I would do it this way. :) That would pretty much require adding an additional command line action paddler gateway, moving your code to a gateway proxy, and reusing some code from the balancer. I will be happy to work with you on that and help along the way as much as necessary.

What do you think? :)

@jkrauss82
Copy link
Author

@mcharytoniuk thanks for the explanation and the suggestion, I think this makes sense and would cover my use case very well. It is also true that I use the balancer like a gateway, yet it also serves as a load balancer as it distributes requests over multiple llamacpp instances running the same model, basically exactly as you described it.

What would be beneficial for the gateway is a more robust way to inspect the request body. Currently this is limited by the Pingora logic which does the selection of the upstream peer before processing the request body.

I would be open for implementing that and thanks for offering your support along the way. Due to the upcoming holiday season I will not be that often behind the keyboard till the end of July, but I can probably set up a few basic things this week and continue discussing here on Github.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Does Paddler consider the "model" parameter at all?
2 participants