-
Notifications
You must be signed in to change notification settings - Fork 261
Initial model swap work #1807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial model swap work #1807
Conversation
Reviewer's GuideIntroduce a containerized daemon service with new CLI commands, a Python HTTP server exposing /api and /model endpoints to catalog, launch, stop, and proxy llama-server instances, using a ModelRunner and CommandFactory for process orchestration and structured logging for lifecycle management. Sequence diagram for serving a new model via Daemon APIsequenceDiagram
actor User
participant CLI
participant DaemonAPI
participant ModelRunner
participant CommandFactory
participant LlamaServer
User->>CLI: POST /api/serve (model_name, runtime, exec_args)
CLI->>DaemonAPI: POST /api/serve
DaemonAPI->>ModelRunner: next_available_port()
DaemonAPI->>CommandFactory: build(model, runtime, port, exec_args)
CommandFactory-->>DaemonAPI: serve command
DaemonAPI->>ModelRunner: add_model(managed_model)
DaemonAPI->>ModelRunner: start_model(model_id)
ModelRunner->>LlamaServer: start process
DaemonAPI-->>CLI: ServeResponse (model_id, serve_path)
Sequence diagram for proxying requests to a running modelsequenceDiagram
actor User
participant ProxyAPI
participant ModelRunner
participant LlamaServer
User->>ProxyAPI: GET/POST /model/{model_id}/...
ProxyAPI->>ModelRunner: lookup managed_model by model_id
ProxyAPI->>LlamaServer: forward request to correct port
LlamaServer-->>ProxyAPI: response
ProxyAPI-->>User: response
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @engelmi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request fundamentally changes how models are served and managed in ramalama
by introducing a new daemon component. This daemon provides a robust, centralized, and API-driven system for orchestrating AI model serving processes, moving towards a more dynamic and scalable model management architecture.
Highlights
- New Daemon Component: This pull request introduces a new
daemon
component to theramalama
project. This daemon is designed to run as a server, providing a centralized and dynamic way to manage and serve AI models within a containerized environment. - Enhanced CLI Commands: New command-line interface (CLI) commands, specifically
ramalama daemon setup
andramalama daemon run
, have been added. These commands facilitate the setup and execution of the new daemon, including container orchestration viapodman
. - RESTful Daemon API: A new RESTful API is exposed by the daemon, offering endpoints for core operations. This includes listing available models (
/api/tags
), initiating new model serving processes (/api/serve
), and stopping currently running models (/api/stop
). - Model Proxying Capability: The daemon now includes a proxy API, accessible via the
/model
path. This proxy intelligently forwards incoming requests to the appropriate dynamically startedllama-server
instances, enabling seamless interaction with served models. - Dynamic Model Lifecycle Management: The daemon is capable of dynamically starting and stopping
llama-server
processes for various models. It handles the allocation of unique network ports for each served model and manages their complete lifecycle, from initiation to graceful shutdown.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new daemon
component for managing and proxying AI models, which is a significant and valuable feature. The overall architecture with a daemon, API handlers, and a model runner is well-structured.
However, the initial implementation has several critical issues that need to be addressed before merging. These include potential crashes due to a NameError
and unsafe dictionary modifications, and incorrect logic that could lead to arguments being dropped. There are also several high-severity issues related to hardcoded values that break the application's configurability, and incorrect network client implementation.
I've provided detailed comments and suggestions to fix these issues. Addressing them will significantly improve the stability, correctness, and maintainability of the new daemon feature.
Nice work @engelmi!!! I'm trying to test this on my nvidia machine but running into this
Would it be possible to add --image to this argument? |
After fixing that error and resolving the image I got it to work!!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
Blocking issues:
- Detected subprocess function 'Popen' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)
General comments:
- ModelRunner’s internal dicts and port tracking are accessed by multiple handler threads—consider adding thread‐safety (e.g. locks) to avoid race conditions under concurrent requests.
- There’s a lot of duplicated referer extraction and path parsing logic between RamalamaHandler and ModelProxyHandler—extracting that into a shared helper would DRY up the code and reduce inconsistencies.
- CommandFactory’s _set_defaults and _build_llama_serve_command methods build up exec_args via repeated dict updates—consider using a dataclass with default values or a clear merge strategy to simplify and harden argument handling.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- ModelRunner’s internal dicts and port tracking are accessed by multiple handler threads—consider adding thread‐safety (e.g. locks) to avoid race conditions under concurrent requests.
- There’s a lot of duplicated referer extraction and path parsing logic between RamalamaHandler and ModelProxyHandler—extracting that into a shared helper would DRY up the code and reduce inconsistencies.
- CommandFactory’s _set_defaults and _build_llama_serve_command methods build up exec_args via repeated dict updates—consider using a dataclass with default values or a clear merge strategy to simplify and harden argument handling.
## Individual Comments
### Comment 1
<location> `ramalama/daemon/model_runner/command_factory.py:104` </location>
<code_context>
+ if self.request_args.get("webui") == "off":
+ cmd.extend(["--no-webui"])
+
+ if check_nvidia() or check_metal(SimpleNamespace({"container": False})):
+ cmd.extend(["--flash-attn"])
+
</code_context>
<issue_to_address>
Incorrect usage of SimpleNamespace constructor.
Passing a dictionary to SimpleNamespace creates an attribute 'container' containing the dictionary, not a boolean. This may cause check_metal to behave incorrectly.
</issue_to_address>
### Comment 2
<location> `ramalama/daemon/model_runner/command_factory.py:108` </location>
<code_context>
+ cmd.extend(["--flash-attn"])
+
+ # gpu arguments
+ ngl = self.request_args.get("ngl")
+ if ngl < 0:
+ ngl = 999
+ cmd.extend(["-ngl", f"{ngl}"])
</code_context>
<issue_to_address>
Potential type issue with ngl comparison.
Since ngl may be a string, ensure it is converted to an integer before performing the comparison to avoid a TypeError.
</issue_to_address>
### Comment 3
<location> `ramalama/daemon/model_runner/command_factory.py:112` </location>
<code_context>
+ if ngl < 0:
+ ngl = 999
+ cmd.extend(["-ngl", f"{ngl}"])
+ cmd.extend(["--threads", f"{self.request_args.get("threads")}"])
+
+ return cmd
</code_context>
<issue_to_address>
Possible type inconsistency for threads argument.
If llama-server requires threads as an integer, ensure it is cast from the request_args value.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
cmd.extend(["--threads", f"{self.request_args.get("threads")}"])
=======
threads = int(self.request_args.get("threads"))
cmd.extend(["--threads", f"{threads}"])
>>>>>>> REPLACE
</suggested_fix>
### Comment 4
<location> `ramalama/daemon/model_runner/runner.py:14` </location>
<code_context>
+ self.id = id
+ self.model = model
+ self.run_cmd: list[str] = run_cmd
+ self.port: str = port
+ self.process: Optional[subprocess.Popen] = None
+
</code_context>
<issue_to_address>
Port should be typed as int, not str.
Typing port as str could lead to type errors or unexpected behavior when used as an int elsewhere.
</issue_to_address>
### Comment 5
<location> `ramalama/daemon/model_runner/runner.py:71` </location>
<code_context>
+ del self._models[model_id]
+
+ def stop(self):
+ for id in self._models.keys():
+ self.stop_model(id)
</code_context>
<issue_to_address>
Modifying dictionary during iteration may cause issues.
Iterate over a list of keys instead to avoid runtime errors: for id in list(self._models.keys()).
</issue_to_address>
### Comment 6
<location> `ramalama/daemon/handler/proxy.py:132` </location>
<code_context>
+ handler.wfile.flush()
+
+ logger.debug(f"Received response from -X {method} {target_url}\nRESPONSE: ")
+ except urllib.error.HTTPError as e:
+ handler.send_response(e.code)
+ handler.end_headers()
+ raise e
+ except urllib.error.URLError as e:
+ handler.send_response(500)
</code_context>
<issue_to_address>
Raising HTTPError after sending response may cause double error handling.
Instead of re-raising the exception after sending the error response, log the error and return to avoid duplicate handling.
</issue_to_address>
### Comment 7
<location> `ramalama/daemon/logging.py:20` </location>
<code_context>
+ formatter = logging.Formatter(fmt, datefmt)
+
+ log_file_path = os.path.join(log_file, "ramalama-daemon.log")
+ handler = logging.FileHandler(log_file_path)
+ handler.setLevel(lvl)
+ handler.setFormatter(formatter)
</code_context>
<issue_to_address>
Logger may add multiple handlers on repeated configuration.
Repeated calls to configure_logger will add duplicate handlers, causing repeated log entries. Please check for existing handlers before adding new ones.
</issue_to_address>
### Comment 8
<location> `ramalama/daemon/handler/ramalama.py:28` </location>
<code_context>
+ finally:
+ self.finish()
+
+ def do_GET(self):
+ logger.debug(f"Handling GET request for path: {self.path}")
+
</code_context>
<issue_to_address>
Consider refactoring the repeated logic in the HTTP verb methods into a single dispatch helper to reduce boilerplate and improve maintainability.
You can collapse all five `do_*` methods into a single dispatch helper, then have each `do_*` just call it. That both removes the repeated referer-parsing/logging and the routing boilerplate:
```python
class RamalamaHandler(http.server.SimpleHTTPRequestHandler):
# … __init__ stays the same …
def _dispatch(self, verb: str):
logger.debug(f"Handling {verb} request for path: {self.path}")
referer = self.headers.get("Referer")
if referer:
logger.debug(f"Request referer: {referer}")
# 1) API handler
if self.path.startswith(DaemonAPIHandler.PATH_PREFIX):
handler = DaemonAPIHandler(self.model_store_path, self.model_runner)
getattr(handler, f"handle_{verb.lower()}")(self)
return
# 2) Proxy handler
is_referred = bool(referer and f"{ModelProxyHandler.PATH_PREFIX}/sha256-" in referer)
if self.path.startswith(ModelProxyHandler.PATH_PREFIX) or is_referred:
handler = ModelProxyHandler(self.model_runner)
method = getattr(handler, f"handle_{verb.lower()}")
method(self, is_referred)
def do_GET(self):
return self._dispatch("GET")
def do_HEAD(self):
return self._dispatch("HEAD")
def do_POST(self):
return self._dispatch("POST")
def do_PUT(self):
return self._dispatch("PUT")
def do_DELETE(self):
return self._dispatch("DELETE")
```
This preserves exactly the same behavior but pulls out the common bits into `_dispatch()`, cutting down on boilerplate by ~80%.
</issue_to_address>
## Security Issues
### Issue 1
<location> `ramalama/daemon/model_runner/runner.py:20` </location>
<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'Popen' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.
*Source: opengrep*
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
c7f9e92
to
6d52669
Compare
Fixes: #598 |
90db157
to
5c5b81c
Compare
5c5b81c
to
09f786d
Compare
@rhatdan PTAL. I think it is ready to be reviewed/merged (sorry for the size of the PR). Quick, updated summary of this PR:
I'd try to get this PR merged since its already quite big. In follow-up PRs we can tackle things like:
I can create GH issues for these to keep track of and plan them. |
ramalama/daemon/daemon.py
Outdated
continue | ||
|
||
try: | ||
logger.error(f"Stopping expired model '{name}'...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be Warn or Debug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to info since stopping an expired model is something expected and "normal" and debug seems not enough.
I would like to discuss with you how you see this working from a user point of view? With the case of containers, would the ramalama daemon execute within the container? |
My only request would be to try and test for interoperability with Docker Model Runner... Lets say a user takes a client that implements Docker Model Runner support for the sake of argument, lets say it's something like AnythingLLM (a client could be a lot of things)... It would be nice if users could just switch the port the client points to between RamaLama/Docker Model Runner, etc. and everything "just works"... Same for OCI artifact standard, etc. |
I agree, I think we should have good integration between RamaLama and Docker Model Runner. |
Signed-off-by: Michael Engel <[email protected]>
Signed-off-by: Michael Engel <[email protected]>
Signed-off-by: Michael Engel <[email protected]>
Signed-off-by: Michael Engel <[email protected]>
Signed-off-by: Michael Engel <[email protected]>
Signed-off-by: Michael Engel <[email protected]>
Signed-off-by: Michael Engel <[email protected]>
Signed-off-by: Michael Engel <[email protected]>
Signed-off-by: Michael Engel <[email protected]>
Signed-off-by: Michael Engel <[email protected]>
09f786d
to
8317f0c
Compare
Thanks for the feedback! @ericcurtin Definitely! So far I haven't looked at the REST API of DMR, though. The DMR endpoints should be easy to mimic after some reverse engineering (since I can't find details such as payload, unfortunately). The provided OpenAI endpoints are not yet clear to me. For example, we currently spin up multiple llama.cpp processes (containers without the daemon), so each llama.cpp instance has its own models... probably we need to improve the llama.cpp integration to achieve this, although I don't know yet if having multiple processes is better/worse than having one big process. So far I only looked at Ollama CLI and made the two endpoints |
Yes, the ramalama daemon starts a simple HTTP server - either directly on the host or in the ramalama container. Because of the latter, it would not be usable directly since this code is missing in the current ramalama images. This is also the reason why I didn't do the integration with the ramalma serve/run/chat command yet. I think this delayed release of change is not good, but couldn't come up with a better idea other than pushing it to "build a new image on each merge". From the user perspective, nothing would change for these three commands - it would all be under the hood. However, we could add then additional CLI options over time leveraging this daemon, e.g. |
Ok lets merge and then we can continue to discuss and play with it. LGTM |
This PR adds initial work on the model swap feature.
Basic idea
The core idea is roughly depicted in the following diagram:
The new
daemon
component is essentially a server providing two different REST APIs:The command
ramalama daemon setup
is starting the container with a running daemon inside, ready to handle requests. Currently, there are these endpoints for the Daemon API available:Trying it out
At the moment, the
daemon
package is missing in the latest ramalama container images. Therefore, one needs to "rebuild" a custom container image, for example:And then build the wheel package and container image:
When done, one can start the container via
and issue requests, e.g. via curl:
Navigating to

/model
in the browser (or again via curl) will list all running models:Using the returned serve path from the serve call on the API enables to use the webui of llama-server:

Navigate to the webui:
Summary by Sourcery
Introduce a containerized Ramalama daemon to dynamically manage and proxy multiple model serving processes via REST APIs.
New Features:
ramalama daemon
CLI withsetup
andrun
commands to launch a persistent model-serving daemon