Initial model swap work #1807

engelmi · 2025-08-06T15:27:34Z

This PR adds initial work on the model swap feature.

Basic idea

The core idea is roughly depicted in the following diagram:

The new daemon component is essentially a server providing two different REST APIs:

Daemon API to control the models run in the container
Proxy API to forward requests to the running models inside the container

The command ramalama daemon setup is starting the container with a running daemon inside, ready to handle requests. Currently, there are these endpoints for the Daemon API available:

GET /api/tags - lists all models available in the mounted model store (excluding oci ones)
GET /api/model - lists all running llama-server processes (so the running models) with additional information
POST /api/serve - start a new llama-server process with the passed parameters (and defaults from ramalama)
POST /api/stop - stop a llama-server process

Trying it out

At the moment, the daemon package is missing in the latest ramalama container images. Therefore, one needs to "rebuild" a custom container image, for example:

FROM quay.io/ramalama/rocm:latest

COPY dist/ /
RUN pip install --force-reinstall /ramalama-0.11.3-py3-none-any.whl

And then build the wheel package and container image:

# in ramalama root directory
$ make pypi-build
$ podman build -f custom-rocm -t quay.io/ramalama/rocm:latest .

When done, one can start the container via

./bin/ramalama daemon setup

and issue requests, e.g. via curl:

# list all available models
$ curl -X GET http://localhost:8080/api/tags
[
    {
        "name": "hf://mlx-community/Llama-3.2-1B-Instruct-4bit",
        "modified_at": "2025-07-24T14:32:16.156628+00:00",
        "size": 712575975,
        "digest": "",
        "details": {
            "format": "",
            "family": "",
            "families": [],
            "parameter_size": "",
            "quantization_level": ""
        }
    },
...
# serve smollm2:135m
curl  -X POST http://localhost:8080/api/serve -d '{"model_name":"smollm2:135m", "runtime":"llama.cpp", "exec_args": {"thinking": "true", "context": 2048, "temp": 0.8}}' -H "Content-Type: application/json"
{
    "model_id": "sha256-8cc4a7f5d0b22d87f9ca41680d72e87d62236588951180cdc479dbfa1c22a184",
    "serve_path": "/model/sha256-8cc4a7f5d0b22d87f9ca41680d72e87d62236588951180cdc479dbfa1c22a184"
}
# stop smollm2:135m
curl  -X POST http://localhost:8080/api/stop -d '{"model_name":"smollm2:135m"}' -H "Content-Type: application/json"

Navigating to /model in the browser (or again via curl) will list all running models:

Using the returned serve path from the serve call on the API enables to use the webui of llama-server:
Navigate to the webui:

Summary by Sourcery

Introduce a containerized Ramalama daemon to dynamically manage and proxy multiple model serving processes via REST APIs.

New Features:

Add ramalama daemon CLI with setup and run commands to launch a persistent model-serving daemon
Implement an HTTP server providing /api endpoints for listing available models, starting new model processes, and stopping them
Add a proxy handler under /model to list running models and forward requests to individual llama-server instances
Develop a ModelRunner framework for spawning, tracking, and terminating managed model processes with dynamic port allocation
Introduce CommandFactory and DTO classes to build serve commands and serialize/deserialize serve/stop request and response payloads
Provide graceful shutdown handling and file-based logging for the daemon

sourcery-ai · 2025-08-06T15:27:43Z

Reviewer's Guide

Introduce a containerized daemon service with new CLI commands, a Python HTTP server exposing /api and /model endpoints to catalog, launch, stop, and proxy llama-server instances, using a ModelRunner and CommandFactory for process orchestration and structured logging for lifecycle management.

Sequence diagram for serving a new model via Daemon API

sequenceDiagram
    actor User
    participant CLI
    participant DaemonAPI
    participant ModelRunner
    participant CommandFactory
    participant LlamaServer

    User->>CLI: POST /api/serve (model_name, runtime, exec_args)
    CLI->>DaemonAPI: POST /api/serve
    DaemonAPI->>ModelRunner: next_available_port()
    DaemonAPI->>CommandFactory: build(model, runtime, port, exec_args)
    CommandFactory-->>DaemonAPI: serve command
    DaemonAPI->>ModelRunner: add_model(managed_model)
    DaemonAPI->>ModelRunner: start_model(model_id)
    ModelRunner->>LlamaServer: start process
    DaemonAPI-->>CLI: ServeResponse (model_id, serve_path)

Sequence diagram for proxying requests to a running model

sequenceDiagram
    actor User
    participant ProxyAPI
    participant ModelRunner
    participant LlamaServer

    User->>ProxyAPI: GET/POST /model/{model_id}/...
    ProxyAPI->>ModelRunner: lookup managed_model by model_id
    ProxyAPI->>LlamaServer: forward request to correct port
    LlamaServer-->>ProxyAPI: response
    ProxyAPI-->>User: response

File-Level Changes

Change	Details	Files
CLI integration for daemon orchestration	Add `daemon_parser` in CLI subcommands Implement `daemon_setup_cli` to launch the container via `podman run` Implement `daemon_run_cli` to invoke the internal daemon entrypoint	`ramalama/cli.py`
Daemon entrypoint and server lifecycle	Create `RamalamaServer` with threading support Implement `ShutdownHandler` for graceful SIGINT/SIGTERM shutdown Provide `parse_args` and `run` functions to start the HTTP server	`ramalama/daemon/daemon.py`
Central request routing in HTTP handler	Implement `RamalamaHandler` to dispatch /api to API handler Route /model (and referer-based) requests to the proxy handler	`ramalama/daemon/handler/ramalama.py`
Daemon API to manage models	Implement GET /api/tags to list model store contents Implement POST /api/serve to parse ServeRequest, build and launch a model process Implement POST /api/stop to stop a running model by ID	`ramalama/daemon/handler/daemon.py`
Proxy API to forward inference/UI requests	Add GET listing of running models under /model Forward HEAD/GET/POST/PUT/DELETE to the correct model server based on path or Referer Handle hop-by-hop headers and errors during forwarding	`ramalama/daemon/handler/proxy.py`
Process orchestration utilities and DTOs	CommandFactory builds llama-server invocation with defaults and runtime-specific args ModelRunner tracks ports, generates unique model IDs, and manages subprocess lifecycle Define ServeRequest/ServeResponse, StopServeRequest, ModelResponse, RunningModelResponse DTOs Configure structured file-based logging for the daemon	`ramalama/daemon/model_runner/command_factory.py` `ramalama/daemon/model_runner/runner.py` `ramalama/daemon/dto/serve.py` `ramalama/daemon/dto/model.py` `ramalama/daemon/dto/proxy.py` `ramalama/daemon/dto/errors.py` `ramalama/daemon/logging.py`

Possibly linked issues

#0: The PR introduces a daemon with an API to serve models, fulfilling the issue's requirement for RamaLama images to be compatible with Podman AI Lab's model serving needs.
Create Nvidia Branch for testing and development #239: The PR introduces a daemon and APIs to manage and serve multiple models concurrently on a server, directly addressing the issue's goal of multi-model server support.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

gemini-code-assist

Summary of Changes

Hello @engelmi, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request fundamentally changes how models are served and managed in ramalama by introducing a new daemon component. This daemon provides a robust, centralized, and API-driven system for orchestrating AI model serving processes, moving towards a more dynamic and scalable model management architecture.

Highlights

New Daemon Component: This pull request introduces a new daemon component to the ramalama project. This daemon is designed to run as a server, providing a centralized and dynamic way to manage and serve AI models within a containerized environment.
Enhanced CLI Commands: New command-line interface (CLI) commands, specifically ramalama daemon setup and ramalama daemon run, have been added. These commands facilitate the setup and execution of the new daemon, including container orchestration via podman.
RESTful Daemon API: A new RESTful API is exposed by the daemon, offering endpoints for core operations. This includes listing available models (/api/tags), initiating new model serving processes (/api/serve), and stopping currently running models (/api/stop).
Model Proxying Capability: The daemon now includes a proxy API, accessible via the /model path. This proxy intelligently forwards incoming requests to the appropriate dynamically started llama-server instances, enabling seamless interaction with served models.
Dynamic Model Lifecycle Management: The daemon is capable of dynamically starting and stopping llama-server processes for various models. It handles the allocation of unique network ports for each served model and manages their complete lifecycle, from initiation to graceful shutdown.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new daemon component for managing and proxying AI models, which is a significant and valuable feature. The overall architecture with a daemon, API handlers, and a model runner is well-structured.

However, the initial implementation has several critical issues that need to be addressed before merging. These include potential crashes due to a NameError and unsafe dictionary modifications, and incorrect logic that could lead to arguments being dropped. There are also several high-severity issues related to hardcoded values that break the application's configurability, and incorrect network client implementation.

I've provided detailed comments and suggestions to fix these issues. Addressing them will significantly improve the stability, correctness, and maintainability of the new daemon feature.

ramalama/daemon/dto/serve.py

ramalama/daemon/model_runner/runner.py

ramalama/cli.py

ramalama/daemon/handler/proxy.py

ramalama/daemon/logging.py

ramalama/daemon/daemon.py

ramalama/daemon/handler/daemon.py

ramalama/daemon/handler/ramalama.py

ramalama/daemon/model_runner/runner.py

bmahabirbu · 2025-08-16T04:28:42Z

Nice work @engelmi!!!

I'm trying to test this on my nvidia machine but running into this

brian@DESKTOP-SB69448:~/ramalama$ ./bin/ramalama --debug daemon setup
2025-08-16 00:24:54 - DEBUG - exec_cmd: podman run --pull never -i -t -d -p 8080:8080 -v /home/brian/.local/share/ramalama:/ramalama/models quay.io/ramalama/rocm:latest ramalama daemon run --store /ramalama/models
Error: quay.io/ramalama/rocm:latest: image not known
brian@DESKTOP-SB69448:~/ramalama$

Would it be possible to add --image to this argument?

bmahabirbu · 2025-08-16T04:49:10Z

After fixing that error and resolving the image I got it to work!!

brian@DESKTOP-SB69448:~/ramalama$ curl -X GET http://localhost:8080/api/tags
[
    {
        "name": "ollama://llama3.2/llama3.2:latest",
        "modified_at": "2025-07-25T00:19:06.530056+00:00",
        "size": 2019379366,
        "digest": "",
        "details": {
            "format": "",
            "family": "",
            "families": [],
            "parameter_size": "",
            "quantization_level": ""
        }
    },
    {
        "name": "ollama://tinyllama/tinyllama:latest",
        "modified_at": "2025-05-14T02:46:44.949334+00:00",
        "size": 637700077,
        "digest": "",
        "details": {
            "format": "",
            "family": "",
            "families": [],
            "parameter_size": "",
            "quantization_level": ""
        }
    },
   ..etc

ramalama/cli.py

ramalama/daemon/model_runner/command_factory.py

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

Blocking issues:

Detected subprocess function 'Popen' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)

General comments:

ModelRunner’s internal dicts and port tracking are accessed by multiple handler threads—consider adding thread‐safety (e.g. locks) to avoid race conditions under concurrent requests.
There’s a lot of duplicated referer extraction and path parsing logic between RamalamaHandler and ModelProxyHandler—extracting that into a shared helper would DRY up the code and reduce inconsistencies.
CommandFactory’s _set_defaults and _build_llama_serve_command methods build up exec_args via repeated dict updates—consider using a dataclass with default values or a clear merge strategy to simplify and harden argument handling.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- ModelRunner’s internal dicts and port tracking are accessed by multiple handler threads—consider adding thread‐safety (e.g. locks) to avoid race conditions under concurrent requests.
- There’s a lot of duplicated referer extraction and path parsing logic between RamalamaHandler and ModelProxyHandler—extracting that into a shared helper would DRY up the code and reduce inconsistencies.
- CommandFactory’s _set_defaults and _build_llama_serve_command methods build up exec_args via repeated dict updates—consider using a dataclass with default values or a clear merge strategy to simplify and harden argument handling.

## Individual Comments

### Comment 1
<location> `ramalama/daemon/model_runner/command_factory.py:104` </location>
<code_context>
+        if self.request_args.get("webui") == "off":
+            cmd.extend(["--no-webui"])
+
+        if check_nvidia() or check_metal(SimpleNamespace({"container": False})):
+            cmd.extend(["--flash-attn"])
+
</code_context>

<issue_to_address>
Incorrect usage of SimpleNamespace constructor.

Passing a dictionary to SimpleNamespace creates an attribute 'container' containing the dictionary, not a boolean. This may cause check_metal to behave incorrectly.
</issue_to_address>

### Comment 2
<location> `ramalama/daemon/model_runner/command_factory.py:108` </location>
<code_context>
+            cmd.extend(["--flash-attn"])
+
+        # gpu arguments
+        ngl = self.request_args.get("ngl")
+        if ngl < 0:
+            ngl = 999
+        cmd.extend(["-ngl", f"{ngl}"])
</code_context>

<issue_to_address>
Potential type issue with ngl comparison.

Since ngl may be a string, ensure it is converted to an integer before performing the comparison to avoid a TypeError.
</issue_to_address>

### Comment 3
<location> `ramalama/daemon/model_runner/command_factory.py:112` </location>
<code_context>
+        if ngl < 0:
+            ngl = 999
+        cmd.extend(["-ngl", f"{ngl}"])
+        cmd.extend(["--threads", f"{self.request_args.get("threads")}"])
+
+        return cmd
</code_context>

<issue_to_address>
Possible type inconsistency for threads argument.

If llama-server requires threads as an integer, ensure it is cast from the request_args value.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
        cmd.extend(["--threads", f"{self.request_args.get("threads")}"])
=======
        threads = int(self.request_args.get("threads"))
        cmd.extend(["--threads", f"{threads}"])
>>>>>>> REPLACE

</suggested_fix>

### Comment 4
<location> `ramalama/daemon/model_runner/runner.py:14` </location>
<code_context>
+        self.id = id
+        self.model = model
+        self.run_cmd: list[str] = run_cmd
+        self.port: str = port
+        self.process: Optional[subprocess.Popen] = None
+
</code_context>

<issue_to_address>
Port should be typed as int, not str.

Typing port as str could lead to type errors or unexpected behavior when used as an int elsewhere.
</issue_to_address>

### Comment 5
<location> `ramalama/daemon/model_runner/runner.py:71` </location>
<code_context>
+        del self._models[model_id]
+
+    def stop(self):
+        for id in self._models.keys():
+            self.stop_model(id)
</code_context>

<issue_to_address>
Modifying dictionary during iteration may cause issues.

Iterate over a list of keys instead to avoid runtime errors: for id in list(self._models.keys()).
</issue_to_address>

### Comment 6
<location> `ramalama/daemon/handler/proxy.py:132` </location>
<code_context>
+                handler.wfile.flush()
+
+                logger.debug(f"Received response from -X {method} {target_url}\nRESPONSE: ")
+        except urllib.error.HTTPError as e:
+            handler.send_response(e.code)
+            handler.end_headers()
+            raise e
+        except urllib.error.URLError as e:
+            handler.send_response(500)
</code_context>

<issue_to_address>
Raising HTTPError after sending response may cause double error handling.

Instead of re-raising the exception after sending the error response, log the error and return to avoid duplicate handling.
</issue_to_address>

### Comment 7
<location> `ramalama/daemon/logging.py:20` </location>
<code_context>
+    formatter = logging.Formatter(fmt, datefmt)
+
+    log_file_path = os.path.join(log_file, "ramalama-daemon.log")
+    handler = logging.FileHandler(log_file_path)
+    handler.setLevel(lvl)
+    handler.setFormatter(formatter)
</code_context>

<issue_to_address>
Logger may add multiple handlers on repeated configuration.

Repeated calls to configure_logger will add duplicate handlers, causing repeated log entries. Please check for existing handlers before adding new ones.
</issue_to_address>

### Comment 8
<location> `ramalama/daemon/handler/ramalama.py:28` </location>
<code_context>
+        finally:
+            self.finish()
+
+    def do_GET(self):
+        logger.debug(f"Handling GET request for path: {self.path}")
+
</code_context>

<issue_to_address>
Consider refactoring the repeated logic in the HTTP verb methods into a single dispatch helper to reduce boilerplate and improve maintainability.

You can collapse all five `do_*` methods into a single dispatch helper, then have each `do_*` just call it. That both removes the repeated referer-parsing/logging and the routing boilerplate:

```python
class RamalamaHandler(http.server.SimpleHTTPRequestHandler):
    # … __init__ stays the same …

    def _dispatch(self, verb: str):
        logger.debug(f"Handling {verb} request for path: {self.path}")
        referer = self.headers.get("Referer")
        if referer:
            logger.debug(f"Request referer: {referer}")

        # 1) API handler
        if self.path.startswith(DaemonAPIHandler.PATH_PREFIX):
            handler = DaemonAPIHandler(self.model_store_path, self.model_runner)
            getattr(handler, f"handle_{verb.lower()}")(self)
            return

        # 2) Proxy handler
        is_referred = bool(referer and f"{ModelProxyHandler.PATH_PREFIX}/sha256-" in referer)
        if self.path.startswith(ModelProxyHandler.PATH_PREFIX) or is_referred:
            handler = ModelProxyHandler(self.model_runner)
            method = getattr(handler, f"handle_{verb.lower()}")
            method(self, is_referred)

    def do_GET(self):
        return self._dispatch("GET")

    def do_HEAD(self):
        return self._dispatch("HEAD")

    def do_POST(self):
        return self._dispatch("POST")

    def do_PUT(self):
        return self._dispatch("PUT")

    def do_DELETE(self):
        return self._dispatch("DELETE")
```

This preserves exactly the same behavior but pulls out the common bits into `_dispatch()`, cutting down on boilerplate by ~80%.
</issue_to_address>

## Security Issues

### Issue 1
<location> `ramalama/daemon/model_runner/runner.py:20` </location>

<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'Popen' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.

*Source: opengrep*
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

ramalama/daemon/model_runner/command_factory.py

ramalama/daemon/model_runner/runner.py

ramalama/daemon/handler/daemon.py

ramalama/daemon/handler/proxy.py

ramalama/daemon/logging.py

ramalama/daemon/model_runner/command_factory.py

rhatdan · 2025-08-27T10:34:01Z

Fixes: #598

engelmi · 2025-09-08T10:48:56Z

@rhatdan PTAL. I think it is ready to be reviewed/merged (sorry for the size of the PR).

Quick, updated summary of this PR:

The REST API of the daemon has two path prefixes: /api and /model
/api provides endpoints for REST operations, e.g. starting to serve a model and listing running ones (see curl examples in the description)
/api endpoints are compliant with ollama cli (only listing available (ollama ls) and running models (ollama ps) at the moment) - provided the ramalama port is 11434 (used by ollama cli)
/model/<org>/<name> provides the llama-server API (OpenAPI compliant)
it also adds an expiration timeout, stopping served models when no traffic was redirected there (currently ~5min)

I'd try to get this PR merged since its already quite big. In follow-up PRs we can tackle things like:

ramalama CLI integration, e.g. :
- serve, run and chat starting the daemon and triggering the serve endpoint (if necessary) and then using REST API requests for communication
- adding a nice web dashboard, e.g. in the REST API root path / (as discussed in #598)
- defining a swap config - with sensible defaults - defining the behavior for swapping models
- etc.

I can create GH issues for these to keep track of and plan them.

rhatdan · 2025-09-08T12:45:14Z

@olliewalsh @mikebonnet @bmahabirbu @ericcurtin PTAL

rhatdan · 2025-09-08T12:47:43Z

ramalama/daemon/daemon.py

+                continue
+
+            try:
+                logger.error(f"Stopping expired model '{name}'...")


Should be Warn or Debug?

Changed to info since stopping an expired model is something expected and "normal" and debug seems not enough.

ramalama/daemon/daemon.py

rhatdan · 2025-09-08T12:55:10Z

I would like to discuss with you how you see this working from a user point of view?

With the case of containers, would the ramalama daemon execute within the container?

ericcurtin · 2025-09-08T13:06:12Z

My only request would be to try and test for interoperability with Docker Model Runner... Lets say a user takes a client that implements Docker Model Runner support for the sake of argument, lets say it's something like AnythingLLM (a client could be a lot of things)... It would be nice if users could just switch the port the client points to between RamaLama/Docker Model Runner, etc. and everything "just works"... Same for OCI artifact standard, etc.

rhatdan · 2025-09-08T15:33:25Z

I agree, I think we should have good integration between RamaLama and Docker Model Runner.

Signed-off-by: Michael Engel <[email protected]>

engelmi · 2025-09-09T07:04:28Z

My only request would be to try and test for interoperability with Docker Model Runner... Lets say a user takes a client that implements Docker Model Runner support for the sake of argument, lets say it's something like AnythingLLM (a client could be a lot of things)... It would be nice if users could just switch the port the client points to between RamaLama/Docker Model Runner, etc. and everything "just works"... Same for OCI artifact standard, etc.

Thanks for the feedback! @ericcurtin Definitely! So far I haven't looked at the REST API of DMR, though. The DMR endpoints should be easy to mimic after some reverse engineering (since I can't find details such as payload, unfortunately). The provided OpenAI endpoints are not yet clear to me. For example, we currently spin up multiple llama.cpp processes (containers without the daemon), so each llama.cpp instance has its own models... probably we need to improve the llama.cpp integration to achieve this, although I don't know yet if having multiple processes is better/worse than having one big process.
Regarding the OCI artifacts: Yeah, that is definitely missing at the moment - I only mount the local model store directory into the daemon container, but don't know how it could be done with OCI artifacts - that needs to be investigated. Or we just keep the listing of available models on the host.

So far I only looked at Ollama CLI and made the two endpoints /api/ps (listing running models) and /api/ tags (listing available models) compatible with it - so running the daemon on port 11434 and using ollama ps would work. But as I wrote, adding more path aliases (and using the user-agent field to distinguish) should be a viable option.

engelmi · 2025-09-09T07:24:53Z

I would like to discuss with you how you see this working from a user point of view?

With the case of containers, would the ramalama daemon execute within the container?

Yes, the ramalama daemon starts a simple HTTP server - either directly on the host or in the ramalama container. Because of the latter, it would not be usable directly since this code is missing in the current ramalama images. This is also the reason why I didn't do the integration with the ramalma serve/run/chat command yet. I think this delayed release of change is not good, but couldn't come up with a better idea other than pushing it to "build a new image on each merge".

From the user perspective, nothing would change for these three commands - it would all be under the hood. However, we could add then additional CLI options over time leveraging this daemon, e.g. ramalama serve smollm2 --ttl=5min to define a max. idle time. Or ramalama serve smollm2 --group=xyz allowing group definitions for model swapping.

rhatdan · 2025-09-09T13:28:39Z

Ok lets merge and then we can continue to discuss and play with it.

LGTM

gemini-code-assist bot reviewed Aug 6, 2025

View reviewed changes

bmahabirbu marked this pull request as ready for review August 16, 2025 05:09

bmahabirbu requested review from rhatdan, bmahabirbu, maxamillion, swarajpande5, jhjaggars and cgruver as code owners August 16, 2025 05:09

bmahabirbu requested changes Aug 16, 2025

View reviewed changes

ramalama/cli.py Outdated Show resolved Hide resolved

ramalama/daemon/model_runner/command_factory.py Outdated Show resolved Hide resolved

sourcery-ai bot requested changes Aug 16, 2025

View reviewed changes

engelmi force-pushed the initial-model-swap-work branch 3 times, most recently from c7f9e92 to 6d52669 Compare August 23, 2025 09:34

bmahabirbu mentioned this pull request Aug 26, 2025

rag: ramalama serve --rag doesn't take into account of RAG #1163

Open

rhatdan mentioned this pull request Aug 27, 2025

Provide model info in chat ui & allow multiple models #598

Open

engelmi force-pushed the initial-model-swap-work branch 9 times, most recently from 90db157 to 5c5b81c Compare September 4, 2025 12:16

engelmi force-pushed the initial-model-swap-work branch from 5c5b81c to 09f786d Compare September 8, 2025 08:02

rhatdan reviewed Sep 8, 2025

View reviewed changes

ramalama/daemon/daemon.py Outdated Show resolved Hide resolved

engelmi added 10 commits September 9, 2025 08:38

Initial ramalama daemon for model swap

6468539

Signed-off-by: Michael Engel <[email protected]>

Initial serve and stop of models and proxying requests

d38d507

Signed-off-by: Michael Engel <[email protected]>

Improved the daemon start and run cli

61fc3fc

Signed-off-by: Michael Engel <[email protected]>

Align daemon APIs /api/tags endpoint with ollama CLI

ad43242

Signed-off-by: Michael Engel <[email protected]>

Added daemon API /api/ps and align with ollama CLI

5503027

Signed-off-by: Michael Engel <[email protected]>

Added expiration timeout and handling for models

0b60604

Signed-off-by: Michael Engel <[email protected]>

Use organization and name of model for serve path instead of id

2c493dd

Signed-off-by: Michael Engel <[email protected]>

Parameterize wait_for_healthy to enable custom health functions

71b9472

Signed-off-by: Michael Engel <[email protected]>

Added log path to llama-server in daemon

1bc1751

Signed-off-by: Michael Engel <[email protected]>

Simplify error handling for proxying requests

8317f0c

Signed-off-by: Michael Engel <[email protected]>

engelmi force-pushed the initial-model-swap-work branch from 09f786d to 8317f0c Compare September 9, 2025 06:39

rhatdan merged commit 8d44110 into containers:main Sep 9, 2025
41 of 46 checks passed

Initial model swap work #1807

Initial model swap work #1807

Uh oh!

Conversation

engelmi commented Aug 6, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Basic idea

Trying it out

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for serving a new model via Daemon API

Sequence diagram for proxying requests to a running model

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bmahabirbu commented Aug 16, 2025

Uh oh!

bmahabirbu commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rhatdan commented Aug 27, 2025

Uh oh!

engelmi commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhatdan commented Sep 8, 2025

Uh oh!

rhatdan Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

engelmi Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rhatdan commented Sep 8, 2025

Uh oh!

ericcurtin commented Sep 8, 2025

Uh oh!

rhatdan commented Sep 8, 2025

engelmi commented Aug 6, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Aug 6, 2025 •

edited

Loading

bmahabirbu commented Aug 16, 2025 •

edited

Loading

engelmi commented Sep 8, 2025 •

edited

Loading

engelmi commented Sep 9, 2025 •

edited

Loading