Skip to content

Conversation

engelmi
Copy link
Member

@engelmi engelmi commented Aug 6, 2025

This PR adds initial work on the model swap feature.

Basic idea

The core idea is roughly depicted in the following diagram:

image

The new daemon component is essentially a server providing two different REST APIs:

  • Daemon API to control the models run in the container
  • Proxy API to forward requests to the running models inside the container

The command ramalama daemon setup is starting the container with a running daemon inside, ready to handle requests. Currently, there are these endpoints for the Daemon API available:

  • GET /api/tags - lists all models available in the mounted model store (excluding oci ones)
  • GET /api/model - lists all running llama-server processes (so the running models) with additional information
  • POST /api/serve - start a new llama-server process with the passed parameters (and defaults from ramalama)
  • POST /api/stop - stop a llama-server process

Trying it out

At the moment, the daemon package is missing in the latest ramalama container images. Therefore, one needs to "rebuild" a custom container image, for example:

FROM quay.io/ramalama/rocm:latest

COPY dist/ /
RUN pip install --force-reinstall /ramalama-0.11.3-py3-none-any.whl

And then build the wheel package and container image:

# in ramalama root directory
$ make pypi-build
$ podman build -f custom-rocm -t quay.io/ramalama/rocm:latest .

When done, one can start the container via

./bin/ramalama daemon setup

and issue requests, e.g. via curl:

# list all available models
$ curl -X GET http://localhost:8080/api/tags
[
    {
        "name": "hf://mlx-community/Llama-3.2-1B-Instruct-4bit",
        "modified_at": "2025-07-24T14:32:16.156628+00:00",
        "size": 712575975,
        "digest": "",
        "details": {
            "format": "",
            "family": "",
            "families": [],
            "parameter_size": "",
            "quantization_level": ""
        }
    },
...
# serve smollm2:135m
curl  -X POST http://localhost:8080/api/serve -d '{"model_name":"smollm2:135m", "runtime":"llama.cpp", "exec_args": {"thinking": "true", "context": 2048, "temp": 0.8}}' -H "Content-Type: application/json"
{
    "model_id": "sha256-8cc4a7f5d0b22d87f9ca41680d72e87d62236588951180cdc479dbfa1c22a184",
    "serve_path": "/model/sha256-8cc4a7f5d0b22d87f9ca41680d72e87d62236588951180cdc479dbfa1c22a184"
}
# stop smollm2:135m
curl  -X POST http://localhost:8080/api/stop -d '{"model_name":"smollm2:135m"}' -H "Content-Type: application/json"

Navigating to /model in the browser (or again via curl) will list all running models:
image

Using the returned serve path from the serve call on the API enables to use the webui of llama-server:
Navigate to the webui:
image

Summary by Sourcery

Introduce a containerized Ramalama daemon to dynamically manage and proxy multiple model serving processes via REST APIs.

New Features:

  • Add ramalama daemon CLI with setup and run commands to launch a persistent model-serving daemon
  • Implement an HTTP server providing /api endpoints for listing available models, starting new model processes, and stopping them
  • Add a proxy handler under /model to list running models and forward requests to individual llama-server instances
  • Develop a ModelRunner framework for spawning, tracking, and terminating managed model processes with dynamic port allocation
  • Introduce CommandFactory and DTO classes to build serve commands and serialize/deserialize serve/stop request and response payloads
  • Provide graceful shutdown handling and file-based logging for the daemon

Copy link
Contributor

sourcery-ai bot commented Aug 6, 2025

Reviewer's Guide

Introduce a containerized daemon service with new CLI commands, a Python HTTP server exposing /api and /model endpoints to catalog, launch, stop, and proxy llama-server instances, using a ModelRunner and CommandFactory for process orchestration and structured logging for lifecycle management.

Sequence diagram for serving a new model via Daemon API

sequenceDiagram
    actor User
    participant CLI
    participant DaemonAPI
    participant ModelRunner
    participant CommandFactory
    participant LlamaServer

    User->>CLI: POST /api/serve (model_name, runtime, exec_args)
    CLI->>DaemonAPI: POST /api/serve
    DaemonAPI->>ModelRunner: next_available_port()
    DaemonAPI->>CommandFactory: build(model, runtime, port, exec_args)
    CommandFactory-->>DaemonAPI: serve command
    DaemonAPI->>ModelRunner: add_model(managed_model)
    DaemonAPI->>ModelRunner: start_model(model_id)
    ModelRunner->>LlamaServer: start process
    DaemonAPI-->>CLI: ServeResponse (model_id, serve_path)
Loading

Sequence diagram for proxying requests to a running model

sequenceDiagram
    actor User
    participant ProxyAPI
    participant ModelRunner
    participant LlamaServer

    User->>ProxyAPI: GET/POST /model/{model_id}/...
    ProxyAPI->>ModelRunner: lookup managed_model by model_id
    ProxyAPI->>LlamaServer: forward request to correct port
    LlamaServer-->>ProxyAPI: response
    ProxyAPI-->>User: response
Loading

File-Level Changes

Change Details Files
CLI integration for daemon orchestration
  • Add daemon_parser in CLI subcommands
  • Implement daemon_setup_cli to launch the container via podman run
  • Implement daemon_run_cli to invoke the internal daemon entrypoint
ramalama/cli.py
Daemon entrypoint and server lifecycle
  • Create RamalamaServer with threading support
  • Implement ShutdownHandler for graceful SIGINT/SIGTERM shutdown
  • Provide parse_args and run functions to start the HTTP server
ramalama/daemon/daemon.py
Central request routing in HTTP handler
  • Implement RamalamaHandler to dispatch /api to API handler
  • Route /model (and referer-based) requests to the proxy handler
ramalama/daemon/handler/ramalama.py
Daemon API to manage models
  • Implement GET /api/tags to list model store contents
  • Implement POST /api/serve to parse ServeRequest, build and launch a model process
  • Implement POST /api/stop to stop a running model by ID
ramalama/daemon/handler/daemon.py
Proxy API to forward inference/UI requests
  • Add GET listing of running models under /model
  • Forward HEAD/GET/POST/PUT/DELETE to the correct model server based on path or Referer
  • Handle hop-by-hop headers and errors during forwarding
ramalama/daemon/handler/proxy.py
Process orchestration utilities and DTOs
  • CommandFactory builds llama-server invocation with defaults and runtime-specific args
  • ModelRunner tracks ports, generates unique model IDs, and manages subprocess lifecycle
  • Define ServeRequest/ServeResponse, StopServeRequest, ModelResponse, RunningModelResponse DTOs
  • Configure structured file-based logging for the daemon
ramalama/daemon/model_runner/command_factory.py
ramalama/daemon/model_runner/runner.py
ramalama/daemon/dto/serve.py
ramalama/daemon/dto/model.py
ramalama/daemon/dto/proxy.py
ramalama/daemon/dto/errors.py
ramalama/daemon/logging.py

Possibly linked issues

  • #0: The PR introduces a daemon with an API to serve models, fulfilling the issue's requirement for RamaLama images to be compatible with Podman AI Lab's model serving needs.
  • Create Nvidia Branch for testing and development #239: The PR introduces a daemon and APIs to manage and serve multiple models concurrently on a server, directly addressing the issue's goal of multi-model server support.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @engelmi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request fundamentally changes how models are served and managed in ramalama by introducing a new daemon component. This daemon provides a robust, centralized, and API-driven system for orchestrating AI model serving processes, moving towards a more dynamic and scalable model management architecture.

Highlights

  • New Daemon Component: This pull request introduces a new daemon component to the ramalama project. This daemon is designed to run as a server, providing a centralized and dynamic way to manage and serve AI models within a containerized environment.
  • Enhanced CLI Commands: New command-line interface (CLI) commands, specifically ramalama daemon setup and ramalama daemon run, have been added. These commands facilitate the setup and execution of the new daemon, including container orchestration via podman.
  • RESTful Daemon API: A new RESTful API is exposed by the daemon, offering endpoints for core operations. This includes listing available models (/api/tags), initiating new model serving processes (/api/serve), and stopping currently running models (/api/stop).
  • Model Proxying Capability: The daemon now includes a proxy API, accessible via the /model path. This proxy intelligently forwards incoming requests to the appropriate dynamically started llama-server instances, enabling seamless interaction with served models.
  • Dynamic Model Lifecycle Management: The daemon is capable of dynamically starting and stopping llama-server processes for various models. It handles the allocation of unique network ports for each served model and manages their complete lifecycle, from initiation to graceful shutdown.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new daemon component for managing and proxying AI models, which is a significant and valuable feature. The overall architecture with a daemon, API handlers, and a model runner is well-structured.

However, the initial implementation has several critical issues that need to be addressed before merging. These include potential crashes due to a NameError and unsafe dictionary modifications, and incorrect logic that could lead to arguments being dropped. There are also several high-severity issues related to hardcoded values that break the application's configurability, and incorrect network client implementation.

I've provided detailed comments and suggestions to fix these issues. Addressing them will significantly improve the stability, correctness, and maintainability of the new daemon feature.

@bmahabirbu
Copy link
Collaborator

Nice work @engelmi!!!

I'm trying to test this on my nvidia machine but running into this

brian@DESKTOP-SB69448:~/ramalama$ ./bin/ramalama --debug daemon setup
2025-08-16 00:24:54 - DEBUG - exec_cmd: podman run --pull never -i -t -d -p 8080:8080 -v /home/brian/.local/share/ramalama:/ramalama/models quay.io/ramalama/rocm:latest ramalama daemon run --store /ramalama/models
Error: quay.io/ramalama/rocm:latest: image not known
brian@DESKTOP-SB69448:~/ramalama$ 

Would it be possible to add --image to this argument?

@bmahabirbu
Copy link
Collaborator

bmahabirbu commented Aug 16, 2025

After fixing that error and resolving the image I got it to work!!

brian@DESKTOP-SB69448:~/ramalama$ curl -X GET http://localhost:8080/api/tags
[
    {
        "name": "ollama://llama3.2/llama3.2:latest",
        "modified_at": "2025-07-25T00:19:06.530056+00:00",
        "size": 2019379366,
        "digest": "",
        "details": {
            "format": "",
            "family": "",
            "families": [],
            "parameter_size": "",
            "quantization_level": ""
        }
    },
    {
        "name": "ollama://tinyllama/tinyllama:latest",
        "modified_at": "2025-05-14T02:46:44.949334+00:00",
        "size": 637700077,
        "digest": "",
        "details": {
            "format": "",
            "family": "",
            "families": [],
            "parameter_size": "",
            "quantization_level": ""
        }
    },
   ..etc

@bmahabirbu bmahabirbu marked this pull request as ready for review August 16, 2025 05:09
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

Blocking issues:

  • Detected subprocess function 'Popen' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)

General comments:

  • ModelRunner’s internal dicts and port tracking are accessed by multiple handler threads—consider adding thread‐safety (e.g. locks) to avoid race conditions under concurrent requests.
  • There’s a lot of duplicated referer extraction and path parsing logic between RamalamaHandler and ModelProxyHandler—extracting that into a shared helper would DRY up the code and reduce inconsistencies.
  • CommandFactory’s _set_defaults and _build_llama_serve_command methods build up exec_args via repeated dict updates—consider using a dataclass with default values or a clear merge strategy to simplify and harden argument handling.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- ModelRunner’s internal dicts and port tracking are accessed by multiple handler threads—consider adding thread‐safety (e.g. locks) to avoid race conditions under concurrent requests.
- There’s a lot of duplicated referer extraction and path parsing logic between RamalamaHandler and ModelProxyHandler—extracting that into a shared helper would DRY up the code and reduce inconsistencies.
- CommandFactory’s _set_defaults and _build_llama_serve_command methods build up exec_args via repeated dict updates—consider using a dataclass with default values or a clear merge strategy to simplify and harden argument handling.

## Individual Comments

### Comment 1
<location> `ramalama/daemon/model_runner/command_factory.py:104` </location>
<code_context>
+        if self.request_args.get("webui") == "off":
+            cmd.extend(["--no-webui"])
+
+        if check_nvidia() or check_metal(SimpleNamespace({"container": False})):
+            cmd.extend(["--flash-attn"])
+
</code_context>

<issue_to_address>
Incorrect usage of SimpleNamespace constructor.

Passing a dictionary to SimpleNamespace creates an attribute 'container' containing the dictionary, not a boolean. This may cause check_metal to behave incorrectly.
</issue_to_address>

### Comment 2
<location> `ramalama/daemon/model_runner/command_factory.py:108` </location>
<code_context>
+            cmd.extend(["--flash-attn"])
+
+        # gpu arguments
+        ngl = self.request_args.get("ngl")
+        if ngl < 0:
+            ngl = 999
+        cmd.extend(["-ngl", f"{ngl}"])
</code_context>

<issue_to_address>
Potential type issue with ngl comparison.

Since ngl may be a string, ensure it is converted to an integer before performing the comparison to avoid a TypeError.
</issue_to_address>

### Comment 3
<location> `ramalama/daemon/model_runner/command_factory.py:112` </location>
<code_context>
+        if ngl < 0:
+            ngl = 999
+        cmd.extend(["-ngl", f"{ngl}"])
+        cmd.extend(["--threads", f"{self.request_args.get("threads")}"])
+
+        return cmd
</code_context>

<issue_to_address>
Possible type inconsistency for threads argument.

If llama-server requires threads as an integer, ensure it is cast from the request_args value.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
        cmd.extend(["--threads", f"{self.request_args.get("threads")}"])
=======
        threads = int(self.request_args.get("threads"))
        cmd.extend(["--threads", f"{threads}"])
>>>>>>> REPLACE

</suggested_fix>

### Comment 4
<location> `ramalama/daemon/model_runner/runner.py:14` </location>
<code_context>
+        self.id = id
+        self.model = model
+        self.run_cmd: list[str] = run_cmd
+        self.port: str = port
+        self.process: Optional[subprocess.Popen] = None
+
</code_context>

<issue_to_address>
Port should be typed as int, not str.

Typing port as str could lead to type errors or unexpected behavior when used as an int elsewhere.
</issue_to_address>

### Comment 5
<location> `ramalama/daemon/model_runner/runner.py:71` </location>
<code_context>
+        del self._models[model_id]
+
+    def stop(self):
+        for id in self._models.keys():
+            self.stop_model(id)
</code_context>

<issue_to_address>
Modifying dictionary during iteration may cause issues.

Iterate over a list of keys instead to avoid runtime errors: for id in list(self._models.keys()).
</issue_to_address>

### Comment 6
<location> `ramalama/daemon/handler/proxy.py:132` </location>
<code_context>
+                handler.wfile.flush()
+
+                logger.debug(f"Received response from -X {method} {target_url}\nRESPONSE: ")
+        except urllib.error.HTTPError as e:
+            handler.send_response(e.code)
+            handler.end_headers()
+            raise e
+        except urllib.error.URLError as e:
+            handler.send_response(500)
</code_context>

<issue_to_address>
Raising HTTPError after sending response may cause double error handling.

Instead of re-raising the exception after sending the error response, log the error and return to avoid duplicate handling.
</issue_to_address>

### Comment 7
<location> `ramalama/daemon/logging.py:20` </location>
<code_context>
+    formatter = logging.Formatter(fmt, datefmt)
+
+    log_file_path = os.path.join(log_file, "ramalama-daemon.log")
+    handler = logging.FileHandler(log_file_path)
+    handler.setLevel(lvl)
+    handler.setFormatter(formatter)
</code_context>

<issue_to_address>
Logger may add multiple handlers on repeated configuration.

Repeated calls to configure_logger will add duplicate handlers, causing repeated log entries. Please check for existing handlers before adding new ones.
</issue_to_address>

### Comment 8
<location> `ramalama/daemon/handler/ramalama.py:28` </location>
<code_context>
+        finally:
+            self.finish()
+
+    def do_GET(self):
+        logger.debug(f"Handling GET request for path: {self.path}")
+
</code_context>

<issue_to_address>
Consider refactoring the repeated logic in the HTTP verb methods into a single dispatch helper to reduce boilerplate and improve maintainability.

You can collapse all five `do_*` methods into a single dispatch helper, then have each `do_*` just call it. That both removes the repeated referer-parsing/logging and the routing boilerplate:

```python
class RamalamaHandler(http.server.SimpleHTTPRequestHandler):
    # … __init__ stays the same …

    def _dispatch(self, verb: str):
        logger.debug(f"Handling {verb} request for path: {self.path}")
        referer = self.headers.get("Referer")
        if referer:
            logger.debug(f"Request referer: {referer}")

        # 1) API handler
        if self.path.startswith(DaemonAPIHandler.PATH_PREFIX):
            handler = DaemonAPIHandler(self.model_store_path, self.model_runner)
            getattr(handler, f"handle_{verb.lower()}")(self)
            return

        # 2) Proxy handler
        is_referred = bool(referer and f"{ModelProxyHandler.PATH_PREFIX}/sha256-" in referer)
        if self.path.startswith(ModelProxyHandler.PATH_PREFIX) or is_referred:
            handler = ModelProxyHandler(self.model_runner)
            method = getattr(handler, f"handle_{verb.lower()}")
            method(self, is_referred)

    def do_GET(self):
        return self._dispatch("GET")

    def do_HEAD(self):
        return self._dispatch("HEAD")

    def do_POST(self):
        return self._dispatch("POST")

    def do_PUT(self):
        return self._dispatch("PUT")

    def do_DELETE(self):
        return self._dispatch("DELETE")
```

This preserves exactly the same behavior but pulls out the common bits into `_dispatch()`, cutting down on boilerplate by ~80%.
</issue_to_address>

## Security Issues

### Issue 1
<location> `ramalama/daemon/model_runner/runner.py:20` </location>

<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'Popen' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.

*Source: opengrep*
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@engelmi engelmi force-pushed the initial-model-swap-work branch 3 times, most recently from c7f9e92 to 6d52669 Compare August 23, 2025 09:34
@rhatdan
Copy link
Member

rhatdan commented Aug 27, 2025

Fixes: #598

@engelmi engelmi force-pushed the initial-model-swap-work branch 9 times, most recently from 90db157 to 5c5b81c Compare September 4, 2025 12:16
@engelmi engelmi force-pushed the initial-model-swap-work branch from 5c5b81c to 09f786d Compare September 8, 2025 08:02
@engelmi
Copy link
Member Author

engelmi commented Sep 8, 2025

@rhatdan PTAL. I think it is ready to be reviewed/merged (sorry for the size of the PR).

Quick, updated summary of this PR:

  • The REST API of the daemon has two path prefixes: /api and /model
  • /api provides endpoints for REST operations, e.g. starting to serve a model and listing running ones (see curl examples in the description)
  • /api endpoints are compliant with ollama cli (only listing available (ollama ls) and running models (ollama ps) at the moment) - provided the ramalama port is 11434 (used by ollama cli)
  • /model/<org>/<name> provides the llama-server API (OpenAPI compliant)
  • it also adds an expiration timeout, stopping served models when no traffic was redirected there (currently ~5min)

I'd try to get this PR merged since its already quite big. In follow-up PRs we can tackle things like:

  • ramalama CLI integration, e.g. :
    • serve, run and chat starting the daemon and triggering the serve endpoint (if necessary) and then using REST API requests for communication
    • adding a nice web dashboard, e.g. in the REST API root path / (as discussed in #598)
    • defining a swap config - with sensible defaults - defining the behavior for swapping models
    • etc.

I can create GH issues for these to keep track of and plan them.

@rhatdan
Copy link
Member

rhatdan commented Sep 8, 2025

continue

try:
logger.error(f"Stopping expired model '{name}'...")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be Warn or Debug?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to info since stopping an expired model is something expected and "normal" and debug seems not enough.

@rhatdan
Copy link
Member

rhatdan commented Sep 8, 2025

I would like to discuss with you how you see this working from a user point of view?

With the case of containers, would the ramalama daemon execute within the container?

@ericcurtin
Copy link
Member

My only request would be to try and test for interoperability with Docker Model Runner... Lets say a user takes a client that implements Docker Model Runner support for the sake of argument, lets say it's something like AnythingLLM (a client could be a lot of things)... It would be nice if users could just switch the port the client points to between RamaLama/Docker Model Runner, etc. and everything "just works"... Same for OCI artifact standard, etc.

@rhatdan
Copy link
Member

rhatdan commented Sep 8, 2025

I agree, I think we should have good integration between RamaLama and Docker Model Runner.

@engelmi engelmi force-pushed the initial-model-swap-work branch from 09f786d to 8317f0c Compare September 9, 2025 06:39
@engelmi
Copy link
Member Author

engelmi commented Sep 9, 2025

My only request would be to try and test for interoperability with Docker Model Runner... Lets say a user takes a client that implements Docker Model Runner support for the sake of argument, lets say it's something like AnythingLLM (a client could be a lot of things)... It would be nice if users could just switch the port the client points to between RamaLama/Docker Model Runner, etc. and everything "just works"... Same for OCI artifact standard, etc.

Thanks for the feedback! @ericcurtin Definitely! So far I haven't looked at the REST API of DMR, though. The DMR endpoints should be easy to mimic after some reverse engineering (since I can't find details such as payload, unfortunately). The provided OpenAI endpoints are not yet clear to me. For example, we currently spin up multiple llama.cpp processes (containers without the daemon), so each llama.cpp instance has its own models... probably we need to improve the llama.cpp integration to achieve this, although I don't know yet if having multiple processes is better/worse than having one big process.
Regarding the OCI artifacts: Yeah, that is definitely missing at the moment - I only mount the local model store directory into the daemon container, but don't know how it could be done with OCI artifacts - that needs to be investigated. Or we just keep the listing of available models on the host.

So far I only looked at Ollama CLI and made the two endpoints /api/ps (listing running models) and /api/ tags (listing available models) compatible with it - so running the daemon on port 11434 and using ollama ps would work. But as I wrote, adding more path aliases (and using the user-agent field to distinguish) should be a viable option.

@engelmi
Copy link
Member Author

engelmi commented Sep 9, 2025

I would like to discuss with you how you see this working from a user point of view?

With the case of containers, would the ramalama daemon execute within the container?

Yes, the ramalama daemon starts a simple HTTP server - either directly on the host or in the ramalama container. Because of the latter, it would not be usable directly since this code is missing in the current ramalama images. This is also the reason why I didn't do the integration with the ramalma serve/run/chat command yet. I think this delayed release of change is not good, but couldn't come up with a better idea other than pushing it to "build a new image on each merge".

From the user perspective, nothing would change for these three commands - it would all be under the hood. However, we could add then additional CLI options over time leveraging this daemon, e.g. ramalama serve smollm2 --ttl=5min to define a max. idle time. Or ramalama serve smollm2 --group=xyz allowing group definitions for model swapping.

@rhatdan
Copy link
Member

rhatdan commented Sep 9, 2025

Ok lets merge and then we can continue to discuss and play with it.

LGTM

@rhatdan rhatdan merged commit 8d44110 into containers:main Sep 9, 2025
41 of 46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants