Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion docs/ramalama-perplexity.1.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,11 @@ URL support means if a model is on a web site or even on your local system, you
#### **--authfile**=*password*
path of the authentication file for OCI registries

#### **--cache-reuse**=256
Min chunk size to attempt reusing from the cache via KV shifting

#### **--ctx-size**, **-c**
size of the prompt context. This option is also available as **--max-model-len**. Applies to llama.cpp and vllm regardless of alias (default: 2048, 0 = loaded from model)
size of the prompt context. This option is also available as **--max-model-len**. Applies to llama.cpp and vllm regardless of alias (default: 4096, 0 = loaded from model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The PR description mentions that llama.cpp defaults to a context size of 4098, but the documentation here states the default is 4096. To ensure accuracy, could you please verify the current default ctx-size in llama.cpp and update the documentation accordingly? This will help avoid confusion for users relying on the default behavior.


#### **--device**
Add a host device to the container. Optional permissions parameter can
Expand Down
5 changes: 4 additions & 1 deletion docs/ramalama-run.1.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,12 +33,15 @@ The default can be overridden in the ramalama.conf file.
#### **--authfile**=*password*
path of the authentication file for OCI registries

#### **--cache-reuse**=256
Min chunk size to attempt reusing from the cache via KV shifting

#### **--color**
Indicate whether or not to use color in the chat.
Possible values are "never", "always" and "auto". (default: auto)

#### **--ctx-size**, **-c**
size of the prompt context. This option is also available as **--max-model-len**. Applies to llama.cpp and vllm regardless of alias (default: 2048, 0 = loaded from model)
size of the prompt context. This option is also available as **--max-model-len**. Applies to llama.cpp and vllm regardless of alias (default: 4096, 0 = loaded from model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The PR description mentions that llama.cpp defaults to a context size of 4098, but the documentation here states the default is 4096. To ensure accuracy, could you please verify the current default ctx-size in llama.cpp and update the documentation accordingly? This will help avoid confusion for users relying on the default behavior.


#### **--device**
Add a host device to the container. Optional permissions parameter can
Expand Down
7 changes: 5 additions & 2 deletions docs/ramalama-serve.1.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,8 +57,11 @@ The default can be overridden in the ramalama.conf file.
#### **--authfile**=*password*
Path of the authentication file for OCI registries

#### **--cache-reuse**=256
Min chunk size to attempt reusing from the cache via KV shifting

#### **--ctx-size**, **-c**
size of the prompt context. This option is also available as **--max-model-len**. Applies to llama.cpp and vllm regardless of alias (default: 2048, 0 = loaded from model)
size of the prompt context. This option is also available as **--max-model-len**. Applies to llama.cpp and vllm regardless of alias (default: 4096, 0 = loaded from model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The PR description mentions that llama.cpp defaults to a context size of 4098, but the documentation here states the default is 4096. To ensure accuracy, could you please verify the current default ctx-size in llama.cpp and update the documentation accordingly? This will help avoid confusion for users relying on the default behavior.


#### **--detach**, **-d**
Run the container in the background and print the new container ID.
Expand Down Expand Up @@ -426,7 +429,7 @@ spec:
- name: model-server
image: quay.io/ramalama/ramalama:0.8
command: ["llama-server"]
args: ['--port', '8081', '--model', '/mnt/models/model.file', '--alias', 'quay.io/rhatdan/granite:latest', '--ctx-size', 2048, '--temp', '0.8', '--jinja', '--cache-reuse', '256', '-v', '--threads', 16, '--host', '127.0.0.1']
args: ['--port', '8081', '--model', '/mnt/models/model.file', '--alias', 'quay.io/rhatdan/granite:latest', '--temp', '0.8', '--jinja', '--cache-reuse', '256', '-v', '--threads', 16, '--host', '127.0.0.1']
securityContext:
allowPrivilegeEscalation: false
capabilities:
Expand Down
7 changes: 5 additions & 2 deletions docs/ramalama.conf
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,13 @@
#
#container = true

#size of the prompt context (0 = loaded from model)
#Min chunk size to attempt reusing from the cache via KV shifting
#
#ctx_size=2048
#cache_reuse=256

#size of the prompt context (0 = loaded from model)
#
#ctx_size=0

# Run RamaLama using the specified container engine.
#
Expand Down
6 changes: 5 additions & 1 deletion docs/ramalama.conf.5.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,12 +71,16 @@ OCI model car image

Image to be used when building and pushing --type=car models

**cache_reuse**=256

Min chunk size to attempt reusing from the cache via KV shifting

**container**=true

Run RamaLama in the default container.
RAMALAMA_IN_CONTAINER environment variable overrides this field.

**ctx_size**=2048
**ctx_size**=0

Size of the prompt context (0 = loaded from model)

Expand Down
10 changes: 10 additions & 0 deletions ramalama/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -776,17 +776,27 @@ def runtime_options(parser, command):
)
parser.add_argument("--authfile", help="path of the authentication file")
if command in ["run", "perplexity", "serve"]:
parser.add_argument(
"--cache-reuse",
dest="cache_reuse",
type=int,
default=CONFIG.cache_reuse,
help="min chunk size to attempt reusing from the cache via KV shifting",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Consider clarifying the unit for --cache-reuse in the help text.

Specifying the unit in the help text will make it clearer for users and prevent misunderstandings.

Suggested change
help="min chunk size to attempt reusing from the cache via KV shifting",
help="min chunk size (in bytes) to attempt reusing from the cache via KV shifting",

completer=suppressCompleter,
)
parser.add_argument(
"-c",
"--ctx-size",
dest="context",
type=int,
default=CONFIG.ctx_size,
help="size of the prompt context (0 = loaded from model)",
completer=suppressCompleter,
)
parser.add_argument(
"--max-model-len",
dest="context",
type=int,
default=CONFIG.ctx_size,
help=argparse.SUPPRESS,
completer=suppressCompleter,
Expand Down
3 changes: 2 additions & 1 deletion ramalama/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,8 @@ class BaseConfig:
api: str = "none"
carimage: str = "registry.access.redhat.com/ubi10-micro:latest"
container: bool = None # type: ignore
ctx_size: int = 2048
ctx_size: int = 0
cache_reuse: int = 256
default_image: str = DEFAULT_IMAGE
dryrun: bool = False
engine: SUPPORTED_ENGINES | None = field(default_factory=get_default_engine)
Expand Down
7 changes: 4 additions & 3 deletions ramalama/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -657,13 +657,14 @@ def llama_serve(self, args):
exec_args += [
"--alias",
self.model,
"--ctx-size",
f"{args.context}",
"--temp",
f"{args.temp}",
"--cache-reuse",
"256",
Comment on lines 660 to -665
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Switching --cache-reuse from a hardcoded value to a parameter increases flexibility but may require validation.

Please add validation for the --cache-reuse parameter to prevent invalid values and potential performance issues.

f"{args.cache_reuse}",
]
if args.context > 0:
exec_args += ["--ctx-size", f"{args.context}"]

exec_args += args.runtime_args

if draft_model_path:
Expand Down
10 changes: 6 additions & 4 deletions test/system/030-run.bats
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@ EOF
run_ramalama -q --dryrun run ${MODEL}
is "$output" "${verify_begin}.*"
is "$output" ".*${MODEL}" "verify model name"
is "$output" ".*--ctx-size 2048" "verify model name"
is "$output" ".*--cache-reuse 256" "verify cache-reuse is being set"
assert "$output" !~ ".*--ctx-size" "assert ctx-size is not show by default"
assert "$output" !~ ".*--seed" "assert seed does not show by default"
assert "$output" !~ ".*-t -i" "assert -t -i not present without tty"

Expand All @@ -38,10 +39,11 @@ EOF
run_ramalama -q --dryrun run --oci-runtime foobar ${MODEL}
is "$output" ".*--runtime foobar" "dryrun correct with --oci-runtime"

RAMALAMA_CONFIG=/dev/null run_ramalama -q --dryrun run --seed 9876 -c 4096 --net bridge --name foobar ${MODEL}
RAMALAMA_CONFIG=/dev/null run_ramalama -q --dryrun run --cache-reuse 512 --seed 9876 -c 4096 --net bridge --name foobar ${MODEL}
is "$output" ".*--network bridge.*" "dryrun correct with --name"
is "$output" ".*${MODEL}" "verify model name"
is "$output" ".*--ctx-size 4096" "verify ctx-size is set"
is "$output" ".*--cache-reuse 512" "verify cache-reuse is being set"
is "$output" ".*--temp 0.8" "verify temp is set"
is "$output" ".*--seed 9876" "verify seed is set"
if not_docker; then
Expand Down Expand Up @@ -90,8 +92,8 @@ EOF

else
run_ramalama -q --dryrun run --ctx-size 4096 ${MODEL}
is "$output" '.*serve.*--ctx-size 4096 --temp 0.8.*' "dryrun correct"
is "$output" ".*--ctx-size 4096" "verify model name"
is "$output" '.*--ctx-size 4096.*' "verify ctx-size is set"
is "$output" '.*--cache-reuse 256.*' "assert cache-reuse is set by default to 256"

run_ramalama 22 run --ctx-size=4096 --name foobar ${MODEL}
is "${lines[0]}" "Error: --nocontainer and --name options conflict. The --name option requires a container." "conflict between nocontainer and --name line"
Expand Down
3 changes: 2 additions & 1 deletion test/unit/test_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@ def test_correct_config_defaults(monkeypatch):

assert cfg.carimage == "registry.access.redhat.com/ubi10-micro:latest"
assert cfg.container in [True, False] # depends on env/system
assert cfg.ctx_size == 2048
assert cfg.ctx_size == 0
assert cfg.cache_reuse == 256
assert cfg.engine in ["podman", "docker", None]
assert cfg.env == []
assert cfg.host == "0.0.0.0"
Expand Down
Loading