Skip to content

Commit 3247632

Browse files
committed
Do not set the ctx-size by default
llama.cpp is defaulting to ctx-size of 4098 and we were hard coding 2049, which means we were not using the default setting. This PR changes to use ctx-size=0 which will not be specified in the command unless the value is > 0, so llama-server default will be used. Also we hard coded cache_reuse=256 with no way for user to override, this PR adds support for cache_reuse being set in ramalama.conf and on the command line. Signed-off-by: Daniel J Walsh <[email protected]>
1 parent b0e1226 commit 3247632

File tree

9 files changed

+43
-13
lines changed

9 files changed

+43
-13
lines changed

docs/ramalama-perplexity.1.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,11 @@ URL support means if a model is on a web site or even on your local system, you
2929
#### **--authfile**=*password*
3030
path of the authentication file for OCI registries
3131

32+
#### **--cache-reuse**=256
33+
Min chunk size to attempt reusing from the cache via KV shifting
34+
3235
#### **--ctx-size**, **-c**
33-
size of the prompt context. This option is also available as **--max-model-len**. Applies to llama.cpp and vllm regardless of alias (default: 2048, 0 = loaded from model)
36+
size of the prompt context. This option is also available as **--max-model-len**. Applies to llama.cpp and vllm regardless of alias (default: 4096, 0 = loaded from model)
3437

3538
#### **--device**
3639
Add a host device to the container. Optional permissions parameter can

docs/ramalama-run.1.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,12 +33,15 @@ The default can be overridden in the ramalama.conf file.
3333
#### **--authfile**=*password*
3434
path of the authentication file for OCI registries
3535

36+
#### **--cache-reuse**=256
37+
Min chunk size to attempt reusing from the cache via KV shifting
38+
3639
#### **--color**
3740
Indicate whether or not to use color in the chat.
3841
Possible values are "never", "always" and "auto". (default: auto)
3942

4043
#### **--ctx-size**, **-c**
41-
size of the prompt context. This option is also available as **--max-model-len**. Applies to llama.cpp and vllm regardless of alias (default: 2048, 0 = loaded from model)
44+
size of the prompt context. This option is also available as **--max-model-len**. Applies to llama.cpp and vllm regardless of alias (default: 4096, 0 = loaded from model)
4245

4346
#### **--device**
4447
Add a host device to the container. Optional permissions parameter can

docs/ramalama-serve.1.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -57,8 +57,11 @@ The default can be overridden in the ramalama.conf file.
5757
#### **--authfile**=*password*
5858
Path of the authentication file for OCI registries
5959

60+
#### **--cache-reuse**=256
61+
Min chunk size to attempt reusing from the cache via KV shifting
62+
6063
#### **--ctx-size**, **-c**
61-
size of the prompt context. This option is also available as **--max-model-len**. Applies to llama.cpp and vllm regardless of alias (default: 2048, 0 = loaded from model)
64+
size of the prompt context. This option is also available as **--max-model-len**. Applies to llama.cpp and vllm regardless of alias (default: 4096, 0 = loaded from model)
6265

6366
#### **--detach**, **-d**
6467
Run the container in the background and print the new container ID.
@@ -426,7 +429,7 @@ spec:
426429
- name: model-server
427430
image: quay.io/ramalama/ramalama:0.8
428431
command: ["llama-server"]
429-
args: ['--port', '8081', '--model', '/mnt/models/model.file', '--alias', 'quay.io/rhatdan/granite:latest', '--ctx-size', 2048, '--temp', '0.8', '--jinja', '--cache-reuse', '256', '-v', '--threads', 16, '--host', '127.0.0.1']
432+
args: ['--port', '8081', '--model', '/mnt/models/model.file', '--alias', 'quay.io/rhatdan/granite:latest', '--temp', '0.8', '--jinja', '--cache-reuse', '256', '-v', '--threads', 16, '--host', '127.0.0.1']
430433
securityContext:
431434
allowPrivilegeEscalation: false
432435
capabilities:

docs/ramalama.conf

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,10 +31,13 @@
3131
#
3232
#container = true
3333

34-
#size of the prompt context (0 = loaded from model)
34+
#Min chunk size to attempt reusing from the cache via KV shifting
3535
#
36-
#ctx_size=2048
36+
#cache_reuse=256
3737

38+
#size of the prompt context (0 = loaded from model)
39+
#
40+
#ctx_size=0
3841

3942
# Run RamaLama using the specified container engine.
4043
#

docs/ramalama.conf.5.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,12 +71,16 @@ OCI model car image
7171

7272
Image to be used when building and pushing --type=car models
7373

74+
**cache_reuse**=256
75+
76+
Min chunk size to attempt reusing from the cache via KV shifting
77+
7478
**container**=true
7579

7680
Run RamaLama in the default container.
7781
RAMALAMA_IN_CONTAINER environment variable overrides this field.
7882

79-
**ctx_size**=2048
83+
**ctx_size**=0
8084

8185
Size of the prompt context (0 = loaded from model)
8286

ramalama/cli.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -776,17 +776,27 @@ def runtime_options(parser, command):
776776
)
777777
parser.add_argument("--authfile", help="path of the authentication file")
778778
if command in ["run", "perplexity", "serve"]:
779+
parser.add_argument(
780+
"--cache-reuse",
781+
dest="cache_reuse",
782+
type=int,
783+
default=CONFIG.cache_reuse,
784+
help="min chunk size to attempt reusing from the cache via KV shifting",
785+
completer=suppressCompleter,
786+
)
779787
parser.add_argument(
780788
"-c",
781789
"--ctx-size",
782790
dest="context",
791+
type=int,
783792
default=CONFIG.ctx_size,
784793
help="size of the prompt context (0 = loaded from model)",
785794
completer=suppressCompleter,
786795
)
787796
parser.add_argument(
788797
"--max-model-len",
789798
dest="context",
799+
type=int,
790800
default=CONFIG.ctx_size,
791801
help=argparse.SUPPRESS,
792802
completer=suppressCompleter,

ramalama/config.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,8 @@ class BaseConfig:
6969
api: str = "none"
7070
carimage: str = "registry.access.redhat.com/ubi10-micro:latest"
7171
container: bool = None # type: ignore
72-
ctx_size: int = 2048
72+
ctx_size: int = 0
73+
cache_reuse: int = 256
7374
default_image: str = DEFAULT_IMAGE
7475
dryrun: bool = False
7576
engine: SUPPORTED_ENGINES | None = field(default_factory=get_default_engine)

ramalama/model.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -657,13 +657,14 @@ def llama_serve(self, args):
657657
exec_args += [
658658
"--alias",
659659
self.model,
660-
"--ctx-size",
661-
f"{args.context}",
662660
"--temp",
663661
f"{args.temp}",
664662
"--cache-reuse",
665-
"256",
663+
f"{args.cache_reuse}",
666664
]
665+
if args.context > 0:
666+
exec_args += ["--ctx-size", f"{args.context}"]
667+
667668
exec_args += args.runtime_args
668669

669670
if draft_model_path:

test/system/030-run.bats

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,8 @@ EOF
2020
run_ramalama -q --dryrun run ${MODEL}
2121
is "$output" "${verify_begin}.*"
2222
is "$output" ".*${MODEL}" "verify model name"
23-
is "$output" ".*--ctx-size 2048" "verify model name"
23+
is "$output" ".*--cache-reuse 256" "verify cache-reuse is being set"
24+
assert "$output" !~ ".*--ctx-size" "assert ctx-size is not show by default"
2425
assert "$output" !~ ".*--seed" "assert seed does not show by default"
2526
assert "$output" !~ ".*-t -i" "assert -t -i not present without tty"
2627

@@ -38,10 +39,11 @@ EOF
3839
run_ramalama -q --dryrun run --oci-runtime foobar ${MODEL}
3940
is "$output" ".*--runtime foobar" "dryrun correct with --oci-runtime"
4041

41-
RAMALAMA_CONFIG=/dev/null run_ramalama -q --dryrun run --seed 9876 -c 4096 --net bridge --name foobar ${MODEL}
42+
RAMALAMA_CONFIG=/dev/null run_ramalama -q --dryrun run --cache-reuse 512 --seed 9876 -c 4096 --net bridge --name foobar ${MODEL}
4243
is "$output" ".*--network bridge.*" "dryrun correct with --name"
4344
is "$output" ".*${MODEL}" "verify model name"
4445
is "$output" ".*--ctx-size 4096" "verify ctx-size is set"
46+
is "$output" ".*--cache-reuse 512" "verify cache-reuse is being set"
4547
is "$output" ".*--temp 0.8" "verify temp is set"
4648
is "$output" ".*--seed 9876" "verify seed is set"
4749
if not_docker; then

0 commit comments

Comments
 (0)