-
Notifications
You must be signed in to change notification settings - Fork 262
Do not set the ctx-size by default #1915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -33,12 +33,15 @@ The default can be overridden in the ramalama.conf file. | |
#### **--authfile**=*password* | ||
path of the authentication file for OCI registries | ||
|
||
#### **--cache-reuse**=256 | ||
Min chunk size to attempt reusing from the cache via KV shifting | ||
|
||
#### **--color** | ||
Indicate whether or not to use color in the chat. | ||
Possible values are "never", "always" and "auto". (default: auto) | ||
|
||
#### **--ctx-size**, **-c** | ||
size of the prompt context. This option is also available as **--max-model-len**. Applies to llama.cpp and vllm regardless of alias (default: 2048, 0 = loaded from model) | ||
size of the prompt context. This option is also available as **--max-model-len**. Applies to llama.cpp and vllm regardless of alias (default: 4096, 0 = loaded from model) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The PR description mentions that |
||
|
||
#### **--device** | ||
Add a host device to the container. Optional permissions parameter can | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -57,8 +57,11 @@ The default can be overridden in the ramalama.conf file. | |
#### **--authfile**=*password* | ||
Path of the authentication file for OCI registries | ||
|
||
#### **--cache-reuse**=256 | ||
Min chunk size to attempt reusing from the cache via KV shifting | ||
|
||
#### **--ctx-size**, **-c** | ||
size of the prompt context. This option is also available as **--max-model-len**. Applies to llama.cpp and vllm regardless of alias (default: 2048, 0 = loaded from model) | ||
size of the prompt context. This option is also available as **--max-model-len**. Applies to llama.cpp and vllm regardless of alias (default: 4096, 0 = loaded from model) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The PR description mentions that |
||
|
||
#### **--detach**, **-d** | ||
Run the container in the background and print the new container ID. | ||
|
@@ -426,7 +429,7 @@ spec: | |
- name: model-server | ||
image: quay.io/ramalama/ramalama:0.8 | ||
command: ["llama-server"] | ||
args: ['--port', '8081', '--model', '/mnt/models/model.file', '--alias', 'quay.io/rhatdan/granite:latest', '--ctx-size', 2048, '--temp', '0.8', '--jinja', '--cache-reuse', '256', '-v', '--threads', 16, '--host', '127.0.0.1'] | ||
args: ['--port', '8081', '--model', '/mnt/models/model.file', '--alias', 'quay.io/rhatdan/granite:latest', '--temp', '0.8', '--jinja', '--cache-reuse', '256', '-v', '--threads', 16, '--host', '127.0.0.1'] | ||
securityContext: | ||
allowPrivilegeEscalation: false | ||
capabilities: | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -776,17 +776,27 @@ def runtime_options(parser, command): | |||||
) | ||||||
parser.add_argument("--authfile", help="path of the authentication file") | ||||||
if command in ["run", "perplexity", "serve"]: | ||||||
parser.add_argument( | ||||||
"--cache-reuse", | ||||||
dest="cache_reuse", | ||||||
type=int, | ||||||
default=CONFIG.cache_reuse, | ||||||
help="min chunk size to attempt reusing from the cache via KV shifting", | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggestion: Consider clarifying the unit for --cache-reuse in the help text. Specifying the unit in the help text will make it clearer for users and prevent misunderstandings.
Suggested change
|
||||||
completer=suppressCompleter, | ||||||
) | ||||||
parser.add_argument( | ||||||
"-c", | ||||||
"--ctx-size", | ||||||
dest="context", | ||||||
type=int, | ||||||
default=CONFIG.ctx_size, | ||||||
help="size of the prompt context (0 = loaded from model)", | ||||||
completer=suppressCompleter, | ||||||
) | ||||||
parser.add_argument( | ||||||
"--max-model-len", | ||||||
dest="context", | ||||||
type=int, | ||||||
default=CONFIG.ctx_size, | ||||||
help=argparse.SUPPRESS, | ||||||
completer=suppressCompleter, | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -657,13 +657,14 @@ def llama_serve(self, args): | |
exec_args += [ | ||
"--alias", | ||
self.model, | ||
"--ctx-size", | ||
f"{args.context}", | ||
"--temp", | ||
f"{args.temp}", | ||
"--cache-reuse", | ||
"256", | ||
Comment on lines
660
to
-665
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggestion: Switching --cache-reuse from a hardcoded value to a parameter increases flexibility but may require validation. Please add validation for the --cache-reuse parameter to prevent invalid values and potential performance issues. |
||
f"{args.cache_reuse}", | ||
] | ||
if args.context > 0: | ||
exec_args += ["--ctx-size", f"{args.context}"] | ||
|
||
exec_args += args.runtime_args | ||
|
||
if draft_model_path: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR description mentions that
llama.cpp
defaults to a context size of 4098, but the documentation here states the default is 4096. To ensure accuracy, could you please verify the current defaultctx-size
inllama.cpp
and update the documentation accordingly? This will help avoid confusion for users relying on the default behavior.