You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/backend/lora.ipynb
+2Lines changed: 2 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -33,6 +33,8 @@
33
33
"\n",
34
34
"* `max_loras_per_batch`: Maximum number of adaptors used by each batch. This argument can affect the amount of GPU memory reserved for multi-LoRA serving, so it should be set to a smaller value when memory is scarce. Defaults to be 8.\n",
35
35
"\n",
36
+
"* `max_loaded_loras`: If specified, it limits the maximum number of LoRA adapters loaded in CPU memory at a time. The value must be greater than or equal to `max-loras-per-batch`.\n",
37
+
"\n",
36
38
"* `lora_backend`: The backend of running GEMM kernels for Lora modules. It can be one of `triton` or `flashinfer`, and set to `triton` by default. For better performance and stability, we recommend using the Triton LoRA backend. In the future, faster backend built upon Cutlass or Cuda kernels will be added.\n",
37
39
"\n",
38
40
"* `max_lora_rank`: The maximum LoRA rank that should be supported. If not specified, it will be automatically inferred from the adapters provided in `--lora-paths`. This argument is needed when you expect to dynamically load adapters of larger LoRA rank after server startup.\n",
Copy file name to clipboardExpand all lines: docs/backend/server_arguments.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -181,6 +181,7 @@ Please consult the documentation below and [server_args.py](https://github.com/s
181
181
|`--lora-target-modules`| The union set of all target modules where LoRA should be applied (e.g., `q_proj`, `k_proj`, `gate_proj`). If not specified, it will be automatically inferred from the adapters provided in `--lora-paths`. This argument is needed when you expect to dynamically load adapters of different target modules after server startup. You can also set it to `all` to enable LoRA for all supported modules. However, enabling LoRA on additional modules introduces a minor performance overhead. If your application is performance-sensitive, we recommend only specifying the modules for which you plan to load adapters. | None |
182
182
|`--lora-paths`| The list of LoRA adapters. You can provide a list of either path in str or renamed path in the format {name}={path}. | None |
183
183
|`--max-loras-per-batch`| Maximum number of adapters for a running batch, include base-only request. | 8 |
184
+
|`--max-loaded-loras`| If specified, it limits the maximum number of LoRA adapters loaded in CPU memory at a time. The value must be greater than or equal to `--max-loras-per-batch`. | None |
184
185
|`--lora-backend`| Choose the kernel backend for multi-LoRA serving. | triton |
help="Maximum number of adapters for a running batch, include base-only request.",
1239
1240
)
1241
+
parser.add_argument(
1242
+
"--max-loaded-loras",
1243
+
type=int,
1244
+
default=ServerArgs.max_loaded_loras,
1245
+
help="If specified, it limits the maximum number of LoRA adapters loaded in CPU memory at a time. The value must be greater than or equal to `--max-loras-per-batch`.",
0 commit comments