Skip to content

Commit 235f544

Browse files
committed
Update documentation from main repository
1 parent 196cc92 commit 235f544

File tree

1 file changed

+242
-0
lines changed

1 file changed

+242
-0
lines changed

docs/api/sllm-store-cli.md

Lines changed: 242 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,242 @@
1+
---
2+
sidebar_position: 2
3+
---
4+
5+
# ServerlessLLM Store CLI
6+
7+
ServerlessLLM Store's CLI allows the use `sllm-store`'s functionalities within a terminal window. It has the functions:
8+
- `start`: Starts the gRPC server with the specified configuration.
9+
- `save`: Convert a HuggingFace model into a loading-optimized format and save it to a local path.
10+
- `load`: Load a model into given GPUs.
11+
12+
## Requirements
13+
- OS: Ubuntu 22.04
14+
- Python: 3.10
15+
- GPU: compute capability 7.0 or higher
16+
17+
## Installations
18+
19+
### Create a virtual environment
20+
``` bash
21+
conda create -n sllm-store python=3.10 -y
22+
conda activate sllm-store
23+
```
24+
25+
### Install C++ Runtime Library (required for compiling and running CUDA/C++ extensions)
26+
``` bash
27+
conda install -c conda-forge libstdcxx-ng=12 -y
28+
```
29+
30+
### Install with pip
31+
```bash
32+
pip install serverless-llm-store
33+
```
34+
35+
## Example Workflow
36+
1. Firstly, start the ServerlessLLM Store server. By default, it uses ./models as the storage path.
37+
Launch the checkpoint store server in a separate process:
38+
``` bash
39+
# 'mem_pool_size' is the maximum size of the memory pool in GB. It should be larger than the model size.
40+
sllm-store start --storage-path $PWD/models --mem-pool-size 4GB
41+
```
42+
43+
2. Convert a model to ServerlessLLM format and save it to a local path:
44+
``` bash
45+
sllm-store save --model facebook/opt-1.3b --backend vllm
46+
```
47+
48+
3. Load a previously saved model into memory, ready for inference:
49+
```bash
50+
sllm-store load --model facebook/opt-1.3b --backend vllm
51+
```
52+
53+
## sllm-store start
54+
55+
Start a gRPC server to serve models stored using ServerlessLLM. This enables fast, low-latency access to models registered via sllm-store save, allowing external clients to load model weights, retrieve metadata, and perform inference-related operations efficiently.
56+
57+
The server supports in-memory caching with customizable memory pooling and chunking, optimized for parallel read access and minimal I/O latency.
58+
59+
#### Usage
60+
```bash
61+
sllm-store start [OPTIONS]
62+
```
63+
64+
#### Options
65+
66+
- `--host <host>`
67+
- Host address to bind the gRPC server to.
68+
69+
- `--port <port>`
70+
- Port number on which the gRPC server will listen for incoming requests.
71+
72+
- `--storage-path <storage_path>`
73+
- Path to the directory containing models previously saved with sllm-store save.
74+
75+
- `--num-thread <num_thread>`
76+
- Number of threads to use for I/O operations and chunk handling.
77+
78+
- `--chunk-size <chunk_size>`
79+
- Size of individual memory chunks used for caching model data (e.g., 64MiB, 512KB). Must include unit suffix.
80+
81+
- `--mem-pool-size <mem_pool_size>`
82+
- Total memory pool size to allocate for the in-memory cache (e.g., 4GiB, 2GB). Must include unit suffix.
83+
84+
- `--disk-size <disk_size>`
85+
- (Currently unused) Would set the maximum size sllm-store can occupy in disk cache.
86+
87+
- `--registration-required`
88+
- If specified, models must be registered with the server before loading.
89+
90+
#### Examples
91+
92+
Start the server using all default values:
93+
``` bash
94+
sllm-store start
95+
```
96+
97+
Start the server with a custom storage path:
98+
``` bash
99+
sllm-store start --storage-path /your/folder
100+
```
101+
102+
Specify a custom port and host:
103+
``` bash
104+
sllm-store start --host 127.0.0.1 --port 9090
105+
```
106+
107+
Use larger chunk size and memory pool for large models in a multi-threaded environment:
108+
``` bash
109+
sllm-store start --num-thread 16 --chunk-size 128MB --mem-pool-size 8GB
110+
```
111+
112+
Run with access control enabled:
113+
``` bash
114+
sllm-store start --registration-required True
115+
```
116+
117+
Full example for production-style setup:
118+
``` bash
119+
sllm-store start \
120+
--host 0.0.0.0 \
121+
--port 8000 \
122+
--storage-path /data/models \
123+
--num-thread 8 \
124+
--chunk-size 64MB \
125+
--mem-pool-size 16GB \
126+
--registration-required True
127+
```
128+
129+
## sllm-store save
130+
131+
Saves a model to a local directory through a backend of choice, making it available for future inference requests. Only model name and backend are required, with the rest having default values.
132+
133+
It supports download of [PEFT LoRA (Low-Rank Adaptation)](https://huggingface.co/docs/peft/main/en/index) for transformer models, and varying tensor sizes for parallel download of vLLM models.
134+
135+
136+
#### Usage
137+
```bash
138+
sllm-store save [OPTIONS]
139+
```
140+
141+
#### Options
142+
143+
- `--model <model_name>`
144+
- Model name to deploy with default configuration. The model name must be a Hugging Face pretrained model name. You can find the list of available models [here](https://huggingface.co/models).
145+
146+
- `--backend <backend_name>`
147+
- Select a backend for the model to be converted to `ServerlessLLM format` from. Supported backends are `vllm` and `transformers`.
148+
149+
- `--adapter`
150+
- Enable LoRA adapter support. Overwrite `adapter`, which is by default set to False. Only `transformers` backend is supported.
151+
152+
- `--adapter-name <adapter_name>`
153+
- Adapter name to save. Must be a Hugging Face pretrained LoRA adapter name.
154+
155+
- `--tensor-parallel-size <tensor_parallel_size>`
156+
- Number of GPUs you want to use. Only `vllm` backend is supported.
157+
158+
- `--local-model-path <local_model_path>`
159+
- Saves the model from a local path if it contains a Hugging Face snapshot of the model.
160+
161+
- `--storage-path <storage_path>`
162+
- Location where the model will be saved.
163+
164+
#### Examples
165+
Save a vLLM model name with default configuration:
166+
```bash
167+
sllm-store save --model facebook/opt-1.3b --backend vllm
168+
```
169+
170+
Save a transformers model to a set location:
171+
```bash
172+
sllm-store save --model facebook/opt-1.3b --backend vllm --storage-path ./your/folder
173+
```
174+
175+
Save a vLLM model from a locally stored snapshot and overwrite the tensor parallel size:
176+
```bash
177+
sllm-store save --model facebook/opt-1.3b --backend vllm --tensor-parallel-size 4 --local-model-path ./path/to/snapshot
178+
```
179+
180+
Save a transformers model with a LoRA adapter:
181+
```bash
182+
sllm-store save --model facebook/opt-1.3b --backend transformers --adapter --adapter-name crumb/FLAN-OPT-1.3b-LoRA
183+
```
184+
185+
## sllm-store load
186+
187+
Load a model from local storage and run example inference to verify deployment. This command supports both the transformers and vllm backends, with optional support for PEFT LoRA adapters and quantized precision formats including int8, fp4, and nf4 (LoRA and quantization supported on transformers backend only).
188+
189+
When using the transformers backend, the function warms up GPU devices, loads the base model from disk, and optionally merges a LoRA adapter if specified. With vllm, it loads the model in the ServerlessLLM format.
190+
191+
#### Usage
192+
```bash
193+
sllm-store load [OPTIONS]
194+
```
195+
196+
#### Options
197+
198+
- `--model <model_name>`
199+
- Model name to deploy with default configuration. The model name must be a Hugging Face pretrained model name. You can find the list of available models [here](https://huggingface.co/models).
200+
201+
- `--backend <backend_name>`
202+
- Select a backend for the model to be converted to `ServerlessLLM format` from. Supported backends are `vllm` and `transformers`.
203+
204+
- `--adapter`
205+
- Enable LoRA adapter support for the transformers backend. Overwrite `adapter` in the default configuration (`transformers` backend only).
206+
207+
- `--adapter-name <adapter_name>`
208+
- Adapter name to save. Must be a Hugging Face pretrained LoRA adapter name.
209+
210+
- `--precision <precision>`
211+
- Precision to use when loading the model (`transformers` backend only). For more info on quantization in ServerlessLLM, visit [here](https://serverlessllm.github.io/docs/stable/store/quickstart#quantization).
212+
213+
- `--storage-path <storage_path>`
214+
- Location where the model will be loaded from.
215+
216+
#### Examples
217+
Load a vllm model from storage:
218+
``` bash
219+
sllm-store load --model facebook/opt-1.3b --backend vllm
220+
```
221+
222+
Load a transformers model from storage with int8 quantization:
223+
``` bash
224+
sllm-store load --model facebook/opt-1.3b --backend transformers --precision int8 --storage-path ./your/models
225+
```
226+
227+
Load a transformers model with a LoRA adapter:
228+
``` bash
229+
sllm-store load --model facebook/opt-1.3b --backend transformers --adapter --adapter-name crumb/FLAN-OPT-1.3b-LoRA
230+
```
231+
232+
#### Note: loading vLLM models
233+
234+
To load models with vLLM, you need to apply a compatibility patch to your vLLM installation. This patch has been tested with vLLM version `0.9.0.1`.
235+
236+
```bash
237+
./sllm_store/vllm_patch/patch.sh
238+
```
239+
240+
:::note
241+
The patch file is located at `sllm_store/vllm_patch/sllm_load.patch` in the ServerlessLLM repository.
242+
:::

0 commit comments

Comments
 (0)