|
| 1 | +--- |
| 2 | +sidebar_position: 2 |
| 3 | +--- |
| 4 | + |
| 5 | +# ServerlessLLM Store CLI |
| 6 | + |
| 7 | +ServerlessLLM Store's CLI allows the use `sllm-store`'s functionalities within a terminal window. It has the functions: |
| 8 | +- `start`: Starts the gRPC server with the specified configuration. |
| 9 | +- `save`: Convert a HuggingFace model into a loading-optimized format and save it to a local path. |
| 10 | +- `load`: Load a model into given GPUs. |
| 11 | + |
| 12 | +## Requirements |
| 13 | +- OS: Ubuntu 22.04 |
| 14 | +- Python: 3.10 |
| 15 | +- GPU: compute capability 7.0 or higher |
| 16 | + |
| 17 | +## Installations |
| 18 | + |
| 19 | +### Create a virtual environment |
| 20 | +``` bash |
| 21 | +conda create -n sllm-store python=3.10 -y |
| 22 | +conda activate sllm-store |
| 23 | +``` |
| 24 | + |
| 25 | +### Install C++ Runtime Library (required for compiling and running CUDA/C++ extensions) |
| 26 | +``` bash |
| 27 | +conda install -c conda-forge libstdcxx-ng=12 -y |
| 28 | +``` |
| 29 | + |
| 30 | +### Install with pip |
| 31 | +```bash |
| 32 | +pip install serverless-llm-store |
| 33 | +``` |
| 34 | + |
| 35 | +## Example Workflow |
| 36 | +1. Firstly, start the ServerlessLLM Store server. By default, it uses ./models as the storage path. |
| 37 | +Launch the checkpoint store server in a separate process: |
| 38 | +``` bash |
| 39 | +# 'mem_pool_size' is the maximum size of the memory pool in GB. It should be larger than the model size. |
| 40 | +sllm-store start --storage-path $PWD/models --mem-pool-size 4GB |
| 41 | +``` |
| 42 | + |
| 43 | +2. Convert a model to ServerlessLLM format and save it to a local path: |
| 44 | +``` bash |
| 45 | +sllm-store save --model facebook/opt-1.3b --backend vllm |
| 46 | +``` |
| 47 | + |
| 48 | +3. Load a previously saved model into memory, ready for inference: |
| 49 | +```bash |
| 50 | +sllm-store load --model facebook/opt-1.3b --backend vllm |
| 51 | +``` |
| 52 | + |
| 53 | +## sllm-store start |
| 54 | + |
| 55 | +Start a gRPC server to serve models stored using ServerlessLLM. This enables fast, low-latency access to models registered via sllm-store save, allowing external clients to load model weights, retrieve metadata, and perform inference-related operations efficiently. |
| 56 | + |
| 57 | +The server supports in-memory caching with customizable memory pooling and chunking, optimized for parallel read access and minimal I/O latency. |
| 58 | + |
| 59 | +#### Usage |
| 60 | +```bash |
| 61 | +sllm-store start [OPTIONS] |
| 62 | +``` |
| 63 | + |
| 64 | +#### Options |
| 65 | + |
| 66 | +- `--host <host>` |
| 67 | + - Host address to bind the gRPC server to. |
| 68 | + |
| 69 | +- `--port <port>` |
| 70 | + - Port number on which the gRPC server will listen for incoming requests. |
| 71 | + |
| 72 | +- `--storage-path <storage_path>` |
| 73 | + - Path to the directory containing models previously saved with sllm-store save. |
| 74 | + |
| 75 | +- `--num-thread <num_thread>` |
| 76 | + - Number of threads to use for I/O operations and chunk handling. |
| 77 | + |
| 78 | +- `--chunk-size <chunk_size>` |
| 79 | + - Size of individual memory chunks used for caching model data (e.g., 64MiB, 512KB). Must include unit suffix. |
| 80 | + |
| 81 | +- `--mem-pool-size <mem_pool_size>` |
| 82 | + - Total memory pool size to allocate for the in-memory cache (e.g., 4GiB, 2GB). Must include unit suffix. |
| 83 | + |
| 84 | +- `--disk-size <disk_size>` |
| 85 | + - (Currently unused) Would set the maximum size sllm-store can occupy in disk cache. |
| 86 | + |
| 87 | +- `--registration-required` |
| 88 | + - If specified, models must be registered with the server before loading. |
| 89 | + |
| 90 | +#### Examples |
| 91 | + |
| 92 | +Start the server using all default values: |
| 93 | +``` bash |
| 94 | +sllm-store start |
| 95 | +``` |
| 96 | + |
| 97 | +Start the server with a custom storage path: |
| 98 | +``` bash |
| 99 | +sllm-store start --storage-path /your/folder |
| 100 | +``` |
| 101 | + |
| 102 | +Specify a custom port and host: |
| 103 | +``` bash |
| 104 | +sllm-store start --host 127.0.0.1 --port 9090 |
| 105 | +``` |
| 106 | + |
| 107 | +Use larger chunk size and memory pool for large models in a multi-threaded environment: |
| 108 | +``` bash |
| 109 | +sllm-store start --num-thread 16 --chunk-size 128MB --mem-pool-size 8GB |
| 110 | +``` |
| 111 | + |
| 112 | +Run with access control enabled: |
| 113 | +``` bash |
| 114 | +sllm-store start --registration-required True |
| 115 | +``` |
| 116 | + |
| 117 | +Full example for production-style setup: |
| 118 | +``` bash |
| 119 | +sllm-store start \ |
| 120 | + --host 0.0.0.0 \ |
| 121 | + --port 8000 \ |
| 122 | + --storage-path /data/models \ |
| 123 | + --num-thread 8 \ |
| 124 | + --chunk-size 64MB \ |
| 125 | + --mem-pool-size 16GB \ |
| 126 | + --registration-required True |
| 127 | +``` |
| 128 | + |
| 129 | +## sllm-store save |
| 130 | + |
| 131 | +Saves a model to a local directory through a backend of choice, making it available for future inference requests. Only model name and backend are required, with the rest having default values. |
| 132 | + |
| 133 | +It supports download of [PEFT LoRA (Low-Rank Adaptation)](https://huggingface.co/docs/peft/main/en/index) for transformer models, and varying tensor sizes for parallel download of vLLM models. |
| 134 | + |
| 135 | + |
| 136 | +#### Usage |
| 137 | +```bash |
| 138 | +sllm-store save [OPTIONS] |
| 139 | +``` |
| 140 | + |
| 141 | +#### Options |
| 142 | + |
| 143 | +- `--model <model_name>` |
| 144 | + - Model name to deploy with default configuration. The model name must be a Hugging Face pretrained model name. You can find the list of available models [here](https://huggingface.co/models). |
| 145 | + |
| 146 | +- `--backend <backend_name>` |
| 147 | + - Select a backend for the model to be converted to `ServerlessLLM format` from. Supported backends are `vllm` and `transformers`. |
| 148 | + |
| 149 | +- `--adapter` |
| 150 | + - Enable LoRA adapter support. Overwrite `adapter`, which is by default set to False. Only `transformers` backend is supported. |
| 151 | + |
| 152 | +- `--adapter-name <adapter_name>` |
| 153 | + - Adapter name to save. Must be a Hugging Face pretrained LoRA adapter name. |
| 154 | + |
| 155 | +- `--tensor-parallel-size <tensor_parallel_size>` |
| 156 | + - Number of GPUs you want to use. Only `vllm` backend is supported. |
| 157 | + |
| 158 | +- `--local-model-path <local_model_path>` |
| 159 | + - Saves the model from a local path if it contains a Hugging Face snapshot of the model. |
| 160 | + |
| 161 | +- `--storage-path <storage_path>` |
| 162 | + - Location where the model will be saved. |
| 163 | + |
| 164 | +#### Examples |
| 165 | +Save a vLLM model name with default configuration: |
| 166 | +```bash |
| 167 | +sllm-store save --model facebook/opt-1.3b --backend vllm |
| 168 | +``` |
| 169 | + |
| 170 | +Save a transformers model to a set location: |
| 171 | +```bash |
| 172 | +sllm-store save --model facebook/opt-1.3b --backend vllm --storage-path ./your/folder |
| 173 | +``` |
| 174 | + |
| 175 | +Save a vLLM model from a locally stored snapshot and overwrite the tensor parallel size: |
| 176 | +```bash |
| 177 | +sllm-store save --model facebook/opt-1.3b --backend vllm --tensor-parallel-size 4 --local-model-path ./path/to/snapshot |
| 178 | +``` |
| 179 | + |
| 180 | +Save a transformers model with a LoRA adapter: |
| 181 | +```bash |
| 182 | +sllm-store save --model facebook/opt-1.3b --backend transformers --adapter --adapter-name crumb/FLAN-OPT-1.3b-LoRA |
| 183 | +``` |
| 184 | + |
| 185 | +## sllm-store load |
| 186 | + |
| 187 | +Load a model from local storage and run example inference to verify deployment. This command supports both the transformers and vllm backends, with optional support for PEFT LoRA adapters and quantized precision formats including int8, fp4, and nf4 (LoRA and quantization supported on transformers backend only). |
| 188 | + |
| 189 | +When using the transformers backend, the function warms up GPU devices, loads the base model from disk, and optionally merges a LoRA adapter if specified. With vllm, it loads the model in the ServerlessLLM format. |
| 190 | + |
| 191 | +#### Usage |
| 192 | +```bash |
| 193 | +sllm-store load [OPTIONS] |
| 194 | +``` |
| 195 | + |
| 196 | +#### Options |
| 197 | + |
| 198 | +- `--model <model_name>` |
| 199 | + - Model name to deploy with default configuration. The model name must be a Hugging Face pretrained model name. You can find the list of available models [here](https://huggingface.co/models). |
| 200 | + |
| 201 | +- `--backend <backend_name>` |
| 202 | + - Select a backend for the model to be converted to `ServerlessLLM format` from. Supported backends are `vllm` and `transformers`. |
| 203 | + |
| 204 | +- `--adapter` |
| 205 | + - Enable LoRA adapter support for the transformers backend. Overwrite `adapter` in the default configuration (`transformers` backend only). |
| 206 | + |
| 207 | +- `--adapter-name <adapter_name>` |
| 208 | + - Adapter name to save. Must be a Hugging Face pretrained LoRA adapter name. |
| 209 | + |
| 210 | +- `--precision <precision>` |
| 211 | + - Precision to use when loading the model (`transformers` backend only). For more info on quantization in ServerlessLLM, visit [here](https://serverlessllm.github.io/docs/stable/store/quickstart#quantization). |
| 212 | + |
| 213 | +- `--storage-path <storage_path>` |
| 214 | + - Location where the model will be loaded from. |
| 215 | + |
| 216 | +#### Examples |
| 217 | +Load a vllm model from storage: |
| 218 | +``` bash |
| 219 | +sllm-store load --model facebook/opt-1.3b --backend vllm |
| 220 | +``` |
| 221 | + |
| 222 | +Load a transformers model from storage with int8 quantization: |
| 223 | +``` bash |
| 224 | +sllm-store load --model facebook/opt-1.3b --backend transformers --precision int8 --storage-path ./your/models |
| 225 | +``` |
| 226 | + |
| 227 | +Load a transformers model with a LoRA adapter: |
| 228 | +``` bash |
| 229 | +sllm-store load --model facebook/opt-1.3b --backend transformers --adapter --adapter-name crumb/FLAN-OPT-1.3b-LoRA |
| 230 | +``` |
| 231 | + |
| 232 | +#### Note: loading vLLM models |
| 233 | + |
| 234 | +To load models with vLLM, you need to apply a compatibility patch to your vLLM installation. This patch has been tested with vLLM version `0.9.0.1`. |
| 235 | + |
| 236 | +```bash |
| 237 | + ./sllm_store/vllm_patch/patch.sh |
| 238 | +``` |
| 239 | + |
| 240 | +:::note |
| 241 | +The patch file is located at `sllm_store/vllm_patch/sllm_load.patch` in the ServerlessLLM repository. |
| 242 | +::: |
0 commit comments