Skip to content

Commit 0ba55e4

Browse files
committed
Update documentation from main repository
1 parent 235f544 commit 0ba55e4

File tree

1 file changed

+7
-9
lines changed

1 file changed

+7
-9
lines changed

docs/stable/store/quickstart.md

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -193,19 +193,15 @@ for output in outputs:
193193

194194
## Quantization
195195

196-
ServerlessLLM currently supports model quantization using `bitsandbytes` through the Hugging Face Transformers' `BitsAndBytesConfig`.
197-
198-
Available precisions include:
199-
- `int8`
200-
- `fp4`
201-
- `nf4`
196+
> Note: Quantization is currently experimental, especially on multi-GPU machines. You may encounter issues when using this feature in multi-GPU environments.
197+
> Note: Our current capabilities do not support pre-quantization or CPU offloading, which is why other quantization methods are not available at the moment.
202198
203-
For further information, consult the [HuggingFace Documentation for BitsAndBytes](https://huggingface.co/docs/transformers/main/en/quantization/bitsandbytes)
199+
ServerlessLLM currently supports `bitsandbytes` quantization through `transformers`.
204200

205-
> Note: Quantization is currently experimental, especially on multi-GPU machines. You may encounter issues when using this feature in multi-GPU environments.
201+
For further information, consult the [HuggingFace Documentation for Quantization](https://huggingface.co/docs/transformers/en/main_classes/quantization)
206202

207203
### Usage
208-
To use quantization, create a `BitsAndBytesConfig` object with your desired settings:
204+
To use quantization, create a quantization config object with your desired settings using the `transformers` format:
209205

210206
```python
211207
from transformers import BitsAndBytesConfig
@@ -238,7 +234,9 @@ model = load_model(
238234
quantization_config=quantization_config,
239235
)
240236
```
237+
A full example can be found [here](https://github.com/ServerlessLLM/ServerlessLLM/blob/main/sllm_store/examples/load_quantized_transformers_model.py).
241238

239+
For users with multi-GPU setups, ensure that the number of CUDA visible devices are the same on both the store server and the user environment via `export CUDA_VISIBLE_DEVICES=<num_gpus>`.
242240

243241
# Fine-tuning
244242
ServerlessLLM currently supports LoRA fine-tuning using peft through the Hugging Face Transformers PEFT.

0 commit comments

Comments
 (0)