You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/stable/store/quickstart.md
+7-9Lines changed: 7 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -193,19 +193,15 @@ for output in outputs:
193
193
194
194
## Quantization
195
195
196
-
ServerlessLLM currently supports model quantization using `bitsandbytes` through the Hugging Face Transformers' `BitsAndBytesConfig`.
197
-
198
-
Available precisions include:
199
-
-`int8`
200
-
-`fp4`
201
-
-`nf4`
196
+
> Note: Quantization is currently experimental, especially on multi-GPU machines. You may encounter issues when using this feature in multi-GPU environments.
197
+
> Note: Our current capabilities do not support pre-quantization or CPU offloading, which is why other quantization methods are not available at the moment.
202
198
203
-
For further information, consult the [HuggingFace Documentation for BitsAndBytes](https://huggingface.co/docs/transformers/main/en/quantization/bitsandbytes)
199
+
ServerlessLLM currently supports `bitsandbytes` quantization through `transformers`.
204
200
205
-
> Note: Quantization is currently experimental, especially on multi-GPU machines. You may encounter issues when using this feature in multi-GPU environments.
201
+
For further information, consult the [HuggingFace Documentation for Quantization](https://huggingface.co/docs/transformers/en/main_classes/quantization)
206
202
207
203
### Usage
208
-
To use quantization, create a `BitsAndBytesConfig`object with your desired settings:
204
+
To use quantization, create a quantization config object with your desired settings using the `transformers` format:
209
205
210
206
```python
211
207
from transformers import BitsAndBytesConfig
@@ -238,7 +234,9 @@ model = load_model(
238
234
quantization_config=quantization_config,
239
235
)
240
236
```
237
+
A full example can be found [here](https://github.com/ServerlessLLM/ServerlessLLM/blob/main/sllm_store/examples/load_quantized_transformers_model.py).
241
238
239
+
For users with multi-GPU setups, ensure that the number of CUDA visible devices are the same on both the store server and the user environment via `export CUDA_VISIBLE_DEVICES=<num_gpus>`.
242
240
243
241
# Fine-tuning
244
242
ServerlessLLM currently supports LoRA fine-tuning using peft through the Hugging Face Transformers PEFT.
0 commit comments