You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Model]: Add transformers backend support (#11330)
# Adds support for `transformers` as a backend
Following huggingface/transformers#35235, a
bunch of models should already be supported, we are ramping up support
for more models.
Thanks @Isotr0py for the TP support, and @hmellor for his help as well!
This includes:
- `trust_remote_code=True` support: any model on the hub, if it
implements attention the correct way can be natively supported!!
- tensor parallel support
---------
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Harry Mellor <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Copy file name to clipboardExpand all lines: docs/source/models/supported_models.md
+76Lines changed: 76 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -40,6 +40,82 @@ If vLLM successfully returns text (for generative models) or hidden states (for
40
40
Otherwise, please refer to [Adding a New Model](#new-model) for instructions on how to implement your model in vLLM.
41
41
Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.
42
42
43
+
### Transformers fallback
44
+
45
+
After the merge of <gh-pr:11330>, `vllm` can fallback to models that are available in `transformers`. This does not work for all models for now, but most decoder language models are supported, and vision language model support is planned!
46
+
47
+
To check if the backend is `transformers`, you can simply do this:
48
+
49
+
```python
50
+
from vllm importLLM
51
+
llm = LLM(model=..., task="generate") # Name or path of your model
If it is `TransformersModel` then it means it's based on `transformers`!
56
+
57
+
#### Supported features
58
+
59
+
##### LORA and quantization
60
+
61
+
Both are not supported yet! Make sure to open an issue and we'll work on this together with the `transformers` team!
62
+
63
+
Usually `transformers` model load weights via the `load_adapters` API, that depends on PEFT. We need to work a bit to either use this api (for now this would result in some weights not being marked as loaded) or replace modules accordingly.
Blocker is that you need to specify supported lora layers, when we would ideally want to load whatever is inside the checkpoint!
75
+
76
+
##### Remote code
77
+
78
+
This fallback also means that any model on the hub that can be used in `transformers` with `trust_remote_code=True` that correctly implements attention can be used in production!
79
+
80
+
```python
81
+
from vllm importLLM
82
+
llm = LLM(model=..., task="generate", trust_remote_code=True) # Name or path of your model
2.`MyModel` python class is loaded from the `auto_map`, and we check that the model `_supports_attention_backend`.
115
+
3. The `TransformersModel` backend is used. See `/model_executors/models/transformers`, which leverage `self.config._attn_implementation = "vllm"`, thus the need to use `ALL_ATTENTION_FUNCTION`.
116
+
117
+
That's it!
118
+
43
119
### ModelScope
44
120
45
121
To use models from [ModelScope](https://www.modelscope.cn) instead of HuggingFace Hub, set an environment variable:
0 commit comments