diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index 4879a7bf045e..cbfabff80e06 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -89,6 +89,22 @@ - local: using-diffusers/image_quality title: FreeU +- title: Quantization + isExpanded: false + sections: + - local: quantization/overview + title: Overview + - local: quantization/bitsandbytes + title: bitsandbytes + - local: quantization/gguf + title: gguf + - local: quantization/torchao + title: torchao + - local: quantization/quanto + title: quanto + - local: quantization/modelopt + title: NVIDIA ModelOpt + - title: Hybrid Inference isExpanded: false sections: @@ -171,22 +187,6 @@ - local: training/ddpo title: Reinforcement learning training with DDPO -- title: Quantization - isExpanded: false - sections: - - local: quantization/overview - title: Getting started - - local: quantization/bitsandbytes - title: bitsandbytes - - local: quantization/gguf - title: gguf - - local: quantization/torchao - title: torchao - - local: quantization/quanto - title: quanto - - local: quantization/modelopt - title: NVIDIA ModelOpt - - title: Model accelerators and hardware isExpanded: false sections: diff --git a/docs/source/en/quantization/bitsandbytes.md b/docs/source/en/quantization/bitsandbytes.md index f97119d5f4cd..e7aaacfd08b4 100644 --- a/docs/source/en/quantization/bitsandbytes.md +++ b/docs/source/en/quantization/bitsandbytes.md @@ -13,56 +13,49 @@ specific language governing permissions and limitations under the License. # bitsandbytes -[bitsandbytes](https://huggingface.co/docs/bitsandbytes/index) is the easiest option for quantizing a model to 8 and 4-bit. 8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16. This reduces the degradative effect outlier values have on a model's performance. +[bitsandbytes](https://huggingface.co/docs/bitsandbytes/index) is a k-bit quantization library with two quantization algorithms. -4-bit quantization compresses a model even further, and it is commonly used with [QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs. +- [LLM.int8](https://huggingface.co/papers/2208.07339) reduces memory use by half by quantizing most features to 8-bits, while handling outliers with 16-bit operations, all without performance loss. +- [QLoRA](https://huggingface.co/papers/2305.14314) compresses weights to 4-bits, and adds a small set of trainable low-rank adapters, reducing memory use without hurting performance. -This guide demonstrates how quantization can enable running -[FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) -on less than 16GB of VRAM and even on a free Google -Colab instance. +This guide demonstrates how quantization enables inference with large diffusion models on less than 16GB of memory. -![comparison image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/quant-bnb/comparison.png) - -To use bitsandbytes, make sure you have the following libraries installed: +Make sure the bitsandbytes library is installed. ```bash -pip install diffusers transformers accelerate bitsandbytes -U +pip -U install diffusers transformers accelerate bitsandbytes ``` -Now you can quantize a model by passing a [`BitsAndBytesConfig`] to [`~ModelMixin.from_pretrained`]. This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers. +Pass a [`BitsAndBytesConfig`] to [`~ModelMixin.from_pretrained`] to quantize a model. The [`BitsAndBytesConfig`] contains your quantization configuration. The [`CLIPTextModel`] and [`AutoencoderKL`] aren't quantized because they're already small and because [`AutoencoderKL`] only has a few `torch.nn.Linear` layers. bitsandbytes is supported in Transformers and Diffusers, so you can quantize [`FluxTransformer2DModel`] and [`transformers.T5EncoderModel`]. - - +By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter. -Quantizing a model in 8-bit halves the memory-usage: +This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers. -bitsandbytes is supported in both Transformers and Diffusers, so you can quantize both the -[`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`]. +> [!NOTE] +> For Ada and higher-series GPUs, change `torch_dtype` to `torch.bfloat16`. -For Ada and higher-series GPUs. we recommend changing `torch_dtype` to `torch.bfloat16`. + + -> [!TIP] -> The [`CLIPTextModel`] and [`AutoencoderKL`] aren't quantized because they're already small in size and because [`AutoencoderKL`] only has a few `torch.nn.Linear` layers. +Quantizing a model to 8-bits reduces memory usage by 2x. ```py -from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig -from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig import torch from diffusers import AutoModel +from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig from transformers import T5EncoderModel +from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig quant_config = TransformersBitsAndBytesConfig(load_in_8bit=True,) - text_encoder_2_8bit = T5EncoderModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", quantization_config=quant_config, - torch_dtype=torch.float16, + dtype=torch.float16, ) quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True,) - transformer_8bit = AutoModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="transformer", @@ -71,26 +64,12 @@ transformer_8bit = AutoModel.from_pretrained( ) ``` -By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter. - -```diff -transformer_8bit = AutoModel.from_pretrained( - "black-forest-labs/FLUX.1-dev", - subfolder="transformer", - quantization_config=quant_config, -+ torch_dtype=torch.float32, -) -``` - -Let's generate an image using our quantized models. - -Setting `device_map="auto"` automatically fills all available space on the GPU(s) first, then the -CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory. +Set `device_map="cuda"` to place the pipeline on an accelerator like a GPU. ```py from diffusers import FluxPipeline -pipe = FluxPipeline.from_pretrained( +pipeline = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", transformer=transformer_8bit, text_encoder_2=text_encoder_2_8bit, @@ -98,57 +77,35 @@ pipe = FluxPipeline.from_pretrained( device_map="auto", ) -pipe_kwargs = { - "prompt": "A cat holding a sign that says hello world", - "height": 1024, - "width": 1024, - "guidance_scale": 3.5, - "num_inference_steps": 50, - "max_sequence_length": 512, -} +prompt=""" +cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California +highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain +""" -image = pipe(**pipe_kwargs, generator=torch.manual_seed(0),).images[0] +image = pipeline(prompt).images[0] ``` -
- -
- -When there is enough memory, you can also directly move the pipeline to the GPU with `.to("cuda")` and apply [`~DiffusionPipeline.enable_model_cpu_offload`] to optimize GPU memory usage. - -Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 8-bit models locally with [`~ModelMixin.save_pretrained`]. -
-Quantizing a model in 4-bit reduces your memory-usage by 4x: - -bitsandbytes is supported in both Transformers and Diffusers, so you can can quantize both the -[`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`]. - -For Ada and higher-series GPUs. we recommend changing `torch_dtype` to `torch.bfloat16`. - -> [!TIP] -> The [`CLIPTextModel`] and [`AutoencoderKL`] aren't quantized because they're already small in size and because [`AutoencoderKL`] only has a few `torch.nn.Linear` layers. +Quantizing a model to 4-bit reduces your memory usage by 4x. ```py -from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig -from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig import torch from diffusers import AutoModel +from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig from transformers import T5EncoderModel +from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig quant_config = TransformersBitsAndBytesConfig(load_in_4bit=True,) - text_encoder_2_4bit = T5EncoderModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", quantization_config=quant_config, - torch_dtype=torch.float16, + dtype=torch.float16, ) quant_config = DiffusersBitsAndBytesConfig(load_in_4bit=True,) - transformer_4bit = AutoModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="transformer", @@ -157,96 +114,45 @@ transformer_4bit = AutoModel.from_pretrained( ) ``` -By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter. - -```diff -transformer_4bit = AutoModel.from_pretrained( - "black-forest-labs/FLUX.1-dev", - subfolder="transformer", - quantization_config=quant_config, -+ torch_dtype=torch.float32, -) -``` - -Let's generate an image using our quantized models. - -Setting `device_map="auto"` automatically fills all available space on the GPU(s) first, then the CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory. +Set `device_map="cuda"` to place the pipeline on an accelerator like a GPU. ```py from diffusers import FluxPipeline -pipe = FluxPipeline.from_pretrained( +pipeline = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", transformer=transformer_4bit, text_encoder_2=text_encoder_2_4bit, torch_dtype=torch.float16, - device_map="auto", + device_map="cuda", ) -pipe_kwargs = { - "prompt": "A cat holding a sign that says hello world", - "height": 1024, - "width": 1024, - "guidance_scale": 3.5, - "num_inference_steps": 50, - "max_sequence_length": 512, -} +prompt=""" +cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California +highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain +""" -image = pipe(**pipe_kwargs, generator=torch.manual_seed(0),).images[0] +image = pipeline(prompt).images[0] ``` -
- -
- -When there is enough memory, you can also directly move the pipeline to the GPU with `.to("cuda")` and apply [`~DiffusionPipeline.enable_model_cpu_offload`] to optimize GPU memory usage. - -Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`]. -
- - -Training with 8-bit and 4-bit weights are only supported for training *extra* parameters. - - - -Check your memory footprint with the `get_memory_footprint` method: +Use [`~ModelMixin.get_memory_footprint`] to estimate the memory footprint of the model parameters. It does not estimate the inference memory requirements. ```py print(model.get_memory_footprint()) ``` -Note that this only tells you the memory footprint of the model params and does _not_ estimate the inference memory requirements. - -Quantized models can be loaded from the [`~ModelMixin.from_pretrained`] method without needing to specify the `quantization_config` parameters: - -```py -from diffusers import AutoModel, BitsAndBytesConfig +## LLM.int8 -quantization_config = BitsAndBytesConfig(load_in_4bit=True) - -model_4bit = AutoModel.from_pretrained( - "hf-internal-testing/flux.1-dev-nf4-pkg", subfolder="transformer" -) -``` - -## 8-bit (LLM.int8() algorithm) - - - -Learn more about the details of 8-bit quantization in this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration)! - - - -This section explores some of the specific features of 8-bit models, such as outlier thresholds and skipping module conversion. +This section goes over outlier thresholds and skipping module conversion, features specific to the LLM.int8 algorithm. ### Outlier threshold -An "outlier" is a hidden state value greater than a certain threshold, and these values are computed in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty. A good default threshold value is 6, but a lower threshold may be needed for more unstable models (small models or finetuning). +An "outlier" is a hidden state value greater than a certain threshold and they're computed in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty. A good default threshold value is 6, but a lower threshold may be needed for more unstable models (small models or fine-tuning). -To find the best threshold for your model, we recommend experimenting with the `llm_int8_threshold` parameter in [`BitsAndBytesConfig`]: +Experiment with the `llm_int8_threshold` argument to find the best threshold for your model. ```py from diffusers import AutoModel, BitsAndBytesConfig @@ -264,7 +170,7 @@ model_8bit = AutoModel.from_pretrained( ### Skip module conversion -For some models, you don't need to quantize every module to 8-bit which can actually cause instability. For example, for diffusion models like [Stable Diffusion 3](../api/pipelines/stable_diffusion/stable_diffusion_3), the `proj_out` module can be skipped using the `llm_int8_skip_modules` parameter in [`BitsAndBytesConfig`]: +Some models don't require every module to be quantized to 8-bits. This can actually cause instability. For example, in [Stable Diffusion 3](../api/pipelines/stable_diffusion/stable_diffusion_3), skip the `proj_out` module using the `llm_int8_skip_modules` argument. ```py from diffusers import SD3Transformer2DModel, BitsAndBytesConfig @@ -280,53 +186,37 @@ model_8bit = SD3Transformer2DModel.from_pretrained( ) ``` +## QLoRA -## 4-bit (QLoRA algorithm) - - - -Learn more about its details in this [blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes). - - - -This section explores some of the specific features of 4-bit models, such as changing the compute data type, using the Normal Float 4 (NF4) data type, and using nested quantization. - +This section goes over compute data type, Normal Float 4 (NF4) data type, and nested quantization, features specific to QLoRA. ### Compute data type -To speedup computation, you can change the data type from float32 (the default value) to bf16 using the `bnb_4bit_compute_dtype` parameter in [`BitsAndBytesConfig`]: +Change the data type from float32 (the default value) to bf16 using the `bnb_4bit_compute_dtype` argument to speed up computation. Use the same `bnb_4bit_compute_dtype` and `torch_dtype` values to remain consistent. ```py import torch -from diffusers import BitsAndBytesConfig +from diffusers import BitsAndBytesConfig, AutoModel -quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16) +quant_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16) +model = AutoModel.from_pretrained( + "black-forest-labs/FLUX.1-dev", + subfolder="transformer", + quantization_config=quant_config, + torch_dtype=torch.bfloat16, +) ``` ### Normal Float 4 (NF4) -NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for weights initialized from a normal distribution. You should use NF4 for training 4-bit base models. This can be configured with the `bnb_4bit_quant_type` parameter in the [`BitsAndBytesConfig`]: +NF4 is a 4-bit data type adapted for weights initialized from a normal distribution. Use NF4 for training 4-bit base models. For inference, NF4 does not have a significant impact on performance. -```py -from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig -from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig - -from diffusers import AutoModel -from transformers import T5EncoderModel +Configure the `bnb_4bit_quant_type` argument to `"nf4"`. -quant_config = TransformersBitsAndBytesConfig( - load_in_4bit=True, - bnb_4bit_quant_type="nf4", -) - -text_encoder_2_4bit = T5EncoderModel.from_pretrained( - "black-forest-labs/FLUX.1-dev", - subfolder="text_encoder_2", - quantization_config=quant_config, - torch_dtype=torch.float16, -) +```py +from diffusers import AutoModel, BitsAndBytesConfig -quant_config = DiffusersBitsAndBytesConfig( +quant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", ) @@ -339,32 +229,14 @@ transformer_4bit = AutoModel.from_pretrained( ) ``` -For inference, the `bnb_4bit_quant_type` does not have a huge impact on performance. However, to remain consistent with the model weights, you should use the `bnb_4bit_compute_dtype` and `torch_dtype` values. - ### Nested quantization -Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter. +Nested quantization quantizes the already quantized weights to save an additional 0.4 bits/parameter. Set `bnb_4bit_use_double_quant=True` to enable nested quantization. ```py -from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig -from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig - -from diffusers import AutoModel -from transformers import T5EncoderModel - -quant_config = TransformersBitsAndBytesConfig( - load_in_4bit=True, - bnb_4bit_use_double_quant=True, -) - -text_encoder_2_4bit = T5EncoderModel.from_pretrained( - "black-forest-labs/FLUX.1-dev", - subfolder="text_encoder_2", - quantization_config=quant_config, - torch_dtype=torch.float16, -) +from diffusers import AutoModel, BitsAndBytesConfig -quant_config = DiffusersBitsAndBytesConfig( +quant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, ) @@ -377,30 +249,16 @@ transformer_4bit = AutoModel.from_pretrained( ) ``` -## Dequantizing `bitsandbytes` models - -Once quantized, you can dequantize a model to its original precision, but this might result in a small loss of quality. Make sure you have enough GPU RAM to fit the dequantized model. - -```python -from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig -from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig +## Dequantize a model -from diffusers import AutoModel -from transformers import T5EncoderModel +Dequantizing recovers the model weights original precision but you may experience a small loss in quality. Make sure you have enough GPU memory to fit the dequantized model. -quant_config = TransformersBitsAndBytesConfig( - load_in_4bit=True, - bnb_4bit_use_double_quant=True, -) +Call [`~ModelMixin.dequantize`] to dequantize a model. -text_encoder_2_4bit = T5EncoderModel.from_pretrained( - "black-forest-labs/FLUX.1-dev", - subfolder="text_encoder_2", - quantization_config=quant_config, - torch_dtype=torch.float16, -) +```python +from diffusers import AutoModel, BitsAndBytesConfig -quant_config = DiffusersBitsAndBytesConfig( +quant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, ) @@ -412,50 +270,34 @@ transformer_4bit = AutoModel.from_pretrained( torch_dtype=torch.float16, ) -text_encoder_2_4bit.dequantize() transformer_4bit.dequantize() ``` ## torch.compile -Speed up inference with `torch.compile`. Make sure you have the latest `bitsandbytes` installed and we also recommend installing [PyTorch nightly](https://pytorch.org/get-started/locally/). +Speed up inference with [torch.compile](../optimization/fp16#torchcompile). Make sure you have the latest bitsandbytes and [PyTorch nightly](https://pytorch.org/get-started/locally/) installed. - - ```py -torch._dynamo.config.capture_dynamic_output_shape_ops = True - -quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) -transformer_4bit = AutoModel.from_pretrained( - "black-forest-labs/FLUX.1-dev", - subfolder="transformer", - quantization_config=quant_config, - torch_dtype=torch.float16, -) -transformer_4bit.compile(fullgraph=True) -``` +import torch +from diffusers import BitsAndBytesConfig, AutoModel - - +torch._dynamo.config.capture_dynamic_output_shape_ops = True -```py -quant_config = DiffusersBitsAndBytesConfig(load_in_4bit=True) -transformer_4bit = AutoModel.from_pretrained( +quant_config = BitsAndBytesConfig(load_in_8bit=True) +transformer_8bit = AutoModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.float16, ) -transformer_4bit.compile(fullgraph=True) +transformer_8bit.compile(fullgraph=True) ``` - - - -On an RTX 4090 with compilation, 4-bit Flux generation completed in 25.809 seconds versus 32.570 seconds without. -Check out the [benchmarking script](https://gist.github.com/sayakpaul/0db9d8eeeb3d2a0e5ed7cf0d9ca19b7d) for more details. +On an RTX 4090 with compilation, 4-bit Flux generation completed in 25.809 seconds versus 32.570 seconds without. Check out the [benchmarking script](https://gist.github.com/sayakpaul/0db9d8eeeb3d2a0e5ed7cf0d9ca19b7d) for more details. ## Resources -* [End-to-end notebook showing Flux.1 Dev inference in a free-tier Colab](https://gist.github.com/sayakpaul/c76bd845b48759e11687ac550b99d8b4) -* [Training](https://github.com/huggingface/diffusers/blob/8c661ea586bf11cb2440da740dd3c4cf84679b85/examples/dreambooth/README_hidream.md#using-quantization) \ No newline at end of file +* Read [A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes](https://huggingface.co/blog/hf-bitsandbytes-integration) to learn more about 8-bit quantization. +* Read [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes) to learn more about 4-bit quantization. +* Check out this [notebook](https://gist.github.com/sayakpaul/c76bd845b48759e11687ac550b99d8b4) for an example of FLUX.1-dev inference on a free-tier instance of Colab. +* Take a look at this [training script](https://github.com/huggingface/diffusers/blob/8c661ea586bf11cb2440da740dd3c4cf84679b85/examples/dreambooth/README_hidream.md#using-quantization) which quantizes the base model with bitsandbytes. \ No newline at end of file diff --git a/docs/source/en/quantization/gguf.md b/docs/source/en/quantization/gguf.md index 47804c102da2..bb49996195c6 100644 --- a/docs/source/en/quantization/gguf.md +++ b/docs/source/en/quantization/gguf.md @@ -13,74 +13,80 @@ specific language governing permissions and limitations under the License. # GGUF -The GGUF file format is typically used to store models for inference with [GGML](https://github.com/ggerganov/ggml) and supports a variety of block wise quantization options. Diffusers supports loading checkpoints prequantized and saved in the GGUF format via `from_single_file` loading with Model classes. Loading GGUF checkpoints via Pipelines is currently not supported. +GGUF is a binary file format for storing and loading [GGML](https://github.com/ggerganov/ggml) models for inference. It's designed to support various blockwise quantization options, single-file deployment, and fast loading and saving. -The following example will load the [FLUX.1 DEV](https://huggingface.co/black-forest-labs/FLUX.1-dev) transformer model using the GGUF Q2_K quantization variant. +Diffusers only supports loading GGUF *model* files as opposed to an entire GGUF pipeline checkpoint. -Before starting please install gguf in your environment +
+Supported quantization types -```shell -pip install -U gguf -``` +- BF16 +- Q4_0 +- Q4_1 +- Q5_0 +- Q5_1 +- Q8_0 +- Q2_K +- Q3_K +- Q4_K +- Q5_K +- Q6_K -Since GGUF is a single file format, use [`~FromSingleFileMixin.from_single_file`] to load the model and pass in the [`GGUFQuantizationConfig`]. +
-When using GGUF checkpoints, the quantized weights remain in a low memory `dtype`(typically `torch.uint8`) and are dynamically dequantized and cast to the configured `compute_dtype` during each module's forward pass through the model. The `GGUFQuantizationConfig` allows you to set the `compute_dtype`. +Make sure gguf is installed. + +```bash +pip install -U gguf +``` -The functions used for dynamic dequantizatation are based on the great work done by [city96](https://github.com/city96/ComfyUI-GGUF), who created the Pytorch ports of the original [`numpy`](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py) implementation by [compilade](https://github.com/compilade). +Load GGUF files with [`~loaders.FromSingleFileMixin.from_single_file`] and pass [`GGUFQuantizationConfig`] to configure the `compute_type`. Quantized weights remain in a low memory data type and are dynamically dequantized and cast to the configured `compute_dtype` during each module's forward pass through the model. ```python import torch +from diffusers import FluxPipeline, AutoModel, GGUFQuantizationConfig -from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig - -ckpt_path = ( - "https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q2_K.gguf" -) -transformer = FluxTransformer2DModel.from_single_file( - ckpt_path, +transformer = AutoModel.from_single_file( + "https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q2_K.gguf", quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16), torch_dtype=torch.bfloat16, ) -pipe = FluxPipeline.from_pretrained( +pipeline = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", transformer=transformer, torch_dtype=torch.bfloat16, + device_map="cuda" ) -pipe.enable_model_cpu_offload() -prompt = "A cat holding a sign that says hello world" -image = pipe(prompt, generator=torch.manual_seed(0)).images[0] +prompt = """ +cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California +highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain +""" +image = pipeline(prompt).images[0] image.save("flux-gguf.png") ``` -## Using Optimized CUDA Kernels with GGUF +## CUDA kernels -Optimized CUDA kernels can accelerate GGUF quantized model inference by approximately 10%. This functionality requires a compatible GPU with `torch.cuda.get_device_capability` greater than 7 and the kernels library: +Optimized CUDA kernels accelerate GGUF model inference by ~10%. You need a compatible GPU with `torch.cuda.get_device_capability` greater than 7 and the [kernels](https://huggingface.co/docs/kernels/index) library. -```shell +```bash pip install -U kernels ``` -Once installed, set `DIFFUSERS_GGUF_CUDA_KERNELS=true` to use optimized kernels when available. Note that CUDA kernels may introduce minor numerical differences compared to the original GGUF implementation, potentially causing subtle visual variations in generated images. To disable CUDA kernel usage, set the environment variable `DIFFUSERS_GGUF_CUDA_KERNELS=false`. +Set `DIFFUSERS_GGUF_CUDA_KERNELS=true` to enable optimized kernels. CUDA kernels introduce minor numerical differences compared to the original GGUF implementation, which may cause subtle visual variations in generated images. -## Supported Quantization Types +```python +import os -- BF16 -- Q4_0 -- Q4_1 -- Q5_0 -- Q5_1 -- Q8_0 -- Q2_K -- Q3_K -- Q4_K -- Q5_K -- Q6_K +# Enable CUDA kernels for ~10% speedup +os.environ["DIFFUSERS_GGUF_CUDA_KERNELS"] = "true" +# Disable CUDA kernels +# os.environ["DIFFUSERS_GGUF_CUDA_KERNELS"] = "false" +``` ## Convert to GGUF -Use the Space below to convert a Diffusers checkpoint into the GGUF format for inference. -run conversion: +Use the Space below to convert a Diffusers checkpoint into a GGUF file. +GGUF files stored in the [Diffusers format](../using-diffusers/other-formats) require the model's `config` path. Provide the `subfolder` argument if the model config is inside a subfolder. ```py import torch +from diffusers import FluxPipeline, AutoModel, GGUFQuantizationConfig -from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig - -ckpt_path = ( - "https://huggingface.co/sayakpaul/different-lora-from-civitai/blob/main/flux_dev_diffusers-q4_0.gguf" -) -transformer = FluxTransformer2DModel.from_single_file( - ckpt_path, +transformer = AutoModel.from_single_file( + "https://huggingface.co/sayakpaul/different-lora-from-civitai/blob/main/flux_dev_diffusers-q4_0.gguf", quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16), config="black-forest-labs/FLUX.1-dev", subfolder="transformer", torch_dtype=torch.bfloat16, ) -pipe = FluxPipeline.from_pretrained( - "black-forest-labs/FLUX.1-dev", - transformer=transformer, - torch_dtype=torch.bfloat16, -) -pipe.enable_model_cpu_offload() -prompt = "A cat holding a sign that says hello world" -image = pipe(prompt, generator=torch.manual_seed(0)).images[0] -image.save("flux-gguf.png") -``` - -When using Diffusers format GGUF checkpoints, it's a must to provide the model `config` path. If the -model config resides in a `subfolder`, that needs to be specified, too. \ No newline at end of file +``` \ No newline at end of file diff --git a/docs/source/en/quantization/overview.md b/docs/source/en/quantization/overview.md index 38abeeac6d4d..07a6295bb9e4 100644 --- a/docs/source/en/quantization/overview.md +++ b/docs/source/en/quantization/overview.md @@ -11,34 +11,34 @@ specific language governing permissions and limitations under the License. --> -# Getting started +# Overview -Quantization focuses on representing data with fewer bits while also trying to preserve the precision of the original data. This often means converting a data type to represent the same information with fewer bits. For example, if your model weights are stored as 32-bit floating points and they're quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memory usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits. +Quantization represents data in lower precision to save memory. For example, quantizing model weights from fp32 to fp16 halves the model size. Lower precision also speeds up inference because calculations take less time with fewer bits. -Diffusers supports multiple quantization backends to make large diffusion models like [Flux](../api/pipelines/flux) more accessible. This guide shows how to use the [`~quantizers.PipelineQuantizationConfig`] class to quantize a pipeline during its initialization from a pretrained or non-quantized checkpoint. +Diffusers supports multiple quantization backends. This makes large diffusion models accessible on all hardware types. This guide shows how to use [`~quantizers.PipelineQuantizationConfig`] to quantize a pipeline from a pretrained checkpoint. ## Pipeline-level quantization -There are two ways to use [`~quantizers.PipelineQuantizationConfig`] depending on how much customization you want to apply to the quantization configuration. +You can use [`~quantizers.PipelineQuantizationConfig`] in two ways. -- for basic use cases, define the `quant_backend`, `quant_kwargs`, and `components_to_quantize` arguments -- for granular quantization control, define a `quant_mapping` that provides the quantization configuration for individual model components +- For a single backend, define the `quant_backend`, `quant_kwargs`, and `components_to_quantize` arguments. +- For multiple backends, define a `quant_mapping` that provides the quantization configuration for individual model components. -### Basic quantization +### Single quantization backend -Initialize [`~quantizers.PipelineQuantizationConfig`] with the following parameters. +Initialize [`~quantizers.PipelineQuantizationConfig`] with these parameters. -- `quant_backend` specifies which quantization backend to use. Currently supported backends include: `bitsandbytes_4bit`, `bitsandbytes_8bit`, `gguf`, `quanto`, and `torchao`. +- `quant_backend` specifies which quantization backend to use. Supported backends include: `bitsandbytes_4bit`, `bitsandbytes_8bit`, `gguf`, `quanto`, and `torchao`. - `quant_kwargs` specifies the quantization arguments to use. -> [!TIP] -> These `quant_kwargs` arguments are different for each backend. Refer to the [Quantization API](../api/quantization) docs to view the arguments for each backend. +> [!NOTE] +> The `quant_kwargs` arguments differ for each backend. Refer to the [Quantization API](../api/quantization) docs to view the specific arguments for each backend. -- `components_to_quantize` specifies which component(s) of the pipeline to quantize. Typically, you should quantize the most compute intensive components like the transformer. The text encoder is another component to consider quantizing if a pipeline has more than one such as [`FluxPipeline`]. The example below quantizes the T5 text encoder in [`FluxPipeline`] while keeping the CLIP model intact. +- `components_to_quantize` specifies which component(s) of the pipeline to quantize. Quantize the most compute intensive components like the transformer. The text encoder is another component to consider quantizing if a pipeline has more than one such as [`FluxPipeline`]. The example below quantizes the T5 text encoder in [`FluxPipeline`] while keeping the CLIP model intact. `components_to_quantize` accepts either a list for multiple models or a string for a single model. -The example below loads the bitsandbytes backend with the following arguments from [`~quantizers.quantization_config.BitsAndBytesConfig`], `load_in_4bit`, `bnb_4bit_quant_type`, and `bnb_4bit_compute_dtype`. +The example below configures the bitsandbytes backend with the `load_in_4bit`, `bnb_4bit_quant_type`, and `bnb_4bit_compute_dtype` arguments from [`~quantizers.quantization_config.BitsAndBytesConfig`]. ```py import torch @@ -55,21 +55,19 @@ pipeline_quant_config = PipelineQuantizationConfig( Pass the `pipeline_quant_config` to [`~DiffusionPipeline.from_pretrained`] to quantize the pipeline. ```py -pipe = DiffusionPipeline.from_pretrained( +pipeline = DiffusionPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", quantization_config=pipeline_quant_config, torch_dtype=torch.bfloat16, -).to("cuda") - -image = pipe("photo of a cute dog").images[0] + device_map="cuda" +) ``` +### Multi-quantization backend -### Advanced quantization +The `quant_mapping` argument provides more options for quantizing each individual component in a pipeline to combine different quantization backends. -The `quant_mapping` argument provides more options for how to quantize each individual component in a pipeline, like combining different quantization backends. - -Initialize [`~quantizers.PipelineQuantizationConfig`] and pass a `quant_mapping` to it. The `quant_mapping` allows you to specify the quantization options for each component in the pipeline such as the transformer and text encoder. +Initialize [`~quantizers.PipelineQuantizationConfig`] and pass a `quant_mapping` to it. The `quant_mapping` lets you specify the quantization options for each component in the pipeline such as the transformer and text encoder. The example below uses two quantization backends, [`~quantizers.quantization_config.QuantoConfig`] and [`transformers.BitsAndBytesConfig`], for the transformer and text encoder. @@ -91,10 +89,10 @@ pipeline_quant_config = PipelineQuantizationConfig( ) ``` -There is a separate bitsandbytes backend in [Transformers](https://huggingface.co/docs/transformers/main_classes/quantization#transformers.BitsAndBytesConfig). You need to import and use [`transformers.BitsAndBytesConfig`] for components that come from Transformers. For example, `text_encoder_2` in [`FluxPipeline`] is a [`~transformers.T5EncoderModel`] from Transformers so you need to use [`transformers.BitsAndBytesConfig`] instead of [`diffusers.BitsAndBytesConfig`]. +[Transformers](https://huggingface.co/docs/transformers/main_classes/quantization#transformers.BitsAndBytesConfig) has a separate bitsandbytes backend. Import and use [`transformers.BitsAndBytesConfig`] for components that come from Transformers. For example, `text_encoder_2` in [`FluxPipeline`] is from Transformers, so use [`transformers.BitsAndBytesConfig`] instead of [`diffusers.BitsAndBytesConfig`]. > [!TIP] -> Use the [basic quantization](#basic-quantization) method above if you don't want to manage these distinct imports or aren't sure where each pipeline component comes from. +> Use the [single quantization backend](#single-quantization-backend) method above if you don't want to manage these distinct imports or aren't sure where each pipeline component comes from. ```py import torch @@ -116,26 +114,51 @@ pipeline_quant_config = PipelineQuantizationConfig( Pass the `pipeline_quant_config` to [`~DiffusionPipeline.from_pretrained`] to quantize the pipeline. ```py -pipe = DiffusionPipeline.from_pretrained( +pipeline = DiffusionPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", quantization_config=pipeline_quant_config, torch_dtype=torch.bfloat16, -).to("cuda") + device_map="cuda" +) -image = pipe("photo of a cute dog").images[0] +image = pipeline("photo of a cute dog").images[0] +``` + +## Saving a quantized pipeline + +Use the [`~PushToHubMixin.push_to_hub`] method to push the quantized pipeline to the Hub. This saves a quantization `config.json` file and the quantized model weights. + +```py +pipeline.push_to_hub("my-repo") +``` + +You can also save the model locally with [`~ModelMixin.save_pretrained`]. + +```py +pipeline.save_pretrained("path/to/save/") +``` + +Reload the quantized model with [`~ModelMixin.from_pretrained`] without defining a [`~quantizers.PipelineQuantizationConfig`]. + +```py +from diffusers import DiffusionPipeline + +pipeline = DiffusionPipeline.from_pretrained( + "nunchaku-tech/nunchaku-flux.1-dev" +) ``` ## Resources -Check out the resources below to learn more about quantization. +Check out these resources to learn more about quantization. -- If you are new to quantization, we recommend checking out the following beginner-friendly courses in collaboration with DeepLearning.AI. +- If you're new to quantization, check out these beginner-friendly courses in collaboration with DeepLearning.AI. - [Quantization Fundamentals with Hugging Face](https://www.deeplearning.ai/short-courses/quantization-fundamentals-with-hugging-face/) - [Quantization in Depth](https://www.deeplearning.ai/short-courses/quantization-in-depth/) -- Refer to the [Contribute new quantization method guide](https://huggingface.co/docs/transformers/main/en/quantization/contribute) if you're interested in adding a new quantization method. +- Refer to the [Contribute new quantization method guide](https://huggingface.co/docs/transformers/main/en/quantization/contribute) if you want to add a new quantization method. -- The Transformers quantization [Overview](https://huggingface.co/docs/transformers/quantization/overview#when-to-use-what) provides an overview of the pros and cons of different quantization backends. +- The Transformers quantization [Overview](https://huggingface.co/docs/transformers/quantization/overview#when-to-use-what) shows the pros and cons of different quantization backends. -- Read the [Exploring Quantization Backends in Diffusers](https://huggingface.co/blog/diffusers-quantization) blog post for a brief introduction to each quantization backend, how to choose a backend, and combining quantization with other memory optimizations. +- Read the [Exploring Quantization Backends in Diffusers](https://huggingface.co/blog/diffusers-quantization) blog post for an introduction to each quantization backend, how to choose a backend, and combining quantization with other memory optimizations. diff --git a/docs/source/en/quantization/quanto.md b/docs/source/en/quantization/quanto.md index d322d76be267..dbeff8cf53ed 100644 --- a/docs/source/en/quantization/quanto.md +++ b/docs/source/en/quantization/quanto.md @@ -13,136 +13,116 @@ specific language governing permissions and limitations under the License. # Quanto -[Quanto](https://github.com/huggingface/optimum-quanto) is a PyTorch quantization backend for [Optimum](https://huggingface.co/docs/optimum/en/index). It has been designed with versatility and simplicity in mind: +[Quanto](https://github.com/huggingface/optimum-quanto) is a PyTorch quantization backend inside the [Optimum](https://huggingface.co/docs/optimum/index) ecosystem. It's designed to work in eager mode, automatically inserts quantization/dequantization steps, supports a variety of weights and activations, and features accelerated matrix multiplication on CUDA devices. -- All features are available in eager mode (works with non-traceable models) -- Supports quantization aware training -- Quantized models are compatible with `torch.compile` -- Quantized models are Device agnostic (e.g CUDA,XPU,MPS,CPU) +Quanto doesn't quantize `nn.Conv2d` and `nn.LayerNorm` modules because Diffusers can only quantize weights in `nn.Linear` layers. -In order to use the Quanto backend, you will first need to install `optimum-quanto>=0.2.6` and `accelerate` +Make sure Quanto and [Accelerate](https://huggingface.co/docs/optimum/index) are installed. -```shell -pip install optimum-quanto accelerate +```bash +pip install -U optimum-quanto accelerate ``` -Now you can quantize a model by passing the `QuantoConfig` object to the `from_pretrained()` method. Although the Quanto library does allow quantizing `nn.Conv2d` and `nn.LayerNorm` modules, currently, Diffusers only supports quantizing the weights in the `nn.Linear` layers of a model. The following snippet demonstrates how to apply `float8` quantization with Quanto. +Create and pass `weights_dtype` to [`QuantoConfig`] configure the target data type to quantize a model to. The example below quantizes the model to `float8`. Check [`QuantoConfig`] for a list of supported weight types. ```python import torch -from diffusers import FluxTransformer2DModel, QuantoConfig +from diffusers import AutoModel, QuantoConfig, FluxPipeline -model_id = "black-forest-labs/FLUX.1-dev" quantization_config = QuantoConfig(weights_dtype="float8") transformer = FluxTransformer2DModel.from_pretrained( - model_id, + "black-forest-labs/FLUX.1-dev", subfolder="transformer", quantization_config=quantization_config, torch_dtype=torch.bfloat16, ) -pipe = FluxPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch_dtype) -pipe.to("cuda") - -prompt = "A cat holding a sign that says hello world" -image = pipe( - prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512 -).images[0] -image.save("output.png") +pipeline = FluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + transformer=transformer, + torch_dtype=torch.bfloat16, + device_map="cuda" +) +prompt = """ +cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California +highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain +""" +image = pipeline(prompt).images[0] +image.save("flux-quanto.png") ``` -## Skipping Quantization on specific modules - -It is possible to skip applying quantization on certain modules using the `modules_to_not_convert` argument in the `QuantoConfig`. Please ensure that the modules passed in to this argument match the keys of the modules in the `state_dict` +[`QuantoConfig`] also works with single files with [`~loaders.FromOriginalModelMixin.from_single_file`]. ```python import torch -from diffusers import FluxTransformer2DModel, QuantoConfig +from diffusers import AutoModel, QuantoConfig -model_id = "black-forest-labs/FLUX.1-dev" -quantization_config = QuantoConfig(weights_dtype="float8", modules_to_not_convert=["proj_out"]) -transformer = FluxTransformer2DModel.from_pretrained( - model_id, - subfolder="transformer", - quantization_config=quantization_config, - torch_dtype=torch.bfloat16, +quantization_config = QuantoConfig(weights_dtype="float8") +transformer = AutoModel.from_single_file( + "https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/flux1-dev.safetensors", + quantization_config=quantization_config, + torch_dtype=torch.bfloat16 ) ``` -## Using `from_single_file` with the Quanto Backend +## torch.compile -`QuantoConfig` is compatible with `~FromOriginalModelMixin.from_single_file`. +Quanto supports torch.compile for `int8` weights only. ```python import torch -from diffusers import FluxTransformer2DModel, QuantoConfig +from diffusers import FluxPipeline, AutoModel, QuantoConfig -ckpt_path = "https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/flux1-dev.safetensors" -quantization_config = QuantoConfig(weights_dtype="float8") -transformer = FluxTransformer2DModel.from_single_file(ckpt_path, quantization_config=quantization_config, torch_dtype=torch.bfloat16) +quantization_config = QuantoConfig(weights_dtype="int8") +transformer = FluxTransformer2DModel.from_pretrained( + "black-forest-labs/FLUX.1-dev", + subfolder="transformer", + quantization_config=quantization_config, + torch_dtype=torch.bfloat16, +) +transformer = torch.compile(transformer, mode="max-autotune", fullgraph=True) +pipeline = FluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + transformer=transformer, + torch_dtype=torch.bfloat16, + device_map="cuda" +) ``` -## Saving Quantized models - -Diffusers supports serializing Quanto models using the `~ModelMixin.save_pretrained` method. +## Skipping quantization on specific modules -The serialization and loading requirements are different for models quantized directly with the Quanto library and models quantized -with Diffusers using Quanto as the backend. It is currently not possible to load models quantized directly with Quanto into Diffusers using `~ModelMixin.from_pretrained` +Use `modules_to_not_convert` to skip quantization on specific modules. The modules passed to this argument must match the module keys in `state_dict`. ```python import torch -from diffusers import FluxTransformer2DModel, QuantoConfig +from diffusers import AutoModel, QuantoConfig -model_id = "black-forest-labs/FLUX.1-dev" -quantization_config = QuantoConfig(weights_dtype="float8") -transformer = FluxTransformer2DModel.from_pretrained( - model_id, +quantization_config = QuantoConfig(weights_dtype="float8", modules_to_not_convert=["proj_out"]) +transformer = AutoModel.from_pretrained( + "black-forest-labs/FLUX.1-dev", subfolder="transformer", quantization_config=quantization_config, torch_dtype=torch.bfloat16, ) -# save quantized model to reuse -transformer.save_pretrained("") - -# you can reload your quantized model with -model = FluxTransformer2DModel.from_pretrained("") ``` -## Using `torch.compile` with Quanto - -Currently the Quanto backend supports `torch.compile` for the following quantization types: +## Saving quantized models -- `int8` weights +Save a Quanto model with [`~ModelMixin.save_pretrained`]. Models quantized directly with the Quanto library - not as a backend in Diffusers - can't be loaded in Diffusers with [`~ModelMixin.from_pretrained`]. ```python import torch -from diffusers import FluxPipeline, FluxTransformer2DModel, QuantoConfig - -model_id = "black-forest-labs/FLUX.1-dev" -quantization_config = QuantoConfig(weights_dtype="int8") -transformer = FluxTransformer2DModel.from_pretrained( - model_id, - subfolder="transformer", - quantization_config=quantization_config, - torch_dtype=torch.bfloat16, -) -transformer = torch.compile(transformer, mode="max-autotune", fullgraph=True) +from diffusers import AutoModel, QuantoConfig -pipe = FluxPipeline.from_pretrained( - model_id, transformer=transformer, torch_dtype=torch_dtype +quantization_config = QuantoConfig(weights_dtype="float8") +transformer = AutoModel.from_pretrained( + "black-forest-labs/FLUX.1-dev", + subfolder="transformer", + quantization_config=quantization_config, + torch_dtype=torch.bfloat16, ) -pipe.to("cuda") -images = pipe("A cat holding a sign that says hello").images[0] -images.save("flux-quanto-compile.png") -``` - -## Supported Quantization Types - -### Weights - -- float8 -- int8 -- int4 -- int2 - +transformer.save_pretrained("path/to/saved/model") +# Reload quantized model +model = AutoModel.from_pretrained("path/to/saved/model") +``` \ No newline at end of file