[Modeling] Fix encoder CPU offloading for whisper #38994

kylesayrs · 2025-06-23T19:05:32Z

Purpose

Fix CPU offloading for WhisperEncoder
- This supports downstream applications which offload the layers of even small models like whisper. See [Performance] Sequential onloading vllm-project/llm-compressor#1263

Without this change, attempting to CPU offload the encoder layer raises a device error

RuntimeError: Tensor on device meta is not on the expected device cuda:0!

Changes

Instead of getting the embed_positions.weight attribute directly, leverage the hf hooks attached to the embed_positions module to onload the weight properly.
- This induces a small, once per request runtime cost as F.embedding must be called with an identity matrix, rather than grabbing the weight value directly

Testing

Use the following test script to verify that generation works with the device map

device_map={
    "model.encoder": "cpu",
    "model.decoder": 0,
    "proj_out": 0,
},

test_whisper_offload.py

import torch
from datasets import load_dataset
from transformers import WhisperForConditionalGeneration, WhisperProcessor


def load_sample(processor):
    ds = load_dataset(
        "MLCommons/peoples_speech",
        "test", split="test[:1]",
        trust_remote_code=True,
    )

    sample = next(iter(ds))
    sample = processor(
        audio=sample["audio"]["array"],
        sampling_rate=sample["audio"]["sampling_rate"],
        text=(" " + sample["text"].capitalize()),
        add_special_tokens=True,
        return_tensors="pt",
    )

    sample["input_features"] = sample["input_features"].to(dtype=torch.bfloat16)
    sample["decoder_input_ids"] = torch.tensor([processor.tokenizer.prefix_tokens])
    del sample["labels"]

    return sample


if __name__ == "__main__":
    model_id = "openai/whisper-large-v3"
    model = WhisperForConditionalGeneration.from_pretrained(
        model_id,
        device_map={
            "model.encoder": "cpu",
            "model.decoder": 0,
            "proj_out": 0,
        },
        torch_dtype=torch.bfloat16
    )
    processor = WhisperProcessor.from_pretrained(model_id)

    assert model.model.encoder.embed_positions.weight.device == torch.device("meta")
    sample = load_sample(processor)
    output = model.generate(**sample, language="en")
    print(processor.batch_decode(output, skip_special_tokens=True))

Potential Reviewers

@SunMarc @ArthurZucker @zucchini-nlp

Signed-off-by: Kyle Sayers <[email protected]>

zucchini-nlp

Thanks, I like that we removed direct access to weight.data. Can you also un-skip offload tests in whisper and make sure they are green?

For ex:

transformers/tests/models/whisper/test_modeling_whisper.py

Lines 3357 to 3368 in 21cb353

    
           @unittest.skip(reason="Some undefined behavior encountered with tiny versions of this model. Skip for now.") 
        
           def test_cpu_offload(self): 
        
               pass 
        
           @unittest.skip(reason="Some undefined behavior encountered with tiny versions of this model. Skip for now.") 
        
           def test_disk_offload_bin(self): 
        
               pass 
        
           @unittest.skip(reason="Some undefined behavior encountered with tiny versions of this model. Skip for now.") 
        
           def test_disk_offload_safetensors(self): 
        
               pass

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs · 2025-06-24T15:00:38Z

@zucchini-nlp @SunMarc Tests unskipped and passing!

Signed-off-by: Kyle Sayers <[email protected]>

tests/models/whisper/test_modeling_whisper.py

Signed-off-by: Kyle Sayers <[email protected]>

zucchini-nlp · 2025-06-25T19:47:24Z

run-slow: whisper

github-actions · 2025-06-25T19:48:50Z

This comment contains run-slow, running the specified jobs:

models: ['models/whisper']
quantizations: [] ...

HuggingFaceDocBuilderDev · 2025-06-25T20:00:43Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

kylesayrs · 2025-06-25T21:12:40Z

@zucchini-nlp Does this test failure indicate something to fix, or is this test noisy?

First differing element 3:
" Fol[1422 chars]ugitives bug out bindle of news that is my segment. Meanwhile."
" Fol[1422 chars]ugitives bug out bindle of news that is my segment. Meanwhile!"

vasqu

@kylesayrs Can you rebase/merge? The failing tests are expected, no worries :D

cc @gante @eustlb for viz

kylesayrs · 2025-06-26T15:28:28Z

@vasqu Merged, thank to hear it :)

vasqu · 2025-06-26T15:56:57Z

Thanks @kylesayrs 🤗

* fix cpu offloading for whisper Signed-off-by: Kyle Sayers <[email protected]> * unskip offloading tests Signed-off-by: Kyle Sayers <[email protected]> * revert small change Signed-off-by: Kyle Sayers <[email protected]> * remove tests Signed-off-by: Kyle Sayers <[email protected]> --------- Signed-off-by: Kyle Sayers <[email protected]>

fix cpu offloading for whisper

0c91e30

Signed-off-by: Kyle Sayers <[email protected]>

This was referenced Jun 23, 2025

[Bugfix] Fix Whisper compatibility with CPU offloading vllm-project/llm-compressor#1574

Closed

Problem faced with quantizing whisper vllm-project/llm-compressor#1573

Closed

zucchini-nlp approved these changes Jun 24, 2025

View reviewed changes

unskip offloading tests

be577a7

Signed-off-by: Kyle Sayers <[email protected]>

revert small change

c4ab9cf

Signed-off-by: Kyle Sayers <[email protected]>

zucchini-nlp reviewed Jun 25, 2025

View reviewed changes

tests/models/whisper/test_modeling_whisper.py Outdated Show resolved Hide resolved

kylesayrs added 2 commits June 25, 2025 11:33

remove tests

8be8a3e

Signed-off-by: Kyle Sayers <[email protected]>

Merge branch 'main' into kylesayrs/whisper-offloading

90c54de

vasqu approved these changes Jun 26, 2025

View reviewed changes

Merge branch 'main' into kylesayrs/whisper-offloading

23b1dfd

vasqu enabled auto-merge (squash) June 26, 2025 15:43

vasqu merged commit 0a8081b into huggingface:main Jun 26, 2025
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Modeling] Fix encoder CPU offloading for whisper #38994

[Modeling] Fix encoder CPU offloading for whisper #38994

Uh oh!

kylesayrs commented Jun 23, 2025 •

edited

Loading

Uh oh!

zucchini-nlp left a comment

Uh oh!

kylesayrs commented Jun 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

zucchini-nlp commented Jun 25, 2025

Uh oh!

github-actions bot commented Jun 25, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jun 25, 2025

Uh oh!

kylesayrs commented Jun 25, 2025

Uh oh!

vasqu left a comment

Uh oh!

kylesayrs commented Jun 26, 2025

Uh oh!

Uh oh!

vasqu commented Jun 26, 2025

Uh oh!

Uh oh!

	@unittest.skip(reason="Some undefined behavior encountered with tiny versions of this model. Skip for now.")
	def test_cpu_offload(self):
	pass

	@unittest.skip(reason="Some undefined behavior encountered with tiny versions of this model. Skip for now.")
	def test_disk_offload_bin(self):
	pass

	@unittest.skip(reason="Some undefined behavior encountered with tiny versions of this model. Skip for now.")
	def test_disk_offload_safetensors(self):
	pass

[Modeling] Fix encoder CPU offloading for whisper #38994

[Modeling] Fix encoder CPU offloading for whisper #38994

Uh oh!

Conversation

kylesayrs commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Testing

Potential Reviewers

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

kylesayrs commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp commented Jun 25, 2025

Uh oh!

github-actions bot commented Jun 25, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jun 25, 2025

Uh oh!

kylesayrs commented Jun 25, 2025

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

kylesayrs commented Jun 26, 2025

Uh oh!

Uh oh!

vasqu commented Jun 26, 2025

Uh oh!

Uh oh!

kylesayrs commented Jun 23, 2025 •

edited

Loading

kylesayrs commented Jun 24, 2025 •

edited

Loading