[Model] Support Deepseek V3.2 #25869

heheda12345 · 2025-09-29T09:25:15Z

Purpose

This PR adds the support for the new deepseek v3.2 model. It is a verified stable version that can be used now.

We are also working on a cleaner implementation #25896 which will be the one merged into main branch.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Chen Zhang <[email protected]>

Signed-off-by: youkaichao <[email protected]>

Signed-off-by: Chen Zhang <[email protected]>

Signed-off-by: Lucas Wilkinson <[email protected]> fix smoke tests Signed-off-by: Lucas Wilkinson <[email protected]> moved to FlashMLA repo Signed-off-by: Lucas Wilkinson <[email protected]> removed pytorch shim Signed-off-by: Lucas Wilkinson <[email protected]>

Signed-off-by: Chen Zhang <[email protected]>

setup sparse attention backend

Signed-off-by: Lucas Wilkinson <[email protected]>

…ild-sparse-flash-mla Build and bind sparse-FlashMLA kernels

…integration [Feature] DeepGEMM integration

* and env and MQA path for both prefill and decode Signed-off-by: Lucas Wilkinson <[email protected]> * fix shapes Signed-off-by: Lucas Wilkinson <[email protected]> --------- Signed-off-by: Lucas Wilkinson <[email protected]>

* code from ds Signed-off-by: youkaichao <[email protected]> * doc from ds Signed-off-by: youkaichao <[email protected]> * Fixes for support_materials/2-tilelang/ Signed-off-by: mgoin <[email protected]> * Fix example 1 Signed-off-by: mgoin <[email protected]> * Fix Einsum in deepgemm * Fix `libc10.so` unimported error * fix reference code Signed-off-by: youkaichao <[email protected]> * adding missing indexer args * passing index args into the module * init Signed-off-by: Chen Zhang <[email protected]> * build indexer k cache medadata * prefill indexer, but weight_proj will output -inf * unqiantized paged indexer, still have -inf issue * remove support material * adding topk_indices mask * add weight scale * unittest infrastructure and fix weight_proj, numeric error due to quantization * varlen prefill passed * paged prefill * add indices mask --------- Signed-off-by: youkaichao <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: mgoin <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Co-authored-by: Chen Zhang <[email protected]>

Co-authored-by: Lucia Fang <[email protected]>

* prefill mla Signed-off-by: Chen Zhang <[email protected]> * can run now Signed-off-by: Chen Zhang <[email protected]> * tmp Signed-off-by: Chen Zhang <[email protected]> * can output the first token Signed-off-by: Chen Zhang <[email protected]> * fix bug Signed-off-by: Chen Zhang <[email protected]> * remove some debug Signed-off-by: Chen Zhang <[email protected]> * update Signed-off-by: Chen Zhang <[email protected]> * hack through cu_seqlen_ks exploding issue * update basic.py Signed-off-by: Chen Zhang <[email protected]> * remove some unnecessary changes Signed-off-by: Chen Zhang <[email protected]> * clean up Signed-off-by: Chen Zhang <[email protected]> --------- Signed-off-by: Chen Zhang <[email protected]> Co-authored-by: Yongye Zhu <[email protected]>

Signed-off-by: Chen Zhang <[email protected]>

Signed-off-by: Lucas Wilkinson <[email protected]>

Signed-off-by: Chen Zhang <[email protected]>

Signed-off-by: NickLucche <[email protected]>

Fix MLA for non dsv32 models

Signed-off-by: Chen Zhang <[email protected]>

mergify · 2025-09-29T09:26:16Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Chen Zhang <[email protected]>

gemini-code-assist

Code Review

This pull request introduces support for the Deepseek v3.2 model, which involves substantial changes across the codebase. Key additions include updates to the build system for a new version of FlashMLA, new CUDA kernels for custom attention and quantization mechanisms specific to this model, and a new sparse attention backend. The changes are extensive, touching upon model definitions, attention operations, and KV cache management to accommodate the unique architecture of Deepseek v3.2. My review has identified a critical bug in the new Triton utilities for sequence packing and unpacking, as well as a significant performance issue in a data gathering function. I have provided detailed suggestions to address these points. Overall, this is a major and valuable contribution that expands vLLM's model support.

vllm/attention/ops/common.py

gemini-code-assist · 2025-09-29T09:32:19Z

vllm/model_executor/models/deepseek_v2.py

+    expected_value = []
+    expected_scale = []
+    for b in range(batch_size):
+        s = cu_seq_lens[b + 1] - cu_seq_lens[b]
+        if s == 0:
+            continue
+        tot = cdiv(s, block_size)
+        blocks = block_table[b, :tot]
+
+        value = []
+        scale = []
+        full_block = torch.arange(tot - 1, device=kv_cache.device, dtype=torch.int32)
+        # print(f"full_blocks: {blocks[full_block]}")
+        non_remaining_value = kv_cache[blocks[full_block], : block_size * head_dim].view(-1, head_dim)
+        non_remaining_scale = kv_cache[blocks[full_block], block_size * head_dim:].view(-1, 4)
+
+        # for i in range(tot - 1):
+        #     value.append(kv_cache[blocks[i], :block_size * head_dim])
+        #     scale.append(kv_cache[blocks[i], block_size * head_dim:])
+
+        remaining = s - (tot - 1) * block_size
+        # value.append(kv_cache[blocks[-1], :remaining * head_dim])
+        # scale.append(kv_cache[blocks[-1], block_size * head_dim: block_size * head_dim + remaining * 4])
+
+        value = torch.cat([non_remaining_value, kv_cache[blocks[-1], :remaining * head_dim].view(-1, head_dim)], dim=0)
+        scale = torch.cat([non_remaining_scale, kv_cache[blocks[-1], block_size * head_dim: block_size * head_dim + remaining * 4].view(-1, 4)], dim=0)
+
+        expected_value.append(value)
+        expected_scale.append(scale)
+
+    gather_value = torch.cat(expected_value, dim=0).view(-1, head_dim)
+    gather_scale = torch.cat(expected_scale, dim=0).view(-1, 4)
+    gather_value = gather_value.view(torch.float8_e4m3fn)
+    gather_scale = gather_scale.view(torch.float32)
+    dst_value.copy_(gather_value)
+    dst_scale.copy_(gather_scale)
+


The current implementation of cp_gather_indexer_k_quant_cache is inefficient for GPU execution. It iterates over batches in a Python loop and uses list.append followed by torch.cat to build the final tensors. This pattern creates many small, intermediate tensors and can lead to significant overhead from memory allocations and kernel launch latencies, especially with a large number of batches.

A more performant approach is to pre-allocate the destination tensors (dst_value, dst_scale) and write the results directly into the appropriate slices within the loop. This avoids the overhead of list manipulation and multiple concatenations, resulting in a more efficient, vectorized operation.

for b in range(batch_size): s = cu_seq_lens[b + 1] - cu_seq_lens[b] if s == 0: continue start_idx = cu_seq_lens[b] tot = cdiv(s, block_size) blocks = block_table[b, :tot] num_full_blocks = s // block_size num_full_tokens = 0 if num_full_blocks > 0: full_blocks_indices = blocks[:num_full_blocks] num_full_tokens = num_full_blocks * block_size value_full = kv_cache[full_blocks_indices, :block_size * head_dim].reshape(-1, head_dim) dst_value[start_idx:start_idx + num_full_tokens].copy_(value_full.view(torch.float8_e4m3fn)) scale_full = kv_cache[full_blocks_indices, block_size * head_dim:].reshape(-1, 4) dst_scale[start_idx:start_idx + num_full_tokens].copy_(scale_full.view(torch.float32)) remaining = s - num_full_tokens if remaining > 0: last_block_idx = blocks[-1] rem_start_idx = start_idx + num_full_tokens value_rem = kv_cache[last_block_idx, :remaining * head_dim].view(-1, head_dim) dst_value[rem_start_idx:rem_start_idx + remaining].copy_(value_rem.view(torch.float8_e4m3fn)) scale_rem = kv_cache[last_block_idx, block_size * head_dim:block_size * head_dim + remaining * 4].view(-1, 4) dst_scale[rem_start_idx:rem_start_idx + remaining].copy_(scale_rem.view(torch.float32))

* fix unpack kernel * increase atol/rtol in test case --------- Co-authored-by: Lucia Fang <[email protected]>

Signed-off-by: Chen Zhang <[email protected]>

Signed-off-by: youkaichao <[email protected]>

Signed-off-by: Lucas Wilkinson <[email protected]>

fix mtp config and padding

youkaichao · 2025-09-30T07:20:56Z

close as we will merge #25896

heheda12345 and others added 30 commits September 20, 2025 18:24

init dev branch

12f85fb

Signed-off-by: Chen Zhang <[email protected]>

add indexer module

1486830

Signed-off-by: youkaichao <[email protected]>

fix fp8 weight loading

ee3271e

Signed-off-by: youkaichao <[email protected]>

fix key

3f4154d

Signed-off-by: youkaichao <[email protected]>

basic test

991b94f

Signed-off-by: youkaichao <[email protected]>

add indexer cache (vllm-project#12)

00c455c

Signed-off-by: Chen Zhang <[email protected]>

setup sparse attention backend

ddaf933

Signed-off-by: Chen Zhang <[email protected]>

build sparse

aff9596

Signed-off-by: Lucas Wilkinson <[email protected]> fix smoke tests Signed-off-by: Lucas Wilkinson <[email protected]> moved to FlashMLA repo Signed-off-by: Lucas Wilkinson <[email protected]> removed pytorch shim Signed-off-by: Lucas Wilkinson <[email protected]>

pass in selected index

22d0fe5

Signed-off-by: Chen Zhang <[email protected]>

make basic.py runable

3b9df19

Signed-off-by: Chen Zhang <[email protected]>

small fix

f85564f

Signed-off-by: Chen Zhang <[email protected]>

reduce api change

fe45b06

Signed-off-by: Chen Zhang <[email protected]>

revert

216c42f

Signed-off-by: Chen Zhang <[email protected]>

Merge pull request vllm-project#14 from vllm-model-0920/mla_backend

446c0de

setup sparse attention backend

format

840f205

Signed-off-by: Lucas Wilkinson <[email protected]>

Merge pull request vllm-project#13 from vllm-model-0920/lwilkinson/bu…

fa13a8b

…ild-sparse-flash-mla Build and bind sparse-FlashMLA kernels

deepgemm integration

0f54ca6

fix clean logic

93eade0

Merge pull request vllm-project#20 from vllm-model-0920/wye-deepgemm-…

1e304d8

…integration [Feature] DeepGEMM integration

support mtp with indexer kv (vllm-project#21)

6a29a01

Co-authored-by: Lucia Fang <[email protected]>

fix indexer bs>1 (vllm-project#23)

75d382e

Signed-off-by: Chen Zhang <[email protected]>

fix build

9905f9d

Signed-off-by: Lucas Wilkinson <[email protected]>

fix import (vllm-project#24)

23e809c

Signed-off-by: Chen Zhang <[email protected]>

enable sparse by default

e19d0c9

Signed-off-by: Chen Zhang <[email protected]>

fix mla

bff5944

Signed-off-by: NickLucche <[email protected]>

Merge pull request vllm-project#28 from vllm-model-0920/fix

d689f18

Fix MLA for non dsv32 models

fix unify kv cache spec

b3a44bd

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 requested review from ProExpertProg and zhuohan123 as code owners September 29, 2025 09:25

heheda12345 added the new-model Requests to new models label Sep 29, 2025

mergify bot added documentation Improvements or additions to documentation ci/build deepseek Related to DeepSeek models rocm Related to AMD ROCm speculative-decoding v1 tpu Related to Google TPUs labels Sep 29, 2025

mergify bot added the needs-rebase label Sep 29, 2025

fix flashmla

aeee929

Signed-off-by: Chen Zhang <[email protected]>

gemini-code-assist bot reviewed Sep 29, 2025

View reviewed changes

simon-mo changed the title ~~[Model] Support Deepseek v3.2~~ [Model] Support Deepseek V3.2 Sep 29, 2025

luccafong and others added 2 commits September 29, 2025 02:35

fix unpack kernel (vllm-project#64)

f142654

* fix unpack kernel * increase atol/rtol in test case --------- Co-authored-by: Lucia Fang <[email protected]>

Merge branch 'dev' of github.com:vllm-model-0920/vllm-dsv32 into dev

b215ed8

youkaichao mentioned this pull request Sep 29, 2025

Tracking Issue: DeepSeek V3.2 support #25877

Open

zyongye and others added 2 commits September 29, 2025 11:35

partial configs

093b0c0

fix blackwell

3e530a5

Signed-off-by: Chen Zhang <[email protected]>

createthis mentioned this pull request Sep 29, 2025

Feature Request: DeepSeek V3.2-Exp support ggml-org/llama.cpp#16331

Open

4 tasks

youkaichao and others added 2 commits September 29, 2025 21:21

update config

88ef733

Signed-off-by: youkaichao <[email protected]>

support 12.8 for blackwell

386f9ae

Signed-off-by: Lucas Wilkinson <[email protected]>

mergify bot mentioned this pull request Sep 29, 2025

Lwilkinson/blackwell 12.8 #25894

Closed

zyongye mentioned this pull request Sep 29, 2025

[New Model] DeepSeek-V3.2 (Rebased to Main) #25896

Merged

Lucia Fang and others added 3 commits September 29, 2025 18:04

fix mtp config

7efa1c5

fix the num tokens

4273a15

Merge pull request #2 from luccafong/mtp_config_enablement

618d877

fix mtp config and padding

youkaichao closed this Sep 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Model] Support Deepseek V3.2 #25869

[Model] Support Deepseek V3.2 #25869

Uh oh!

heheda12345 commented Sep 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Sep 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Sep 29, 2025

Uh oh!

youkaichao commented Sep 30, 2025

Uh oh!

Uh oh!

Uh oh!

[Model] Support Deepseek V3.2 #25869

[Model] Support Deepseek V3.2 #25869

Uh oh!

Conversation

heheda12345 commented Sep 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Sep 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

youkaichao commented Sep 30, 2025

Uh oh!

Uh oh!

heheda12345 commented Sep 29, 2025 •

edited by github-actions bot

Loading