Skip to content

Conversation

heheda12345
Copy link
Collaborator

@heheda12345 heheda12345 commented Sep 29, 2025

Purpose

This PR adds the support for the new deepseek v3.2 model. It is a verified stable version that can be used now.

We are also working on a cleaner implementation #25896 which will be the one merged into main branch.

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

heheda12345 and others added 30 commits September 20, 2025 18:24
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>

fix smoke tests

Signed-off-by: Lucas Wilkinson <[email protected]>

moved to FlashMLA repo

Signed-off-by: Lucas Wilkinson <[email protected]>

removed pytorch shim

Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
…ild-sparse-flash-mla

Build and bind sparse-FlashMLA kernels
…integration

[Feature] DeepGEMM integration
* and env and MQA path for both prefill and decode

Signed-off-by: Lucas Wilkinson <[email protected]>

* fix shapes

Signed-off-by: Lucas Wilkinson <[email protected]>

---------

Signed-off-by: Lucas Wilkinson <[email protected]>
* code from ds

Signed-off-by: youkaichao <[email protected]>

* doc from ds

Signed-off-by: youkaichao <[email protected]>

* Fixes for support_materials/2-tilelang/

Signed-off-by: mgoin <[email protected]>

* Fix example 1

Signed-off-by: mgoin <[email protected]>

* Fix Einsum in deepgemm

* Fix `libc10.so` unimported error

* fix reference code

Signed-off-by: youkaichao <[email protected]>

* adding missing indexer args

* passing index args into the module

* init

Signed-off-by: Chen Zhang <[email protected]>

* build indexer k cache medadata

* prefill indexer, but weight_proj will output -inf

* unqiantized paged indexer, still have -inf issue

* remove support material

* adding topk_indices mask

* add weight scale

* unittest infrastructure and fix weight_proj, numeric error due to quantization

* varlen prefill passed

* paged prefill

* add indices mask

---------

Signed-off-by: youkaichao <[email protected]>
Signed-off-by: mgoin <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Co-authored-by: youkaichao <[email protected]>
Co-authored-by: mgoin <[email protected]>
Co-authored-by: Wentao Ye <[email protected]>
Co-authored-by: Chen Zhang <[email protected]>
* prefill mla

Signed-off-by: Chen Zhang <[email protected]>

* can run now

Signed-off-by: Chen Zhang <[email protected]>

* tmp

Signed-off-by: Chen Zhang <[email protected]>

* can output the first token

Signed-off-by: Chen Zhang <[email protected]>

* fix bug

Signed-off-by: Chen Zhang <[email protected]>

* remove some debug

Signed-off-by: Chen Zhang <[email protected]>

* update

Signed-off-by: Chen Zhang <[email protected]>

* hack through cu_seqlen_ks exploding issue

* update basic.py

Signed-off-by: Chen Zhang <[email protected]>

* remove some unnecessary changes

Signed-off-by: Chen Zhang <[email protected]>

* clean up

Signed-off-by: Chen Zhang <[email protected]>

---------

Signed-off-by: Chen Zhang <[email protected]>
Co-authored-by: Yongye Zhu <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
@heheda12345 heheda12345 added the new-model Requests to new models label Sep 29, 2025
@mergify mergify bot added documentation Improvements or additions to documentation ci/build deepseek Related to DeepSeek models rocm Related to AMD ROCm speculative-decoding v1 tpu Related to Google TPUs labels Sep 29, 2025
Copy link

mergify bot commented Sep 29, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 29, 2025
Signed-off-by: Chen Zhang <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the Deepseek v3.2 model, which involves substantial changes across the codebase. Key additions include updates to the build system for a new version of FlashMLA, new CUDA kernels for custom attention and quantization mechanisms specific to this model, and a new sparse attention backend. The changes are extensive, touching upon model definitions, attention operations, and KV cache management to accommodate the unique architecture of Deepseek v3.2. My review has identified a critical bug in the new Triton utilities for sequence packing and unpacking, as well as a significant performance issue in a data gathering function. I have provided detailed suggestions to address these points. Overall, this is a major and valuable contribution that expands vLLM's model support.

Comment on lines +546 to +582
expected_value = []
expected_scale = []
for b in range(batch_size):
s = cu_seq_lens[b + 1] - cu_seq_lens[b]
if s == 0:
continue
tot = cdiv(s, block_size)
blocks = block_table[b, :tot]

value = []
scale = []
full_block = torch.arange(tot - 1, device=kv_cache.device, dtype=torch.int32)
# print(f"full_blocks: {blocks[full_block]}")
non_remaining_value = kv_cache[blocks[full_block], : block_size * head_dim].view(-1, head_dim)
non_remaining_scale = kv_cache[blocks[full_block], block_size * head_dim:].view(-1, 4)

# for i in range(tot - 1):
# value.append(kv_cache[blocks[i], :block_size * head_dim])
# scale.append(kv_cache[blocks[i], block_size * head_dim:])

remaining = s - (tot - 1) * block_size
# value.append(kv_cache[blocks[-1], :remaining * head_dim])
# scale.append(kv_cache[blocks[-1], block_size * head_dim: block_size * head_dim + remaining * 4])

value = torch.cat([non_remaining_value, kv_cache[blocks[-1], :remaining * head_dim].view(-1, head_dim)], dim=0)
scale = torch.cat([non_remaining_scale, kv_cache[blocks[-1], block_size * head_dim: block_size * head_dim + remaining * 4].view(-1, 4)], dim=0)

expected_value.append(value)
expected_scale.append(scale)

gather_value = torch.cat(expected_value, dim=0).view(-1, head_dim)
gather_scale = torch.cat(expected_scale, dim=0).view(-1, 4)
gather_value = gather_value.view(torch.float8_e4m3fn)
gather_scale = gather_scale.view(torch.float32)
dst_value.copy_(gather_value)
dst_scale.copy_(gather_scale)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of cp_gather_indexer_k_quant_cache is inefficient for GPU execution. It iterates over batches in a Python loop and uses list.append followed by torch.cat to build the final tensors. This pattern creates many small, intermediate tensors and can lead to significant overhead from memory allocations and kernel launch latencies, especially with a large number of batches.

A more performant approach is to pre-allocate the destination tensors (dst_value, dst_scale) and write the results directly into the appropriate slices within the loop. This avoids the overhead of list manipulation and multiple concatenations, resulting in a more efficient, vectorized operation.

    for b in range(batch_size):
        s = cu_seq_lens[b + 1] - cu_seq_lens[b]
        if s == 0:
            continue

        start_idx = cu_seq_lens[b]

        tot = cdiv(s, block_size)
        blocks = block_table[b, :tot]

        num_full_blocks = s // block_size
        num_full_tokens = 0
        if num_full_blocks > 0:
            full_blocks_indices = blocks[:num_full_blocks]
            num_full_tokens = num_full_blocks * block_size

            value_full = kv_cache[full_blocks_indices, :block_size * head_dim].reshape(-1, head_dim)
            dst_value[start_idx:start_idx + num_full_tokens].copy_(value_full.view(torch.float8_e4m3fn))

            scale_full = kv_cache[full_blocks_indices, block_size * head_dim:].reshape(-1, 4)
            dst_scale[start_idx:start_idx + num_full_tokens].copy_(scale_full.view(torch.float32))

        remaining = s - num_full_tokens
        if remaining > 0:
            last_block_idx = blocks[-1]

            rem_start_idx = start_idx + num_full_tokens

            value_rem = kv_cache[last_block_idx, :remaining * head_dim].view(-1, head_dim)
            dst_value[rem_start_idx:rem_start_idx + remaining].copy_(value_rem.view(torch.float8_e4m3fn))

            scale_rem = kv_cache[last_block_idx, block_size * head_dim:block_size * head_dim + remaining * 4].view(-1, 4)
            dst_scale[rem_start_idx:rem_start_idx + remaining].copy_(scale_rem.view(torch.float32))

@simon-mo simon-mo changed the title [Model] Support Deepseek v3.2 [Model] Support Deepseek V3.2 Sep 29, 2025
luccafong and others added 2 commits September 29, 2025 02:35
* fix unpack kernel

* increase atol/rtol in test case

---------

Co-authored-by: Lucia Fang <[email protected]>
zyongye and others added 2 commits September 29, 2025 11:35
youkaichao and others added 2 commits September 29, 2025 21:21
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
@youkaichao
Copy link
Member

close as we will merge #25896

@youkaichao youkaichao closed this Sep 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build deepseek Related to DeepSeek models documentation Improvements or additions to documentation needs-rebase new-model Requests to new models rocm Related to AMD ROCm speculative-decoding tpu Related to Google TPUs v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.