-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Model] Support Deepseek V3.2 #25869
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]> fix smoke tests Signed-off-by: Lucas Wilkinson <[email protected]> moved to FlashMLA repo Signed-off-by: Lucas Wilkinson <[email protected]> removed pytorch shim Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
setup sparse attention backend
Signed-off-by: Lucas Wilkinson <[email protected]>
…ild-sparse-flash-mla Build and bind sparse-FlashMLA kernels
…integration [Feature] DeepGEMM integration
* and env and MQA path for both prefill and decode Signed-off-by: Lucas Wilkinson <[email protected]> * fix shapes Signed-off-by: Lucas Wilkinson <[email protected]> --------- Signed-off-by: Lucas Wilkinson <[email protected]>
* code from ds Signed-off-by: youkaichao <[email protected]> * doc from ds Signed-off-by: youkaichao <[email protected]> * Fixes for support_materials/2-tilelang/ Signed-off-by: mgoin <[email protected]> * Fix example 1 Signed-off-by: mgoin <[email protected]> * Fix Einsum in deepgemm * Fix `libc10.so` unimported error * fix reference code Signed-off-by: youkaichao <[email protected]> * adding missing indexer args * passing index args into the module * init Signed-off-by: Chen Zhang <[email protected]> * build indexer k cache medadata * prefill indexer, but weight_proj will output -inf * unqiantized paged indexer, still have -inf issue * remove support material * adding topk_indices mask * add weight scale * unittest infrastructure and fix weight_proj, numeric error due to quantization * varlen prefill passed * paged prefill * add indices mask --------- Signed-off-by: youkaichao <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: mgoin <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Co-authored-by: Chen Zhang <[email protected]>
Co-authored-by: Lucia Fang <[email protected]>
* prefill mla Signed-off-by: Chen Zhang <[email protected]> * can run now Signed-off-by: Chen Zhang <[email protected]> * tmp Signed-off-by: Chen Zhang <[email protected]> * can output the first token Signed-off-by: Chen Zhang <[email protected]> * fix bug Signed-off-by: Chen Zhang <[email protected]> * remove some debug Signed-off-by: Chen Zhang <[email protected]> * update Signed-off-by: Chen Zhang <[email protected]> * hack through cu_seqlen_ks exploding issue * update basic.py Signed-off-by: Chen Zhang <[email protected]> * remove some unnecessary changes Signed-off-by: Chen Zhang <[email protected]> * clean up Signed-off-by: Chen Zhang <[email protected]> --------- Signed-off-by: Chen Zhang <[email protected]> Co-authored-by: Yongye Zhu <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Fix MLA for non dsv32 models
Signed-off-by: Chen Zhang <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Chen Zhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for the Deepseek v3.2 model, which involves substantial changes across the codebase. Key additions include updates to the build system for a new version of FlashMLA, new CUDA kernels for custom attention and quantization mechanisms specific to this model, and a new sparse attention backend. The changes are extensive, touching upon model definitions, attention operations, and KV cache management to accommodate the unique architecture of Deepseek v3.2. My review has identified a critical bug in the new Triton utilities for sequence packing and unpacking, as well as a significant performance issue in a data gathering function. I have provided detailed suggestions to address these points. Overall, this is a major and valuable contribution that expands vLLM's model support.
expected_value = [] | ||
expected_scale = [] | ||
for b in range(batch_size): | ||
s = cu_seq_lens[b + 1] - cu_seq_lens[b] | ||
if s == 0: | ||
continue | ||
tot = cdiv(s, block_size) | ||
blocks = block_table[b, :tot] | ||
|
||
value = [] | ||
scale = [] | ||
full_block = torch.arange(tot - 1, device=kv_cache.device, dtype=torch.int32) | ||
# print(f"full_blocks: {blocks[full_block]}") | ||
non_remaining_value = kv_cache[blocks[full_block], : block_size * head_dim].view(-1, head_dim) | ||
non_remaining_scale = kv_cache[blocks[full_block], block_size * head_dim:].view(-1, 4) | ||
|
||
# for i in range(tot - 1): | ||
# value.append(kv_cache[blocks[i], :block_size * head_dim]) | ||
# scale.append(kv_cache[blocks[i], block_size * head_dim:]) | ||
|
||
remaining = s - (tot - 1) * block_size | ||
# value.append(kv_cache[blocks[-1], :remaining * head_dim]) | ||
# scale.append(kv_cache[blocks[-1], block_size * head_dim: block_size * head_dim + remaining * 4]) | ||
|
||
value = torch.cat([non_remaining_value, kv_cache[blocks[-1], :remaining * head_dim].view(-1, head_dim)], dim=0) | ||
scale = torch.cat([non_remaining_scale, kv_cache[blocks[-1], block_size * head_dim: block_size * head_dim + remaining * 4].view(-1, 4)], dim=0) | ||
|
||
expected_value.append(value) | ||
expected_scale.append(scale) | ||
|
||
gather_value = torch.cat(expected_value, dim=0).view(-1, head_dim) | ||
gather_scale = torch.cat(expected_scale, dim=0).view(-1, 4) | ||
gather_value = gather_value.view(torch.float8_e4m3fn) | ||
gather_scale = gather_scale.view(torch.float32) | ||
dst_value.copy_(gather_value) | ||
dst_scale.copy_(gather_scale) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation of cp_gather_indexer_k_quant_cache
is inefficient for GPU execution. It iterates over batches in a Python loop and uses list.append
followed by torch.cat
to build the final tensors. This pattern creates many small, intermediate tensors and can lead to significant overhead from memory allocations and kernel launch latencies, especially with a large number of batches.
A more performant approach is to pre-allocate the destination tensors (dst_value
, dst_scale
) and write the results directly into the appropriate slices within the loop. This avoids the overhead of list manipulation and multiple concatenations, resulting in a more efficient, vectorized operation.
for b in range(batch_size):
s = cu_seq_lens[b + 1] - cu_seq_lens[b]
if s == 0:
continue
start_idx = cu_seq_lens[b]
tot = cdiv(s, block_size)
blocks = block_table[b, :tot]
num_full_blocks = s // block_size
num_full_tokens = 0
if num_full_blocks > 0:
full_blocks_indices = blocks[:num_full_blocks]
num_full_tokens = num_full_blocks * block_size
value_full = kv_cache[full_blocks_indices, :block_size * head_dim].reshape(-1, head_dim)
dst_value[start_idx:start_idx + num_full_tokens].copy_(value_full.view(torch.float8_e4m3fn))
scale_full = kv_cache[full_blocks_indices, block_size * head_dim:].reshape(-1, 4)
dst_scale[start_idx:start_idx + num_full_tokens].copy_(scale_full.view(torch.float32))
remaining = s - num_full_tokens
if remaining > 0:
last_block_idx = blocks[-1]
rem_start_idx = start_idx + num_full_tokens
value_rem = kv_cache[last_block_idx, :remaining * head_dim].view(-1, head_dim)
dst_value[rem_start_idx:rem_start_idx + remaining].copy_(value_rem.view(torch.float8_e4m3fn))
scale_rem = kv_cache[last_block_idx, block_size * head_dim:block_size * head_dim + remaining * 4].view(-1, 4)
dst_scale[rem_start_idx:rem_start_idx + remaining].copy_(scale_rem.view(torch.float32))
* fix unpack kernel * increase atol/rtol in test case --------- Co-authored-by: Lucia Fang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
fix mtp config and padding
close as we will merge #25896 |
Purpose
This PR adds the support for the new deepseek v3.2 model. It is a verified stable version that can be used now.
We are also working on a cleaner implementation #25896 which will be the one merged into main branch.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.