-
Notifications
You must be signed in to change notification settings - Fork 3.1k
[POC] enable trtllm fp4 from trtllm wheel #7711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @Alcanderian, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request primarily focuses on enabling FP4 quantization for Mixture of Experts (MoE) layers using TensorRT-LLM, aiming to optimize performance for models like DeepSeekV3. It introduces a new, optimized execution flow for MoE operations and includes minor adjustments to existing dependencies and MoE processing logic.
Highlights
- TensorRT-LLM FP4 MoE Integration: I've integrated a new execution path for Mixture of Experts (MoE) layers that leverages TensorRT-LLM's FP4 quantization and block-scale MoE runner. This path is conditionally enabled and uses
torch.ops.trtllm.fp4_quantize
andtorch.ops.trtllm.fp4_block_scale_moe_runner
for optimized processing, with hardcoded parameters for DeepSeekV3. - FlashInfer Dependency Adjustment: I've updated the required
flashinfer_python
version inpyproject.toml
and the corresponding version assertion inengine.py
from0.2.7
to0.2.6.post1
. - DeepSeekV2 MoE Optimization: For DeepSeekV2 models, I've increased the
DUAL_STREAM_TOKEN_THRESHOLD
from 1024 to 2048 and reordered therouter_logits
calculation within theforward_normal_dual_stream
method to occur inside thealt_stream
context, aiming to improve dual-stream processing efficiency.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for TensorRT-LLM FP4, which is a great optimization. The changes are mostly on the right track, but I've found a critical logic error in modelopt_quant.py
that makes some code unreachable and could lead to incorrect behavior. Additionally, the new TRT-LLM implementation contains several hardcoded, model-specific values that should be generalized. Addressing these issues will make the new feature more robust and maintainable.
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables FP4 quantization for MoE layers using TensorRT-LLM, which is a significant performance optimization. The changes involve adding new code paths controlled by the SGLANG_ENABLE_TRTLLM_GEN_MOE
environment variable, including new weight processing logic and a custom kernel runner for MoE. A comprehensive test suite from TensorRT-LLM has also been added to validate the new functionality.
My review focuses on improving the maintainability and robustness of the new code. I've pointed out several hardcoded values that are tied to specific model configurations or kernel internals, which could make the code brittle. I've suggested using named constants or deriving these values from configuration where possible. I also noted some areas where the code complexity has increased and could benefit from refactoring to improve clarity. Overall, this is a great step towards higher performance, and addressing these points will make the implementation more solid.
if not ENABLE_TRTLMM_GEN_MOE: | ||
w13_blockscale_swizzled = self.swizzle_blockscale(layer.w13_weight_scale) | ||
layer.w13_weight_scale = Parameter( | ||
w13_blockscale_swizzled, requires_grad=False | ||
) | ||
else: | ||
layer.w13_weight_scale = Parameter( | ||
layer.w13_weight_scale.data, requires_grad=False | ||
) | ||
self.trtllm_gen_process_expert_w3_w1_weight(layer) | ||
self.trtllm_gen_process_expert_w3_w1_weight_scale_nvfp4(layer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic within process_weights_after_loading
has become quite complex with the addition of the ENABLE_TRTLMM_GEN_MOE
flag. Specifically, the reassignment of layer.w13_weight_scale
(and w2_weight_scale
) is confusing. Initially, it's a ModelWeightParameter
, but in this if not ENABLE_TRTLMM_GEN_MOE
branch, it's replaced with a Parameter
containing swizzled scales. This can make the code harder to understand and maintain.
Consider refactoring the logic for the two paths into separate helper methods to improve clarity. For example, _process_weights_flashinfer()
and _process_weights_trtllm()
. Also, instead of reassigning layer.w13_weight_scale
, it might be clearer to use a different attribute name for the swizzled scales, like layer.w13_blockscale_swizzled
(as it was before), to avoid confusion about its type and content.
Nice work! FYI the TRTLLM Gen MoE kernel is being added to flashinfer, so that will help to ease the deps setup. |
self.quant_config = quant_config | ||
|
||
if ENABLE_TRTLMM_GEN_MOE: | ||
self.kernel = torch.ops.trtllm.nvfp4_gemm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trtllm nvfp4_gemm needs profiling to find the best config. How is that handled in sglang?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is probably none from what I can see. By default it'll use some heuristics, but they are not the best. There is autotuner in trtllm that handles this https://github.com/NVIDIA/TensorRT-LLM/pull/5207/files. I can bring that autotuner with flashninfer integration flashinfer-ai/flashinfer#1214
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have no auto tunner in sglang for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the performance may not be optimal without profiling.
|
||
# NOTE: For some unknown reason, router_gemm seems degrade accept length. | ||
if ENABLE_TRTLMM_GEN_MOE and not self.is_nextn: | ||
return torch.ops.trtllm.dsv3_router_gemm_op( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this one different from the dsv3_router_gemm
on line 255?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dsv3_router_gemm below do not have cublas fallback, and F.linear do not support fp16 input with fp32 output for now.
if ENABLE_TRTLMM_GEN_MOE: | ||
router_logits = self.gate(hidden_states) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is to reproduce trtllm profile's timeline
tile_tokens_dim = 8 | ||
|
||
# https://github.com/NVIDIA/TensorRT-LLM/blob/v1.0.0rc1/tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py#L195 | ||
outputs = torch.ops.trtllm.fp4_block_scale_moe_runner( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if you have benchmarked the prefill perf separately. trtllm gen MoE is optimized for decoding, i.e., small input num_tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not done the benchmark on prefill, but I can see that it aslo be called in trtllm prefill stage. And by the way are there any fp4_gemm interface can apply the same weight/scaling factor layout as trtllm gen moe? Because we cannot change the layout of weight/scaling factor in the forward stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe trtllm has added it: https://github.com/NVIDIA/TensorRT-LLM/blob/d4d21a106e8176bf20e627ee432cca5ef920c325/tests/unittest/_torch/thop/test_fp4_gemm_quantize.py#L122
But the weight layout is different from MoE.
Are you planning to run MoE as individual gemms in the forward stage? If so, this might be close to what you want:
https://github.com/NVIDIA/TensorRT-LLM/blob/d4d21a106e8176bf20e627ee432cca5ef920c325/tensorrt_llm/_torch/models/modeling_llama_min_latency.py#L47
The precision is not nvfp4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two seem unable to provide any effective assistance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I wanted to say is that the layout changes will not be compatible with single gemms, because some of the layout shuffling is for swiglu fusion. The only possibility is the nvfp4 version of GatedMLP kernels, which shares the same weight layout as MoE. These kernels can be built. They are just not checked into trtllm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually now I think about it, the GatedMLP kernels would be optimized for low latency too.
based tbh |
fyi, we merged flashinfer-ai/flashinfer#1214 and the plan is to enable these kernels through flashinfer in SGL |
Motivation
ATTENTION: flashinfer cutlass moe may not work after installing trtllm wheel
UPD 072025, archieve up to 134 tps with tp8
UPD 070525, archieve up to 128 tps
ENV
RUN
TODO: add a custom cublas_mm kernel to let router gemm always give fp32 output, https://github.com/sgl-project/sglang/pull/7711/files#diff-5b9e34dd492bd8a14702a18b594721091092276fad1cf8736fba6ef1f33c1b04R247
Modifications
Checklist