[POC] enable trtllm fp4 from trtllm wheel #7711

Alcanderian · 2025-07-02T08:45:37Z

Motivation

ATTENTION: flashinfer cutlass moe may not work after installing trtllm wheel

UPD 072025, archieve up to 134 tps with tp8

UPD 070525, archieve up to 128 tps

SGLANG_TRTLLM_GEN_MOE_EP_SIZE=2 SGLANG_ENABLE_TRTLLM_GEN_MOE=1 python3 -m sglang.launch_server \
--model-path /dev/shm/DeepSeek-R1-FP4 --trust-remote-code --quantization modelopt_fp4 --tp 8 \
--enable-flashinfer-moe --enable-ep-moe --enable-flashinfer-allreduce-fusion

ENV

pip3 install --no-cache-dir tensorrt-llm==1.0.0rc0 --no-deps
pip3 install --no-cache-dir tensorrt~=10.11.0 \
  tensorrt_cu12_bindings~=10.11.0 tensorrt_cu12~=10.11.0 \
  tensorrt_cu12_libs~=10.11.0 --no-deps
pip3 install --no-cache-dir nvtx mpi4py onnx onnx_graphsurgeon\>=0.5.2 \
  StrEnum accelerate\>=0.25.0 nvidia-modelopt\[torch\]~=0.31.0
pip3 install --no-cache-dir aenum backoff click_option_group \
  colored diffusers\>=0.27.0 etcd3 evaluate meson openai opencv-python-headless
pip3 install --no-cache-dir optimum ordered-set peft polygraphy \
  tornado\>=6.5.0 setuptools\<80 h5py==3.12.1 blake3

pip3 install -e "python[blackwell]"
pip3 install --no-cache-dir nvidia-nccl-cu12==2.27.6 --force-reinstall --no-deps

RUN

SGLANG_ENABLE_TRTLLM_GEN_MOE=1 python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-R1-FP4 \
    --trust-remote-code --quantization modelopt_fp4 --tp 8 --enable-flashinfer-moe

# ACC
python3 benchmark/gsm8k/bench_sglang.py --parallel 1400 --num-questions 1400
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:40<00:00, 32.91it/s]
Accuracy: 0.951
Invalid: 0.000
Latency: 40.467 s
Output throughput: 3108.238 token/s

lm_eval
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9659|±  | 0.005|
|     |       |strict-match    |     5|exact_match|↑  |0.9492|±  | 0.006|

# TPS
[2025-07-02 11:51:13 TP0] Decode batch. #running-req: 1, #token: 1236, token usage: 0.00, cuda graph: True, gen throughput (token/s): 119.54, #queue-req:

TODO: add a custom cublas_mm kernel to let router gemm always give fp32 output, https://github.com/sgl-project/sglang/pull/7711/files#diff-5b9e34dd492bd8a14702a18b594721091092276fad1cf8736fba6ef1f33c1b04R247

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Summary of Changes

Hello @Alcanderian, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on enabling FP4 quantization for Mixture of Experts (MoE) layers using TensorRT-LLM, aiming to optimize performance for models like DeepSeekV3. It introduces a new, optimized execution flow for MoE operations and includes minor adjustments to existing dependencies and MoE processing logic.

Highlights

TensorRT-LLM FP4 MoE Integration: I've integrated a new execution path for Mixture of Experts (MoE) layers that leverages TensorRT-LLM's FP4 quantization and block-scale MoE runner. This path is conditionally enabled and uses torch.ops.trtllm.fp4_quantize and torch.ops.trtllm.fp4_block_scale_moe_runner for optimized processing, with hardcoded parameters for DeepSeekV3.
FlashInfer Dependency Adjustment: I've updated the required flashinfer_python version in pyproject.toml and the corresponding version assertion in engine.py from 0.2.7 to 0.2.6.post1.
DeepSeekV2 MoE Optimization: For DeepSeekV2 models, I've increased the DUAL_STREAM_TOKEN_THRESHOLD from 1024 to 2048 and reordered the router_logits calculation within the forward_normal_dual_stream method to occur inside the alt_stream context, aiming to improve dual-stream processing efficiency.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for TensorRT-LLM FP4, which is a great optimization. The changes are mostly on the right track, but I've found a critical logic error in modelopt_quant.py that makes some code unreachable and could lead to incorrect behavior. Additionally, the new TRT-LLM implementation contains several hardcoded, model-specific values that should be generalized. Addressing these issues will make the new feature more robust and maintainable.

python/sglang/srt/layers/quantization/modelopt_quant.py

Alcanderian · 2025-07-02T19:31:29Z

/gemini review

gemini-code-assist

Code Review

This pull request enables FP4 quantization for MoE layers using TensorRT-LLM, which is a significant performance optimization. The changes involve adding new code paths controlled by the SGLANG_ENABLE_TRTLLM_GEN_MOE environment variable, including new weight processing logic and a custom kernel runner for MoE. A comprehensive test suite from TensorRT-LLM has also been added to validate the new functionality.

My review focuses on improving the maintainability and robustness of the new code. I've pointed out several hardcoded values that are tied to specific model configurations or kernel internals, which could make the code brittle. I've suggested using named constants or deriving these values from configuration where possible. I also noted some areas where the code complexity has increased and could benefit from refactoring to improve clarity. Overall, this is a great step towards higher performance, and addressing these points will make the implementation more solid.

python/sglang/srt/layers/quantization/modelopt_quant.py

gemini-code-assist · 2025-07-02T19:33:01Z

python/sglang/srt/layers/quantization/modelopt_quant.py

+        if not ENABLE_TRTLMM_GEN_MOE:
+            w13_blockscale_swizzled = self.swizzle_blockscale(layer.w13_weight_scale)
+            layer.w13_weight_scale = Parameter(
+                w13_blockscale_swizzled, requires_grad=False
+            )
+        else:
+            layer.w13_weight_scale = Parameter(
+                layer.w13_weight_scale.data, requires_grad=False
+            )
+            self.trtllm_gen_process_expert_w3_w1_weight(layer)
+            self.trtllm_gen_process_expert_w3_w1_weight_scale_nvfp4(layer)


The logic within process_weights_after_loading has become quite complex with the addition of the ENABLE_TRTLMM_GEN_MOE flag. Specifically, the reassignment of layer.w13_weight_scale (and w2_weight_scale) is confusing. Initially, it's a ModelWeightParameter, but in this if not ENABLE_TRTLMM_GEN_MOE branch, it's replaced with a Parameter containing swizzled scales. This can make the code harder to understand and maintain.

Consider refactoring the logic for the two paths into separate helper methods to improve clarity. For example, _process_weights_flashinfer() and _process_weights_trtllm(). Also, instead of reassigning layer.w13_weight_scale, it might be clearer to use a different attribute name for the swizzled scales, like layer.w13_blockscale_swizzled (as it was before), to avoid confusion about its type and content.

python/sglang/srt/models/deepseek_v2.py

trevor-m · 2025-07-03T22:56:43Z

Nice work! FYI the TRTLLM Gen MoE kernel is being added to flashinfer, so that will help to ease the deps setup.

python/sglang/srt/layers/moe/fused_moe_triton/layer.py

hlu1 · 2025-07-07T23:55:27Z

python/sglang/srt/layers/quantization/modelopt_quant.py

        self.quant_config = quant_config

+        if ENABLE_TRTLMM_GEN_MOE:
+            self.kernel = torch.ops.trtllm.nvfp4_gemm


The trtllm nvfp4_gemm needs profiling to find the best config. How is that handled in sglang?

There is probably none from what I can see. By default it'll use some heuristics, but they are not the best. There is autotuner in trtllm that handles this https://github.com/NVIDIA/TensorRT-LLM/pull/5207/files. I can bring that autotuner with flashninfer integration flashinfer-ai/flashinfer#1214

We have no auto tunner in sglang for now

Then the performance may not be optimal without profiling.

hlu1 · 2025-07-08T00:04:45Z

python/sglang/srt/models/deepseek_v2.py


        # NOTE: For some unknown reason, router_gemm seems degrade accept length.
+        if ENABLE_TRTLMM_GEN_MOE and not self.is_nextn:
+            return torch.ops.trtllm.dsv3_router_gemm_op(


How is this one different from the dsv3_router_gemm on line 255?

The dsv3_router_gemm below do not have cublas fallback, and F.linear do not support fp16 input with fp32 output for now.

hlu1 · 2025-07-08T00:06:23Z

python/sglang/srt/models/deepseek_v2.py

+            if ENABLE_TRTLMM_GEN_MOE:
+                router_logits = self.gate(hidden_states)


Why this change?

This change is to reproduce trtllm profile's timeline

hlu1 · 2025-07-08T00:08:41Z

python/sglang/srt/layers/quantization/modelopt_quant.py

+            tile_tokens_dim = 8
+
+            # https://github.com/NVIDIA/TensorRT-LLM/blob/v1.0.0rc1/tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py#L195
+            outputs = torch.ops.trtllm.fp4_block_scale_moe_runner(


I don't know if you have benchmarked the prefill perf separately. trtllm gen MoE is optimized for decoding, i.e., small input num_tokens.

I have not done the benchmark on prefill, but I can see that it aslo be called in trtllm prefill stage. And by the way are there any fp4_gemm interface can apply the same weight/scaling factor layout as trtllm gen moe? Because we cannot change the layout of weight/scaling factor in the forward stage.

Yes, I believe trtllm has added it: https://github.com/NVIDIA/TensorRT-LLM/blob/d4d21a106e8176bf20e627ee432cca5ef920c325/tests/unittest/_torch/thop/test_fp4_gemm_quantize.py#L122
But the weight layout is different from MoE.
Are you planning to run MoE as individual gemms in the forward stage? If so, this might be close to what you want:
https://github.com/NVIDIA/TensorRT-LLM/blob/d4d21a106e8176bf20e627ee432cca5ef920c325/tensorrt_llm/_torch/models/modeling_llama_min_latency.py#L47
The precision is not nvfp4.

These two seem unable to provide any effective assistance.

What I wanted to say is that the layout changes will not be compatible with single gemms, because some of the layout shuffling is for swiglu fusion. The only possibility is the nvfp4 version of GatedMLP kernels, which shares the same weight layout as MoE. These kernels can be built. They are just not checked into trtllm.

Actually now I think about it, the GatedMLP kernels would be optimized for low latency too.

azhurkevich · 2025-07-08T01:10:11Z

based tbh

Co-authored-by: hlu1 <[email protected]>

azhurkevich · 2025-07-16T23:51:21Z

fyi, we merged flashinfer-ai/flashinfer#1214 and the plan is to enable these kernels through flashinfer in SGL

[opt] enable trtllm fp4

326361e

gemini-code-assist bot reviewed Jul 2, 2025

View reviewed changes

python/sglang/srt/layers/quantization/modelopt_quant.py Outdated Show resolved Hide resolved

python/sglang/srt/layers/quantization/modelopt_quant.py Outdated Show resolved Hide resolved

python/sglang/srt/layers/quantization/modelopt_quant.py Outdated Show resolved Hide resolved

Alcanderian and others added 6 commits July 2, 2025 08:10

2nd

acccd2f

3

15240d2

add test

e12859f

readbility

e652b36

4

c313cca

Merge branch 'main' into trtllm-fp4

4f92e77

Alcanderian marked this pull request as ready for review July 2, 2025 16:58

Alcanderian requested review from merrymercy, Ying1123, zhyncs, hnyls2002, ispobock, ByronHsu, zhaochenyang20, HaiShaw, ch-wan and BBuf as code owners July 2, 2025 16:58

Alcanderian added 4 commits July 2, 2025 11:52

fix acc

5ebc8ad

lint

3331814

router_gemm

6560082

lint

794a37c

gemini-code-assist bot reviewed Jul 2, 2025

View reviewed changes

zhyncs assigned zhyncs, ispobock, BBuf and Fridge003 Jul 3, 2025

Alcanderian added 2 commits July 3, 2025 19:41

Merge branch 'main' into trtllm-fp4

1f09807

Merge branch 'main' into trtllm-fp4

88ad6d1

Alcanderian and others added 9 commits July 4, 2025 22:13

Merge branch 'main' into trtllm-fp4

38e652f

support EP-TP combine

02fa3ad

Merge branch 'main' into trtllm-fp4

1ea5ef9

use trt fp4 linear

b7d66cc

Merge branch 'main' into trtllm-fp4

56fd5cb

use trt router gemm and disable it for nextn

7f1a2e8

Merge branch 'main' into trtllm-fp4

7e75c19

Merge branch 'main' into trtllm-fp4

1dda8a0

Merge branch 'main' into trtllm-fp4

e2d0675

hlu1 reviewed Jul 8, 2025

View reviewed changes

BBuf and others added 5 commits July 14, 2025 10:24

Merge branch 'main' into trtllm-fp4

ca3a714

lint

3c77c42

Merge branch 'main' into trtllm-fp4

4707908

Merge branch 'main' into trtllm-fp4

4fdd5a7

upd

bf0a000

Co-authored-by: hlu1 <[email protected]>

Alcanderian and others added 4 commits July 20, 2025 15:30

Merge branch 'main' into trtllm-fp4

ee5e054

Merge branch 'main' into trtllm-fp4

f08c44c

fix

34d11dd

Merge branch 'main' into trtllm-fp4

3369c71

merrymercy mentioned this pull request Jul 21, 2025

Development Roadmap (2025 H2) #7736

Open

1 task

azhurkevich mentioned this pull request Jul 23, 2025

[Feature] Integrating FlashInfer FP4/FP8 Low-Latency MoE Kernels for DSR1 #8037

Closed

2 tasks

hnyls2002 assigned Alcanderian and hnyls2002 Jul 28, 2025

zhyncs closed this Aug 4, 2025

zhyncs deleted the trtllm-fp4 branch August 4, 2025 10:11

		if ENABLE_TRTLMM_GEN_MOE:
		router_logits = self.gate(hidden_states)

[POC] enable trtllm fp4 from trtllm wheel #7711

[POC] enable trtllm fp4 from trtllm wheel #7711

Uh oh!

Conversation

Alcanderian commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Alcanderian commented Jul 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

trevor-m commented Jul 3, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hlu1 Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

azhurkevich commented Jul 8, 2025

Uh oh!

azhurkevich commented Jul 16, 2025

Uh oh!

Uh oh!

Alcanderian commented Jul 2, 2025 •

edited

Loading

hlu1 Jul 17, 2025 •

edited

Loading