[1/N] MoE Refactor: refactor `select_experts` #7966

ch-wan · 2025-07-11T22:16:06Z

Motivation

This pull request extracts the select_experts computation from within FusedMoE and EPMoE, moving it outside these modules. This refactoring offers three key benefits:

Enable gate-router fusion.
Simplifying MoE's input: reducing input number from 16 to 7.
Unifying API with DeepEPMoE.

This PR temporarily disables triton_kernel_moe, which will be added back later.

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Summary of Changes

Hello @ch-wan, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant refactoring of the Mixture-of-Experts (MoE) expert selection mechanism. The core intent is to centralize and simplify how expert selection parameters are handled and passed throughout the MoE layers and their integrations within various models and quantization methods. By introducing a dedicated TopK class and TopKOutput data structure, I've reduced parameter proliferation in function signatures and improved the modularity and maintainability of the MoE implementation.

Highlights

MoE Refactor: Centralized Expert Selection Logic: I've refactored the Mixture-of-Experts (MoE) expert selection logic by introducing a new TopK class and a TopKOutput NamedTuple. The TopK class now encapsulates all parameters and logic related to selecting top-k experts (e.g., top_k, renormalize, use_grouped_topk, custom_routing_function, etc.). The select_experts function, previously returning a tuple, now returns an instance of TopKOutput, which bundles topk_weights, topk_ids, and the original router_logits.
Simplified MoE Layer Interfaces: The forward methods across various MoE-related layers (e.g., DeepEPMoE, FusedMoE, and their underlying implementations in fused_moe_native.py, fused_moe_triton/fused_moe.py, fused_moe_triton/layer.py) have been updated. Instead of accepting a multitude of individual parameters for expert selection, they now primarily accept the new TopKOutput object, streamlining their signatures and reducing parameter clutter.
Model Integration Updates: All affected model implementations (deepseek.py, deepseek_v2.py, grok.py, hunyuan.py, llama4.py, mixtral.py, olmoe.py, qwen2_moe.py, qwen3_moe.py) have been updated to instantiate the new TopK class in their __init__ methods. Their forward passes now call this TopK instance to obtain the TopKOutput, which is then passed to the respective MoE expert layers. This centralizes the expert selection configuration within each model's MoE layer.
Quantization Method Alignment: Various quantization methods (blockwise_int8.py, compressed_tensors_moe.py, fp8.py, gptq.py, modelopt_quant.py, moe_wna16.py, w8a8_fp8.py, w8a8_int8.py) that interact with MoE layers have been updated to consume the new TopKOutput object, ensuring consistency across the MoE pipeline.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The code changes introduce a refactoring of the Mixture-of-Experts (MoE) expert selection logic, encapsulating the expert selection process, leading to cleaner and more modular code. I've identified a couple of critical issues that could lead to runtime errors and a medium-severity issue related to API design that would improve code clarity and prevent future bugs.

python/sglang/srt/models/deepseek.py

python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py

…ct_experts

…factor/select_experts

ch-wan · 2025-07-19T06:12:47Z

/gemini review

ch-wan · 2025-07-19T06:12:53Z

/gemini summary

Copilot

Pull Request Overview

This pull request refactors the MoE (Mixture of Experts) architecture by extracting the select_experts computation from within FusedMoE and EPMoE modules, moving it outside to enable gate-router fusion and reduce input parameters. The refactoring introduces a new TopK class and TopKOutput data structure to encapsulate expert selection logic.

Introduces TopK class to handle expert selection with various topk algorithms and routing functions
Creates TopKOutput named tuple to standardize the interface between expert selection and MoE computation
Updates all MoE-related classes and quantization methods to use the new topk_output parameter instead of multiple individual parameters

Reviewed Changes

Copilot reviewed 39 out of 39 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
python/sglang/srt/layers/moe/topk.py	Introduces TopK class and TopKOutput structure for expert selection
python/sglang/srt/layers/moe/fused_moe_triton/layer.py	Updates FusedMoE to accept TopKOutput instead of routing parameters
python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py	Modifies fused_moe and fused_experts to use TopKOutput
python/sglang/srt/layers/quantization/*.py	Updates all quantization methods to use TopKOutput interface
python/sglang/srt/models/*.py	Updates all MoE model implementations to use TopK class
test/srt/*.py	Updates test files to use the new TopK interface

Copilot · 2025-07-19T06:13:24Z

python/sglang/srt/layers/moe/topk.py

+        if self.use_grouped_topk:
+            assert num_expert_group is not None and topk_group is not None
+        self.top_k = top_k
+        self.use_grouped_topk = use_grouped_topk


The condition if self.use_grouped_topk: references self.use_grouped_topk but this attribute is not defined in the __init__ method. The parameter use_grouped_topk is passed but never assigned to self.use_grouped_topk.

Suggested change

if self.use_grouped_topk:

assert num_expert_group is not None and topk_group is not None

self.top_k = top_k

self.use_grouped_topk = use_grouped_topk

self.use_grouped_topk = use_grouped_topk

if self.use_grouped_topk:

assert num_expert_group is not None and topk_group is not None

self.top_k = top_k

Copilot · 2025-07-19T06:13:24Z

python/sglang/srt/layers/moe/topk.py

 )

+if TYPE_CHECKING:
+    from sglang.srt.layers.moe.topk import TopKOutput


This import creates a circular import since we're importing TopKOutput from the same module (sglang.srt.layers.moe.topk) where it's being defined.

Suggested change

from sglang.srt.layers.moe.topk import TopKOutput

pass # Avoid circular import; use forward reference for TopKOutput

Copilot · 2025-07-19T06:13:25Z

python/sglang/srt/models/granitemoe.py

        hidden_states = hidden_states.view(-1, self.hidden_size)
        router_logits, _ = self.gate(hidden_states)
-        final_hidden_states = self.experts(hidden_states, router_logits)
+        topk_output = self.topk(router_logits)


The TopK.forward method expects hidden_states as the first parameter, but here only router_logits is passed. This should be self.topk(hidden_states, router_logits) to match the method signature.

Suggested change

topk_output = self.topk(router_logits)

topk_output = self.topk(hidden_states, router_logits)

python/sglang/srt/models/phimoe.py

python/sglang/srt/layers/quantization/unquant.py

gemini-code-assist

Code Review

This pull request refactors the Mixture of Experts (MoE) select_experts logic by encapsulating the expert selection into a dedicated TopK operator, simplifying the API of the MoE layers and improving code clarity and maintainability. The changes are consistently applied across the various MoE implementations and quantization methods.

python/sglang/srt/layers/moe/topk.py

gemini-code-assist · 2025-07-19T06:14:36Z

python/sglang/srt/layers/quantization/base_config.py

    def apply(
        self,
        layer: torch.nn.Module,
        x: torch.Tensor,
-        router_logits: torch.Tensor,
-        top_k: int,
-        renormalize: bool,
-        use_grouped_topk: bool,
+        topk_output: TopKOutput,
+        *,
+        activation: str = "silu",
+        apply_router_weight_on_input: bool = False,
+        inplace: bool = True,
+        no_combine: bool = False,
+        routed_scaling_factor: Optional[float] = None,
    ) -> torch.Tensor:


The signature of this abstract method apply doesn't include **kwargs, but several of its implementations in subclasses (e.g., in awq.py, gptq.py, compressed_tensors_moe.py) do include it. This violates the Liskov substitution principle and can lead to unexpected behavior with type checkers and other tools. To improve consistency and maintainability, add **kwargs to the signature here.

Suggested change

def apply(

self,

layer: torch.nn.Module,

x: torch.Tensor,

router_logits: torch.Tensor,

top_k: int,

renormalize: bool,

use_grouped_topk: bool,

topk_output: TopKOutput,

*,

activation: str = "silu",

apply_router_weight_on_input: bool = False,

inplace: bool = True,

no_combine: bool = False,

routed_scaling_factor: Optional[float] = None,

) -> torch.Tensor:

def apply(

self,

layer: torch.nn.Module,

x: torch.Tensor,

topk_output: TopKOutput,

*,

activation: str = "silu",

apply_router_weight_on_input: bool = False,

inplace: bool = True,

no_combine: bool = False,

routed_scaling_factor: Optional[float] = None,

**kwargs,

) -> torch.Tensor:

gemini-code-assist · 2025-07-19T06:16:21Z

Summary of Changes

This pull request undertakes a significant refactoring of the Mixture-of-Experts (MoE) implementation by centralizing the expert selection logic into a new TopK class and standardizing its output through a TopKOutput NamedTuple. This change aims to simplify the interfaces of MoE layers and their associated quantization methods, reduce input parameter count, and ensure API consistency across different MoE implementations, including DeepEPMoE. As part of this transition, the Triton MoE kernel has been temporarily disabled.

Highlights

MoE Expert Selection Refactoring: The select_experts computation has been extracted from FusedMoE and EPMoE modules into a new, dedicated TopK class. This centralizes the logic for selecting top-k experts and allows for more flexible and unified handling of routing.
Simplified MoE Layer Interfaces: A new TopKOutput NamedTuple is introduced to encapsulate the results of expert selection (weights, IDs, and router logits). MoE layer forward methods and quantization apply methods now accept a single TopKOutput object instead of numerous individual parameters, significantly simplifying their signatures and reducing input complexity from 16 to 7 parameters.
API Unification and Input Reduction: This refactoring enables better gate-router fusion and unifies the API with DeepEPMoE, making the overall MoE implementation more consistent and easier to manage across different model architectures.
Temporary Triton Kernel Disablement: The triton_kernel_moe functionality has been temporarily disabled as part of this refactoring. It is planned to be re-enabled in a future update once compatibility is ensured.
Widespread Model and Test Integration: All affected MoE models (Deepseek, GraniteMoE, Grok, Hunyuan, Llama4, Mixtral, OlmoE, PhiMoE, Qwen2-MoE, Qwen3-MoE) and various test cases have been updated to utilize the new TopK class for expert selection and pass the TopKOutput object to their respective MoE layers and quantization methods.

Changelog

python/sglang/srt/custom_op.py
- Added TopK class to the torch.compile workaround for single-token batch sizes, ensuring forward_native is used.
python/sglang/srt/layers/linear.py
- Updated type hint for quant_config parameter to use direct type reference instead of a string forward reference.
python/sglang/srt/layers/moe/ep_moe/layer.py
- Removed various unused imports (Callable, einops, Module, CustomOp, ExpertLocationDispatchInfo).
- Changed import from select_experts to TopKOutput.
- Simplified __init__ method by removing parameters related to expert selection (e.g., renormalize, use_grouped_topk, correction_bias), as these are now handled by the TopK object.
- Updated forward, forward_deepgemm, and forward_normal methods to accept topk_output: TopKOutput instead of router_logits and other individual parameters.
- Removed internal calls to select_experts, now unpacking topk_weights, topk_ids from topk_output.
python/sglang/srt/layers/moe/fused_moe_native.py
- Removed unused import Callable.
- Changed import from select_experts to TopKOutput.
- Updated fused_moe_forward_native and moe_forward_native function signatures to accept topk_output: TopKOutput and removed numerous individual expert selection parameters.
- Removed internal calls to select_experts, now unpacking topk_weights, topk_ids from topk_output.
python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py
- Removed unused import Callable.
- Changed import from select_experts to TopKOutput.
- Updated fused_experts function signature to accept topk_output: TopKOutput instead of separate topk_weights and topk_ids.
- Updated fused_moe function signature to accept topk_output: TopKOutput and removed gating_output, topk, renormalize, and other expert selection parameters.
- Removed internal calls to select_experts.
python/sglang/srt/layers/moe/fused_moe_triton/layer.py
- Removed unused import Callable.
- Added import TopKOutput.
- Simplified __init__ method by removing expert selection parameters (e.g., renormalize, use_grouped_topk, custom_routing_function).
- Updated forward method to accept topk_output: TopKOutput instead of router_logits.
- Adjusted parameters passed to self.quant_method.apply to use topk_output.
python/sglang/srt/layers/moe/topk.py
- Added from __future__ import annotations and TYPE_CHECKING for improved type hinting.
- Introduced TopK class inheriting from CustomOp, encapsulating expert selection logic.
- Added TopKOutput as a NamedTuple to return topk_weights, topk_ids, and router_logits.
- Modified the standalone select_experts function to return TopKOutput and updated its signature to use keyword-only arguments for clarity.
- Implemented forward_native, forward_cuda, forward_cpu, and forward_npu methods within the TopK class, which internally call select_experts.
python/sglang/srt/layers/quantization/init.py
- Added from __future__ import annotations and TYPE_CHECKING.
- Added TopKOutput import under TYPE_CHECKING.
- Updated new_apply method signature to accept topk_output: TopKOutput and removed many individual parameters.
python/sglang/srt/layers/quantization/awq.py
- Added TYPE_CHECKING and TopKOutput import.
- Updated apply method signature to accept topk_output: TopKOutput and removed individual expert selection parameters.
- Removed internal call to select_experts and unpacked topk_weights, topk_ids, router_logits from topk_output.
python/sglang/srt/layers/quantization/base_config.py
- Added from __future__ import annotations and TYPE_CHECKING.
- Added TopKOutput import under TYPE_CHECKING.
- Updated abstract apply method signature to accept topk_output: TopKOutput and removed individual expert selection parameters, also adding **kwargs for consistency.
python/sglang/srt/layers/quantization/blockwise_int8.py
- Added TYPE_CHECKING and TopKOutput import.
- Updated apply method signature to accept topk_output: TopKOutput and removed individual expert selection parameters.
- Removed internal call to select_experts.
python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors_moe.py
- Added from __future__ import annotations and TYPE_CHECKING.
- Added TopKOutput import.
- Made CompressedTensorsMoEMethod inherit from FusedMoEMethodBase.
- Updated apply method signature to accept topk_output: TopKOutput and removed individual expert selection parameters.
- Removed internal call to select_experts and unpacked topk_weights, topk_ids, router_logits from topk_output.
python/sglang/srt/layers/quantization/fp8.py
- Added TopKOutput import under TYPE_CHECKING.
- Updated apply method signature to accept topk_output: TopKOutput and removed individual expert selection parameters.
- Removed internal call to select_experts and unpacked topk_weights, topk_ids from topk_output.
- Updated maybe_apply_hip_fused_experts signature to accept topk_output.
python/sglang/srt/layers/quantization/gptq.py
- Added TYPE_CHECKING and TopKOutput import.
- Updated apply method signature to accept topk_output: TopKOutput and removed individual expert selection parameters.
- Removed internal call to select_experts and unpacked topk_weights, topk_ids, router_logits from topk_output.
python/sglang/srt/layers/quantization/modelopt_quant.py
- Added TYPE_CHECKING and TopKOutput import.
- Updated apply method signature to accept topk_output: TopKOutput and removed individual expert selection parameters.
- Removed internal call to select_experts.
python/sglang/srt/layers/quantization/moe_wna16.py
- Added TYPE_CHECKING and TopKOutput import, and import numpy as np.
- Updated apply method signature to accept topk_output: TopKOutput and removed individual expert selection parameters.
- Removed internal call to select_experts.
- Adjusted top_k parameter in npu_fused_experts call to topk_ids.shape[1].
python/sglang/srt/layers/quantization/unquant.py
- Added from __future__ import annotations and TYPE_CHECKING.
- Added TopKOutput import.
- Removed direct attributes for MoE forward functions from __init__.
- Updated apply, forward_cuda, forward_cpu, and forward_npu signatures to accept topk_output: TopKOutput and removed individual expert selection parameters.
- Removed internal calls to select_experts.
- Temporarily disabled Triton kernel support in forward_cuda with a NotImplementedError.
- Local imports for moe_forward_native and fused_experts added where needed.
python/sglang/srt/layers/quantization/w8a8_fp8.py
- Added TYPE_CHECKING and TopKOutput import.
- Updated apply method signature to accept topk_output: TopKOutput and removed individual expert selection parameters.
- Removed internal call to select_experts.
python/sglang/srt/layers/quantization/w8a8_int8.py
- Added TYPE_CHECKING and TopKOutput import.
- Removed UnquantizedLinearMethod import.
- Updated apply method signature to accept topk_output: TopKOutput and removed individual expert selection parameters.
- Removed internal call to select_experts.
- Changed intermediate_size type hint in create_weights from List[int] to int.
- Adjusted top_k parameter in npu_fused_experts call to topk_ids.shape[1].
python/sglang/srt/models/deepseek.py
- Added TopK import.
- Instantiated self.topk = TopK(...) in __init__.
- Updated forward method to use self.topk for expert selection and pass topk_output to fused_moe.
python/sglang/srt/models/deepseek_v2.py
- Changed import from select_experts to TopK.
- Instantiated self.topk = TopK(...) in __init__ with relevant grouped top-k parameters.
- Simplified experts __init__ parameters, relying on TopK for expert selection details.
- Updated forward_normal_dual_stream, forward_normal, forward_cpu, forward_deepep, and op_select_experts methods to use self.topk for expert selection and pass topk_output where appropriate.
python/sglang/srt/models/granitemoe.py
- Added TopK import.
- Instantiated self.topk = TopK(...) in __init__.
- Simplified experts __init__ parameters.
- Updated forward method to use self.topk for expert selection and pass topk_output to self.experts.
python/sglang/srt/models/grok.py
- Added TopK import.
- Instantiated self.topk = TopK(...) in __init__.
- Simplified experts __init__ parameters.
- Updated forward method to use self.topk for expert selection and pass topk_output to self.experts.
python/sglang/srt/models/hunyuan.py
- Added TopK import.
- Instantiated self.topk = TopK(...) in __init__.
- Simplified experts __init__ parameters.
- Updated forward method to use self.topk for expert selection and pass topk_output to self.experts.
python/sglang/srt/models/llama4.py
- Added TopK import.
- Instantiated self.topk = TopK(...) in __init__.
- Simplified experts __init__ parameters.
- Updated _forward_core_normal and _forward_core_shared_routed_overlap methods to use self.topk for expert selection and pass topk_output to self.experts.
python/sglang/srt/models/mixtral.py
- Added TopK import.
- Instantiated self.topk = TopK(...) in __init__.
- Simplified experts __init__ parameters.
- Updated forward method to use self.topk for expert selection and pass topk_output to self.experts.
python/sglang/srt/models/olmoe.py
- Added TopK import.
- Instantiated self.topk = TopK(...) in __init__.
- Simplified experts __init__ parameters.
- Updated forward method to use self.topk for expert selection and pass topk_output to self.experts.
python/sglang/srt/models/phimoe.py
- Added TopK import.
- Instantiated self.topk = TopK(...) in __init__.
- Simplified experts __init__ parameters.
- Updated forward method to use self.topk for expert selection and pass topk_output to self.experts.
python/sglang/srt/models/qwen2_moe.py
- Added TopK import.
- Instantiated self.topk = TopK(...) in __init__.
- Simplified experts __init__ parameters.
- Updated forward method to use self.topk for expert selection and pass topk_output to self.experts.
python/sglang/srt/models/qwen3_moe.py
- Changed import from select_experts to TopK.
- Instantiated self.topk = TopK(...) in __init__.
- Removed renormalize attribute from Qwen3MoE __init__.
- Updated forward_normal, forward_deepep, and op_select_experts methods to use self.topk for expert selection and pass topk_output where appropriate.
python/sglang/test/test_block_fp8.py
- Added select_experts import.
- Updated _w8a8_block_fp8_fused_moe to call select_experts and pass the resulting TopKOutput to fused_moe.
python/sglang/test/test_block_fp8_ep.py
- Updated ep_moe function to unpack topk_weights, topk_ids, and _ from the select_experts return value.
python/sglang/test/test_cutlass_w4a8_moe.py
- Updated test_cutlass_w4a8_moe to unpack topk_weights, topk_ids, and _ from the select_experts return value, and removed redundant use_grouped_topk and renormalize parameters from the call.
python/sglang/test/test_fp4_moe.py
- Updated test_cutlass_fp4_moe_no_graph to unpack topk_weights, topk_ids, and _ from the select_experts return value, and removed redundant use_grouped_topk and renormalize parameters from the call.
test/srt/test_block_int8.py
- Added select_experts import.
- Updated _w8a8_block_int8_fused_moe to call select_experts and pass the resulting TopKOutput to fused_moe.
test/srt/test_fused_moe.py
- Added select_experts import.
- Updated _test_case to call select_experts and pass the resulting TopKOutput to fused_moe for both sglang_output and triton_output.
test/srt/test_int8_kernel.py
- Added select_experts import.
- Updated _w8a8_int8_fused_moe to call select_experts and pass the resulting TopKOutput to fused_moe.
test/srt/test_triton_moe_channel_fp8_kernel.py
- Added select_experts import.
- Updated _w8a8_fp8_fused_moe to call select_experts and pass the resulting TopKOutput to fused_moe.
test/srt/test_triton_moe_wna16.py
- Added select_experts import.
- Updated test_fused_moe_wn16 to call select_experts and pass the resulting TopKOutput to fused_moe.

Activity

Author ch-wan requested a Gemini review and summary.
Bot gemini-code-assist[bot] identified a critical TypeError due to a missing gating_output argument in fused_moe calls, suggesting refactoring fused_moe to remove the redundant parameter. (This was addressed by the PR's core refactoring, which removes gating_output from fused_moe's signature).
Bot gemini-code-assist[bot] suggested updating an assertion to use topk_output.router_logits if gating_output is removed. (This assertion was removed as part of the refactoring).
Bot Copilot pointed out that self.use_grouped_topk was referenced before assignment in TopK.__init__.
Bot Copilot noted a circular import issue with TopKOutput in sglang.srt.layers.moe.topk, which was addressed by using from __future__ import annotations and TYPE_CHECKING.
Bot Copilot identified incorrect TopK.forward calls in granitemoe.py and phimoe.py where hidden_states was missing, which were subsequently corrected.
Bot Copilot provided a nitpick suggestion to add a GitHub issue reference to the TODO comment about temporarily disabled Triton kernel.
Bot gemini-code-assist[bot] reiterated the critical AttributeError in TopK.__init__ regarding self.use_grouped_topk being used before assignment, which was a duplicate of Copilot's finding.
Bot gemini-code-assist[bot] highlighted a medium priority issue with the abstract apply method in base_config.py lacking **kwargs in its signature, leading to Liskov substitution principle violation. (This was addressed by adding **kwargs to the signature).

ch-wan · 2025-07-19T06:41:00Z

@yuan-luo This PR temporarily disables triton_kernel_moe as its routing function's function is incompatible with our select_experts. As you are familiar with these APIs, could you please help with adding this feature back? Thanks.

yuan-luo · 2025-07-19T11:58:49Z

@yuan-luo This PR temporarily disables triton_kernel_moe as its routing function's function is incompatible with our select_experts. As you are familiar with these APIs, could you please help with adding this feature back? Thanks.

Thanks for the great job! I'll try to follow up this part.

Merge branch 'sglang_public_tracker of [email protected]:Theta/SGLang.git into main https://code.alipay.com/Theta/SGLang/pull_requests/192 Reviewed-by: 得泽 <[email protected]> * fix duplicate args in schedule_batch (sgl-project#7816) * [AMD] Fail gracefully when AITER is unavailable gfx90a GPUs (sgl-project#7187) * docs: update README (sgl-project#7821) * [theta] add py-spy deps * feat: support DeepSeek-R1-W4AFP8 model with ep-moe mode (sgl-project#7762) * Enable ModelOpt Llama4 fp8 checkpoint deployment in SGLang (sgl-project#7129) * [Minor] Fix sporadic CI timeout caused by underestimated tests. (sgl-project#7850) * [Bugfix] Fix two batch overlap with auto DeepEP Dispatch (sgl-project#7853) * Fix cache modules of triton import error (sgl-project#7832) * [router] forward stream_options in request (sgl-project#7860) * Fix illegal memory in trtllm allreduce fusion (sgl-project#7864) * Fix llama4 vision (sgl-project#7840) * Support Mimo-VL (sgl-project#7579) * fix: Handles input_embeds in GenerateReqInput when n>1 (sgl-project#7830) * [Multimodal][Perf] Use `pybase64` instead of `base64` (sgl-project#7724) * Bump xgrammar's version to 0.1.20 (sgl-project#7866) * [CPU]convert topk_weights to fp32 for INT8 and FP8 paths (for llama4) and fix LmHead weight pack (sgl-project#7818) * [PD] Add guidance for prefill bootstrap timeout (sgl-project#7846) * Update native_api doc to match the change in the `get_model_info` endpoint (sgl-project#7660) * Revert "Embedding parallel by attn_tp (sgl-project#7623)" (sgl-project#7880) * chore: bump v0.4.9.post1 (sgl-project#7882) * Fixes typo in assertion message (sgl-project#7895) * [CI] Add deepep tests to CI (sgl-project#7872) * [CPU] [FP8] set SGLANG_CPU_FP8_CVT_FTZ in CMakeLists.txt (sgl-project#7885) * [CPU][Qwen3 MoE] Enable fused_topk CPU fusion and enhance FP8 TP padding (sgl-project#7838) * Remove unused imports (sgl-project#7898) * [router] Update metrics when request completes (sgl-project#7899) * [feature] Add start step profile argument in /start_profile (sgl-project#7608) * [bugfix] add pd router policy validation (sgl-project#7904) * vlm: support video as an input modality (sgl-project#5888) * Feat: Support Phi-3.5-MoE in SGLang (sgl-project#7907) * add sentencepiece as dependency explicitly (sgl-project#7922) * Fix bug of deepseek-v3 under DP+EP mode with large batchsize/seqlen (sgl-project#6449) * [feature]Ascend quantization support (sgl-project#7791) * [ready b200] fuse allreduce+add_rmsnorm in prepare_attention + mlp module (sgl-project#7775) * Support Kimi K2 (sgl-project#7940) * [feature] kv transfer support of ascend npu (sgl-project#7795) * fix: minor fix for modelopt weight load compatibility (sgl-project#7953) * temporarily disable deepep-8-gpu and activate two small tests (sgl-project#7961) * [fix]Update unitest for fp8_blockwise_scaled_grouped_mm kernel (sgl-project#7932) * chore: bump sgl-kernel v0.2.5 (sgl-project#7964) * Revert "[PD Disaggregation] replace transfer with batch transfer for better performance (sgl-project#7236)" (sgl-project#7968) * chore: upgrade xgrammar 0.1.21 (sgl-project#7962) * delete uselese code caused by fuse allreduce+add_rmsnorm pr (sgl-project#7970) * Fix wrong gemm branch cause 250us slower (sgl-project#7969) * [router] add worker abstraction (sgl-project#7960) * chore: upgrade sgl-kernel 0.2.5 (sgl-project#7971) * chore: bump v0.4.9.post2 (sgl-project#7963) * [minor fix] llama4 hybrid memory (sgl-project#7950) * [minor fix] SWA missing methods (sgl-project#7972) * [script] update loogle test (sgl-project#7975) * perf: add kimi k2 fused_moe tuning config for h20_3e * [theta] perf: add kimi k2 fused_moe tuning config for h200 * [minor fix] SWA missing methods (sgl-project#7972) * [script] update loogle test (sgl-project#7975) * perf: add kimi k2 fused_moe tuning config for h30_3e * docs: update README (sgl-project#7985) * Overlap the gating function with shared experts in DeepSeek (sgl-project#7978) * [BugFix] fix pre_reorder_triton_kernel default int32 issue (sgl-project#7814) * [minor] Add server_args check for Llama4 with hybrid (sgl-project#7988) * Tiny fix mooncake log warning wrong output (sgl-project#7952) * [BugFix] add verify logit_bias to avoid crash because of IndexError (sgl-project#7749) * SWA Prefix Cache (sgl-project#7367) * chore: remove unnecessary limits on quantization methods in test script (sgl-project#7997) * Refactor dynamic LoRA update to fix incorrect handling of variant weight shapes (sgl-project#7844) * Support for Phi-1.5 & Phi-2 models (sgl-project#7862) * [Dockerfile] Multi-arch support for ROCm (sgl-project#7902) * [CPU] fix no attribute 'can_fuse_mlp_allreduce' error (sgl-project#8010) * perf: add kimi k2 fused_moe tuning config for h30_3e (sgl-project#8021) * [ci] CI supports use cached models (sgl-project#7874) * [Minor] Remove redundant print (sgl-project#8005) * [Feature]TP Group Switching for PD-Multiplexing (sgl-project#7653) * [Feature] CUDA Green Context Support (sgl-project#7649) * Fix flaky CI: test_vlm_models (sgl-project#8006) * Fix Bug 'get_cpu_copy not Implemented' in pd offloading mode (sgl-project#7982) * prevent server crash from potential invalid grammar (sgl-project#7897) * Setup workflow for releasing mi300x and mi350x dockers. (sgl-project#8035) * fix: modality length mismatch with image_data (sgl-project#7887) * Update CODEOWNERS (sgl-project#8044) * perf: add qwen3-30b-a3b fused moe tuning config for h20 * [feat]Support fusion kernel for constructing quant input and scale factor for fp8_blockwise_scaled_grouped_mm (sgl-project#8023) * feat: update multimodal data handling in engine entrypoint (sgl-project#8002) * fix: remove redundant rotary embedding cache recomputation in MiniCPM (sgl-project#8022) * Fix the input tools format and history tool_calls in OpenAI API (sgl-project#6556) * fix: resolve arm build issue (sgl-project#8052) * concurrently load weights of DeepseekV2ForCausalLM (sgl-project#7943) * H20 tune config for Kimi (sgl-project#8047) * Update amd docker image. (sgl-project#8045) * feat: replace Decord with video_reader-rs (sgl-project#5163) * remove kv_a.congigous in DeepseekV2AttentionMLA (sgl-project#8058) * update transformers to 4.53.2 (sgl-project#8029) * Fix different device type adjustment in PP (sgl-project#7760) * Use device_group for all_gather when disabling overlap scheduling (sgl-project#8001) * Revert "feat: replace Decord with video_reader-rs" (sgl-project#8077) * Fix CI xeon test with triton 3.3.1 (sgl-project#8086) * fix greenctx stream compability (sgl-project#8090) * [misc] update nvshmem and pin deepEP commit hash (sgl-project#8098) * [Feature] Layer-wise Prefill (sgl-project#7634) * [1/n] chore: decouple quantization implementation from vLLM dependency (sgl-project#7992) * refactor: unify names of the feature field of MultimodalDataItem (sgl-project#8075) * feat: add tp_rank, pp_rank and dp_rank labels for scheduler metrics (sgl-project#7597) * [ci] limit cmake build nproc (sgl-project#8100) * [ci] disable memory imbalance check for draft worker (sgl-project#8108) * [Fix] ensure DeepGEMM is only enabled for FP8_W8A8 models (sgl-project#8110) * [ci] recover 8-gpu deepep test (sgl-project#8105) * Refactor: move all quantization-related code to `srt/layer/quantization` (sgl-project#7989) * [kernel] opt moe align block kernel by block/warp scan algorithm (sgl-project#7884) * Super tiny fix typo (sgl-project#8046) * fix: update HostKVCache init to report correct msg when available memory is not enough (sgl-project#8102) * [Hunyuan]: Fix Dense Model Support (sgl-project#8117) * feat: add production metric for retracted requests due to insufficient kvcache (sgl-project#7030) * refactor: simply MultimodalTokens logic (sgl-project#7924) * [Fix][Ready]Fix register spilling in cutlass nvfp4 gemm kernel on Blackwell (sgl-project#8127) * Feat: Support Granite 3.0 MoE in SGLang (sgl-project#7959) * load draft model fix (sgl-project#7506) * [CPU][Llama4] Fix Llama4 MoE inputs with "apply_router_weight_on_input" (sgl-project#7889) * [Quantization][w8a8_int8] Fix weight loading issue for w8a8_int8 path with "ignore" layer list in quantization config (sgl-project#7820) * Hicache Storage Layer Prototype (sgl-project#7704) * Revert "Fix different device type adjustment in PP" (sgl-project#8141) * feat: enchance green context stream creation robust with backward compatibility (sgl-project#8136) * fix compressed tensors WNA16 imports (sgl-project#8142) * [Bugfix] Fix w8a8_int8 import error on NPU (sgl-project#8147) * [3/n] chore: decouple AWQ implementation from vLLM dependency (sgl-project#8113) * [router] Refactor router and policy traits with dependency injection (sgl-project#7987) * [AMD] Add triton awq_dequantize kernel to support AWQ on ROCm (sgl-project#7661) * [Doc] Steps to add a new attention backend (sgl-project#8155) * chore: tune mem fraction static for vlm (sgl-project#6881) * Support NVFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs (sgl-project#7302) * Feat: Support audio in Phi4-mm model (sgl-project#8048) * [PD] Support non-MLA models PD different TP with DP attention (sgl-project#7931) * [health_generate] fix: fix the /health_generate always success bug (sgl-project#8028) * [router] router metrics cleanup (sgl-project#8158) * [router] allow router to have empty workers (sgl-project#8160) * Add GB200 wide-EP docker (sgl-project#8157) * [1/N] MoE Refactor: refactor `select_experts` (sgl-project#7966) * chore: bump sgl-kernel v0.2.6 (sgl-project#8165) * chore: upgrade sgl-kernel 0.2.6 (sgl-project#8166) * [theta] sync bailing * Fix suffix mismatch for the metrics. (sgl-project#8168) * Update README.md (sgl-project#8171) * Clean up server args (sgl-project#8161) * Fix LoRA buffer contamination during adapter eviction (sgl-project#8103) * Fix Dockerfile.gb200 (sgl-project#8169) * [router] add ut for worker and errors (sgl-project#8170) * bugfix: fix sglang crash in NVIDIA MIG container (sgl-project#8167) * Support start up LoRA server without initial adapters (sgl-project#8019) * Clean warning logs for gate_proj loading in Lora (sgl-project#8172) * Fix tuning_fused_moe_triton.py (sgl-project#8175) * [Feature] Simple Improve Health Check Mechanism for Production-Grade Stability (sgl-project#8115) * Add bf16 output option for dsv3_router_gemm kernel (sgl-project#7999) * Enable FlashInfer support encoder models and add head_dim padding workaround (sgl-project#6230) * Add get_hidden_dim to qwen3.py for correct lora (sgl-project#7312) * feat: add h200 tp 16 kimi k2 moe config (sgl-project#8176) * feat: add b200 tp 16 kimi k2 moe config (sgl-project#8178) * fix moe gate dtype, fix tbo, fix fake dispatch (sgl-project#7825) * Revert "[Feature] Simple Improve Health Check Mechanism for Production-Grade Stability" (sgl-project#8181) * feat: update nccl 2.27.6 (sgl-project#8182) * Feat: Support for Persimmon Model (sgl-project#7983) * feat: add h200 tp 16 kimi k2 moe config (sgl-project#8183) * Fix eagle3 cuda graph (sgl-project#8163) * fix: fix the bug of loading Internvl3 (sgl-project#8067) * Fix dtype error in CI (sgl-project#8197) * Cherry-pick commit 2dc5de40 "perf: add bailing mo..." 到当前分支 * [router] add ut for pd request, metrics and config (sgl-project#8184) * [feature] enable NPU CI (sgl-project#7935) * [fix] fix modelopt fp4 on b200 (sgl-project#8195) * chore: bump sgl-kernel v0.2.6.post1 (sgl-project#8200) * Apply fused sorted token ids padding (sgl-project#8193) * [Refactor] simplify multimodal data processing (sgl-project#8107) * [theta] feat vl name * [router] add ut for pd router (sgl-project#8208) * [router] upgade router version to 0.1.6 (sgl-project#8209) * Remve router gemm output dtype conversion (sgl-project#8204) * chore: upgrade sgl-kernel 0.2.6.post1 (sgl-project#8202) * [Feature] Add a test for Layer-wise Prefill (sgl-project#8231) * docs: update 2025 h2 roadmap (sgl-project#8237) * fix: retrieve mm token by modality, raise error if none (sgl-project#8221) * [AMD] Remove vllm's scaled_fp8_quant and moe_sum when SGLANG_USE_AITER=1 (sgl-project#7484) * [theta] tune h20 config for qwen3 235b * [theta] tune h20 config for qwen3 235b * fix: sgl-router remove dead code (sgl-project#8257) * [fix] benchmark : routed_scaling_factor is None (sgl-project#8059) * [Benchmark] add disable-auto-run param for hicache/bench_multiturn (sgl-project#7822) * Preliminary Support for Qwen3XMLDetector (sgl-project#8260) * chore: bump v0.4.9.post3 (sgl-project#8265) * PullRequest: 178 perf: add qwen235b h20-3e fused moe kernel config * [theta] tune h20 config for qwen3 480b * Skip llama4 vision module loading when multimodal disabled (sgl-project#8272) * PullRequest: 180 新增Qwen480B和Qwen235B在NVIDIA H20-3e上的Fused MoE Triton配置 * Fix sgl-kernel ci test (sgl-project#8284) * [theta] tune h200 config for qwen3 480b * Introduce Stable LoRA ID System for Overlapped Updates and Prefix Caching (sgl-project#8261) * Hicache IO kernel refactoring (sgl-project#8264) * bug fix and tag (sgl-project#8282) * HiCache Fix (sgl-project#8288) * [sgl-kernel] Opt per_token_quant_fp8 with warp reduce (sgl-project#8130) * [router] add common ut infra to mock worker and app (sgl-project#8295) * fix: workaround for deepgemm warmup issue (sgl-project#8302) * [Performance][PD Disaggregation] optimize TokenToKVPoolAllocator by sorting free pages (sgl-project#8133) * Fix the issue of incorrect finish reason in final stream response chunk returned during tool call (sgl-project#7708) * fix: match chat-template for internvl3 (sgl-project#8262) * Fix gemma3n with hybrid swa (sgl-project#8240) * chore: upgrade sgl-kernel 0.2.7 (sgl-project#8304) * fix: prevent crashes due to logit bias dimension mismatch (sgl-project#7685) * feat(function call): complete utility method for KimiK2Detector and enhance documentation (sgl-project#8043) * Fix incomplete tool call capture issue in streaming response of DeepSeek-V3 when enable MTP (sgl-project#7562) * [AMD] Pull latest image for AMD CI (sgl-project#8070) * Pin the version of petit kernel to fix the APIs (sgl-project#8235) * [bug] fix pd completion protocol for batching support (sgl-project#8317) * [router] fix pd model completion request (sgl-project#8303) * fix bug when eos_ids==0 (sgl-project#8315) * [router] add endpoint unit test (sgl-project#8298) * [code style] Clean dead triton kernel code in fused_moe and useless vllm_ops import (sgl-project#8310) * chore: upgrade flashinfer v0.2.9rc1 (sgl-project#8301) * [router] add streaming unit test (sgl-project#8299) * [router] add request format unit test (sgl-project#8300) * HiCache Storage TP Refinement (sgl-project#8307) * breakdown kernel update (sgl-project#8334) * support idle batch for TBO (sgl-project#8233) * [Feature] Integrate quick allreduce and select the best allreduce implementation (sgl-project#6619) * DP Enhancement (sgl-project#8280) * fix: Fix failed functional tests https://github.com/meta-llama/llama-stack-evals (sgl-project#8266) * [AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels for AMD GPUs (sgl-project#7135) * [CPU] Add tutorial docs for SGL on CPU (sgl-project#8000) * chore: upgrade mooncake 0.3.5 (sgl-project#8341) * [torch.compile bug] avoid biased_grouped_topk_impl func repeatedly triggering `torch.compile` in forward pass (sgl-project#8353) * [P/D] Support ipv6 in P/D scenario (sgl-project#7858) * Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct (sgl-project#8344) * [Bugfix][Feat] Add XML-ish grammar in EBNFComposer and fix misc bugs in Qwen3 detector (sgl-project#8357) * Clean up server_args, triton cache manager (sgl-project#8332) * fix: upgrade nccl version (sgl-project#8359) * [Feat] Add reasoning parser for Qwen/Qwen3-235B-A22B-Thinking-2507 (sgl-project#8363) * fix: kimi k2 xgrammar crash (sgl-project#8367) * Fix FP4 MoE accuracy from missing routed_scaling_factor (sgl-project#8333) * [CI] Fix flaky threshold (sgl-project#8370) * chore: bump v0.4.9.post4 (sgl-project#8305) * Fix test_moe_fused_gate_combined sgl-kernel ci test (sgl-project#8374) * Uodate Dockerfile.gb200 to latest sglang (sgl-project#8356) * chore: improve mmmu benchmark (sgl-project#7000) * Save peak memory in logits processor (sgl-project#8343) * Extract update_weights from RL Engine to SGLang to keep simplicity and fix torch reduce (sgl-project#8267) * chore: improvements on mm_utils (sgl-project#7737) * vlm: optimize tensor transport (sgl-project#6003) * Tiny assert EPLB is used together with expert parallel (sgl-project#8381) * model: support intern-s1 (sgl-project#8350) * Add perf tests for LoRA (sgl-project#8314) * Remove slot usage in code to be backward-compatible with python 3.9 (sgl-project#8396) * Add docker release flow for gb200 (sgl-project#8394) * HiCache, check before terminate prefetching (sgl-project#8372) * Add nvfp4 scaled mm benchmark. (sgl-project#8401) * Urgent Fix: intern-s1 chat-template matching (sgl-project#8403) * Tool to dump and compare internal activation tensors (sgl-project#7976) * Minor tool for comparison of benchmark results (sgl-project#7974) * Fix bench script making input data on L2 cache (sgl-project#7739) * [NVIDIA] Add Flashinfer MoE blockscale fp8 backend (sgl-project#8036) * Update Cutlass in sgl-kernel to v4.1 (sgl-project#8392) * fix: minor fix TransportProxyTensor under tp (sgl-project#8382) * [router] add different policies for p node and d node (sgl-project#8395) * Add A800 fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct (sgl-project#8351) * fix: fix the missing metrics on non-rank0 nodes (sgl-project#7720) * [2/N] MoE Refactor: Unify weight loader and quant methods (sgl-project#8397) * Use FlashInfer FP4 gemm. (sgl-project#8241) * Support precomputed_embeddings for Llama 4 (sgl-project#8156) * [hotfix] fix merge conflicts in FlashInferEPMoE (sgl-project#8405) * chore: update CODEOWNERS (sgl-project#8407) * chore: upgrade flashinfer v0.2.9rc2 (sgl-project#8406) * Support triton kernels v3.4.0 for fused_moe (sgl-project#8258) * [Bugfix] Prevent PD server crash from invalid grammar (sgl-project#8062) * Change to use native arm runner (sgl-project#8414) * Support overlapped lora updates (sgl-project#8213) * Support ue8m0 for triton quant kernel (sgl-project#7603) * Fix: Improve test_openai_function_calling unit test and fix reasoning_parser.py think_start_token logic (sgl-project#8316) * bugfix: Fix multiple finish_reason chunks and tool_calls finish reason check (sgl-project#8417) * Fix test_openai_server (sgl-project#8419) * Fix docker buildx push error (sgl-project#8425) * bugfix: Fix XGrammar backend to use model's EOS tokens for constrained generation (sgl-project#8422) * [router] improve router logs and request id header (sgl-project#8415) * [feat] Support different attention backends for prefill and decode (sgl-project#6338) * chore: bump transformer to 4.54.0 (sgl-project#8416) * [PD] Fix abort_request for PD disaggregation (sgl-project#8352) * GLM-4.5 Model Support (sgl-project#8224) * Remove zstd compression for building Dockerfile.gb200 (sgl-project#8442) * doc: add bench_one_batch_server in the benchmark doc (sgl-project#8441) * GLM-4.5 Model Support Follow-up (sgl-project#8445) * fix GLM4_MOE launch with compressed_tensor quant model (sgl-project#8456) * Fix per_token_group_quant_8bit when hidden_dim // group_size is not divided by 4. (sgl-project#8449) * Revert "[kernel] opt moe align block kernel by block/warp scan algorithm" (sgl-project#8457) * chore: bump v0.4.9.post5 (sgl-project#8458) * fix:reorder topk experts to ensure shared expert replaces minimal score (sgl-project#8125) * perf: add kimi k2 h200 fused moe config (extracted from theta-asap-sglang-049) * Cherry-pick commit 4a75e015 "Add draft model fuse..." 到当前分支 * Update PR template (sgl-project#8465) * feat: throttle requests at scheduler based on --max_queued_requests (sgl-project#7565) * [theta] tuning script for glm4 moe * perf: add fused moe kernel config glm4.5,h20-3e,tp8 * [theta] tuning script for glm4 moe h20 * fix: update dep (sgl-project#8467) * [NVIDIA] Change to use `num_local_experts` (sgl-project#8453) * Fix parsing ChatCompletionMessage (sgl-project#7273) * [3/N] MoE Refactor: Simplify DeepEP Output (sgl-project#8421) * feat: support glm4 tuning (sgl-project#8473) * Fix DEEPEP BF16 compatibility for Deepseek Style model like GLM 4.5 (sgl-project#8469) * Update codeowner (sgl-project#8476) * chore: add glm4 fp8 tp8 config (sgl-project#8478) * chore: add glm 4.5 fp8 tp4 config (sgl-project#8480) * [CI]Add genai-bench Performance Validation for PD Router (sgl-project#8477) * Update CODEOWNERS (sgl-project#8485) * Rename the last step in pr-test.yml as pr-test-finish (sgl-project#8486) * Reduce memory usage for fp4 moe (sgl-project#8413) * Tiny add warnings for DeepEP when it is suboptimal (sgl-project#8426) * Support colocating requests (sgl-project#7973) * Fix incorrect KV cache allocation for MTP models. (sgl-project#8482) * Add PVC and update resource limits in k8s config (sgl-project#8489) * chore: bump v0.4.9.post6 (sgl-project#8517) * Always trigger pr-test (sgl-project#8527) * Update README.md (sgl-project#8528) * [sgl-kernel performace] fix fp8 quant kernels dispatch __nv_fp8_e4m3 bug to improve performance 10%-20% (sgl-project#8499) * Update cutlass_moe.py (sgl-project#8535) * Fix moe align kernel test (sgl-project#8531) * Split the scheduler into multiple mixin classes to reduce the file size (sgl-project#8483) * bring back kimi vl ci (sgl-project#8537) * fix: temporarily disable cuda-ipc for mm data tensor (sgl-project#8431) * Support EPLB in FusedMoE (sgl-project#8448) * feat(hicache): support file backend reading directory config form env. (sgl-project#8498) * feature(pd-hicache): Prefill instances support reusing the RemoteStorage Cache via HiCache. (sgl-project#8516) * [router] allow longer time out for router e2e (sgl-project#8560) * Update cutlass_moe.py (sgl-project#8545) * Update CODEOWNERS (sgl-project#8562) * [feature] [sgl-router] Add a dp-aware routing strategy (sgl-project#6869) * [Hot-Fix] moe_aligned_block_size CI failed in AMD (sgl-project#8461) * Cherry-pick commit 4fdc06a9 "add fp8a8 kimi-k2 dr..." 到当前分支 * [Model] Add support for Arcee Foundational Model (sgl-project#8154) * Revert "Fix the input tools format and history tool_calls in OpenAI API (sgl-project#6556)" (sgl-project#8584) * Add hf3fs support for hicache storage (based on sgl-project#7704) (sgl-project#7280) * [router] migrate router from actix to axum (sgl-project#8479) * [Fix]Fix index oob in get_group_gemm_starts kernel. (sgl-project#8564) * Bump transfomers to 4.54.1 to fix Gemma cache issue. (sgl-project#8541) * Add GKE's default CUDA runtime lib location to PATH and LD_LIBRARY_PATH. (sgl-project#8544) * Bug: Fix google gemma3n-mm audio input not working bug (sgl-project#8365) * update sgl-kernel for EP: kernel part (sgl-project#8514) * chore: bump sgl-kernel v0.2.8 (sgl-project#8599) * [bugfix] Fix 2 minor bugs in the hicache storage layer (sgl-project#8404) * fix incorrect increase of hit count (sgl-project#8533) * Support l3 cache (mooncake store) for hiradix cache (sgl-project#7211) * [theta] Conditionally import HiCacheHF3FS sgl-project#8598 * update sgl-kernel for EP: python part (sgl-project#8550) * add SVG logo (sgl-project#8603) * [4/N] MoE Refactor: Unified Triton Kernel for FusedMoE and EPMoE (sgl-project#8515) * fix: fork should not run pypi router (sgl-project#8604) * model: support Step3V (sgl-project#8583) * [Feature] Hybrid EP and TP (sgl-project#8590) * chore: bump v0.4.10 (sgl-project#8608) * [PD] Use batch transfer for rdma transport and add notes for mnnvl usage (sgl-project#8595) * [bugifx] QWen-1M context support[2/3] using current cuda stream in the DCA's kernel for bugfix. (sgl-project#8611) * Fix hf3fs_fuse import error (sgl-project#8623) * Update step3v default config (sgl-project#8626) * [ci] fix genai-bench execution cmd (sgl-project#8629) * [router] update router pypi version (sgl-project#8628) * [Optimization][Perf] Disable the GC during CUDA graph capture to speed up by up to 3x (sgl-project#8577) * Fix typos in py_test/test_launch_server.py (sgl-project#6227) * misc: Remove debug print to logger.info (sgl-project#8633) * SGLang HiCache NIXL Connector (sgl-project#8488) * [bug] remove pdlb from minilb since its no longer available (sgl-project#8634) * [bugfix] Fix flashinfer cutlass EP moe after MoE refactor (sgl-project#8630) * Conditionally import HiCacheHF3FS (sgl-project#8598) * TRTLLM Gen MLA Decode Kernel Integration (same as sgl-project#7938) (sgl-project#8632) * Fix nan value generated after custom all reduce (sgl-project#8532) * Revert "Fix nan value generated after custom all reduce (sgl-project#8532)" (sgl-project#8642) * Feature/modelscope model download (sgl-project#8083) * chore: speedup NPU CI by cache (sgl-project#8270) * [Bugfix] fix w8a8_int8 load issue (sgl-project#8308) * [bugfix] fix router python parser for pd urls (sgl-project#8644) * [router] add basic usage doc (sgl-project#8640) * [router] upgrade router version to 0.1.8 (sgl-project#8645) * [NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE (sgl-project#8450) * HiCache, fixing hash value indexing (sgl-project#8636) * Interface change for kvcache io to support page first layout (sgl-project#8318) * Update batch size limitation of dsv3_router_gemm kernel to 16 (sgl-project#8051) * chore: bump v0.4.10.post1 (sgl-project#8652) * Add hf3fs_utils.cpp to package-data (sgl-project#8653) * Fix chat template handling for OpenAI serving (sgl-project#8635) * Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… (sgl-project#8511) * [5/N] MoE Refactor: Update MoE parallelism arguments (sgl-project#8658) * Increase tolerance to address CI failures (sgl-project#8643) * [Kimi K2] dsv3_router_gemm supports NUM_EXPERTS == 384 (sgl-project#8013) * [DOC]Update sgl-kernel README (sgl-project#8665) * fix per token cuda kernel hidden dim cannot divide by 16 (sgl-project#8543) * fix arg typo for --disaggregation-transfer-backend (sgl-project#8664) * [fix] fix pd disagg error of vlms (sgl-project#8094) * Disable tp for shared experts under expert parallelism for GLM4.5 model (sgl-project#8647) (sgl-project#8647) * [bugfix] Fix page size for create_flashmla_kv_indices_triton() for cutlass mla (sgl-project#8685) * [bug] limit bootstrap room to to [0, 2^63 - 1] (sgl-project#8684) * Update CODEOWNERS (sgl-project#8686) * Fix deepgemm masked grouped gemm jit compile (sgl-project#8679) * Fix FP8 block quantization when N or K is not multiples of 128 (sgl-project#8648) * bugfix(hicache): Fix 'MooncakeStore' not defined error. (sgl-project#8668) * upgrade xgrammar 0.1.22 (sgl-project#8522) * [bugfix] Add 'disaggregation_mode' parameter to warmup function when compile deep_gemm manually (sgl-project#8618) * Add support for NCCL symmetric memory for TP allreduces (sgl-project#8238) * [1/2] sgl-kernel: Fuse routed scaling factor into select_experts (sgl-project#8364) * chore(gb200): update dockerfile to handle fp4 disaggregation (sgl-project#8694) * [bugfix] Apply routed scaling factor to cutlass_fused_experts_fp8 (sgl-project#8688) * Fix: resolve prefill of retracted request out-of-memory issue when ignore_eos is enabled (sgl-project#7434) * model: adapt mllama4 to VisionAttention (sgl-project#8512) * Add tensor.detach() back to update weight util (sgl-project#8691) * [Doc] Polish sgl-kernel readme for cu126 build error (sgl-project#8704) * [theta] merge 0802-3 * Revert "[1/2] sgl-kernel: Fuse routed scaling factor into select_experts" (sgl-project#8706) * [router] minor code clean up and and refactoring (sgl-project#8711) * [Bug] fix green context's incompatibility with `cuda < 12.4` (sgl-project#8701) * chore: bump sgl-kernel v0.2.9 (sgl-project#8713) * Remove assertions about per group quant fp8 (sgl-project#8717) * [FIX] Fix the nightly CI by disabling swa mem pool for gemma2 (sgl-project#8693) * Fix triton moe error caused by TopK refactor (sgl-project#8705) * [router] Implement HTTP Dependency Injection Pattern for Router System (sgl-project#8714) * [Feature] Radix Tree in C++ (sgl-project#7369) * [Perf]Use Cooperative Schedule for H100 & H200 & H800 in fp8_blockwise_scaled_grouped_mm (sgl-project#8722) * Fix fused MoE when `routed_scaling_factor is None` (sgl-project#8709) * Tiny fix CI pytest error (sgl-project#8524) * [hotfix] fix mixtral with tensor-level compressed-tensor quantization (sgl-project#8721) * Support limiting max loaded loras in CPU. (sgl-project#8650) * Reduce memory accumulation in long-running server (sgl-project#8306) * HiCache storage, style change and bug fix (sgl-project#8719) * [feat] support minimum token load balance in dp attention (sgl-project#7379) * Do layernorm before allgather for DP attention (sgl-project#8631) * [fix] Fix divide by zero error for llama4. (sgl-project#8683) * feat: Add new moe triton for NVIDIA RTX 6000 Ada (sgl-project#8547) * [Improvements] Merge health check route (sgl-project#8444) * chore: bump sgl-kernel 0.3.0 with torch 2.8.0 (sgl-project#8718) * Save cuda graph memory for fa3 (sgl-project#8567) * [CUDA Graph] save cuda graph memory by using next_token_logits_buffer (sgl-project#8579) * [DP] fix the compatibility issue between DP attention and `--attention-backend triton` (sgl-project#8723) * chore: bump v0.4.10.post2 (sgl-project#8727) * feat: Support DP Attention for step3_vl (sgl-project#8699) * [RL] fix update weight for FusedMoE with EP (sgl-project#8676) * use fp32 for e_score_correction_bias in GLM-4.5 (sgl-project#8729) * Fix triton kernels topk with keyword arguments (sgl-project#8732) * feat: support cutlass_moe_fp8 kernel for fusedmoe in sm90 (sgl-project#8678) * Fix the missing 'lof' choice of --schedule-policy server args (sgl-project#7114) * fix args typo in memory_pool_host (sgl-project#8662) * [CI] Do not trigger pd-disaggregation CI in draft PR (sgl-project#8737) * [MoE] Enable `renormalize=False` in Triton kernels (sgl-project#8735) * Replace torch.jit.script with torch.compile in get_masked_input_and_mask to fix benchmark underreporting (sgl-project#8733) * Fix bug of refactoring TopKOutput in w4afp8 (sgl-project#8745) * Rename lora_path to lora_id in batches (sgl-project#8437) * [sgl-kernel] avoid per_token_quant_fp8.cu hardcode sm_count (sgl-project#8738) * [CI] Ascend NPU CI enhancement (sgl-project#8294) * [bugfix] fix import path in HiCacheController (sgl-project#8749)

ch-wan requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock, ByronHsu, zhaochenyang20, HaiShaw and BBuf as code owners July 11, 2025 22:16

gemini-code-assist bot reviewed Jul 11, 2025

View reviewed changes

python/sglang/srt/models/deepseek.py Outdated Show resolved Hide resolved

python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py Outdated Show resolved Hide resolved

ch-wan force-pushed the cheng/refactor/select_experts branch 2 times, most recently from e842518 to ec627cb Compare July 11, 2025 23:17

refactor: refactor select_experts

dbcc849

ch-wan force-pushed the cheng/refactor/select_experts branch from ec627cb to dbcc849 Compare July 11, 2025 23:27

ch-wan added 2 commits July 11, 2025 17:24

fix api

77ed5ed

fix dual stream

be2307d

ch-wan force-pushed the cheng/refactor/select_experts branch from a793320 to 4fc1a66 Compare July 12, 2025 11:03

update select_experts

1c2be52

ch-wan force-pushed the cheng/refactor/select_experts branch from 4fc1a66 to 1c2be52 Compare July 12, 2025 11:30

fix tests

ec0392b

ch-wan requested review from HandH1998, yizhang2077 and FlamingoPg as code owners July 12, 2025 12:01

ch-wan added 2 commits July 13, 2025 01:33

add awq_marlin

581fcc5

Merge remote-tracking branch 'upstream/main' into cheng/refactor/sele…

e93ec6b

…ct_experts

ch-wan force-pushed the cheng/refactor/select_experts branch from 7d688e7 to e93ec6b Compare July 13, 2025 08:34

ch-wan mentioned this pull request Jul 13, 2025

[1/n] chore: decouple quantization implementation from vLLM dependency #7992

Merged

6 tasks

ch-wan added 2 commits July 18, 2025 18:35

Merge commit '3964b352c3613b06b0f10fa5d7a8b2630fa80d61' into cheng/re…

38f064f

…factor/select_experts

fix merge conflict

bddb87d

ch-wan and others added 8 commits July 18, 2025 18:54

remove awq_marlin.py

428d778

multiple fixes

da71012

possible fix

760cc19

Merge branch 'main' into cheng/refactor/select_experts

c0d4bf2

fix

ee81037

format

0618cf4

Merge branch 'main' into cheng/refactor/select_experts

d6c1ed7

cleanup

083f620

ch-wan changed the title ~~[1/N] [WIP] MoE Refactor: refactor select_experts~~ [1/N] MoE Refactor: refactor select_experts Jul 19, 2025

ch-wan changed the title ~~[1/N] MoE Refactor: refactor select_experts~~ [1/N] MoE Refactor: refactor select_experts Jul 19, 2025

ch-wan requested a review from Copilot July 19, 2025 06:12

Copilot AI reviewed Jul 19, 2025

View reviewed changes

gemini-code-assist bot reviewed Jul 19, 2025

View reviewed changes

multiple fixes

5c61200

zhyncs approved these changes Jul 19, 2025

View reviewed changes

zhyncs merged commit 15ad6c9 into sgl-project:main Jul 19, 2025
167 of 188 checks passed

merrymercy mentioned this pull request Jul 21, 2025

Development Roadmap (2025 H2) #7736

Open

1 task

yuan-luo mentioned this pull request Jul 22, 2025

Support triton kernels v3.4.0 for fused_moe #8258

Merged

6 tasks

ch-wan added a commit that referenced this pull request Jul 23, 2025

[1/N] MoE Refactor: refactor select_experts (#7966)

44b619d

ch-wan mentioned this pull request Aug 2, 2025

[Roadmap] MoE Refactor #8715

Open

66 tasks

	from sglang.srt.layers.moe.topk import TopKOutput
	pass # Avoid circular import; use forward reference for TopKOutput

	topk_output = self.topk(router_logits)
	topk_output = self.topk(hidden_states, router_logits)

[1/N] MoE Refactor: refactor select_experts #7966

[1/N] MoE Refactor: refactor select_experts #7966

Uh oh!

Conversation

ch-wan commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

ch-wan commented Jul 19, 2025

Uh oh!

ch-wan commented Jul 19, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot commented Jul 19, 2025

Summary of Changes

Highlights

Uh oh!

ch-wan commented Jul 19, 2025

Uh oh!

Uh oh!

yuan-luo commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

[1/N] MoE Refactor: refactor `select_experts` #7966

[1/N] MoE Refactor: refactor `select_experts` #7966

ch-wan commented Jul 11, 2025 •

edited

Loading

yuan-luo commented Jul 19, 2025 •

edited

Loading