Skip to content

Conversation

yanbing-j
Copy link
Contributor

@yanbing-j yanbing-j commented Aug 1, 2025

Motivation

Modifications

This PR is to add support of FP8 block quantize when N or K is not multiples of 128.

Accuracy Test

Benchmark & Profiling

Checklist

Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@mingfeima mingfeima added the ready-to-merge The PR is ready to merge after the CI is green. label Aug 1, 2025
@mingfeima mingfeima marked this pull request as ready for review August 1, 2025 06:41
@zhyncs zhyncs merged commit 1fe691a into sgl-project:main Aug 1, 2025
37 checks passed
MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025
yuan-luo pushed a commit to antgroup/sglang that referenced this pull request Sep 18, 2025
Merge branch 'sglang_public_tracker of [email protected]:Theta/SGLang.git into main

https://code.alipay.com/Theta/SGLang/pull_requests/192


Reviewed-by: 得泽 <[email protected]>


* fix duplicate args in schedule_batch (sgl-project#7816)
* [AMD] Fail gracefully when AITER is unavailable gfx90a GPUs (sgl-project#7187)
* docs: update README (sgl-project#7821)
* [theta] add py-spy deps
* feat: support DeepSeek-R1-W4AFP8 model with ep-moe mode (sgl-project#7762)
* Enable ModelOpt Llama4 fp8 checkpoint deployment in SGLang (sgl-project#7129)
* [Minor] Fix sporadic CI timeout caused by underestimated tests. (sgl-project#7850)
* [Bugfix] Fix two batch overlap with auto DeepEP Dispatch (sgl-project#7853)
* Fix cache modules of triton import error (sgl-project#7832)
* [router] forward stream_options in request (sgl-project#7860)
* Fix illegal memory in trtllm allreduce fusion (sgl-project#7864)
* Fix llama4 vision (sgl-project#7840)
* Support Mimo-VL (sgl-project#7579)
* fix: Handles input_embeds in GenerateReqInput when n>1 (sgl-project#7830)
* [Multimodal][Perf] Use `pybase64` instead of `base64` (sgl-project#7724)
* Bump xgrammar's version to 0.1.20 (sgl-project#7866)
* [CPU]convert topk_weights to fp32 for INT8 and FP8 paths (for llama4) and fix LmHead weight pack (sgl-project#7818)
* [PD] Add guidance for prefill bootstrap timeout (sgl-project#7846)
* Update native_api doc to match the change in the `get_model_info` endpoint (sgl-project#7660)
* Revert "Embedding parallel by attn_tp (sgl-project#7623)" (sgl-project#7880)
* chore: bump v0.4.9.post1 (sgl-project#7882)
* Fixes typo in assertion message (sgl-project#7895)
* [CI] Add deepep tests to CI (sgl-project#7872)
* [CPU] [FP8] set SGLANG_CPU_FP8_CVT_FTZ in CMakeLists.txt (sgl-project#7885)
* [CPU][Qwen3 MoE] Enable fused_topk CPU fusion and enhance FP8 TP padding (sgl-project#7838)
* Remove unused imports (sgl-project#7898)
* [router] Update metrics when request completes (sgl-project#7899)
* [feature] Add start step profile argument in /start_profile (sgl-project#7608)
* [bugfix] add pd router policy validation (sgl-project#7904)
* vlm: support video as an input modality (sgl-project#5888)
* Feat: Support Phi-3.5-MoE in SGLang (sgl-project#7907)
* add sentencepiece as dependency explicitly (sgl-project#7922)
* Fix bug of deepseek-v3 under DP+EP mode with large batchsize/seqlen (sgl-project#6449)
* [feature]Ascend quantization support (sgl-project#7791)
* [ready b200] fuse allreduce+add_rmsnorm in prepare_attention + mlp module (sgl-project#7775)
* Support Kimi K2 (sgl-project#7940)
* [feature] kv transfer support of ascend npu (sgl-project#7795)
* fix: minor fix for modelopt weight load compatibility (sgl-project#7953)
* temporarily disable deepep-8-gpu and activate two small tests (sgl-project#7961)
* [fix]Update unitest for fp8_blockwise_scaled_grouped_mm kernel (sgl-project#7932)
* chore: bump sgl-kernel v0.2.5 (sgl-project#7964)
* Revert "[PD Disaggregation] replace transfer with batch transfer for better performance (sgl-project#7236)" (sgl-project#7968)
* chore: upgrade xgrammar 0.1.21 (sgl-project#7962)
* delete uselese code caused by fuse allreduce+add_rmsnorm pr (sgl-project#7970)
* Fix wrong gemm branch cause 250us slower (sgl-project#7969)
* [router] add worker abstraction (sgl-project#7960)
* chore: upgrade sgl-kernel 0.2.5 (sgl-project#7971)
* chore: bump v0.4.9.post2 (sgl-project#7963)
* [minor fix] llama4 hybrid memory (sgl-project#7950)
* [minor fix] SWA missing methods (sgl-project#7972)
* [script] update loogle test (sgl-project#7975)
* perf: add kimi k2 fused_moe tuning config for h20_3e
* [theta] perf: add kimi k2 fused_moe tuning config for h200
* [minor fix] SWA missing methods (sgl-project#7972)
* [script] update loogle test (sgl-project#7975)
* perf: add kimi k2 fused_moe tuning config for h30_3e
* docs: update README (sgl-project#7985)
* Overlap the gating function with shared experts in DeepSeek (sgl-project#7978)
* [BugFix] fix pre_reorder_triton_kernel default int32 issue (sgl-project#7814)
* [minor] Add server_args check for Llama4 with hybrid (sgl-project#7988)
* Tiny fix mooncake log warning wrong output (sgl-project#7952)
* [BugFix] add verify logit_bias to avoid crash because of IndexError  (sgl-project#7749)
* SWA Prefix Cache (sgl-project#7367)
* chore: remove unnecessary limits on quantization methods in test script (sgl-project#7997)
* Refactor dynamic LoRA update to fix incorrect handling of variant weight shapes (sgl-project#7844)
* Support for Phi-1.5 & Phi-2 models (sgl-project#7862)
* [Dockerfile] Multi-arch support for ROCm (sgl-project#7902)
* [CPU] fix no attribute 'can_fuse_mlp_allreduce' error (sgl-project#8010)
* perf: add kimi k2 fused_moe tuning config for h30_3e (sgl-project#8021)
* [ci] CI supports use cached models (sgl-project#7874)
* [Minor] Remove redundant print (sgl-project#8005)
* [Feature]TP Group Switching for PD-Multiplexing (sgl-project#7653)
* [Feature] CUDA Green Context Support (sgl-project#7649)
* Fix flaky CI: test_vlm_models (sgl-project#8006)
* Fix Bug 'get_cpu_copy not Implemented' in pd offloading mode (sgl-project#7982)
* prevent server crash from potential invalid grammar (sgl-project#7897)
* Setup workflow for releasing mi300x and mi350x dockers. (sgl-project#8035)
* fix: modality length mismatch with image_data (sgl-project#7887)
* Update CODEOWNERS (sgl-project#8044)
* perf: add qwen3-30b-a3b fused moe tuning config for h20
* [feat]Support fusion kernel for constructing quant input and scale factor for fp8_blockwise_scaled_grouped_mm (sgl-project#8023)
* feat: update multimodal data handling in engine entrypoint (sgl-project#8002)
* fix: remove redundant rotary embedding cache recomputation in MiniCPM (sgl-project#8022)
* Fix the input tools format and history tool_calls in OpenAI API  (sgl-project#6556)
* fix: resolve arm build issue (sgl-project#8052)
* concurrently load weights of DeepseekV2ForCausalLM (sgl-project#7943)
* H20 tune config for Kimi (sgl-project#8047)
* Update amd docker image. (sgl-project#8045)
* feat: replace Decord with video_reader-rs (sgl-project#5163)
* remove kv_a.congigous in DeepseekV2AttentionMLA (sgl-project#8058)
* update transformers to 4.53.2 (sgl-project#8029)
* Fix different device type adjustment in PP (sgl-project#7760)
* Use device_group for all_gather when disabling overlap scheduling (sgl-project#8001)
* Revert "feat: replace Decord with video_reader-rs" (sgl-project#8077)
* Fix CI xeon test with triton 3.3.1 (sgl-project#8086)
* fix greenctx stream compability (sgl-project#8090)
* [misc] update nvshmem and pin deepEP commit hash (sgl-project#8098)
* [Feature] Layer-wise Prefill (sgl-project#7634)
* [1/n] chore: decouple quantization implementation from vLLM dependency (sgl-project#7992)
* refactor: unify names of the feature field of MultimodalDataItem (sgl-project#8075)
* feat: add tp_rank, pp_rank and dp_rank labels for scheduler metrics (sgl-project#7597)
* [ci] limit cmake build nproc (sgl-project#8100)
* [ci] disable memory imbalance check for draft worker (sgl-project#8108)
* [Fix] ensure DeepGEMM is only enabled for FP8_W8A8 models (sgl-project#8110)
* [ci] recover 8-gpu deepep test (sgl-project#8105)
* Refactor: move all quantization-related code to `srt/layer/quantization` (sgl-project#7989)
* [kernel] opt moe align block kernel by block/warp scan algorithm (sgl-project#7884)
* Super tiny fix typo (sgl-project#8046)
* fix: update HostKVCache init to report correct msg when available memory is not enough (sgl-project#8102)
* [Hunyuan]: Fix Dense Model Support (sgl-project#8117)
* feat: add production metric for retracted requests due to insufficient kvcache (sgl-project#7030)
* refactor: simply MultimodalTokens logic (sgl-project#7924)
* [Fix][Ready]Fix register spilling in cutlass nvfp4 gemm kernel on Blackwell (sgl-project#8127)
* Feat: Support Granite 3.0 MoE in SGLang (sgl-project#7959)
* load draft model fix (sgl-project#7506)
* [CPU][Llama4] Fix Llama4 MoE inputs with "apply_router_weight_on_input"  (sgl-project#7889)
* [Quantization][w8a8_int8] Fix weight loading issue for w8a8_int8 path with "ignore" layer list in quantization config (sgl-project#7820)
* Hicache Storage Layer Prototype (sgl-project#7704)
* Revert "Fix different device type adjustment in PP" (sgl-project#8141)
* feat: enchance green context stream creation robust with backward compatibility (sgl-project#8136)
* fix compressed tensors WNA16 imports (sgl-project#8142)
* [Bugfix] Fix w8a8_int8 import error on NPU (sgl-project#8147)
* [3/n] chore: decouple AWQ implementation from vLLM dependency (sgl-project#8113)
* [router] Refactor router and policy traits with dependency injection (sgl-project#7987)
* [AMD] Add triton awq_dequantize kernel to support AWQ on ROCm (sgl-project#7661)
* [Doc] Steps to add a new attention backend (sgl-project#8155)
* chore: tune mem fraction static for vlm (sgl-project#6881)
* Support NVFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs (sgl-project#7302)
* Feat: Support audio in Phi4-mm model (sgl-project#8048)
* [PD] Support non-MLA models PD different TP with DP attention (sgl-project#7931)
* [health_generate] fix: fix the /health_generate always success bug (sgl-project#8028)
* [router] router metrics cleanup (sgl-project#8158)
* [router] allow router to have empty workers (sgl-project#8160)
* Add GB200 wide-EP docker (sgl-project#8157)
* [1/N] MoE Refactor: refactor `select_experts` (sgl-project#7966)
* chore: bump sgl-kernel v0.2.6 (sgl-project#8165)
* chore: upgrade sgl-kernel 0.2.6 (sgl-project#8166)
* [theta] sync bailing
* Fix suffix mismatch for the metrics. (sgl-project#8168)
* Update README.md (sgl-project#8171)
* Clean up server args (sgl-project#8161)
* Fix LoRA buffer contamination during adapter eviction (sgl-project#8103)
* Fix Dockerfile.gb200 (sgl-project#8169)
* [router] add ut for worker and errors (sgl-project#8170)
* bugfix: fix sglang crash in NVIDIA MIG container (sgl-project#8167)
* Support start up LoRA server without initial adapters (sgl-project#8019)
* Clean warning logs for gate_proj loading in Lora (sgl-project#8172)
* Fix tuning_fused_moe_triton.py (sgl-project#8175)
* [Feature] Simple Improve Health Check Mechanism for Production-Grade Stability (sgl-project#8115)
* Add bf16 output option for dsv3_router_gemm kernel (sgl-project#7999)
* Enable FlashInfer support encoder models and add head_dim padding workaround (sgl-project#6230)
* Add get_hidden_dim to qwen3.py for correct lora (sgl-project#7312)
* feat: add h200 tp 16 kimi k2 moe config (sgl-project#8176)
* feat: add b200 tp 16 kimi k2 moe config (sgl-project#8178)
* fix moe gate dtype, fix tbo, fix fake dispatch (sgl-project#7825)
* Revert "[Feature] Simple Improve Health Check Mechanism for Production-Grade Stability" (sgl-project#8181)
* feat: update nccl 2.27.6 (sgl-project#8182)
* Feat: Support for Persimmon Model (sgl-project#7983)
* feat: add h200 tp 16 kimi k2 moe config (sgl-project#8183)
* Fix eagle3 cuda graph (sgl-project#8163)
* fix: fix the bug of loading Internvl3 (sgl-project#8067)
* Fix dtype error in CI (sgl-project#8197)
* Cherry-pick commit 2dc5de40 "perf: add bailing mo..." 到当前分支
* [router] add ut for pd request, metrics and config (sgl-project#8184)
* [feature] enable NPU CI (sgl-project#7935)
* [fix] fix modelopt fp4 on b200 (sgl-project#8195)
* chore: bump sgl-kernel v0.2.6.post1 (sgl-project#8200)
* Apply fused sorted token ids padding (sgl-project#8193)
* [Refactor] simplify multimodal data processing (sgl-project#8107)
* [theta] feat vl name
* [router] add ut for pd router (sgl-project#8208)
* [router] upgade router version to 0.1.6 (sgl-project#8209)
* Remve router gemm output dtype conversion (sgl-project#8204)
* chore: upgrade sgl-kernel 0.2.6.post1 (sgl-project#8202)
* [Feature] Add a test for Layer-wise Prefill (sgl-project#8231)
* docs: update 2025 h2 roadmap (sgl-project#8237)
* fix: retrieve mm token by modality, raise error if none (sgl-project#8221)
* [AMD] Remove vllm's scaled_fp8_quant and moe_sum when SGLANG_USE_AITER=1 (sgl-project#7484)
* [theta] tune h20 config for qwen3 235b
* [theta] tune h20 config for qwen3 235b
* fix: sgl-router remove dead code (sgl-project#8257)
* [fix] benchmark : routed_scaling_factor is None (sgl-project#8059)
* [Benchmark] add disable-auto-run param for hicache/bench_multiturn (sgl-project#7822)
* Preliminary Support for Qwen3XMLDetector (sgl-project#8260)
* chore: bump v0.4.9.post3 (sgl-project#8265)
* PullRequest: 178 perf: add qwen235b h20-3e fused moe kernel config
* [theta] tune h20 config for qwen3 480b
* Skip llama4 vision module loading when multimodal disabled (sgl-project#8272)
* PullRequest: 180 新增Qwen480B和Qwen235B在NVIDIA H20-3e上的Fused MoE Triton配置
* Fix sgl-kernel ci test (sgl-project#8284)
* [theta] tune h200 config for qwen3 480b
* Introduce Stable LoRA ID System for Overlapped Updates and Prefix Caching (sgl-project#8261)
* Hicache IO kernel refactoring (sgl-project#8264)
* bug fix and tag (sgl-project#8282)
* HiCache Fix (sgl-project#8288)
* [sgl-kernel] Opt per_token_quant_fp8 with warp reduce (sgl-project#8130)
* [router] add common ut infra to mock worker and app (sgl-project#8295)
* fix: workaround for deepgemm warmup issue (sgl-project#8302)
* [Performance][PD Disaggregation] optimize TokenToKVPoolAllocator by sorting free pages (sgl-project#8133)
* Fix the issue of incorrect finish reason in final stream response chunk returned during tool call (sgl-project#7708)
* fix: match chat-template for internvl3 (sgl-project#8262)
* Fix gemma3n with hybrid swa (sgl-project#8240)
* chore: upgrade sgl-kernel 0.2.7 (sgl-project#8304)
* fix: prevent crashes due to logit bias dimension mismatch (sgl-project#7685)
* feat(function call): complete utility method for KimiK2Detector and enhance documentation (sgl-project#8043)
* Fix incomplete tool call capture issue in streaming response of DeepSeek-V3 when enable MTP  (sgl-project#7562)
* [AMD] Pull latest image for AMD CI (sgl-project#8070)
* Pin the version of petit kernel to fix the APIs (sgl-project#8235)
* [bug] fix pd completion protocol for batching support (sgl-project#8317)
* [router] fix pd model completion request (sgl-project#8303)
* fix bug when eos_ids==0 (sgl-project#8315)
* [router] add endpoint unit test (sgl-project#8298)
* [code style] Clean dead triton kernel code in fused_moe and useless vllm_ops import (sgl-project#8310)
* chore: upgrade flashinfer v0.2.9rc1 (sgl-project#8301)
* [router] add streaming unit test (sgl-project#8299)
* [router] add request format unit test (sgl-project#8300)
* HiCache Storage TP Refinement (sgl-project#8307)
* breakdown kernel update (sgl-project#8334)
* support idle batch for TBO (sgl-project#8233)
* [Feature] Integrate quick allreduce and select the best allreduce implementation (sgl-project#6619)
* DP Enhancement (sgl-project#8280)
* fix: Fix failed functional tests https://github.com/meta-llama/llama-stack-evals (sgl-project#8266)
* [AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels for AMD GPUs (sgl-project#7135)
* [CPU] Add tutorial docs for SGL on CPU (sgl-project#8000)
* chore: upgrade mooncake 0.3.5 (sgl-project#8341)
* [torch.compile bug] avoid biased_grouped_topk_impl func repeatedly triggering `torch.compile` in forward pass (sgl-project#8353)
* [P/D] Support ipv6 in P/D scenario (sgl-project#7858)
* Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct (sgl-project#8344)
* [Bugfix][Feat] Add XML-ish grammar in EBNFComposer and fix misc bugs in Qwen3 detector (sgl-project#8357)
* Clean up server_args, triton cache manager (sgl-project#8332)
* fix: upgrade nccl version (sgl-project#8359)
* [Feat] Add reasoning parser for Qwen/Qwen3-235B-A22B-Thinking-2507 (sgl-project#8363)
* fix: kimi k2 xgrammar crash (sgl-project#8367)
* Fix FP4 MoE accuracy from missing routed_scaling_factor (sgl-project#8333)
* [CI] Fix flaky threshold (sgl-project#8370)
* chore: bump v0.4.9.post4 (sgl-project#8305)
* Fix test_moe_fused_gate_combined sgl-kernel ci test (sgl-project#8374)
* Uodate Dockerfile.gb200 to latest sglang (sgl-project#8356)
* chore: improve mmmu benchmark (sgl-project#7000)
* Save peak memory in logits processor (sgl-project#8343)
* Extract update_weights from RL Engine to SGLang to keep simplicity and fix torch reduce (sgl-project#8267)
* chore: improvements on mm_utils (sgl-project#7737)
* vlm: optimize tensor transport (sgl-project#6003)
* Tiny assert EPLB is used together with expert parallel (sgl-project#8381)
* model: support intern-s1 (sgl-project#8350)
* Add perf tests for LoRA (sgl-project#8314)
* Remove slot usage in code to be backward-compatible with python 3.9 (sgl-project#8396)
* Add docker release flow for gb200 (sgl-project#8394)
* HiCache, check before terminate prefetching (sgl-project#8372)
* Add nvfp4 scaled mm benchmark. (sgl-project#8401)
* Urgent Fix: intern-s1 chat-template matching (sgl-project#8403)
* Tool to dump and compare internal activation tensors (sgl-project#7976)
* Minor tool for comparison of benchmark results (sgl-project#7974)
* Fix bench script making input data on L2 cache (sgl-project#7739)
* [NVIDIA] Add Flashinfer MoE blockscale fp8 backend (sgl-project#8036)
* Update Cutlass in sgl-kernel to v4.1 (sgl-project#8392)
* fix: minor fix TransportProxyTensor under tp (sgl-project#8382)
* [router] add different policies for p node and d node (sgl-project#8395)
* Add A800 fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct (sgl-project#8351)
* fix: fix the missing metrics on non-rank0 nodes (sgl-project#7720)
* [2/N] MoE Refactor: Unify weight loader and quant methods (sgl-project#8397)
* Use FlashInfer FP4 gemm. (sgl-project#8241)
* Support precomputed_embeddings for Llama 4 (sgl-project#8156)
* [hotfix] fix merge conflicts in FlashInferEPMoE (sgl-project#8405)
* chore: update CODEOWNERS (sgl-project#8407)
* chore: upgrade flashinfer v0.2.9rc2 (sgl-project#8406)
* Support triton kernels v3.4.0 for fused_moe (sgl-project#8258)
* [Bugfix] Prevent PD server crash from invalid grammar (sgl-project#8062)
* Change to use native arm runner (sgl-project#8414)
* Support overlapped lora updates  (sgl-project#8213)
* Support ue8m0 for triton quant kernel (sgl-project#7603)
* Fix: Improve test_openai_function_calling unit test and fix reasoning_parser.py think_start_token logic (sgl-project#8316)
* bugfix: Fix multiple finish_reason chunks and tool_calls finish reason check (sgl-project#8417)
* Fix test_openai_server (sgl-project#8419)
* Fix docker buildx push error (sgl-project#8425)
* bugfix: Fix XGrammar backend to use model's EOS tokens for constrained generation (sgl-project#8422)
* [router] improve router logs and request id header (sgl-project#8415)
* [feat] Support different attention backends for prefill and decode  (sgl-project#6338)
* chore: bump transformer to 4.54.0 (sgl-project#8416)
* [PD] Fix abort_request for PD disaggregation (sgl-project#8352)
* GLM-4.5 Model Support (sgl-project#8224)
* Remove zstd compression for building Dockerfile.gb200 (sgl-project#8442)
* doc: add bench_one_batch_server in the benchmark doc (sgl-project#8441)
* GLM-4.5 Model Support Follow-up (sgl-project#8445)
* fix GLM4_MOE launch with compressed_tensor quant model (sgl-project#8456)
* Fix per_token_group_quant_8bit when hidden_dim // group_size is not divided by 4. (sgl-project#8449)
* Revert "[kernel] opt moe align block kernel by block/warp scan algorithm" (sgl-project#8457)
* chore: bump v0.4.9.post5 (sgl-project#8458)
* fix:reorder topk experts to ensure shared expert replaces minimal score (sgl-project#8125)
* perf: add kimi k2 h200 fused moe config (extracted from theta-asap-sglang-049)
* Cherry-pick commit 4a75e015 "Add draft model fuse..." 到当前分支
* Update PR template (sgl-project#8465)
* feat: throttle requests at scheduler based on --max_queued_requests (sgl-project#7565)
* [theta] tuning script for glm4 moe
* perf: add fused moe kernel config glm4.5,h20-3e,tp8
* [theta] tuning script for glm4 moe h20
* fix: update dep (sgl-project#8467)
* [NVIDIA] Change to use `num_local_experts` (sgl-project#8453)
* Fix parsing ChatCompletionMessage (sgl-project#7273)
* [3/N] MoE Refactor: Simplify DeepEP Output (sgl-project#8421)
* feat: support glm4 tuning (sgl-project#8473)
* Fix DEEPEP BF16 compatibility for Deepseek Style model like GLM 4.5 (sgl-project#8469)
* Update codeowner (sgl-project#8476)
* chore: add glm4 fp8 tp8 config (sgl-project#8478)
* chore: add glm 4.5 fp8 tp4 config (sgl-project#8480)
* [CI]Add genai-bench Performance Validation for PD Router (sgl-project#8477)
* Update CODEOWNERS (sgl-project#8485)
* Rename the last step in pr-test.yml as pr-test-finish (sgl-project#8486)
* Reduce memory usage for fp4 moe (sgl-project#8413)
* Tiny add warnings for DeepEP when it is suboptimal (sgl-project#8426)
* Support colocating requests (sgl-project#7973)
* Fix incorrect KV cache allocation for MTP models. (sgl-project#8482)
* Add PVC and update resource limits in k8s config (sgl-project#8489)
* chore: bump v0.4.9.post6 (sgl-project#8517)
* Always trigger pr-test (sgl-project#8527)
* Update README.md (sgl-project#8528)
* [sgl-kernel performace] fix fp8 quant kernels dispatch __nv_fp8_e4m3 bug to improve performance 10%-20% (sgl-project#8499)
* Update cutlass_moe.py (sgl-project#8535)
* Fix moe align kernel test (sgl-project#8531)
* Split the scheduler into multiple mixin classes to reduce the file size (sgl-project#8483)
* bring back kimi vl ci (sgl-project#8537)
* fix: temporarily disable cuda-ipc for mm data tensor (sgl-project#8431)
* Support EPLB in FusedMoE (sgl-project#8448)
* feat(hicache): support file backend reading directory config form env. (sgl-project#8498)
* feature(pd-hicache): Prefill instances support reusing the RemoteStorage Cache via HiCache. (sgl-project#8516)
* [router] allow longer time out for router e2e (sgl-project#8560)
* Update cutlass_moe.py (sgl-project#8545)
* Update CODEOWNERS (sgl-project#8562)
* [feature] [sgl-router] Add a dp-aware routing strategy (sgl-project#6869)
* [Hot-Fix] moe_aligned_block_size CI failed in AMD (sgl-project#8461)
* Cherry-pick commit 4fdc06a9 "add fp8a8 kimi-k2 dr..." 到当前分支
* [Model] Add support for Arcee Foundational Model (sgl-project#8154)
* Revert "Fix the input tools format and history tool_calls in OpenAI API  (sgl-project#6556)" (sgl-project#8584)
* Add hf3fs support for hicache storage (based on sgl-project#7704) (sgl-project#7280)
* [router] migrate router from actix to axum (sgl-project#8479)
* [Fix]Fix index oob in get_group_gemm_starts kernel. (sgl-project#8564)
* Bump transfomers to 4.54.1 to fix Gemma cache issue. (sgl-project#8541)
* Add GKE's default CUDA runtime lib location to PATH and LD_LIBRARY_PATH. (sgl-project#8544)
* Bug: Fix google gemma3n-mm audio input not working bug (sgl-project#8365)
* update sgl-kernel for EP: kernel part  (sgl-project#8514)
* chore: bump sgl-kernel v0.2.8 (sgl-project#8599)
* [bugfix] Fix 2 minor bugs in the hicache storage layer (sgl-project#8404)
* fix incorrect increase of hit count (sgl-project#8533)
* Support l3 cache (mooncake store) for hiradix cache (sgl-project#7211)
* [theta] Conditionally import HiCacheHF3FS sgl-project#8598
* update sgl-kernel for EP: python part (sgl-project#8550)
* add SVG logo (sgl-project#8603)
* [4/N] MoE Refactor: Unified Triton Kernel for FusedMoE and EPMoE (sgl-project#8515)
* fix: fork should not run pypi router (sgl-project#8604)
* model: support Step3V (sgl-project#8583)
* [Feature] Hybrid EP and TP (sgl-project#8590)
* chore: bump v0.4.10 (sgl-project#8608)
* [PD] Use batch transfer for rdma transport and add notes for mnnvl usage (sgl-project#8595)
* [bugifx] QWen-1M context support[2/3] using current cuda stream in the DCA's kernel for bugfix. (sgl-project#8611)
* Fix hf3fs_fuse import error (sgl-project#8623)
* Update step3v default config (sgl-project#8626)
* [ci] fix genai-bench execution cmd (sgl-project#8629)
* [router] update router pypi version (sgl-project#8628)
* [Optimization][Perf] Disable the GC during CUDA graph capture to speed up by up to 3x (sgl-project#8577)
* Fix typos in py_test/test_launch_server.py (sgl-project#6227)
* misc: Remove debug print to logger.info (sgl-project#8633)
* SGLang HiCache NIXL Connector (sgl-project#8488)
* [bug] remove pdlb from minilb since its no longer available (sgl-project#8634)
* [bugfix] Fix flashinfer cutlass EP moe after MoE refactor (sgl-project#8630)
* Conditionally import HiCacheHF3FS (sgl-project#8598)
* TRTLLM Gen MLA Decode Kernel Integration (same as sgl-project#7938) (sgl-project#8632)
* Fix nan value generated after custom all reduce (sgl-project#8532)
* Revert "Fix nan value generated after custom all reduce (sgl-project#8532)" (sgl-project#8642)
* Feature/modelscope model download (sgl-project#8083)
* chore: speedup NPU CI by cache (sgl-project#8270)
* [Bugfix] fix w8a8_int8 load issue (sgl-project#8308)
* [bugfix] fix router python parser for pd urls (sgl-project#8644)
* [router] add basic usage doc (sgl-project#8640)
* [router] upgrade router version to 0.1.8 (sgl-project#8645)
* [NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE (sgl-project#8450)
* HiCache, fixing hash value indexing (sgl-project#8636)
* Interface change for kvcache io to support page first layout (sgl-project#8318)
* Update batch size limitation of dsv3_router_gemm kernel to 16 (sgl-project#8051)
* chore: bump v0.4.10.post1 (sgl-project#8652)
* Add hf3fs_utils.cpp to package-data (sgl-project#8653)
* Fix chat template handling for OpenAI serving (sgl-project#8635)
* Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… (sgl-project#8511)
* [5/N] MoE Refactor: Update MoE parallelism arguments (sgl-project#8658)
* Increase tolerance to address CI failures (sgl-project#8643)
* [Kimi K2] dsv3_router_gemm supports NUM_EXPERTS == 384 (sgl-project#8013)
* [DOC]Update sgl-kernel README (sgl-project#8665)
* fix per token cuda kernel hidden dim cannot divide by 16 (sgl-project#8543)
* fix arg typo for --disaggregation-transfer-backend (sgl-project#8664)
* [fix] fix pd disagg error of vlms (sgl-project#8094)
* Disable tp for shared experts under expert parallelism for GLM4.5 model (sgl-project#8647) (sgl-project#8647)
* [bugfix] Fix page size for create_flashmla_kv_indices_triton() for cutlass mla (sgl-project#8685)
* [bug] limit bootstrap room to to [0, 2^63 - 1] (sgl-project#8684)
* Update CODEOWNERS (sgl-project#8686)
* Fix deepgemm masked grouped gemm jit compile (sgl-project#8679)
* Fix FP8 block quantization when N or K is not multiples of 128 (sgl-project#8648)
* bugfix(hicache): Fix 'MooncakeStore' not defined error. (sgl-project#8668)
* upgrade xgrammar 0.1.22 (sgl-project#8522)
* [bugfix] Add 'disaggregation_mode' parameter to warmup function when compile deep_gemm manually (sgl-project#8618)
* Add support for NCCL symmetric memory for TP allreduces (sgl-project#8238)
* [1/2] sgl-kernel: Fuse routed scaling factor into select_experts (sgl-project#8364)
* chore(gb200): update dockerfile to handle fp4 disaggregation (sgl-project#8694)
* [bugfix] Apply routed scaling factor to cutlass_fused_experts_fp8 (sgl-project#8688)
* Fix: resolve prefill of retracted request out-of-memory issue when ignore_eos is enabled (sgl-project#7434)
* model: adapt mllama4 to VisionAttention (sgl-project#8512)
* Add tensor.detach() back to update weight util (sgl-project#8691)
* [Doc] Polish sgl-kernel readme for cu126 build error (sgl-project#8704)
* [theta] merge 0802-3
* Revert "[1/2] sgl-kernel: Fuse routed scaling factor into select_experts" (sgl-project#8706)
* [router] minor code clean up and and refactoring (sgl-project#8711)
* [Bug] fix green context's incompatibility with `cuda < 12.4` (sgl-project#8701)
* chore: bump sgl-kernel v0.2.9 (sgl-project#8713)
* Remove assertions about per group quant fp8 (sgl-project#8717)
* [FIX] Fix the nightly CI by disabling swa mem pool for gemma2 (sgl-project#8693)
* Fix triton moe error caused by TopK refactor (sgl-project#8705)
* [router] Implement HTTP Dependency Injection Pattern for Router System (sgl-project#8714)
* [Feature] Radix Tree in C++ (sgl-project#7369)
* [Perf]Use Cooperative Schedule for H100 & H200 & H800 in fp8_blockwise_scaled_grouped_mm (sgl-project#8722)
* Fix fused MoE when `routed_scaling_factor is None` (sgl-project#8709)
* Tiny fix CI pytest error (sgl-project#8524)
* [hotfix] fix mixtral with tensor-level compressed-tensor quantization (sgl-project#8721)
* Support limiting max loaded loras in CPU. (sgl-project#8650)
* Reduce memory accumulation in long-running server (sgl-project#8306)
* HiCache storage, style change and bug fix (sgl-project#8719)
* [feat] support minimum token load balance in dp attention (sgl-project#7379)
* Do layernorm before allgather for DP attention (sgl-project#8631)
* [fix] Fix divide by zero error for llama4. (sgl-project#8683)
* feat: Add new moe triton for NVIDIA RTX 6000 Ada (sgl-project#8547)
* [Improvements] Merge health check route (sgl-project#8444)
* chore: bump sgl-kernel 0.3.0 with torch 2.8.0 (sgl-project#8718)
* Save cuda graph memory for fa3 (sgl-project#8567)
* [CUDA Graph] save cuda graph memory by using next_token_logits_buffer (sgl-project#8579)
* [DP] fix the compatibility issue between DP attention and `--attention-backend triton` (sgl-project#8723)
* chore: bump v0.4.10.post2 (sgl-project#8727)
* feat: Support DP Attention for step3_vl (sgl-project#8699)
* [RL] fix update weight for FusedMoE with EP (sgl-project#8676)
* use fp32 for e_score_correction_bias in GLM-4.5 (sgl-project#8729)
* Fix triton kernels topk with keyword arguments (sgl-project#8732)
* feat: support cutlass_moe_fp8 kernel for fusedmoe in sm90 (sgl-project#8678)
* Fix the missing 'lof' choice of --schedule-policy server args (sgl-project#7114)
* fix args typo in memory_pool_host (sgl-project#8662)
* [CI] Do not trigger pd-disaggregation CI in draft PR (sgl-project#8737)
* [MoE] Enable `renormalize=False` in Triton kernels (sgl-project#8735)
* Replace torch.jit.script with torch.compile in get_masked_input_and_mask to fix benchmark underreporting (sgl-project#8733)
* Fix bug of refactoring TopKOutput in w4afp8 (sgl-project#8745)
* Rename lora_path to lora_id in batches (sgl-project#8437)
* [sgl-kernel] avoid per_token_quant_fp8.cu hardcode sm_count (sgl-project#8738)
* [CI] Ascend NPU CI enhancement (sgl-project#8294)
* [bugfix] fix import path in HiCacheController (sgl-project#8749)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge The PR is ready to merge after the CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants