-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Open
Description
SGLang NPU support on 2025 H2
During 2025 H1, we have contributed initial supports for NPU (#3853, #7022), which make it possible for users to run SGLang on NPU hardware.
Our goal on 2025 H2 is to provide a seamless running experience on NPUs, and here is a rough development roadmap:
CI on NPU hardware
- [July] Enable autoscaling runners [feature] enable NPU CI #7935
- E2E/unittest test coverage [CI] Ascend NPU CI enhancement #8294
User / Developer experience
User experience is also to be taken into our consideration, containers and documents will be provided soon
- [July] Docker image [feat] add ascend readme and docker release #8700
- [July] Docs (Quickstart / Installation / tutorials…)
Model support
We will start with supporting the hotest models
- [July] DeepseekV2 / V3 family
- [July] Qwen3 family
- [July] Qwen3-MoE family
Performance Enhancement
Attention Backend
- [July] Ascend Attention Backend implementation w/ PA & MLA fused kernels Ascend attention backend(PA&MLA) #7722
Parallelism
- [August] Support DeepEP expert parallelism [Feature] Optimize DeepSeek's DeepEP on Ascend NPU #8355
- [August] Optimization on DeepEPMoE implementation with fused kernels
Quantization
- [July] Support for Ascend-specific W8A8 quant method [feature]Ascend quantization support #7791
- [September] Support for AWQ quant method [Feature] Support AWQ quantization on NPU #9104 thx @ErvinXie
- [September] Support for GPTQ quant method
Cache
- [July] A new transfer-engine implementation supports Device-to-device transfer on NPUs [feature] kv transfer support of ascend npu #7795
- [November] A new cache pooling system supports HBM & DRAM mixed-pooling, coherent memory access and remote L3 cache direct copy to L1 cache on NPUs
- [October] An optimized bucketing router policy for extremely uneven prompt length
Support Graph Mode
- [November] NPU graph mode support [Feature] support ACLGraph #8030
EPLB
- [October] Support Expert Distribution Recorder on NPUs
- [October] Support Async loading of experts' weights
Speculative Decoding
- [August] Support DeepSeek-R1's MTP
Community
-
#npu-support
is actively constructing on SGLang slack channel
zhyncs, lambert0312, AniZpZ, ErvinXie, Alcanderian and 6 moreSwipe4057Swipe4057moyans