Skip to content

[Roadmap] Supporting Ascend NPU on 2025 H2 #8004

@iforgetmyname

Description

@iforgetmyname

SGLang NPU support on 2025 H2

During 2025 H1, we have contributed initial supports for NPU (#3853, #7022), which make it possible for users to run SGLang on NPU hardware.

Our goal on 2025 H2 is to provide a seamless running experience on NPUs, and here is a rough development roadmap:

CI on NPU hardware

User / Developer experience

User experience is also to be taken into our consideration, containers and documents will be provided soon

Model support

We will start with supporting the hotest models

  • [July] DeepseekV2 / V3 family
  • [July] Qwen3 family
  • [July] Qwen3-MoE family

Performance Enhancement

Attention Backend

Parallelism

Quantization

Cache

  • [July] A new transfer-engine implementation supports Device-to-device transfer on NPUs [feature] kv transfer support of ascend npu #7795
  • [November] A new cache pooling system supports HBM & DRAM mixed-pooling, coherent memory access and remote L3 cache direct copy to L1 cache on NPUs
  • [October] An optimized bucketing router policy for extremely uneven prompt length

Support Graph Mode

EPLB

  • [October] Support Expert Distribution Recorder on NPUs
  • [October] Support Async loading of experts' weights

Speculative Decoding

  • [August] Support DeepSeek-R1's MTP

Community

  • #npu-support is actively constructing on SGLang slack channel

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions