[Roadmap] Supporting Ascend NPU on 2025 H2

# SGLang NPU support on 2025 H2

During 2025 H1, we have contributed initial supports for NPU ([#3853](https://github.com/sgl-project/sglang/pull/3853), [#7022](https://github.com/sgl-project/sglang/pull/7022)), which make it possible for users to run SGLang on NPU hardware.

Our goal on 2025 H2 is to provide a seamless running experience on NPUs, and here is a rough development roadmap:

## CI on NPU hardware

- [x] [**_July_**] Enable autoscaling runners #7935 
- [ ] E2E/unittest test coverage #8294

## User / Developer experience

*User experience is also to be taken into our consideration, containers and documents will be provided soon*

- [ ] [**_July_**] Docker image #8700 
- [ ] [**_July_**] Docs (Quickstart / Installation / tutorials…)

## Model support

*We will start with supporting the hotest models*

- [ ] [**_July_**] DeepseekV2 / V3 family
- [ ] [**_July_**] Qwen3 family
- [ ] [**_July_**] Qwen3-MoE family

## Performance Enhancement

### Attention Backend

- [x] [**_July_**] Ascend Attention Backend implementation w/ PA & MLA fused kernels #7722 

### Parallelism

- [ ] [**_August_**] Support DeepEP expert parallelism #8355 
- [ ] [**_August_**] Optimization on DeepEPMoE implementation with fused kernels

### Quantization

- [x] [**_July_**] Support for Ascend-specific W8A8 quant method #7791 
- [ ] [**_September_**] Support for AWQ quant method #9104 thx @ErvinXie
- [ ] [**_September_**] Support for GPTQ quant method

### Cache

- [x] [**_July_**] A new transfer-engine implementation supports Device-to-device transfer on NPUs #7795 
- [ ] [**_November_**] A new cache pooling system supports HBM & DRAM mixed-pooling, coherent memory access and remote L3 cache direct copy to L1 cache on NPUs
- [ ] [**_October_**] An optimized bucketing router policy for extremely uneven prompt length

### Support Graph Mode

- [ ] [**_November_**] NPU graph mode support #8030

### EPLB

- [ ] [**_October_**] Support Expert Distribution Recorder on NPUs
- [ ] [**_October_**] Support Async loading of experts' weights

### Speculative Decoding

- [ ] [**_August_**] Support DeepSeek-R1's MTP

### Community

- [ ] `#npu-support` is actively constructing on SGLang slack channel


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Roadmap] Supporting Ascend NPU on 2025 H2 #8004

SGLang NPU support on 2025 H2

CI on NPU hardware

User / Developer experience

Model support

Performance Enhancement

Attention Backend

Parallelism

Quantization

Cache

Support Graph Mode

EPLB

Speculative Decoding

Community

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Roadmap] Supporting Ascend NPU on 2025 H2 #8004

Description

SGLang NPU support on 2025 H2

CI on NPU hardware

User / Developer experience

Model support

Performance Enhancement

Attention Backend

Parallelism

Quantization

Cache

Support Graph Mode

EPLB

Speculative Decoding

Community

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions