[Roadmap] Distributed Serving Enhancement on 2025 H2

Roadmap of Distributed Serving Enhancement on 2025 H2
1. P/D Disaggregated Serving @ShangmingCai
* [ ] Implement P/D disaggregation without GPUDirect RDMA by leveraging CPU buffer copy. @alogfans
* [x] Support Mooncake P/D Disaggregation with PP @ShangmingCai @ssssnow #8846
* [ ] Minimize reconfiguration overhead during P/D scaling, role transitions, and topology changes. @hzh0425  @LLLL114 https://github.com/sgl-project/sglang/pull/9325
* [ ] OME integration (auto config) @slin1237 

2. Global KVCache Pool @ykwd  @huangtingwei9988

* [x] Enable global KVCache sharing via Mooncake store with a pluggable backend interface. (Based on https://github.com/sgl-project/sglang/pull/7704)
	- [x] Initial integration @huangtingwei9988  @zhangzuo21 https://github.com/sgl-project/sglang/pull/7211
	- [x] Fine-grained prefetch
	- [ ] advanced prefetch policy
* [ ] Integrate global KV cache sharing with P/D disaggregation. @ShangmingCai @hzh0425 
	- [x] KV Cache sharing among Prefill instances https://github.com/sgl-project/sglang/pull/8516
	- [ ] Decode instances async contribute KV Cache to reduce TTFT of multi-turn conversation
* [ ] Support KVCache-aware request routing with global cache integration. @slin1237 @ShangmingCai 
* [x] 3FS Mini Manager: enabling cross-machine reuse capability for 3FS @hzh0425 @pansicheng 
* [ ] 3FS Operator: providing one-click deployment for 3FS. （OME Operator repository @pansicheng @hzh0425 @WANNA959)

3. RAS (Reliability, Availability, Serviceability)

* [ ] Implement P/D health monitoring and fast reconfiguration leveraging disaggregated architecture. @ShangmingCai @whybeyoung 
	- [x] Abort request gracefully in P/D mode when client actively kills/disconnect the HTTP request  #8177 #8352
	- [x] Support health check based on /health_generate https://github.com/sgl-project/sglang/issues/8444  @whybeyoung 
	- [x] Address Tokenizer Manager bottleneck issue #8964 @whybeyoung @LLLL114 
	- [ ] Address De Tokenizer Manager bottleneck issue https://github.com/sgl-project/sglang/pull/9970/ @whybeyoung @LLLL114 
	- [ ] Fast recovery/reconfiguration policy support
* [ ] Introduce Elastic EP and cooperate with EPLB to tolerate partial GPU failures during inference. @UNIDY2002 @HanHan009527 https://github.com/sgl-project/sglang/pull/10423
* [ ] Implement fine-grained profiling for PD with EP/DP/PP. #8965 #9962 @sufeng-buaa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Roadmap] Distributed Serving Enhancement on 2025 H2 #8210

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Roadmap] Distributed Serving Enhancement on 2025 H2 #8210

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions