-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Open
Description
Roadmap of Distributed Serving Enhancement on 2025 H2
- P/D Disaggregated Serving @ShangmingCai
- Implement P/D disaggregation without GPUDirect RDMA by leveraging CPU buffer copy. @alogfans
- Support Mooncake P/D Disaggregation with PP @ShangmingCai @ssssnow [PD] Support PD disaggregation with Prefill PP #8846
- Minimize reconfiguration overhead during P/D scaling, role transitions, and topology changes. @hzh0425 @LLLL114 [PD] Support P/D role conversion of running sglang server #9325
- OME integration (auto config) @slin1237
- Global KVCache Pool @ykwd @huangtingwei9988
- Enable global KVCache sharing via Mooncake store with a pluggable backend interface. (Based on Hicache Storage Layer Prototype #7704)
- Initial integration @huangtingwei9988 @zhangzuo21 Support l3 cache (mooncake store) for hiradix cache #7211
- Fine-grained prefetch
- advanced prefetch policy
- Integrate global KV cache sharing with P/D disaggregation. @ShangmingCai @hzh0425
- KV Cache sharing among Prefill instances feature(pd-hicache): Prefill instances support reusing the RemoteStorage Cache via HiCache. #8516
- Decode instances async contribute KV Cache to reduce TTFT of multi-turn conversation
- Support KVCache-aware request routing with global cache integration. @slin1237 @ShangmingCai
- 3FS Mini Manager: enabling cross-machine reuse capability for 3FS @hzh0425 @pansicheng
- 3FS Operator: providing one-click deployment for 3FS. (OME Operator repository @pansicheng @hzh0425 @WANNA959)
- RAS (Reliability, Availability, Serviceability)
- Implement P/D health monitoring and fast reconfiguration leveraging disaggregated architecture. @ShangmingCai @whybeyoung
- Abort request gracefully in P/D mode when client actively kills/disconnect the HTTP request [Bug] PD disaggregation Abort Issue #8177 [PD] Fix abort_request for PD disaggregation #8352
- Support health check based on /health_generate [Improvements] Merge health check route #8444 @whybeyoung
- Address Tokenizer Manager bottleneck issue Support Multi Process Tokenizer Manager(#6555) #8964 @whybeyoung @LLLL114
- Address De Tokenizer Manager bottleneck issue Support Multi Detokenizer based on Multi Tokenizer #9970 @whybeyoung @LLLL114
- Fast recovery/reconfiguration policy support
- Introduce Elastic EP and cooperate with EPLB to tolerate partial GPU failures during inference. @UNIDY2002 @HanHan009527 [1/N] Introduce Mooncake Backend and Mooncake EP to Support Elastic EP #10423
- Implement fine-grained profiling for PD with EP/DP/PP. [Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency #8965 [Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 1 #9962 @sufeng-buaa
hzh0425, Swipe4057, yizhang2077, zhyncs, Atream and 23 morehzh0425, Swipe4057, yizhang2077, zhyncs, yiakwy-xpu-ml-framework-team and 3 moressssnow, ShangmingCai, Swipe4057, zhyncs, ch-wan and 3 more