-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Open
4 / 84 of 8 issues completedLabels
Description
Checklist
- 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 2. Please use English, otherwise it will be closed.
Features
CY25H2
- overlapped lora updates Support overlapped lora updates #8213 @lifuhuang
- compatibility with radix attention [Bug] Why can't I use multi-lora adapter and radix attention together? #2880 [Feature] Further support for Lora Radix Cache #9144 Support radix cache for Lora feature #7216 @Fridge003
- adapter GPU pinning [Feature] LRU Eviction Strategy for Lora Adapters: Evicting Adapters with Priority #8053 Support GPU pinning for LoRA #8697 Support pinning adapter via server args. #9249 @lifuhuang
- LRU cache support for lora memory pool [Feature] LRU Eviction Strategy for Lora Adapters: Evicting Adapters with Priority #8053 (TBD)
- FlashInfer deprecation [Refactor] Deprecate FlashInfer lora backend #7809 @lifuhuang
- Perf - LoRA Batch Preparation Optimization [Perf] Speed up LoRA Batch Initialization #6961 @lifuhuang @Fridge003
- Perf - Kernel Optimization [Perf] LoRA Kernel benchmark & optimization #9040 [Feature] Cutlass kernels for LoRA #7910 @Qiaolin-Yu @Fridge003 @lifuhuang
- support lora for embedding layer [Feature] Support Lora for VocabParallelEmbedding layer #3438 @Beichen-Ma
- support lora for MoE layer [Feature] Comprehensive LoRA Adapter Support for MOE Models: Including Expert Weights Integration #9897
- unified paging (support lora with different ranks) [Feature] Support unified paging in multi-lora serving #3647 @Sunt-ing @jcbjcbjc
- Async LoRA prefetch ([Feature] Asynchronous LoRA prefetch #8712)
- OpenAI compatible API
- LRU Offloading ([Feature] Optimize LoRA Loading Mechanism to Decouple User Limits from CPU Memory Constraints #10266)
CY25H1
- triton kernel & benchmark [Feature] Define backends and add Triton backend for Lora #3161 @Fridge003
- accuracy alignment [Bug] HuggingFace and SGLang inference don't match #2671 [Fix] Fix accuracy bug and refactor codes for lora #3413 @Fridge003
- test cases enhancement [Feature] Test case enhancement for Lora features #3414 [Fix] Fix bugs and refactor codes in lora for better scalability. #3652 [Feature] add multi-rank support for Lora #4492 [Fix] Improve Lora tests and reduce CI runtime #4925 @aoshen524 @jcbjcbjc
- support multi-rank adaptors [Feature] add multi-rank support for Lora #4492 @jcbjcbjc
- support tensor parallel [Bug] tensor_model_parallel_all_reduce' is not defined #2931 [Feature] Support Tensor Parallelism and Weight Slicing for Lora #4274 @aoshen524
- compatibility with cuda graph [Feature] Support compatibility between Cuda Graph and Lora #3282 Feat: support cuda graph for LoRA #4115 @Qiaolin-Yu @Beichen-Ma
- support phi4mm [Feature] Phi-4-MM support #6544 @lifuhuang
- dynamic load/unload Refactor LoRAManager and LoRAMemoryPool state management logic for dynamic LoRA loading support #7412 Support dynamic LoRA loading / unloading in engine/server API #7446 @lifuhuang @Fridge003
- Documentation Add document for LoRA serving #5521 @Fridge003
Related resources
zhyncs, Sunt-ing, Zion-W9, matthewdm0816, Ying1123 and 6 moreqeternity, Sunt-ing, Ying1123, chaokunyang, tedspare and 2 more