Support updating weights at once by stopping all requests #6698

tianyuzhou95 · 2025-05-28T09:15:42Z

Motivation

Currently, when updating model weights, we have to wait for all requests to complete generation before we can start updating. This significantly reduces the throughput of batch inference especially in RL scenarios. Therefore, a method is needed to quickly stop all running/queued requests so that model weights can be updated immediately.

Modifications

Return result rather than silently ignore when abort request in waiting queue.
Extend the native API /abort_request to abort all requests.
Add a new boolean field abort_all_requests inside the request of updating weights to decide whether to abort all requests before updating weights.

gemini-code-assist

Hello @tianyuzhou95, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team! Gemini here, providing a summary of this pull request to help everyone get up to speed quickly.

This PR introduces functionality to quickly stop all ongoing and queued requests within the SGLang runtime. The primary motivation, as described by the author, is to enable immediate model weight updates without having to wait for potentially long-running requests to complete, which is particularly beneficial for use cases like batch inference in RL scenarios where throughput is critical.

The changes involve adding a new native HTTP API endpoint /stop_all that can be called directly, and also integrating this stop mechanism into the existing weight update APIs (update_weights_from_disk, update_weights_from_distributed, update_weights_from_tensor) via a new boolean parameter stop_all_requests. When triggered, the system will modify the maximum number of new tokens for both waiting and running requests, effectively causing them to finish immediately or at their current state.

Highlights

New API Endpoint: Adds a new /stop_all HTTP API endpoint to trigger the stopping of all requests.
Weight Update Integration: Adds a stop_all_requests boolean field to the weight update request inputs, allowing requests to be stopped automatically before a weight update.
Scheduler Logic: Implements the core logic in the scheduler to handle the StopAllReq, modifying the max_new_tokens for queued and running requests to halt generation.

Changelog

Click here to see the changelog

python/sglang/srt/entrypoints/http_server.py
- Added /stop_all API endpoint (GET/POST) that calls _global_state.tokenizer_manager.stop_all().
python/sglang/srt/managers/io_struct.py
- Added StopAllReq dataclass.
- Added stop_all_requests: bool = False field to UpdateWeightFromDiskReqInput.
- Added stop_all_requests: bool = False field to UpdateWeightsFromDistributedReqInput.
- Added stop_all_requests: bool = False field to UpdateWeightsFromTensorReqInput.
python/sglang/srt/managers/scheduler.py
- Imported StopAllReq.
- Added handler for StopAllReq in the scheduler's request mapping.
- Implemented stop_all_request method to set max_new_tokens to 0 for waiting requests and to the current output length for running requests.
python/sglang/srt/managers/tokenizer_manager.py
- Imported StopAllReq.
- Added stop_all() method that sends a StopAllReq to the scheduler.
- Added checks in update_weights_from_disk, update_weights_from_distributed, and update_weights_from_tensor to call self.stop_all() if stop_all_requests is true.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Requests flow fast,
But models need update,
Stop them all now.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a valuable feature to pause/stop ongoing requests before model weights are updated, which is particularly useful for scenarios like reinforcement learning. The approach involves a new API endpoint /stop_all and flags in weight update requests to trigger this behavior. The mechanisms for marking requests to stop within the scheduler by adjusting max_new_tokens seem correct.

The main concern revolves around the synchronization: ensuring that requests are effectively stopped or paused in the scheduler before the weight update operation commences. The current implementation sends a stop signal asynchronously, which might lead to race conditions where weights are updated while requests are still being processed with old weights, or caches are flushed prematurely. I've provided a detailed comment and suggestions on this critical aspect.

Minor points include a variable naming choice, which is noted in the summary but not as an individual comment due to review settings.

Summary of Findings

Synchronization of Request Stopping and Weight Updates: Critical: The current mechanism for stopping requests (stop_all()) sends an asynchronous message to the scheduler. There's no guarantee that requests are stopped before the weight update process begins, potentially leading to race conditions or failed cache flushes. This needs to be made synchronous or ensure quiescence before updates.
Variable Naming: Medium: In scheduler.py, the variable sum shadows the built-in function, potentially causing confusion.
Clarity of /stop_all API behavior: Medium: The /stop_all API endpoint's current fire-and-forget nature might not align with user expectations of immediate, confirmed stopping. Its behavior should be clarified or made more robust if tokenizer_manager.stop_all becomes async.
Indentation: Low: In tokenizer_manager.py, the line req = StopAllReq() has inconsistent indentation (1 space instead of 4). (Not commented due to severity settings)

Merge Readiness

This pull request introduces an important capability. However, due to the critical synchronization issue identified between stopping requests and updating model weights, I recommend that these changes not be merged until this issue is thoroughly addressed. Addressing this will ensure the stability and correctness of the dynamic weight update feature. The other medium severity suggestions should also be considered for improving clarity and robustness. I am not authorized to approve pull requests, so please ensure further review and approval by authorized maintainers after addressing the feedback.

python/sglang/srt/managers/tokenizer_manager.py

python/sglang/srt/managers/scheduler.py

python/sglang/srt/entrypoints/http_server.py

tianyuzhou95 · 2025-06-10T07:43:19Z

@zhuzilin seems this PR also meets your case? cc @zhaochenyang20

related pr: #6855

zhaochenyang20 · 2025-06-10T23:30:57Z

@tianyuzhou95 please resolve the conflcits. I will merge #6855 and #6698

tianyuzhou95 · 2025-06-11T02:05:39Z

@tianyuzhou95 please resolve the conflcits. I will merge #6855 and #6698

@zhaochenyang20 Thanks, already resolved :)

zhyncs · 2025-06-23T01:04:07Z

please rebase

tianyuzhou95 · 2025-06-23T02:04:25Z

please rebase

Thanks, already rebased.

zhaochenyang20 · 2025-06-25T04:45:21Z

Add a new boolean field abort_all_requests inside the request of updating weights to decide whether to abort all requests before updating weights.

That means if this is turn off, the update will wait all the requests to finish, then update?

tianyuzhou95 · 2025-06-25T04:58:25Z

Add a new boolean field abort_all_requests inside the request of updating weights to decide whether to abort all requests before updating weights.

That means if this is turn off, the update will wait all the requests to finish, then update?

Yes, please refer https://github.com/sgl-project/sglang/blob/v0.4.8/python/sglang/srt/managers/tokenizer_manager.py#L889, the update will be blocked by the write lock.

python/sglang/srt/managers/tokenizer_manager.py

hebiao064 · 2025-06-29T06:14:49Z

will ask @zhyncs and @zhaochenyang20 to merge this PR once CI passed, thanks a lot!

hebiao064 · 2025-06-29T06:46:27Z

pls fix lint by pre-commit run --all-files

tianyuzhou95 · 2025-06-29T06:52:04Z

pre-commit run --all-files

Thanks, already fix linting. Besides, also add an unit test for updating weights with abort_all_requests=True.

zhyncs · 2025-07-03T02:33:39Z

Hi @tianyuzhou95 please rebase cc @zhuzilin

Co-authored-by: Tianyu Zhou <[email protected]>

Signed-off-by: Tianyu Zhou <[email protected]>

Set abort_all_requests=True to enable this feature. Signed-off-by: Tianyu Zhou <[email protected]>

Signed-off-by: Tianyu Zhou <[email protected]>

tianyuzhou95 · 2025-07-03T02:45:57Z

Hi @tianyuzhou95 please rebase cc @zhuzilin

Thanks for pointing out, already rebased :)

…t#6698) Signed-off-by: Tianyu Zhou <[email protected]> Co-authored-by: Zilin Zhu <[email protected]>

@mickqian

* Use seq_len_fill_value in the cuda graph runners (sgl-project#7233) * support custom weight loader for model runner (sgl-project#7122) Co-authored-by: kavioyu <[email protected]> * Fix AMD speculative decoding (sgl-project#7252) * [Refactor] OAI Server components (sgl-project#7167) Signed-off-by: Xinyuan Tong <[email protected]> * OAI Server Skeleton & Core Utility Endpoints (sgl-project#7179) * [amd] Opt dsv3 moe (sgl-project#7160) Co-authored-by: wunhuang <[email protected]> * update ci node for xeon (sgl-project#7265) * feat: mtp support dp-attention (sgl-project#6081) Co-authored-by: austindeng <[email protected]> Co-authored-by: tianqilin.99 <[email protected]> Co-authored-by: Qiaolin Yu <[email protected]> Co-authored-by: ch-wan <[email protected]> * support qwen2 running on ascend npu device (sgl-project#7022) Co-authored-by: 刁莹煜 <[email protected]> * Fix Deepseek R1 0528 FP4 tensor name mismatch issue during weights loading. (sgl-project#7164) * bugfix(tool call ebnf): Fix EBNF generation for optional function parameters (sgl-project#7283) * Fix AWQ Dequant and Weight Loading of deepseek v2 (sgl-project#6842) * fix: resolve b200 dsv3 mtp issue (sgl-project#7286) * ci: Fix test_ebnf_generate_all_optional_function_params (sgl-project#7288) * fix: only enable flash_attn test on sm80 sm90 (sgl-project#7289) * [PD] Support get local ip from NIC for PD disaggregation (sgl-project#7237) Signed-off-by: Shangming Cai <[email protected]> * [PD] Add custom memory pool option to support Mooncake PD with NVLink (sgl-project#7264) Signed-off-by: Shangming Cai <[email protected]> * Upstreaming hicache bug fixes (sgl-project#7267) * Update python API of activation, topk, norm and rope and remove vllm dependency (sgl-project#6614) Co-authored-by: Wu, Chunyuan <[email protected]> Co-authored-by: jianan-gu <[email protected]> Co-authored-by: sdp <[email protected]> * Fix hicache benchmark script bug - some sampled input_request is [] (sgl-project#7300) * chore: change logs from`INFO` to `DEBUG` for dp and add force quit for tokenizer manager (sgl-project#7251) * update invalid link in doc (sgl-project#7297) * Fix mini_lb for PD with long output: limit chunk size of decode response (sgl-project#7301) Signed-off-by: ch-tiger1 <[email protected]> Co-authored-by: ch-tiger1 <[email protected]> * Fix profiler error when there are idle passes (sgl-project#7003) * [pd] optimize dockerfile for pd disaggregation (sgl-project#7319) Co-authored-by: zhyncs <[email protected]> * Merge PDLB (Prefill-Decode Load Balancer) into SGLang Router (sgl-project#7096) * Add more refactored openai test & in CI (sgl-project#7284) * fix: resolve blackwell deepep image issue (sgl-project#7331) * add seed in CPU UTs to avoid flaky failure (sgl-project#7333) * Multi-Stage Awake: Support Resume and Pause KV Cache and Weights separately (sgl-project#7099) * Reintroduce tiny fix sampler error when prob is not contiguous (sgl-project#7354) * [Refactor] Clean up radix cache related API (sgl-project#7303) Co-authored-by: Zhiqiang Xie <[email protected]> * Put `_normalize_rid` before other normalization in `io_struct` (sgl-project#7363) * [PD] Transfer hidden states for mtp when disaggregation (sgl-project#7242) * [Bugfix][PD] Set conclude state before clear when failure happens (sgl-project#7362) Signed-off-by: Shangming Cai <[email protected]> * docs: update installation (sgl-project#7366) * [Docker] optimize dockerfile remove deepep and blackwell merge it to… (sgl-project#7343) Co-authored-by: Yineng Zhang <[email protected]> * Clean unused import for mimo mtp model (sgl-project#7370) * [Bugfix]Fix hang bug using dp attention with HiRadixCache (sgl-project#7159) Signed-off-by: huanglong <[email protected]> * [Doc] add embedding rerank doc (sgl-project#7364) * Fix judgment condition for enabling Deepseek V3/R1 shared expert fusion optimization (sgl-project#7371) * Feat/refactor embedding server (sgl-project#7322) * Purge VerlEngine (sgl-project#7326) Signed-off-by: Ata Fatahi <[email protected]> * support return logprobs for pipeline (sgl-project#7356) Co-authored-by: Zhang Kaihong <[email protected]> * [PD] Optimize custom mem pool usage and bump mooncake version (sgl-project#7393) Signed-off-by: Shangming Cai <[email protected]> * Support THUDM/GLM-4-0414 (GLM-Z1) Glm4ForCausalLM architecture. (sgl-project#5485) * Refine OpenAI serving entrypoint to remove batch requests (sgl-project#7372) Signed-off-by: Xinyuan Tong <[email protected]> Co-authored-by: Chang Su <[email protected]> * [Feature] Comprehensive Hybrid Parallelism Support (sgl-project#6389) * [DeepSeekNextN] fix: residual of head norm can be None (sgl-project#7398) * [OAI refactor] Add rerank and score serving (sgl-project#7399) Co-authored-by: Chang Su <[email protected]> * [OAI Server Refactor] [ChatCompletions & Completions] Implement UsageInfo Processor (sgl-project#7360) Co-authored-by: Chang Su <[email protected]> * Fix All-Gather under world size one (sgl-project#7219) * Optimize DP attn scheduling for speculative decoding (sgl-project#7285) * Update usage_processor.py (sgl-project#7402) * Fix 7285 Merge Conflicts (sgl-project#7403) * chore: upgrade mooncake-transfer-engine 0.3.4 (sgl-project#7401) * [OAI Server Refactor] [ChatCompletions & Completions] Support Return Hidden State (sgl-project#7329) Signed-off-by: keru <[email protected]> * Remove batches api in docs & example (sgl-project#7400) * [BugFix]: fix EmbeddingReqInput single input error (sgl-project#7396) * [BugFix]fix qwen25 invoke function call streaming responses with curly braces as the starting indicator (sgl-project#7394) * fix overlap pagecount (sgl-project#6984) Co-authored-by: Zhiqiang Xie <[email protected]> * fix: Fix CI test_function_call_parser.py (sgl-project#7425) * Fix CPU offloading for MLA memory pool (sgl-project#7409) * [fix] PD disaggregation when enable mtp and tp!=dp (sgl-project#7420) * feat(oai refactor): Replace `openai_api` with `entrypoints/openai` (sgl-project#7351) Co-authored-by: Jin Pan <[email protected]> * Refactor LoRAManager and LoRAMemoryPool state management logic for dynamic LoRA loading support (sgl-project#7412) * refactor(test): reorganize OpenAI test file structure (sgl-project#7408) * [minor] simplify the `TokenToKVPoolAllocator` (sgl-project#7414) * Tiny add logging for GC (sgl-project#7406) * FlashInfer NVFP4 MoE with EP & 2-stream shared expert (sgl-project#7327) Co-authored-by: JieXin Liang <[email protected]> Co-authored-by: alcanderian <[email protected]> * Remove copy after bmm (sgl-project#7441) * Fix torch compile run (sgl-project#7391) Co-authored-by: wunhuang <[email protected]> Co-authored-by: Sai Enduri <[email protected]> * [misc] Add PD service discovery support in router (sgl-project#7361) * add fused moe config for qwen3 in triton3.3.1 (sgl-project#7445) * Fix CUDA Graph Check under Deepep with DP FFN (sgl-project#7451) * Update hyperparameter_tuning.md (sgl-project#7454) * feat: integrate deepgemm into EPMoE (sgl-project#6821) Co-authored-by: tianqilin.99 <[email protected]> Co-authored-by: TianQiLin666666 <[email protected]> Co-authored-by: Cheng Wan <[email protected]> * Solve docker build failed in the virtual machine (sgl-project#7290) Co-authored-by: wunhuang <[email protected]> Co-authored-by: Sai Enduri <[email protected]> Co-authored-by: HAI <[email protected]> * Fix a bug in BatchTokenIDOut & Misc style and dependency updates (sgl-project#7457) * [CI] Upgrade mooncake to 0.3.4.post1 to fix 8 gpu tests (sgl-project#7472) Signed-off-by: Shangming Cai <[email protected]> * Fix prefill OOM due to wrong token calculation when page > 1 (sgl-project#7397) * feat(func_call): Add more check in `BaseFormatDetector.parse_streaming_increment` (sgl-project#7479) * Fix dtype for idle input in spec decoding (sgl-project#7456) * update mooncake in dockerfile (sgl-project#7480) * kvcache io kernels and test case (sgl-project#7382) * [perf] slightly imporve DeepSeek-R1-FP4 TP8 (sgl-project#7481) * Quick fix for DeepGemm requant to also cover MTP. (sgl-project#7378) * Support weight loading without mmap (sgl-project#7469) * ci: Revert openai_server related tests in AMD suites (sgl-project#7449) * Perormance: Enable cuda graph for dp idle batch (sgl-project#7269) Co-authored-by: austindeng <[email protected]> Co-authored-by: Cheng Wan <[email protected]> Co-authored-by: ch-wan <[email protected]> * bugfix: Prevent global mutation of conv.stop_str across requests (sgl-project#7347) Co-authored-by: Chang Su <[email protected]> * Fix RequestValidationError response format (sgl-project#7487) * Fix MTP with Deepseek R1 Fp4 (sgl-project#7376) * chore: bump sgl-kernel v0.2.0 (sgl-project#7490) * chore: bump v0.4.8 (sgl-project#7493) * [AMD] add aiter fused moe in DeepEP path (sgl-project#7268) * enable aiter_biased_grouped_topk kernel (sgl-project#7423) * [PD Disaggregation] replace transfer with batch transfer for better performance (sgl-project#7236) * Remove cumsum_buffer initilization (sgl-project#7439) * [benchmark] fbgemm benchmark support bandwidth report and support fbgemm_cutlass_gmm (sgl-project#7422) * Support multi-thread model weight loading (sgl-project#7277) * [PD] NIXL: Register kv args in advance and cleanup finished requests (sgl-project#6717) * fix: Add `--model` as an alias for `--model-path` in server_args (sgl-project#7505) * misc: Improvement to serving_chat.py and add more ut (sgl-project#7489) * Fuse sorted_token_ids padding to moe_align_block_size kernel (sgl-project#7437) * [OAI] patch origin request_id logic (sgl-project#7508) * [PD][Spec] Fix hidden state transfer for spec decode (sgl-project#7516) Signed-off-by: Shangming Cai <[email protected]> * EPLB support for MTP (sgl-project#7510) * clean duplicate code (sgl-project#7512) * [ci] add router benchmark script and CI (sgl-project#7498) * fix: force synchronization between TP workers when update_weights (sgl-project#6626) Co-authored-by: dangkai.dk <[email protected]> * [CPU] [BF16] Call fused_experts_cpu, weight_packed_linear and bmm_cpu kernel in DeepSeek model (sgl-project#6641) Co-authored-by: Thien Tran <[email protected]> * [CI] Upgrade mooncake to v0.3.4.post2 to fix potential slice failed bug (sgl-project#7522) Signed-off-by: Shangming Cai <[email protected]> * npu fused op (sgl-project#7386) Co-authored-by: Li Junwen <[email protected]> * feat: send kvmetrics from sglang scheduler (sgl-project#6721) * [PD] Add different TP sizes support for no-MLA models (sgl-project#6793) Co-authored-by: shangmingc <[email protected]> Co-authored-by: Shangming Cai <[email protected]> * enable aiter fp8 blockscale quant (sgl-project#7520) * take aiter get_rope back (sgl-project#7521) * Fix typo of flash_cache (sgl-project#7513) * feat: add return hidden_states at async generation (sgl-project#7507) * minor: 'role' must be system/assistant/tool, but case insensitive for now (sgl-project#7499) * Fix FP8 KV Cache Support in FA3 Backend (sgl-project#7148) * Fix gathered_buffer issues in tbo (sgl-project#7531) * [PD] Raise error for incompatible mooncake version and some minor fixes (sgl-project#7527) Signed-off-by: Shangming Cai <[email protected]> * [CMake] Fix sgl-kernel CMakeLists for Blackwell (sgl-project#7543) * Add Tencent HunYuanMoEV1 model support (sgl-project#7549) * Update seed in CPU UTs to avoid flaky failure with single test (sgl-project#7544) * chore: improve ci bug reporting (sgl-project#7542) * chore: remove vlm unnecessary import (sgl-project#7541) Signed-off-by: Xinyuan Tong <[email protected]> Co-authored-by: yhyang201 <[email protected]> Co-authored-by: Mick <[email protected]> * chore: bump v0.4.8.post1 (sgl-project#7559) * [PD][NIXL] Set is_sorted=False to fix NIXL_ERR_NOT_FOUND (sgl-project#7330) * [Fix] incorrect assert in EPLB (sgl-project#7575) * Updates Gemma3n MLP layer to adapt latest transformers version (sgl-project#7573) Signed-off-by: Xinyuan Tong <[email protected]> * Fix MTP error when enabling two-batch overlap (sgl-project#7569) * Add e2e test for multi instance multi stage memory release/resume occupuation (sgl-project#7208) Signed-off-by: Ata Fatahi <[email protected]> * [CI] Add CI Testing for Prefill-Decode Disaggregation with Router (sgl-project#7540) * Updates transformers and timm dependencies (sgl-project#7577) Signed-off-by: Xinyuan Tong <[email protected]> * feat: support compatibility between MTP and two-batch-overlap (sgl-project#7225) Co-authored-by: Cheng Wan <[email protected]> * Move multimodal processors into a separate folder (sgl-project#7581) * Fix broken CI TestVILAServer (sgl-project#7610) * [router] add centralized configuration module for sgl-router (sgl-project#7588) * Fix: Minicpm (sgl-project#7612) Signed-off-by: Xinyuan Tong <[email protected]> * Hybrid kv cache for LLaMA4 (sgl-project#6563) Co-authored-by: Cheng Wan <[email protected]> Co-authored-by: tarinkk <[email protected]> Co-authored-by: tarinkk <[email protected]> Co-authored-by: Hanming Lu <[email protected]> * [CPU] add optimizations for INT8 and FP8 DeepSeek (sgl-project#6769) Co-authored-by: Zheng, Beilei <[email protected]> * Tiny add logs for expert location updater (sgl-project#7308) * Fix flakiness in LoRA batch test. (sgl-project#7552) * [BUG] fix local_rank in initialize_dp_attention (sgl-project#7584) * Support dynamic LoRA loading / unloading in engine/server API (sgl-project#7446) * [PD] Respect sampling_params.max_new_tokens when PD disaggregation is activated (sgl-project#7598) Signed-off-by: Shangming Cai <[email protected]> * fix unit tests (sgl-project#7618) * Let ep_scatter support arbitrary strides / ue8m0 format (sgl-project#7309) * Let EP prefill support new DeepGEMM (sgl-project#7310) * docs: add gb200 nvl72 and a16z grant (sgl-project#7620) * oai: Adds support for OpenAI chat completions API in bench_serving (sgl-project#7036) Signed-off-by: Xinyuan Tong <[email protected]> Co-authored-by: yhyang201 <[email protected]> Co-authored-by: Mick <[email protected]> * [bugfix] Remove PR comment posting from Rust benchmark workflow (sgl-project#7625) * [Minor] clean up multimodal processor and tokenizer manager (sgl-project#7624) * Add dsv3 fused a gemm to sgl-kernel (sgl-project#7630) * Add @mickqian as the CODEOWNERS of multimodal (sgl-project#7636) * Fix stream reasoning parser and Adds Kimi reasoning parser (sgl-project#7432) Signed-off-by: Xinyuan Tong <[email protected]> * Fix sgl-router startup crash (sgl-project#7619) * [bugfix] fix runtime dropping panic in editable (sgl-project#7628) * Move files related to EPLB (sgl-project#7580) * [misc] reduce weird rope_scaling_factor warning (sgl-project#7176) * [AMD] Add unit-test-sgl-kernel-amd to AMD CI (sgl-project#7539) * Update CODEOWNERS (sgl-project#7640) * [EAGLE] remove a wrong adjustment for page_size > 1 & topk > 1 in server_args.py (sgl-project#7643) * [CPU] add c++ kernel to bind CPU cores and memory node (sgl-project#7524) * Improve streaming, log_level, memory report, weight loading, and benchmark script (sgl-project#7632) Co-authored-by: Kan Wu <[email protected]> * Add dsv3 router gemm kernel (sgl-project#7627) * chore: upgrade flashinfer v0.2.7 jit (sgl-project#7663) * [doc] update lws doc for pd (sgl-project#7318) * Fix: sync prepare_fp8_layer_for_marlin with latest vllm changes (sgl-project#7648) * Add small requirements for benchmark/parse_result tools (sgl-project#7671) * [CPU] remove process_group from inputs of shm_allreduce and shm_allgather (sgl-project#7486) * chore: bump sgl-kernel v0.2.1 (sgl-project#7675) * support llama4 eagle3 (sgl-project#6985) Co-authored-by: shuaills <[email protected]> Co-authored-by: Shenggui Li <[email protected]> Co-authored-by: Yingyi Huang <[email protected]> Co-authored-by: yizhang2077 <[email protected]> * Refactor mm processors and Enable mixed modality processing (sgl-project#7629) Signed-off-by: Xinyuan Tong <[email protected]> * upgrade sgl kernel to 0.2.1 for main (sgl-project#7676) * add description for llama4 eagle3 (sgl-project#7688) * fix(model loader): use safe_open to prevent file handle leaks. (sgl-project#7684) * chore: upgrade flashinfer v0.2.7.post1 (sgl-project#7698) * Improve error handling for requests with unloaded LoRA path(s) (sgl-project#7642) * Apply dsv3_fused_a_gemm kernel (sgl-project#7635) * Fix GPTQMarlinMoE (sgl-project#7697) * [1/n] apply wna16marlin kernel in moe weight only quantization (sgl-project#7683) Co-authored-by: 晟海 <[email protected]> Co-authored-by: yych0745 <[email protected]> Co-authored-by: HandH1998 <[email protected]> Co-authored-by: 弋云 <[email protected]> Co-authored-by: walker-ai <[email protected]> * Apply dsv3 router gemm kernel for deepseek-r1 fp4 (sgl-project#7677) * [AMD] Temporarily disable test_no_overlap_scheduler and test_vision_chunked_prefill (sgl-project#7717) * [RL] add --skip-warmup (sgl-project#7416) * [RL] support update_weights_from_distributed with different group and multiple weights (sgl-project#7292) * [router] add --log-level to sgl-router (sgl-project#6512) * [b200] support trt-llm allreduce fuse rms_norm_add kernel (sgl-project#7621) * [CPU] Bind threads and numa node for each TP rank (sgl-project#6549) Co-authored-by: srinarayan-srikanthan <[email protected]> * Support non-contiguous query input for extend/decode attention (sgl-project#7462) * Support updating weights at once by stopping all requests (sgl-project#6698) Signed-off-by: Tianyu Zhou <[email protected]> Co-authored-by: Zilin Zhu <[email protected]> * Fix num_tokens_pre_allocated in disaggregation log (sgl-project#7714) * [CPU] [sgl-kernel] set dispatch key of initialize to CatchAll (sgl-project#7734) * [CPU] fix all_reduce and all_gather (sgl-project#6770) Co-authored-by: blzheng <[email protected]> * fix awq and dsv3 fused gemm compatible (sgl-project#7735) * [CI][Router] Fix bench_one_batch_server for pd router test (sgl-project#7731) Signed-off-by: Shangming Cai <[email protected]> * Add CUTLASS FP8 Blockscale MoE kernel for Hopper architecture (sgl-project#7278) Co-authored-by: HydraQYH <[email protected]> Co-authored-by: TianQiLin666666 <[email protected]> * fix dsv3 fused proj check (sgl-project#7738) * Ascend attention backend(PA&MLA) (sgl-project#7722) Co-authored-by: Maksim <[email protected]> Co-authored-by: VDV1985 <[email protected]> * [fix] fix dsv3_router_gemm filter (sgl-project#7750) * [CPU] refine CPU integration code (sgl-project#7647) * [CPU] support the case where num_attention_heads or intermediate_size is not divisible by the TP size (sgl-project#6771) * support qwen3 dense model dp attention (sgl-project#7681) * [optimize] add two stream norm for qwen3 (sgl-project#7740) Co-authored-by: ispobock <[email protected]> * feat: use D2D instead of H2H in pp (sgl-project#7673) Co-authored-by: alpha-baby <[email protected]> * [Bug] add flashinfer bool check for fusedmoe in Qwen moe models (sgl-project#7723) * [fix] put cpu in the first priority in get_device() (sgl-project#7752) * [optimize] fuse renormalize into moe_topk_softmax (sgl-project#7744) Co-authored-by: ispobock <[email protected]> * chore: bump sgl-kernel 0.2.2 (sgl-project#7755) * fix CI: update native api ipynb (sgl-project#7754) Signed-off-by: Xinyuan Tong <[email protected]> * fuse renormal into moe topk softmax kernel python code (sgl-project#7751) Co-authored-by: ispobock <[email protected]> Co-authored-by: zhyncs <[email protected]> * Remove type conversion and fix id map in topk (sgl-project#7759) * Add V2-lite model test (sgl-project#7390) Co-authored-by: DiweiSun <[email protected]> * refactor llama4 dp attention logic (sgl-project#7729) * fix(docs): fix the broken link in `docs/references/production_metrics.md` (sgl-project#7741) Signed-off-by: rudeigerc <[email protected]> * [fix] update bench_speculative.py for compatibility (sgl-project#7764) Signed-off-by: Kay Yan <[email protected]> * Move mem_fraction_static adjustment for multimodal models to `server_args.py` & Fix session control & Other cleanups (sgl-project#7748) * [RL] Add --nccl-port to prevent port conflict (sgl-project#7418) * [RL] add pause and continue generation for async rl training (sgl-project#7419) * [Fix] Alloc return type error (sgl-project#7778) Signed-off-by: Capronir <[email protected]> * [feat] Support EAGLE3 for Qwen (sgl-project#7745) Co-authored-by: 纬杭 <[email protected]> Co-authored-by: zyksir <[email protected]> * saving hidden_states.clone() (sgl-project#7705) * [1/n]: add cutlass W4A8 moe kernel for hopper architecture (sgl-project#7772) Signed-off-by: yangsijia.614 <[email protected]> Co-authored-by: yicwang <[email protected]> * add model: qwen2-audio (sgl-project#7596) * Optimize Hopper CUTLASS FP8 Blockwise Grouped GEMM Kernel in Small K Scenario (sgl-project#7782) * Embedding parallel by attn_tp (sgl-project#7623) * fix: fix apply_shuffle_mul_sum (sgl-project#7444) * chore: bump sgl-kernel v0.2.3 (sgl-project#7784) * fix: use nvidia-nccl-cu12 2.27.5 (sgl-project#7787) * DP Attention with Auto DeepEP Dispatch (sgl-project#7222) * chore: upgrade sgl-kernel v0.2.3 (sgl-project#7786) * Fix incorrect spec_num_draft_tokens in draft_extend (sgl-project#7757) * [fix] fix misusing of is_cuda (sgl-project#7790) * Add treemask mode to build_eagle_tree & release sgl-kernel 0.2.3 (sgl-project#7756) Co-authored-by: Pranjal Shankhdhar <[email protected]> * chore: bump sgl-kernel v0.2.4 (sgl-project#7800) * ci: fix port args (sgl-project#7792) * Fix CI test OOM issue. (sgl-project#7799) * chore: upgrade sgl-kernel v0.2.4 (sgl-project#7801) * chore: bump v0.4.9 (sgl-project#7802) * fix merge conflict issue * fix hpu attention nonetyep issue * fix alignment * fix alignment2 * Ci failure fixes * fix attention-backend choices --------- Signed-off-by: Xinyuan Tong <[email protected]> Signed-off-by: Shangming Cai <[email protected]> Signed-off-by: ch-tiger1 <[email protected]> Signed-off-by: huanglong <[email protected]> Signed-off-by: Ata Fatahi <[email protected]> Signed-off-by: keru <[email protected]> Signed-off-by: Tianyu Zhou <[email protected]> Signed-off-by: rudeigerc <[email protected]> Signed-off-by: Kay Yan <[email protected]> Signed-off-by: Capronir <[email protected]> Signed-off-by: yangsijia.614 <[email protected]> Signed-off-by: Mohit Sinha <[email protected]> Co-authored-by: Lianmin Zheng <[email protected]> Co-authored-by: KavioYu <[email protected]> Co-authored-by: kavioyu <[email protected]> Co-authored-by: Xinyuan Tong <[email protected]> Co-authored-by: yhyang201 <[email protected]> Co-authored-by: kk <[email protected]> Co-authored-by: wunhuang <[email protected]> Co-authored-by: DiweiSun <[email protected]> Co-authored-by: u4lr451 <[email protected]> Co-authored-by: austindeng <[email protected]> Co-authored-by: tianqilin.99 <[email protected]> Co-authored-by: Qiaolin Yu <[email protected]> Co-authored-by: ch-wan <[email protected]> Co-authored-by: Yijie Zhu <[email protected]> Co-authored-by: 刁莹煜 <[email protected]> Co-authored-by: Charles Chen <[email protected]> Co-authored-by: Chang Su <[email protected]> Co-authored-by: AniZpZ <[email protected]> Co-authored-by: Yineng Zhang <[email protected]> Co-authored-by: shangmingc <[email protected]> Co-authored-by: Zhiqiang Xie <[email protected]> Co-authored-by: YanbingJiang <[email protected]> Co-authored-by: Wu, Chunyuan <[email protected]> Co-authored-by: jianan-gu <[email protected]> Co-authored-by: sdp <[email protected]> Co-authored-by: Binyao Jiang <[email protected]> Co-authored-by: ishandhanani <[email protected]> Co-authored-by: linzhuo <[email protected]> Co-authored-by: ch-tiger1 <[email protected]> Co-authored-by: ch-tiger1 <[email protected]> Co-authored-by: fzyzcjy <[email protected]> Co-authored-by: ybyang <[email protected]> Co-authored-by: Simo Lin <[email protected]> Co-authored-by: Jinn <[email protected]> Co-authored-by: Stefan He <[email protected]> Co-authored-by: DarkSharpness <[email protected]> Co-authored-by: Atream <[email protected]> Co-authored-by: Li Hui <[email protected]> Co-authored-by: Huang Long <[email protected]> Co-authored-by: woodx <[email protected]> Co-authored-by: Ata Fatahi <[email protected]> Co-authored-by: strgrb <[email protected]> Co-authored-by: Zhang Kaihong <[email protected]> Co-authored-by: Wenbo Yang <[email protected]> Co-authored-by: Chang Su <[email protected]> Co-authored-by: Cheng Wan <[email protected]> Co-authored-by: Keyang Ru <[email protected]> Co-authored-by: ehuaa <[email protected]> Co-authored-by: pansicheng <[email protected]> Co-authored-by: Liangsheng Yin <[email protected]> Co-authored-by: Jin Pan <[email protected]> Co-authored-by: Lifu Huang <[email protected]> Co-authored-by: Trevor Morris <[email protected]> Co-authored-by: JieXin Liang <[email protected]> Co-authored-by: alcanderian <[email protected]> Co-authored-by: Ke Bao <[email protected]> Co-authored-by: Sai Enduri <[email protected]> Co-authored-by: Yi Zhang <[email protected]> Co-authored-by: xutizhou <[email protected]> Co-authored-by: TianQiLin666666 <[email protected]> Co-authored-by: HAI <[email protected]> Co-authored-by: Yuhong Guo <[email protected]> Co-authored-by: huangtingwei <[email protected]> Co-authored-by: Alex Sun <[email protected]> Co-authored-by: valarLip <[email protected]> Co-authored-by: Francis <[email protected]> Co-authored-by: Xiaoyu Zhang <[email protected]> Co-authored-by: xianzhiT <[email protected]> Co-authored-by: yilian49 <[email protected]> Co-authored-by: DangKai <[email protected]> Co-authored-by: dangkai.dk <[email protected]> Co-authored-by: Thien Tran <[email protected]> Co-authored-by: ll819214 <[email protected]> Co-authored-by: Li Junwen <[email protected]> Co-authored-by: zixuanzhang226 <[email protected]> Co-authored-by: Hongbo Xu <[email protected]> Co-authored-by: shangmingc <[email protected]> Co-authored-by: eigen <[email protected]> Co-authored-by: mlmz <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Meng, Peng <[email protected]> Co-authored-by: Mick <[email protected]> Co-authored-by: yhyang201 <[email protected]> Co-authored-by: tarinkk <[email protected]> Co-authored-by: tarinkk <[email protected]> Co-authored-by: tarinkk <[email protected]> Co-authored-by: Hanming Lu <[email protected]> Co-authored-by: Zheng, Beilei <[email protected]> Co-authored-by: Sheng Qi <[email protected]> Co-authored-by: finetune <[email protected]> Co-authored-by: Hubert Lu <[email protected]> Co-authored-by: Kan Wu <[email protected]> Co-authored-by: Baizhou Zhang <[email protected]> Co-authored-by: narutolhy <[email protected]> Co-authored-by: lukec <[email protected]> Co-authored-by: shuaills <[email protected]> Co-authored-by: Shenggui Li <[email protected]> Co-authored-by: Yingyi Huang <[email protected]> Co-authored-by: Simon_CQK <[email protected]> Co-authored-by: Kyungmin Lee <[email protected]> Co-authored-by: 晟海 <[email protected]> Co-authored-by: yych0745 <[email protected]> Co-authored-by: HandH1998 <[email protected]> Co-authored-by: 弋云 <[email protected]> Co-authored-by: walker-ai <[email protected]> Co-authored-by: Zilin Zhu <[email protected]> Co-authored-by: srinarayan-srikanthan <[email protected]> Co-authored-by: Albert <[email protected]> Co-authored-by: Ziming Huang <[email protected]> Co-authored-by: ayrnb <[email protected]> Co-authored-by: HydraQYH <[email protected]> Co-authored-by: ronnie_zheng <[email protected]> Co-authored-by: Maksim <[email protected]> Co-authored-by: VDV1985 <[email protected]> Co-authored-by: ispobock <[email protected]> Co-authored-by: TianyuZhang1214 <[email protected]> Co-authored-by: alpha-baby <[email protected]> Co-authored-by: Yuchen Cheng <[email protected]> Co-authored-by: Kay Yan <[email protected]> Co-authored-by: Caproni <[email protected]> Co-authored-by: Ximingwang-09 <[email protected]> Co-authored-by: 纬杭 <[email protected]> Co-authored-by: zyksir <[email protected]> Co-authored-by: SijiaYang <[email protected]> Co-authored-by: yicwang <[email protected]> Co-authored-by: Leng Yue <[email protected]> Co-authored-by: Qi Yuhang <[email protected]> Co-authored-by: Gang Chen <[email protected]> Co-authored-by: Pranjal Shankhdhar <[email protected]> Co-authored-by: jay <[email protected]>

…t#6698) Signed-off-by: Tianyu Zhou <[email protected]> Co-authored-by: Zilin Zhu <[email protected]>

tianyuzhou95 requested review from merrymercy, Ying1123, hnyls2002, xiezhq-hermann and zhaochenyang20 as code owners May 28, 2025 09:15

gemini-code-assist bot reviewed May 28, 2025

View reviewed changes

tianyuzhou95 changed the title ~~Albert/pause requests v0~~ Support updating weights at once by stopping all requests May 28, 2025

gemini-code-assist bot suggested changes May 28, 2025

View reviewed changes

python/sglang/srt/managers/tokenizer_manager.py Outdated Show resolved Hide resolved

python/sglang/srt/managers/scheduler.py Outdated Show resolved Hide resolved

python/sglang/srt/entrypoints/http_server.py Outdated Show resolved Hide resolved

tianyuzhou95 force-pushed the albert/pause-requests-v0 branch 4 times, most recently from 2c08b33 to 7c0775e Compare June 3, 2025 03:29

tianyuzhou95 force-pushed the albert/pause-requests-v0 branch from 7c0775e to c5bf0a3 Compare June 11, 2025 02:04

tianyuzhou95 force-pushed the albert/pause-requests-v0 branch from 78811f2 to c063e38 Compare June 12, 2025 03:37

tianyuzhou95 requested review from zhyncs, ispobock, ByronHsu and CatherineSue as code owners June 12, 2025 03:37

tianyuzhou95 force-pushed the albert/pause-requests-v0 branch from da7c2a6 to 194428b Compare June 14, 2025 16:01

garrett4wade mentioned this pull request Jun 16, 2025

[Feature] Interrupt running requests when updating weights for RL #6486

Closed

2 tasks

zhuzilin mentioned this pull request Jun 21, 2025

[sglang] Tracking sglang compatibility in slime THUDM/slime#6

Open

20 tasks

tianyuzhou95 force-pushed the albert/pause-requests-v0 branch from 194428b to a1ab3aa Compare June 23, 2025 02:03

tianyuzhou95 force-pushed the albert/pause-requests-v0 branch from 833ecab to ffada2c Compare June 27, 2025 06:38

hebiao064 reviewed Jun 29, 2025

View reviewed changes

python/sglang/srt/managers/tokenizer_manager.py Outdated Show resolved Hide resolved

tianyuzhou95 force-pushed the albert/pause-requests-v0 branch from a0b2b53 to ffa89fe Compare June 29, 2025 06:08

hebiao064 approved these changes Jun 29, 2025

View reviewed changes

tianyuzhou95 force-pushed the albert/pause-requests-v0 branch from 40ee70d to 6edd44b Compare June 29, 2025 06:24

tianyuzhou95 force-pushed the albert/pause-requests-v0 branch from 6edd44b to f52f998 Compare June 29, 2025 06:49

tianyuzhou95 force-pushed the albert/pause-requests-v0 branch from 1d0bf68 to a48c38f Compare June 29, 2025 06:57

This was referenced Jun 30, 2025

[RL] support abort all and fix abort on waiting queue #6855

Closed

[RL] Add test for /abort_request #7626

Merged

tianyuzhou95 force-pushed the albert/pause-requests-v0 branch 2 times, most recently from d34a764 to 6361e85 Compare July 2, 2025 07:39

hebiao064 approved these changes Jul 2, 2025

View reviewed changes

zhyncs assigned zhuzilin Jul 3, 2025

zhyncs added the high priority label Jul 3, 2025

zhuzilin and others added 4 commits July 3, 2025 10:42

support abort all and fix abort on waiting queue

9f47206

Co-authored-by: Tianyu Zhou <[email protected]>

add unit test for abort_all

3d54887

Signed-off-by: Tianyu Zhou <[email protected]>

support aborting all requests before updating weights

0116262

Set abort_all_requests=True to enable this feature. Signed-off-by: Tianyu Zhou <[email protected]>

add unit test for aborting all requests during update weights

5094845

Signed-off-by: Tianyu Zhou <[email protected]>

tianyuzhou95 force-pushed the albert/pause-requests-v0 branch from 6361e85 to 5094845 Compare July 3, 2025 02:44

Merge branch 'main' into albert/pause-requests-v0

4997fef

zhyncs merged commit d3c275b into sgl-project:main Jul 3, 2025
95 of 104 checks passed

zhuzilin mentioned this pull request Jul 16, 2025

Update sglang_example.py THUDM/slime#63

Closed

chenxijun1029 pushed a commit to chenxijun1029/sglang that referenced this pull request Jul 17, 2025

Support updating weights at once by stopping all requests (sgl-projec…

0354142

…t#6698) Signed-off-by: Tianyu Zhou <[email protected]> Co-authored-by: Zilin Zhu <[email protected]>

shuaills pushed a commit to shuaills/sglang that referenced this pull request Jul 21, 2025

Support updating weights at once by stopping all requests (sgl-projec…

e33732a

…t#6698) Signed-off-by: Tianyu Zhou <[email protected]> Co-authored-by: Zilin Zhu <[email protected]>

Support updating weights at once by stopping all requests #6698

Support updating weights at once by stopping all requests #6698

Uh oh!

Conversation

tianyuzhou95 commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyuzhou95 commented Jun 10, 2025

Uh oh!

zhaochenyang20 commented Jun 10, 2025

Uh oh!

tianyuzhou95 commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhyncs commented Jun 23, 2025

Uh oh!

tianyuzhou95 commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaochenyang20 commented Jun 25, 2025

Uh oh!

tianyuzhou95 commented Jun 25, 2025

Uh oh!

Uh oh!

hebiao064 commented Jun 29, 2025

Uh oh!

hebiao064 commented Jun 29, 2025

Uh oh!

tianyuzhou95 commented Jun 29, 2025

Uh oh!

zhyncs commented Jul 3, 2025

Uh oh!

tianyuzhou95 commented Jul 3, 2025

Uh oh!

Uh oh!

Uh oh!

tianyuzhou95 commented May 28, 2025 •

edited

Loading

tianyuzhou95 commented Jun 11, 2025 •

edited

Loading

tianyuzhou95 commented Jun 23, 2025 •

edited

Loading