Skip to content

Conversation

CatherineSue
Copy link
Collaborator

@CatherineSue CatherineSue commented Jun 19, 2025

Motivation

This PR aims to refactor and modernize the OpenAI API implementation in SGLang by removing the legacy openai_api module and consolidating it into the entrypoints/openai structure.

Modifications

1. Remove Legacy OpenAI API Module

  • Removed sglang/srt/openai_api directory (commit 5a7df078)
  • Replaced all imports with sglang/srt/entrypoints/openai
  • Removed batch and file endpoints from http_server.py
  • Updated OpenAI-compatible endpoints in http_server.py
  • Removed openai/api_server.py and conftest.py (commit d6d7351c)
  • Consolidated functionality into the new structure

2. Introduce Centralized Template Management (commit 7d343747)

  • Created TemplateManager class to eliminate global template state
  • Replaced global chat_template_name/completion_template_name variables
  • Integrated TemplateManager into _GlobalState with proper dependency injection
  • Added proper type hints for TokenizerManager and TemplateManager parameters
  • Extracted and reorganized Jinja template utilities
    • Created jinja_template_utils.py with template detection and processing logic (moved from utils.py)
  • Moved utilities from openai/utils.py to avoid improper dependencies
  • Renamed detect_template_content_formatdetect_jinja_template_content_format
  • Optimized template content format detection
  • Detect format during template loading instead of on every request
  • Cache detected format in TemplateManager.jinja_template_content_format

3. Error Fixes and Improvements

  • Fixed validation and error handling (commit b0c940cb)
  • Added validation_exception_handler to http_server.py. This will enforce Content-Type: application/json in the request header, which is an OpenAI standard. It also enable FastAPI to automatically decode payload, bypassing the need to handle it manually in each endpoint.
  • Fixed import and parameter errors (commits 90cc976e, 40a5c5a2)
  • Fixed is_multimodal not found error
  • Fixed enable_thinking parameter issues

4. Code Cleanup and Optimization

  • Cleaned up unused imports and code (commit f9add484)
  • Removed unused imports in entrypoints/openai
  • Removed unnecessary tests in test/srt/openai
  • Updated test timings in run_suite.py
  • Removed V1RerankReqInput duplication between openai.protocol and io_struct
  • Improved logging (commits c3be8638, a7d3b35c)
  • Use logger.exception for better error details
  • Skip embedding in log_requests for concised logs

5. Bug Fixes for Specific Tests

  • Fixed streaming response format (commit f9add484)
  • Updated test_openai_server.py to handle separated usage chunk and finish_reason chunk
  • Removed batch test in test_openai_server.py
  • Remove enable_thinking in reasoning content handling (commits e583fd9, (b22eb4a...2dc0341))
    • The problem is related to the enable_thinking parameter and it was used for check for reasoning content parsing: if reasoning_parser and request.separate_reasoning and enable_thinking:. See adapter.py, serving_chat.py
    • Some models like DeepSeek do not support the enable_thinking parameter.
    • This parameter is only supported by Qwen3 in its chat_template.
    • Our logic was to set this to True unless it is explicitly set through chat_template_kwargs. However, this brings confusion to developers as not all models support enable_thinking.
    • Suggestion:
      • The server should not use enable_thinking as a condition for reasoning content parsing since it’s specific to Qwen3’s chat template, not a general OpenAI-compatible parameter.
      • It’s more appropriate to treat it as a sampling parameter, controlled entirely by the chat template.
      • ReasoningParser has force_thinking=True, which causes models like Qwen3 to treat even the initial user message as part of the reasoning, leading to unexpected behavior.
      • For models like DeepSeek-R1, the tag was removed in their templates, so enable_thinking is irrelevant.
      • This PR changed force_thinking=True to False in Qwen3Detector and fixed the logic in BaseReasoningFormatDetector to enable it to handle cases when force_thinking=False.
      • Test: currently, test_reasoning_content is passed.

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @CatherineSue, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request undertakes a significant refactoring of the OpenAI-compatible API serving infrastructure and template management within the system. The primary goal is to enhance modularity, improve code organization, and centralize template handling, leading to a more maintainable and extensible codebase.

Highlights

  • API Module Reorganization: The core OpenAI API serving logic has been moved from sglang.srt.openai_api to sglang.srt.entrypoints.openai, establishing a clearer separation of concerns and improving code structure.
  • Centralized Template Management: A new TemplateManager class is introduced to consolidate the loading and management of both chat and completion templates, replacing scattered logic and global state for improved modularity.
  • Dedicated API Handlers: The HTTP server now utilizes specialized OpenAIServing classes for different OpenAI API endpoints (e.g., chat, completions, embeddings, rerank, score), promoting a more object-oriented approach to API request handling.
  • Expanded API Functionality: New OpenAI-compatible endpoints for reranking (/v1/rerank) and scoring (/v1/score) are added, leveraging the new serving class architecture.
  • Streamlined Protocol Definitions: Minor adjustments are made to the OpenAI protocol definitions, such as making finish_reason and dimensions optional, for increased flexibility and robustness.
  • Removal of Batch API: Existing batch API endpoints (/v1/files, /v1/batches) and their associated test cases have been removed, simplifying the API surface.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request primarily refactors the OpenAI API integration by moving code from an openai_api directory to entrypoints/openai and introducing a TemplateManager for better organization. It also adds new serving classes for different OpenAI-compatible endpoints like rerank and score.

Overall, the refactoring appears to be a positive step towards better code organization and maintainability. My main concerns revolve around potential API contract changes in protocol.py and the removal of batch file processing functionality, which should be clarified if it's an intended change for this PR.

Key Changes & Observations:

  • Refactoring: Successfully moved OpenAI related logic to the entrypoints/openai path.
  • TemplateManager: Introduced TemplateManager to centralize chat and completion template loading and management, which is a good improvement over global state or scattered logic.
  • New Serving Classes: Added dedicated classes (OpenAIServingChat, OpenAIServingCompletion, OpenAIServingEmbedding, OpenAIServingRerank, OpenAIServingScore) to handle specific OpenAI endpoints. This enhances modularity.
  • Endpoint Removal: The /v1/files and /v1/batches endpoints, along with their associated functionality and tests, have been removed. This is a significant functional change and needs confirmation if it's within the scope of this refactor.
  • Protocol Changes:
    • finish_reason in CompletionResponseChoice and ChatCompletionResponseChoice is now Optional.
    • ChatCompletionRequest's validate_messages_not_empty validator was removed.
    • ChatCompletionRequest's set_tool_choice_default validator was simplified.
    • A new RerankResponse class was added.
  • Logging: Improved error logging in serving_base.py by using logger.exception.
  • Docstrings: Added a more detailed module docstring to python/sglang/srt/conversation.py.

Recommendations:

  • Clarify if the removal of batch file processing APIs (/v1/files, /v1/batches) is an intended part of this refactoring PR.
  • Review the API contract changes in protocol.py to ensure they are intentional and to understand any potential impact on clients.
  • Consider re-adding Pydantic-level validation for non-empty messages in ChatCompletionRequest if it's a strict requirement.

The changes generally improve the codebase structure. Addressing the points above will help ensure the quality and clarity of this refactor.

text: str
logprobs: Optional[LogProbs] = None
finish_reason: Literal["stop", "length", "content_filter", "abort"]
finish_reason: Optional[Literal["stop", "length", "content_filter", "abort"]] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Making finish_reason optional is an API contract change. Is it now possible for finish_reason to be None in some scenarios? If so, could you briefly explain these scenarios or point to where this change is handled downstream? This also applies to ChatCompletionResponseChoice.finish_reason on line 428.

Comment on lines 380 to 412
@model_validator(mode="before")
@classmethod
def set_tool_choice_default(cls, values):
if isinstance(values, dict):
if values.get("tool_choice") is None:
if values.get("tools") is None:
values["tool_choice"] = "none"
else:
values["tool_choice"] = "auto"
if values.get("tool_choice") is None:
if values.get("tools") is None:
values["tool_choice"] = "none"
else:
values["tool_choice"] = "auto"
return values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The isinstance(values, dict) check was removed from set_tool_choice_default validator. This makes the validator slightly less robust if ChatCompletionRequest could be instantiated with non-dict-like data that Pydantic can't coerce to a dict before this validator runs. This could lead to an AttributeError on values.get().


except Exception as e:
logger.error(f"Error in request: {e}")
logger.exception(f"Error in request: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using logger.exception here instead of logger.error is a good change as it will automatically include the stack trace in the log, which is very helpful for debugging unexpected errors.

Comment on lines 225 to 152
if self.template_manager.chat_template_name is None:
prompt, prompt_ids, image_data, audio_data, modalities, stop = (
self._apply_jinja_template(request, tools, is_multimodal)
)
else:
prompt, image_data, audio_data, modalities, stop = (
prompt, prompt_ids, image_data, audio_data, modalities, stop = (
self._apply_conversation_template(request)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for choosing between Jinja template and conversation template application has been updated to use self.template_manager.chat_template_name. This is a good centralization of template-related state and logic into the new TemplateManager.

)

template_content_format = self._cached_template_format
template_content_format = self.template_manager.jinja_template_content_format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using self.template_manager.jinja_template_content_format instead of detecting and caching the format within this class is a good simplification and aligns with the TemplateManager's role.

Comment on lines +331 to 255
stop = request.stop
image_data = image_data if image_data else None
audio_data = audio_data if audio_data else None
modalities = modalities if modalities else []
return prompt, prompt_ids, image_data, audio_data, modalities, stop
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assignment stop = request.stop is a bit more direct than request.stop or []. If request.stop can be None, this change is fine as None is often a valid value for stop sequences (meaning use default or no specific stop sequences beyond EOS). The conditional initialization of image_data, audio_data, and modalities is also clearer.

Comment on lines +381 to +306
if not is_multimodal:
prompt_ids = self.tokenizer_manager.tokenizer.encode(prompt)

return prompt, prompt_ids, image_data, audio_data, modalities, stop
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Moving the tokenization of prompt (if not multimodal) into _apply_conversation_template from its previous location in _process_messages makes sense, as _apply_conversation_template is now responsible for returning prompt_ids as well.

Comment on lines 686 to 650
if request.chat_template_kwargs is not None:
enable_thinking = request.chat_template_kwargs.get(
"enable_thinking", True
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Checking if request.chat_template_kwargs is not None: before accessing request.chat_template_kwargs.get(...) is a good defensive check. The previous getattr(request, "chat_template_kwargs", {}).get(...) was also safe, but this is arguably more explicit and readable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original issue is that the default is None, it wil override {} and raise error on None.get()

Comment on lines +1 to 5
"""Template utilities for Jinja template processing.
This module provides utilities for analyzing and processing Jinja chat templates,
including content format detection and message processing.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file rename from openai_api/utils.py to jinja_template_utils.py and the updated module docstring make the purpose of this module clearer, focusing specifically on Jinja template utilities.

CatherineSue pushed a commit to woodx9/sglang that referenced this pull request Jun 20, 2025
@CatherineSue CatherineSue force-pushed the chang/remove-adapter branch 2 times, most recently from fa46bfc to 019e5e1 Compare June 20, 2025 23:28
@CatherineSue CatherineSue marked this pull request as ready for review June 21, 2025 01:47
@slin1237 slin1237 self-requested a review June 21, 2025 01:51
CatherineSue and others added 11 commits June 21, 2025 02:20
- Replace all imports with sglang/srt/entrypoints/openai
- Remove batch and file endpoints in http_server.py
- Update openai-compatible endpoints in http_server.py
- Add TODO find a better way for the chat template management
- logger.error won't print detailed exec details
- prompt and prompt_ids are not created before referenced
- modalities data should be None in `_apply_jinja_template` otherwise there will be an error in GenerateReqInput.normalize
…utilities

* Create centralized TemplateManager class to eliminate global template state
  - Replace global chat_template_name/completion_template_name variables
  - Integrate TemplateManager into _GlobalState and inject into serving classes
  - Add proper type hints for TokenizerManager and TemplateManager parameters

* Extract and reorganize Jinja template utilities for better separation of concerns
  - Create jinja_template_utils.py with template detection and processing logic
  - Move utilities from openai/utils.py to avoid improper dependencies
  - Rename detect_template_content_format -> detect_jinja_template_content_format

* Optimize template content format detection for better performance
  - Detect format during template loading instead of on every request
  - Cache detected format in TemplateManager.jinja_template_content_format property
  - Add logging for template format detection results

* Clean up codebase and improve maintainability
  - Remove unused imports and clean up import organization
  - Simplify TemplateManager interface by removing unused is_initialized property
  - Update all serving classes (chat, completions, embedding) to use dependency injection
  - Improve code organization and eliminate architectural debt

Benefits:
- ✅ Eliminates global state pollution
- ✅ Better separation of concerns (generic vs OpenAI-specific utilities)
- ✅ Improved performance through caching
- ✅ Cleaner dependency injection pattern
- ✅ More testable and maintainable architecture
- The handling for a single string in a list can be removed as #7396 is merged.
- Add UT cases in test_openai_server for such case
…ontent

- Remove enable_thinking in the check condition when handling reasoning content:

enable_thinking is a flag that only supported by Qwen3, in its chat_template. We pass by this parameter in `self.tokenizer_manager.tokenizer.apply_chat_template`.

When handling reasoning content, as other models don't support enable_thinking, this flag should be removed from the check condition.

- Add back `_get_enable_thinking_from_request` as a util function as some reasoning related parser or backend may need it in the future.
@woodx9
Copy link
Contributor

woodx9 commented Jun 21, 2025

I tested endpoint embedding rerank and score, all seems good.

- Correct name should be `detect_jinja_template_content_format`
Qwen3 should not assume it is always in reasoning mode. As Qwen3 supports a parameter called `enable_thinking=False` in its chat_template. In this case, it won't generate thinking content.
…ing is False

- Introduce a self._in_reasoning to handle case when force_reasoning is False
We don't need to separate _in_reasoning and _force_in_reasoning
- If there is already a tool, the next token can be simply a tool_call_separator, before following with a `{`
@CatherineSue CatherineSue changed the title feat(oai refactor): Remove openai_api with entrypoints/openai feat(oai refactor): Replace openai_api with entrypoints/openai Jun 21, 2025
- Add `retrieve_model` in `http_server.py`. It will retreive a model's information. And returns 404 if model is not in the served model list.
- Add UT
@CatherineSue
Copy link
Collaborator Author

unittest-test-backend-4-gpu

Local test:
test_local_attn
image (1)

test_pp_single_node
image (2)

@zhyncs zhyncs merged commit 72676cd into main Jun 21, 2025
67 of 91 checks passed
@zhyncs zhyncs deleted the chang/remove-adapter branch June 21, 2025 20:21
whybeyoung added a commit to whybeyoung/sglang that referenced this pull request Jun 24, 2025
yilian49 pushed a commit to yilian49/sglang that referenced this pull request Jun 24, 2025
whybeyoung added a commit to whybeyoung/sglang that referenced this pull request Jun 24, 2025
@taegeonum
Copy link

taegeonum commented Jul 9, 2025

@CatherineSue Hello, may I ask why files/batch apis (/v1/files, /v1/batches) have been removed ? Since we need these features, how can we support these features again?

chenxijun1029 pushed a commit to chenxijun1029/sglang that referenced this pull request Jul 17, 2025
pi314ever pushed a commit to pi314ever/sglang that referenced this pull request Jul 17, 2025
* Use seq_len_fill_value in the cuda graph runners (sgl-project#7233)

* support custom weight loader for model runner (sgl-project#7122)

Co-authored-by: kavioyu <[email protected]>

* Fix AMD speculative decoding (sgl-project#7252)

* [Refactor] OAI Server components (sgl-project#7167)

Signed-off-by: Xinyuan Tong <[email protected]>

* OAI Server Skeleton & Core Utility Endpoints (sgl-project#7179)

* [amd] Opt dsv3 moe (sgl-project#7160)

Co-authored-by: wunhuang <[email protected]>

* update ci node for xeon (sgl-project#7265)

* feat: mtp support dp-attention (sgl-project#6081)

Co-authored-by: austindeng <[email protected]>
Co-authored-by: tianqilin.99 <[email protected]>
Co-authored-by: Qiaolin Yu <[email protected]>
Co-authored-by: ch-wan <[email protected]>

* support qwen2 running on ascend npu device (sgl-project#7022)

Co-authored-by: 刁莹煜 <[email protected]>

* Fix Deepseek R1 0528 FP4 tensor name mismatch issue during weights loading. (sgl-project#7164)

* bugfix(tool call ebnf): Fix EBNF generation for optional function parameters (sgl-project#7283)

* Fix AWQ Dequant and Weight Loading of deepseek v2 (sgl-project#6842)

* fix: resolve b200 dsv3 mtp issue (sgl-project#7286)

* ci: Fix test_ebnf_generate_all_optional_function_params (sgl-project#7288)

* fix: only enable flash_attn test on sm80 sm90 (sgl-project#7289)

* [PD] Support get local ip from NIC for PD disaggregation (sgl-project#7237)

Signed-off-by: Shangming Cai <[email protected]>

* [PD] Add custom memory pool option to support Mooncake PD with NVLink  (sgl-project#7264)

Signed-off-by: Shangming Cai <[email protected]>

* Upstreaming hicache bug fixes (sgl-project#7267)

* Update python API of activation, topk, norm and rope and remove vllm dependency (sgl-project#6614)

Co-authored-by: Wu, Chunyuan <[email protected]>
Co-authored-by: jianan-gu <[email protected]>
Co-authored-by: sdp <[email protected]>

* Fix hicache benchmark script bug - some sampled input_request is [] (sgl-project#7300)

* chore: change logs from`INFO` to `DEBUG` for dp and add force quit for tokenizer manager (sgl-project#7251)

* update invalid link in doc (sgl-project#7297)

* Fix mini_lb for PD with long output: limit chunk size of decode response (sgl-project#7301)

Signed-off-by: ch-tiger1 <[email protected]>
Co-authored-by: ch-tiger1 <[email protected]>

* Fix profiler error when there are idle passes (sgl-project#7003)

* [pd] optimize dockerfile for  pd disaggregation (sgl-project#7319)

Co-authored-by: zhyncs <[email protected]>

* Merge PDLB (Prefill-Decode Load Balancer) into SGLang Router (sgl-project#7096)

* Add more refactored openai test & in CI (sgl-project#7284)

* fix: resolve blackwell deepep image issue (sgl-project#7331)

* add seed in CPU UTs to avoid flaky failure (sgl-project#7333)

* Multi-Stage Awake: Support Resume and Pause KV Cache and Weights separately (sgl-project#7099)

* Reintroduce tiny fix sampler error when prob is not contiguous (sgl-project#7354)

* [Refactor] Clean up radix cache related API (sgl-project#7303)

Co-authored-by: Zhiqiang Xie <[email protected]>

* Put `_normalize_rid` before other normalization in `io_struct` (sgl-project#7363)

* [PD] Transfer hidden states for mtp when disaggregation (sgl-project#7242)

* [Bugfix][PD] Set conclude state before clear when failure happens (sgl-project#7362)

Signed-off-by: Shangming Cai <[email protected]>

* docs: update installation (sgl-project#7366)

* [Docker] optimize dockerfile  remove deepep and blackwell merge it to… (sgl-project#7343)

Co-authored-by: Yineng Zhang <[email protected]>

* Clean unused import for mimo mtp model (sgl-project#7370)

* [Bugfix]Fix hang bug using dp attention with HiRadixCache (sgl-project#7159)

Signed-off-by: huanglong <[email protected]>

* [Doc] add embedding rerank doc (sgl-project#7364)

* Fix judgment condition for enabling Deepseek V3/R1 shared expert fusion optimization (sgl-project#7371)

* Feat/refactor embedding server (sgl-project#7322)

* Purge VerlEngine (sgl-project#7326)

Signed-off-by: Ata Fatahi <[email protected]>

* support return logprobs for pipeline (sgl-project#7356)

Co-authored-by: Zhang Kaihong <[email protected]>

* [PD] Optimize custom mem pool usage and bump mooncake version (sgl-project#7393)

Signed-off-by: Shangming Cai <[email protected]>

* Support THUDM/GLM-4-0414 (GLM-Z1) Glm4ForCausalLM architecture. (sgl-project#5485)

* Refine OpenAI serving entrypoint to remove batch requests (sgl-project#7372)

Signed-off-by: Xinyuan Tong <[email protected]>
Co-authored-by: Chang Su <[email protected]>

* [Feature] Comprehensive Hybrid Parallelism Support (sgl-project#6389)

* [DeepSeekNextN] fix: residual of head norm can be None (sgl-project#7398)

* [OAI refactor] Add rerank and score serving (sgl-project#7399)

Co-authored-by: Chang Su <[email protected]>

* [OAI Server Refactor] [ChatCompletions & Completions] Implement UsageInfo Processor (sgl-project#7360)

Co-authored-by: Chang Su <[email protected]>

* Fix All-Gather under world size one (sgl-project#7219)

* Optimize DP attn scheduling for speculative decoding (sgl-project#7285)

* Update usage_processor.py (sgl-project#7402)

* Fix 7285 Merge Conflicts (sgl-project#7403)

* chore: upgrade mooncake-transfer-engine 0.3.4 (sgl-project#7401)

* [OAI Server Refactor] [ChatCompletions & Completions] Support Return Hidden State (sgl-project#7329)

Signed-off-by: keru <[email protected]>

* Remove batches api in docs & example (sgl-project#7400)

* [BugFix]: fix EmbeddingReqInput single input error (sgl-project#7396)

* [BugFix]fix qwen25 invoke function call streaming responses with curly braces as the starting indicator (sgl-project#7394)

* fix overlap pagecount (sgl-project#6984)

Co-authored-by: Zhiqiang Xie <[email protected]>

* fix: Fix CI test_function_call_parser.py (sgl-project#7425)

* Fix CPU offloading for MLA memory pool (sgl-project#7409)

* [fix] PD disaggregation when enable mtp and tp!=dp (sgl-project#7420)

* feat(oai refactor): Replace `openai_api` with `entrypoints/openai`  (sgl-project#7351)

Co-authored-by: Jin Pan <[email protected]>

* Refactor LoRAManager and LoRAMemoryPool state management logic for dynamic LoRA loading support (sgl-project#7412)

* refactor(test): reorganize OpenAI test file structure (sgl-project#7408)

* [minor] simplify the `TokenToKVPoolAllocator` (sgl-project#7414)

* Tiny add logging for GC  (sgl-project#7406)

* FlashInfer NVFP4 MoE with EP & 2-stream shared expert (sgl-project#7327)

Co-authored-by: JieXin Liang <[email protected]>
Co-authored-by: alcanderian <[email protected]>

* Remove copy after bmm (sgl-project#7441)

* Fix torch compile run (sgl-project#7391)

Co-authored-by: wunhuang <[email protected]>
Co-authored-by: Sai Enduri <[email protected]>

* [misc] Add PD service discovery support in router (sgl-project#7361)

* add fused moe config for qwen3 in triton3.3.1 (sgl-project#7445)

* Fix CUDA Graph Check under Deepep with DP FFN (sgl-project#7451)

* Update hyperparameter_tuning.md (sgl-project#7454)

* feat: integrate deepgemm into EPMoE (sgl-project#6821)

Co-authored-by: tianqilin.99 <[email protected]>
Co-authored-by: TianQiLin666666 <[email protected]>
Co-authored-by: Cheng Wan <[email protected]>

* Solve docker build failed in the virtual machine (sgl-project#7290)

Co-authored-by: wunhuang <[email protected]>
Co-authored-by: Sai Enduri <[email protected]>
Co-authored-by: HAI <[email protected]>

* Fix a bug in BatchTokenIDOut & Misc style and dependency updates (sgl-project#7457)

* [CI] Upgrade mooncake to 0.3.4.post1 to fix 8 gpu tests (sgl-project#7472)

Signed-off-by: Shangming Cai <[email protected]>

* Fix prefill OOM due to wrong token calculation when page > 1  (sgl-project#7397)

* feat(func_call): Add more check in `BaseFormatDetector.parse_streaming_increment` (sgl-project#7479)

* Fix dtype for idle input in spec decoding (sgl-project#7456)

* update mooncake in dockerfile (sgl-project#7480)

* kvcache io kernels and test case (sgl-project#7382)

* [perf] slightly imporve DeepSeek-R1-FP4 TP8 (sgl-project#7481)

* Quick fix for DeepGemm requant to also cover MTP. (sgl-project#7378)

* Support weight loading without mmap (sgl-project#7469)

* ci: Revert openai_server related tests in AMD suites (sgl-project#7449)

* Perormance: Enable cuda graph for dp idle batch (sgl-project#7269)

Co-authored-by: austindeng <[email protected]>
Co-authored-by: Cheng Wan <[email protected]>
Co-authored-by: ch-wan <[email protected]>

* bugfix: Prevent global mutation of conv.stop_str across requests (sgl-project#7347)

Co-authored-by: Chang Su <[email protected]>

* Fix RequestValidationError response format (sgl-project#7487)

* Fix MTP with Deepseek R1 Fp4 (sgl-project#7376)

* chore: bump sgl-kernel v0.2.0 (sgl-project#7490)

* chore: bump v0.4.8 (sgl-project#7493)

* [AMD] add aiter fused moe in DeepEP path (sgl-project#7268)

* enable aiter_biased_grouped_topk kernel (sgl-project#7423)

* [PD Disaggregation] replace transfer with batch transfer for better performance (sgl-project#7236)

* Remove cumsum_buffer initilization (sgl-project#7439)

* [benchmark] fbgemm benchmark support bandwidth report and support fbgemm_cutlass_gmm (sgl-project#7422)

* Support multi-thread model weight loading (sgl-project#7277)

* [PD] NIXL: Register kv args in advance and cleanup finished requests (sgl-project#6717)

* fix: Add `--model` as an alias for `--model-path` in server_args (sgl-project#7505)

* misc: Improvement to serving_chat.py and add more ut (sgl-project#7489)

* Fuse sorted_token_ids padding to moe_align_block_size kernel (sgl-project#7437)

* [OAI] patch origin request_id logic (sgl-project#7508)

* [PD][Spec] Fix hidden state transfer for spec decode (sgl-project#7516)

Signed-off-by: Shangming Cai <[email protected]>

* EPLB support for MTP (sgl-project#7510)

* clean duplicate code (sgl-project#7512)

* [ci] add router benchmark script and CI (sgl-project#7498)

* fix: force synchronization between TP workers when update_weights (sgl-project#6626)

Co-authored-by: dangkai.dk <[email protected]>

* [CPU] [BF16] Call fused_experts_cpu, weight_packed_linear and bmm_cpu kernel in DeepSeek model (sgl-project#6641)

Co-authored-by: Thien Tran <[email protected]>

* [CI] Upgrade mooncake to v0.3.4.post2 to fix potential slice failed bug (sgl-project#7522)

Signed-off-by: Shangming Cai <[email protected]>

* npu fused op (sgl-project#7386)

Co-authored-by: Li Junwen <[email protected]>

* feat: send kvmetrics from sglang scheduler (sgl-project#6721)

* [PD] Add different TP sizes support for no-MLA models (sgl-project#6793)

Co-authored-by: shangmingc <[email protected]>
Co-authored-by: Shangming Cai <[email protected]>

* enable aiter fp8 blockscale quant (sgl-project#7520)

* take aiter get_rope back (sgl-project#7521)

* Fix typo of flash_cache (sgl-project#7513)

* feat: add return hidden_states at async generation (sgl-project#7507)

* minor: 'role' must be system/assistant/tool, but case insensitive for now (sgl-project#7499)

* Fix FP8 KV Cache Support in FA3 Backend (sgl-project#7148)

* Fix gathered_buffer issues in tbo (sgl-project#7531)

* [PD] Raise error for incompatible mooncake version and some minor fixes (sgl-project#7527)

Signed-off-by: Shangming Cai <[email protected]>

* [CMake] Fix sgl-kernel CMakeLists for Blackwell (sgl-project#7543)

* Add Tencent HunYuanMoEV1 model support (sgl-project#7549)

* Update seed in CPU UTs to avoid flaky failure with single test (sgl-project#7544)

* chore: improve ci bug reporting (sgl-project#7542)

* chore: remove vlm unnecessary import (sgl-project#7541)

Signed-off-by: Xinyuan Tong <[email protected]>
Co-authored-by: yhyang201 <[email protected]>
Co-authored-by: Mick <[email protected]>

* chore: bump v0.4.8.post1 (sgl-project#7559)

* [PD][NIXL] Set is_sorted=False to fix NIXL_ERR_NOT_FOUND (sgl-project#7330)

* [Fix] incorrect assert in EPLB (sgl-project#7575)

* Updates Gemma3n MLP layer to adapt latest transformers version (sgl-project#7573)

Signed-off-by: Xinyuan Tong <[email protected]>

* Fix MTP error when enabling two-batch overlap  (sgl-project#7569)

* Add e2e test for multi instance multi stage memory release/resume occupuation (sgl-project#7208)

Signed-off-by: Ata Fatahi <[email protected]>

* [CI] Add CI Testing for Prefill-Decode Disaggregation with Router (sgl-project#7540)

* Updates transformers and timm dependencies (sgl-project#7577)

Signed-off-by: Xinyuan Tong <[email protected]>

* feat: support compatibility between MTP and two-batch-overlap (sgl-project#7225)

Co-authored-by: Cheng Wan <[email protected]>

* Move multimodal processors into a separate folder (sgl-project#7581)

* Fix broken CI TestVILAServer (sgl-project#7610)

* [router] add centralized configuration module for sgl-router (sgl-project#7588)

* Fix: Minicpm (sgl-project#7612)

Signed-off-by: Xinyuan Tong <[email protected]>

* Hybrid kv cache for LLaMA4 (sgl-project#6563)

Co-authored-by: Cheng Wan <[email protected]>
Co-authored-by: tarinkk <[email protected]>
Co-authored-by: tarinkk <[email protected]>
Co-authored-by: Hanming Lu <[email protected]>

* [CPU] add optimizations for INT8 and FP8 DeepSeek (sgl-project#6769)

Co-authored-by: Zheng, Beilei <[email protected]>

* Tiny add logs for expert location updater (sgl-project#7308)

* Fix flakiness in LoRA batch test. (sgl-project#7552)

* [BUG] fix local_rank in initialize_dp_attention (sgl-project#7584)

* Support dynamic LoRA loading / unloading in engine/server API (sgl-project#7446)

* [PD] Respect sampling_params.max_new_tokens when PD disaggregation is activated (sgl-project#7598)

Signed-off-by: Shangming Cai <[email protected]>

* fix unit tests (sgl-project#7618)

* Let ep_scatter support arbitrary strides / ue8m0 format (sgl-project#7309)

* Let EP prefill support new DeepGEMM (sgl-project#7310)

* docs: add gb200 nvl72 and a16z grant (sgl-project#7620)

* oai: Adds support for OpenAI chat completions API in bench_serving (sgl-project#7036)

Signed-off-by: Xinyuan Tong <[email protected]>
Co-authored-by: yhyang201 <[email protected]>
Co-authored-by: Mick <[email protected]>

* [bugfix] Remove PR comment posting from Rust benchmark workflow (sgl-project#7625)

* [Minor] clean up multimodal processor and tokenizer manager (sgl-project#7624)

* Add dsv3 fused a gemm to sgl-kernel (sgl-project#7630)

* Add @mickqian as the CODEOWNERS of multimodal (sgl-project#7636)

* Fix stream reasoning parser and Adds Kimi reasoning parser  (sgl-project#7432)

Signed-off-by: Xinyuan Tong <[email protected]>

* Fix sgl-router startup crash (sgl-project#7619)

* [bugfix] fix runtime dropping panic in editable (sgl-project#7628)

* Move files related to EPLB (sgl-project#7580)

* [misc] reduce weird rope_scaling_factor warning (sgl-project#7176)

* [AMD] Add unit-test-sgl-kernel-amd to AMD CI (sgl-project#7539)

* Update CODEOWNERS (sgl-project#7640)

* [EAGLE] remove a wrong adjustment for page_size > 1 & topk > 1 in server_args.py (sgl-project#7643)

* [CPU] add c++ kernel to bind CPU cores and memory node (sgl-project#7524)

* Improve streaming, log_level, memory report, weight loading, and benchmark script (sgl-project#7632)

Co-authored-by: Kan Wu <[email protected]>

* Add dsv3 router gemm kernel (sgl-project#7627)

* chore: upgrade flashinfer v0.2.7 jit (sgl-project#7663)

* [doc] update lws doc for pd (sgl-project#7318)

* Fix: sync prepare_fp8_layer_for_marlin with latest vllm changes (sgl-project#7648)

* Add small requirements for benchmark/parse_result tools (sgl-project#7671)

* [CPU] remove process_group from inputs of shm_allreduce and shm_allgather (sgl-project#7486)

* chore: bump sgl-kernel v0.2.1 (sgl-project#7675)

* support llama4 eagle3  (sgl-project#6985)

Co-authored-by: shuaills <[email protected]>
Co-authored-by: Shenggui Li <[email protected]>
Co-authored-by: Yingyi Huang <[email protected]>
Co-authored-by: yizhang2077 <[email protected]>

* Refactor mm processors and Enable mixed modality processing (sgl-project#7629)

Signed-off-by: Xinyuan Tong <[email protected]>

* upgrade sgl kernel to 0.2.1 for main (sgl-project#7676)

* add description for llama4 eagle3 (sgl-project#7688)

* fix(model loader): use safe_open to prevent file handle leaks. (sgl-project#7684)

* chore: upgrade flashinfer v0.2.7.post1 (sgl-project#7698)

* Improve error handling for requests with unloaded LoRA path(s) (sgl-project#7642)

* Apply dsv3_fused_a_gemm kernel (sgl-project#7635)

* Fix GPTQMarlinMoE (sgl-project#7697)

* [1/n] apply wna16marlin kernel in moe weight only quantization (sgl-project#7683)

Co-authored-by: 晟海 <[email protected]>
Co-authored-by: yych0745 <[email protected]>
Co-authored-by: HandH1998 <[email protected]>
Co-authored-by: 弋云 <[email protected]>
Co-authored-by: walker-ai <[email protected]>

* Apply dsv3 router gemm kernel for deepseek-r1 fp4 (sgl-project#7677)

* [AMD] Temporarily disable test_no_overlap_scheduler and test_vision_chunked_prefill (sgl-project#7717)

* [RL] add --skip-warmup (sgl-project#7416)

* [RL] support update_weights_from_distributed with different group and multiple weights (sgl-project#7292)

* [router] add --log-level to sgl-router (sgl-project#6512)

* [b200] support trt-llm allreduce fuse rms_norm_add kernel (sgl-project#7621)

* [CPU] Bind threads and numa node for each TP rank (sgl-project#6549)

Co-authored-by: srinarayan-srikanthan <[email protected]>

* Support non-contiguous query input for extend/decode attention (sgl-project#7462)

* Support updating weights at once by stopping all requests (sgl-project#6698)

Signed-off-by: Tianyu Zhou <[email protected]>
Co-authored-by: Zilin Zhu <[email protected]>

* Fix num_tokens_pre_allocated in disaggregation log (sgl-project#7714)

* [CPU] [sgl-kernel] set dispatch key of initialize to CatchAll (sgl-project#7734)

* [CPU] fix all_reduce and all_gather (sgl-project#6770)

Co-authored-by: blzheng <[email protected]>

* fix awq and dsv3 fused gemm compatible (sgl-project#7735)

* [CI][Router] Fix bench_one_batch_server for pd router test (sgl-project#7731)

Signed-off-by: Shangming Cai <[email protected]>

* Add CUTLASS FP8 Blockscale MoE kernel for Hopper architecture (sgl-project#7278)

Co-authored-by: HydraQYH <[email protected]>
Co-authored-by: TianQiLin666666 <[email protected]>

* fix dsv3 fused proj check  (sgl-project#7738)

* Ascend attention backend(PA&MLA) (sgl-project#7722)

Co-authored-by: Maksim <[email protected]>
Co-authored-by: VDV1985 <[email protected]>

* [fix] fix dsv3_router_gemm filter (sgl-project#7750)

* [CPU] refine CPU integration code (sgl-project#7647)

* [CPU] support the case where num_attention_heads or intermediate_size is not divisible by the TP size (sgl-project#6771)

* support qwen3 dense model dp attention (sgl-project#7681)

* [optimize] add two stream norm for qwen3 (sgl-project#7740)

Co-authored-by: ispobock <[email protected]>

* feat: use D2D instead of H2H in pp (sgl-project#7673)

Co-authored-by: alpha-baby <[email protected]>

* [Bug] add flashinfer bool check for fusedmoe in Qwen moe models (sgl-project#7723)

* [fix] put cpu in the first priority in get_device() (sgl-project#7752)

* [optimize] fuse renormalize into moe_topk_softmax (sgl-project#7744)

Co-authored-by: ispobock <[email protected]>

* chore: bump sgl-kernel 0.2.2 (sgl-project#7755)

* fix CI: update native api ipynb (sgl-project#7754)

Signed-off-by: Xinyuan Tong <[email protected]>

* fuse renormal into moe topk softmax kernel python code (sgl-project#7751)

Co-authored-by: ispobock <[email protected]>
Co-authored-by: zhyncs <[email protected]>

* Remove type conversion and fix id map in topk (sgl-project#7759)

* Add V2-lite model test (sgl-project#7390)

Co-authored-by: DiweiSun <[email protected]>

* refactor llama4 dp attention logic (sgl-project#7729)

* fix(docs): fix the broken link in `docs/references/production_metrics.md` (sgl-project#7741)

Signed-off-by: rudeigerc <[email protected]>

* [fix] update bench_speculative.py for compatibility (sgl-project#7764)

Signed-off-by: Kay Yan <[email protected]>

* Move mem_fraction_static adjustment for multimodal models to `server_args.py` & Fix session control & Other cleanups (sgl-project#7748)

* [RL] Add --nccl-port to prevent port conflict (sgl-project#7418)

* [RL] add pause and continue generation for async rl training (sgl-project#7419)

* [Fix] Alloc return type error (sgl-project#7778)

Signed-off-by: Capronir <[email protected]>

* [feat] Support EAGLE3 for Qwen (sgl-project#7745)

Co-authored-by: 纬杭 <[email protected]>
Co-authored-by: zyksir <[email protected]>

* saving hidden_states.clone() (sgl-project#7705)

* [1/n]: add cutlass W4A8 moe kernel for hopper architecture (sgl-project#7772)

Signed-off-by: yangsijia.614 <[email protected]>
Co-authored-by: yicwang <[email protected]>

* add model: qwen2-audio (sgl-project#7596)

* Optimize Hopper CUTLASS FP8 Blockwise Grouped GEMM Kernel in Small K Scenario (sgl-project#7782)

* Embedding parallel by attn_tp (sgl-project#7623)

* fix: fix apply_shuffle_mul_sum (sgl-project#7444)

* chore: bump sgl-kernel v0.2.3 (sgl-project#7784)

* fix: use nvidia-nccl-cu12 2.27.5 (sgl-project#7787)

* DP Attention with Auto DeepEP Dispatch (sgl-project#7222)

* chore: upgrade sgl-kernel v0.2.3 (sgl-project#7786)

* Fix incorrect spec_num_draft_tokens in draft_extend (sgl-project#7757)

* [fix] fix misusing of is_cuda (sgl-project#7790)

* Add treemask mode to build_eagle_tree & release sgl-kernel 0.2.3 (sgl-project#7756)

Co-authored-by: Pranjal Shankhdhar <[email protected]>

* chore: bump sgl-kernel v0.2.4 (sgl-project#7800)

* ci: fix port args (sgl-project#7792)

* Fix CI test OOM issue. (sgl-project#7799)

* chore: upgrade sgl-kernel v0.2.4 (sgl-project#7801)

* chore: bump v0.4.9 (sgl-project#7802)

* fix merge conflict issue

* fix hpu attention nonetyep issue

* fix alignment

* fix alignment2

* Ci failure fixes

* fix attention-backend choices

---------

Signed-off-by: Xinyuan Tong <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: ch-tiger1 <[email protected]>
Signed-off-by: huanglong <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: keru <[email protected]>
Signed-off-by: Tianyu Zhou <[email protected]>
Signed-off-by: rudeigerc <[email protected]>
Signed-off-by: Kay Yan <[email protected]>
Signed-off-by: Capronir <[email protected]>
Signed-off-by: yangsijia.614 <[email protected]>
Signed-off-by: Mohit Sinha <[email protected]>
Co-authored-by: Lianmin Zheng <[email protected]>
Co-authored-by: KavioYu <[email protected]>
Co-authored-by: kavioyu <[email protected]>
Co-authored-by: Xinyuan Tong <[email protected]>
Co-authored-by: yhyang201 <[email protected]>
Co-authored-by: kk <[email protected]>
Co-authored-by: wunhuang <[email protected]>
Co-authored-by: DiweiSun <[email protected]>
Co-authored-by: u4lr451 <[email protected]>
Co-authored-by: austindeng <[email protected]>
Co-authored-by: tianqilin.99 <[email protected]>
Co-authored-by: Qiaolin Yu <[email protected]>
Co-authored-by: ch-wan <[email protected]>
Co-authored-by: Yijie Zhu <[email protected]>
Co-authored-by: 刁莹煜 <[email protected]>
Co-authored-by: Charles Chen <[email protected]>
Co-authored-by: Chang Su <[email protected]>
Co-authored-by: AniZpZ <[email protected]>
Co-authored-by: Yineng Zhang <[email protected]>
Co-authored-by: shangmingc <[email protected]>
Co-authored-by: Zhiqiang Xie <[email protected]>
Co-authored-by: YanbingJiang <[email protected]>
Co-authored-by: Wu, Chunyuan <[email protected]>
Co-authored-by: jianan-gu <[email protected]>
Co-authored-by: sdp <[email protected]>
Co-authored-by: Binyao Jiang <[email protected]>
Co-authored-by: ishandhanani <[email protected]>
Co-authored-by: linzhuo <[email protected]>
Co-authored-by: ch-tiger1 <[email protected]>
Co-authored-by: ch-tiger1 <[email protected]>
Co-authored-by: fzyzcjy <[email protected]>
Co-authored-by: ybyang <[email protected]>
Co-authored-by: Simo Lin <[email protected]>
Co-authored-by: Jinn <[email protected]>
Co-authored-by: Stefan He <[email protected]>
Co-authored-by: DarkSharpness <[email protected]>
Co-authored-by: Atream <[email protected]>
Co-authored-by: Li Hui <[email protected]>
Co-authored-by: Huang Long <[email protected]>
Co-authored-by: woodx <[email protected]>
Co-authored-by: Ata Fatahi <[email protected]>
Co-authored-by: strgrb <[email protected]>
Co-authored-by: Zhang Kaihong <[email protected]>
Co-authored-by: Wenbo Yang <[email protected]>
Co-authored-by: Chang Su <[email protected]>
Co-authored-by: Cheng Wan <[email protected]>
Co-authored-by: Keyang Ru <[email protected]>
Co-authored-by: ehuaa <[email protected]>
Co-authored-by: pansicheng <[email protected]>
Co-authored-by: Liangsheng Yin <[email protected]>
Co-authored-by: Jin Pan <[email protected]>
Co-authored-by: Lifu Huang <[email protected]>
Co-authored-by: Trevor Morris <[email protected]>
Co-authored-by: JieXin Liang <[email protected]>
Co-authored-by: alcanderian <[email protected]>
Co-authored-by: Ke Bao <[email protected]>
Co-authored-by: Sai Enduri <[email protected]>
Co-authored-by: Yi Zhang <[email protected]>
Co-authored-by: xutizhou <[email protected]>
Co-authored-by: TianQiLin666666 <[email protected]>
Co-authored-by: HAI <[email protected]>
Co-authored-by: Yuhong Guo <[email protected]>
Co-authored-by: huangtingwei <[email protected]>
Co-authored-by: Alex Sun <[email protected]>
Co-authored-by: valarLip <[email protected]>
Co-authored-by: Francis <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
Co-authored-by: xianzhiT <[email protected]>
Co-authored-by: yilian49 <[email protected]>
Co-authored-by: DangKai <[email protected]>
Co-authored-by: dangkai.dk <[email protected]>
Co-authored-by: Thien Tran <[email protected]>
Co-authored-by: ll819214 <[email protected]>
Co-authored-by: Li Junwen <[email protected]>
Co-authored-by: zixuanzhang226 <[email protected]>
Co-authored-by: Hongbo Xu <[email protected]>
Co-authored-by: shangmingc <[email protected]>
Co-authored-by: eigen <[email protected]>
Co-authored-by: mlmz <[email protected]>
Co-authored-by: Ruihang Lai <[email protected]>
Co-authored-by: Meng, Peng <[email protected]>
Co-authored-by: Mick <[email protected]>
Co-authored-by: yhyang201 <[email protected]>
Co-authored-by: tarinkk <[email protected]>
Co-authored-by: tarinkk <[email protected]>
Co-authored-by: tarinkk <[email protected]>
Co-authored-by: Hanming Lu <[email protected]>
Co-authored-by: Zheng, Beilei <[email protected]>
Co-authored-by: Sheng Qi <[email protected]>
Co-authored-by: finetune <[email protected]>
Co-authored-by: Hubert Lu <[email protected]>
Co-authored-by: Kan Wu <[email protected]>
Co-authored-by: Baizhou Zhang <[email protected]>
Co-authored-by: narutolhy <[email protected]>
Co-authored-by: lukec <[email protected]>
Co-authored-by: shuaills <[email protected]>
Co-authored-by: Shenggui Li <[email protected]>
Co-authored-by: Yingyi Huang <[email protected]>
Co-authored-by: Simon_CQK <[email protected]>
Co-authored-by: Kyungmin Lee <[email protected]>
Co-authored-by: 晟海 <[email protected]>
Co-authored-by: yych0745 <[email protected]>
Co-authored-by: HandH1998 <[email protected]>
Co-authored-by: 弋云 <[email protected]>
Co-authored-by: walker-ai <[email protected]>
Co-authored-by: Zilin Zhu <[email protected]>
Co-authored-by: srinarayan-srikanthan <[email protected]>
Co-authored-by: Albert <[email protected]>
Co-authored-by: Ziming Huang <[email protected]>
Co-authored-by: ayrnb <[email protected]>
Co-authored-by: HydraQYH <[email protected]>
Co-authored-by: ronnie_zheng <[email protected]>
Co-authored-by: Maksim <[email protected]>
Co-authored-by: VDV1985 <[email protected]>
Co-authored-by: ispobock <[email protected]>
Co-authored-by: TianyuZhang1214 <[email protected]>
Co-authored-by: alpha-baby <[email protected]>
Co-authored-by: Yuchen Cheng <[email protected]>
Co-authored-by: Kay Yan <[email protected]>
Co-authored-by: Caproni <[email protected]>
Co-authored-by: Ximingwang-09 <[email protected]>
Co-authored-by: 纬杭 <[email protected]>
Co-authored-by: zyksir <[email protected]>
Co-authored-by: SijiaYang <[email protected]>
Co-authored-by: yicwang <[email protected]>
Co-authored-by: Leng Yue <[email protected]>
Co-authored-by: Qi Yuhang <[email protected]>
Co-authored-by: Gang Chen <[email protected]>
Co-authored-by: Pranjal Shankhdhar <[email protected]>
Co-authored-by: jay <[email protected]>
shuaills pushed a commit to shuaills/sglang that referenced this pull request Jul 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants