[Feat] Support Torch Symm Mem AllReduce #10571

yuan-luo · 2025-09-17T13:24:26Z

Motivation

This PR is to support PyTorch Symm Mem AllReduce which support large message package size.

Delivers performance gains for scenarios with msg_size > 512KB.
Currently, MSCCLPP only supports a maximum message size of 1MB; messages larger than 1MB will error out.
Symm Mem AllReduce supports messages ≥ 2MB.
This PR lays the groundwork for long-context support and enables large-message AllReduce (AR).

Torch Symm Mem AllReduce is based on NVLS,
On Blackwell, NVLS is always deterministic.
On Hopper, NVLS is deterministic with cuda 12.8+ or updated cuda 12.4

SYMM_MEM:
| msg_size   |   torch eager time |   symm mem eager time |   symm mem graph time |   pynccl graph time |
|------------|--------------------|-----------------------|-----------------------|---------------------|
| 2.0 KiB    |            48.6752 |               42.6112 |               8.95296 |             16.4637 |
| 4.0 KiB    |            26.304  |               27.6096 |               8.99104 |             18.583  |
| 8.0 KiB    |            21.9744 |               27.3632 |               9.21984 |             18.6733 |
| 16.0 KiB   |            20.3968 |               27.1584 |               9.3568  |             19.1792 |
| 32.0 KiB   |            20.2848 |               34.1888 |               9.40416 |             20.8093 |
| 64.0 KiB   |            20.5088 |               26.5888 |               9.73728 |             20.968  |
| 128.0 KiB  |           110.179  |               26.8448 |              10.0224  |             21.5664 |
| 256.0 KiB  |            20.9184 |               29.7504 |              10.6275  |             21.7779 |
| 512.0 KiB  |            23.5904 |               27.5392 |              12.2198  |             21.9856 |
| 1.0 MiB    |            28.1664 |               29.1744 |              14.2813  |             27.6038 |
|------------|--------------------|-----------------------|-----------------------|---------------------|
| 2.0 MiB    |            39.2576 |               29.1744 |              19.0234  |             39.5312 |  <<<<<< MSCCLPP doesn't support from this size on
| 4.0 MiB    |            52.5408 |               33.4528 |              28.1082  |             52.8202 |
| 8.0 MiB    |            78.064  |               74.0832 |              46.6918  |             86.2893 |
| 16.0 MiB   |           118.87   |               93.2256 |              88.1763  |            119.501  |

MSCCLPP:
| msg_size   |   torch eager time |   msccl eager time |   msccl graph time |   pynccl graph time |
|------------|--------------------|--------------------|--------------------|---------------------|
| 2.0 KiB    |            67.6608 |            19.7952 |            7.17696 |             18.5763 |
| 4.0 KiB    |            22.7936 |            19.248  |            6.94688 |             18.6624 |
| 8.0 KiB    |            20.5888 |            19.4688 |            7.32512 |             19.0259 |
| 16.0 KiB   |            21.4752 |            20.9152 |            7.58144 |             19.3904 |
| 32.0 KiB   |            21.4208 |            21.1424 |            7.55168 |             20.9411 |
| 64.0 KiB   |            21.84   |            19.8304 |            8.00928 |             21.1987 |
| 128.0 KiB  |            20.9728 |            20.72   |            8.91872 |             21.479  |
| 256.0 KiB  |            24.2208 |            20.2752 |            9.8064  |             21.7882 |
| 512.0 KiB  |            24.4448 |            21.9616 |           11.9722  |             22.0272 |
| 1.0 MiB    |            26.8448 |            22.1696 |           18.2694  |             27.7914 |

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates PyTorch Symmetric Memory AllReduce into the system, providing a more robust and performant solution for distributed communication, especially for large data transfers. By extending support beyond the limitations of existing AllReduce mechanisms, it enhances the system's capability to handle demanding workloads, such as those found in long-context language models, ensuring better scalability and efficiency in distributed environments.

Highlights

PyTorch Symmetric Memory AllReduce Support: This pull request introduces support for PyTorch Symmetric Memory AllReduce, a new communication primitive designed to handle larger message sizes more efficiently than existing methods like MSCCLPP.
Performance Gains for Large Messages: The new Symmetric Memory AllReduce delivers significant performance improvements for message sizes greater than 512KB, particularly for messages 2MB and larger, where MSCCLPP previously hit its limits.
Groundwork for Long-Context Models: This feature lays crucial groundwork for supporting long-context models and enabling large-message AllReduce operations, which are essential for scaling such models.
Comprehensive Benchmarking and Testing: New benchmark scripts and unit tests have been added to validate the performance and correctness of the Symmetric Memory AllReduce implementation across various message sizes and data types.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for PyTorch Symmetric Memory AllReduce, aiming to improve performance for large message sizes. The changes include a new SymmMemCommunicator, integration into the distributed parallel state management, and new benchmarks and tests.

My review has identified a critical bug in parallel_state.py due to a variable name typo. Additionally, I've found several areas for improvement across the new files, including removing dead code and unused imports, clarifying magic numbers with comments, and making the implementation more flexible by avoiding hardcoded data types. The new tests could also be more robust to ensure the new feature is exercised correctly across different configurations. Overall, the changes are well-structured, but addressing these points will enhance the code's correctness and maintainability.

python/sglang/srt/distributed/parallel_state.py

benchmark/kernels/all_reduce/benchmark_symm_mem.py

python/sglang/srt/distributed/device_communicators/all_reduce_utils.py

python/sglang/srt/distributed/device_communicators/symm_mem.py

test/srt/test_symm_mem_allreduce.py

trevor-m · 2025-09-17T21:54:49Z

Hi @yuan-luo #8238 enabled NCCL symmetric memory all-reduce with --enable-symm-mem. It avoids a copy into the symm memory buffer by using a custom memory allocation pool.
That approach used pynccl because native torch support for symmetric memory wasn't added yet.

yuan-luo · 2025-09-18T02:13:22Z

Hi @yuan-luo #8238 enabled NCCL symmetric memory all-reduce with --enable-symm-mem. It avoids a copy into the symm memory buffer by using a custom memory allocation pool. That approach used pynccl because native torch support for symmetric memory wasn't added yet.

@trevor-m I knew this PR. Both are two different mechnisms. One uses NCCL and the other uses Torch mechanism. I'll add a new feature flag to distinguish them.
Per test result the Torch Symm Mem can provide more robust and high performance AllReduce in large package size.
Btw, vLLM has set Torch Symmetrix Memory All-Reduce as the default All-Reduce which was custom All-Reduce previously.

yuan-luo · 2025-09-18T09:16:26Z

New server_args --enable-torch-symm-mem E2E test passed.

$python3 -m sglang.launch_server --model-path /home/admin/DeepSeek-R1 --host 0.0.0.0 --port 30000 --trust-remote-code --enable-cache-report --quantization fp8 --log-level info --max-running-requests 32 --mem-fraction-static 0.92 --chunked-prefill-size 16384 --context-length 65535 --attention-backend flashinfer --disable-radix-cache --page-size 64 --tp-size 8 --enable-metrics --cuda-graph-max-bs 32 --enable-torch-symm-mem
......

[2025-09-18 17:13:48] INFO:     Started server process [80415]
[2025-09-18 17:13:48] INFO:     Waiting for application startup.
[2025-09-18 17:13:48] INFO:     Application startup complete.
[2025-09-18 17:13:48] INFO:     Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2025-09-18 17:13:49] INFO:     127.0.0.1:46614 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-09-18 17:13:49 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-09-18 17:13:49 TP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2025-09-18 17:13:49 TP0] Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=4096, K=512, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
DeepGEMM warmup: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32768/32768 [00:06<00:00, 5397.02it/s]
[2025-09-18 17:13:58] INFO:     127.0.0.1:46618 - "POST /generate HTTP/1.1" 200 OK
[2025-09-18 17:13:58] The server is fired up and ready to roll!
[2025-09-18 17:14:19 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-09-18 17:14:19 TP0] Decode batch. #running-req: 1, #token: 64, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1.26, #queue-req: 0, 
[2025-09-18 17:14:20 TP0] Decode batch. #running-req: 1, #token: 128, token usage: 0.00, cuda graph: True, gen throughput (token/s): 84.71, #queue-req: 0, 
[2025-09-18 17:14:20 TP0] Decode batch. #running-req: 1, #token: 192, token usage: 0.00, cuda graph: True, gen throughput (token/s): 84.76, #queue-req: 0, 
[2025-09-18 17:14:21 TP0] Decode batch. #running-req: 1, #token: 192, token usage: 0.00, cuda graph: True, gen throughput (token/s): 84.95, #queue-req: 0, 
[2025-09-18 17:14:21 TP0] Decode batch. #running-req: 1, #token: 256, token usage: 0.00, cuda graph: True, gen throughput (token/s): 84.86, #queue-req: 0, 
[2025-09-18 17:14:21] INFO:     127.0.0.1:39634 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Client:

$cat test_openai.py
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
# Chat completion
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "List 3 countries and their capitals. Tell me how you rank them"},
    ],
    temperature=0,
    max_tokens=200,
)
print(response)

$python test_openai.py 
ChatCompletion(id='b060b6252b784549b14f4268ba7a3c47', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content="Okay, the user wants me to list three countries and their capitals and then rank them. Let me start by picking three countries. Maybe I should choose well-known ones to keep it simple. Let's see, France with Paris, Japan with Tokyo, and Brazil with Brasília. That covers different continents.\n\nNow, how to rank them. The user didn't specify the criteria, so I need to decide on a basis. Population? Economic size? Area? Or maybe cultural significance. Since capitals are involved, maybe the population of the capital cities? Or perhaps the historical importance of the capitals. Alternatively, alphabetical order, but that's not really a ranking. \n\nWait, the user might want a subjective ranking based on some interesting factors. Let me think. If I rank them by the population of the capitals, Tokyo is the largest, then Paris, then Brasília. But maybe the user wants a more engaging reason. For example, cultural influence, tourism, or something like that. \n\n", refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=None)], created=1758186861, model='default', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=200, prompt_tokens=25, total_tokens=225, completion_tokens_details=None, prompt_tokens_details=None, reasoning_tokens=0), metadata={'weight_version': 'default'})

yuan-luo requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock, HaiShaw, ch-wan, BBuf, kushanam, Edwardf0t1 and yizhang2077 as code owners September 17, 2025 13:24

gemini-code-assist bot reviewed Sep 17, 2025

View reviewed changes

yuan-luo force-pushed the symm_mem_ar branch from 54d6aa0 to 3dd7ce9 Compare September 17, 2025 15:23

yuan-luo changed the title ~~Support PyTorch Symm Mem AllReduce~~ Support Torch Symm Mem AllReduce Sep 18, 2025

yuan-luo force-pushed the symm_mem_ar branch from 3dd7ce9 to 5435e8a Compare September 18, 2025 03:49

yuan-luo changed the title ~~Support Torch Symm Mem AllReduce~~ [WIP] Support Torch Symm Mem AllReduce Sep 18, 2025

yuan-luo changed the title ~~[WIP] Support Torch Symm Mem AllReduce~~ [Feat] Support Torch Symm Mem AllReduce Sep 18, 2025

yuan-luo force-pushed the symm_mem_ar branch from 5435e8a to f5803c7 Compare September 18, 2025 09:30

yuan-luo mentioned this pull request Sep 23, 2025

[Feature] Support deterministic inference for MoE in large TP #10785

Closed

2 tasks

luoyuan.luo added 2 commits September 23, 2025 16:33

Support symm_mem allreduce

626a068

Add torch_symm_mem server args

cf1e296

yuan-luo force-pushed the symm_mem_ar branch from 5c4f1b0 to cf1e296 Compare September 23, 2025 08:34

Merge branch 'main' into symm_mem_ar

4600f5f

yuan-luo added the run-ci label Sep 24, 2025

yuan-luo added 2 commits September 26, 2025 10:43

Merge branch 'main' into symm_mem_ar

441bd22

Merge branch 'main' into symm_mem_ar

a2a911a

yuan-luo added 2 commits September 28, 2025 16:48

Merge branch 'main' into symm_mem_ar

fa6947d

Merge branch 'main' into symm_mem_ar

48b81b6

Alcanderian approved these changes Sep 29, 2025

View reviewed changes

yuan-luo added 2 commits September 29, 2025 22:54

Merge branch 'main' into symm_mem_ar

c8dec58

Merge branch 'main' into symm_mem_ar

e5fc01a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat] Support Torch Symm Mem AllReduce #10571

[Feat] Support Torch Symm Mem AllReduce #10571

Uh oh!

yuan-luo commented Sep 17, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

trevor-m commented Sep 17, 2025

Uh oh!

yuan-luo commented Sep 18, 2025 •

edited

Loading

Uh oh!

yuan-luo commented Sep 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

[Feat] Support Torch Symm Mem AllReduce #10571

Are you sure you want to change the base?

[Feat] Support Torch Symm Mem AllReduce #10571

Uh oh!

Conversation

yuan-luo commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

trevor-m commented Sep 17, 2025

Uh oh!

yuan-luo commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuan-luo commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

yuan-luo commented Sep 17, 2025 •

edited

Loading

yuan-luo commented Sep 18, 2025 •

edited

Loading

yuan-luo commented Sep 18, 2025 •

edited

Loading