[Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 1 #9962

sufeng-buaa · 2025-09-03T09:07:56Z

Motivation

The PR is response to #8965 . For details on the motivation and visual output, please refer to the issue.

Modifications

To avoid overwhelming reviewers with a large amount of code, we have split the patch into two parts.

This is the first part, which includes:

A request tracing package, which provides a set of static instrumentation APIs.
Implementation of request tracing for normal requests, covering scenarios with MultiTokenizer and TP>1.

The second part will be submitted after this part has been reviewed or merged, include:

Request tracing support for PD disaggregation and DP attention scenarios
A script for converting opentelemetry data to perfetto data

To help reviewers better understand the design, we would like to clarify a few key points:

To trace multiple concurrently executing requests and observe intra-request parallelism (e.g., TP > 1), a global variable is used to maintain the trace context of the currently active request.

SglangTraceReqContext (rid="req-123")
├── SglangTraceThreadContext(thread_label="scheduler", tp_rank=0)
│ └── SglangTraceSliceContext (name="prefill") # cur slice
|
└── SglangTraceThreadContext(thread_label="scheduler", tp_rank=1)
  └── SglangTraceSliceContext (name="prefill") # cur slice

To capture the execution flow of a request, adjacent spans are linked using the opentelemetry.sdk.trace.span.Span.add_link() API. As a result, the SglangTraceSliceContext keeps track of the prev_span_context to enable proper linking.
When a request crosses thread boundaries, the trace context is explicitly propagated using a dedicated structure, SglangTracePropagateContext, which includes both the root span context and the previous span context to ensure continuity across threads.

How to enable request tracing?

Please refer to docs/references/production_request_trace.md

How to use request tracing APIs?

Please refer to docs/references/production_request_trace.md and test/srt/test_tracing.py

Instrumentation Overhead Evaluation

Our testing platform is based on an Intel® Xeon® Platinum 8469C processor, configured with 192 CPU cores, 1 TB of RAM, and 8 NVIDIA H20 GPUs.

enable tracing

The overhead of tracing a single slice is approximately 40 μs, including the combined cost of trace_slice_start() and trace_slice_end().
In the scheduler, multiple requests are processed concurrently, so the actual overhead = single slice overhead * batch_size.

However, due to the overlap scheduling mechanism, the tracing overhead can be largely hidden by the GPU-side forward computation time.
The overhead of a single cross-thread trace context propagation is approximately 90 μs.

disable tracing

The overhead of an empty instrumentation point is approximately 300–400 ns, which is negligible under the overlap scheduling mechanism.

TODO

Dynamically enable or disable via Http Post interface.
Request tracing for EP and PP scenarios is currently under implementation.

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist

Summary of Changes

Hello @sufeng-buaa, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive request tracing system to SGLang, enabling detailed monitoring of request latency and execution flow. By integrating with OpenTelemetry, it provides the capability to visualize how requests are processed across different components and threads, which is crucial for performance analysis and debugging in complex, distributed environments. The changes lay the groundwork for understanding and optimizing the system's behavior under various loads.

Highlights

New Tracing Package: Introduces a new sglang.srt.tracing package with static instrumentation APIs for fine-grained request tracking.
OpenTelemetry Integration: Leverages OpenTelemetry for exporting trace data, allowing visualization in tools like Jaeger. Includes new Docker Compose and OpenTelemetry configuration files for easy setup.
Request Lifecycle Tracing: Implements tracing for the full lifecycle of normal requests, covering tokenization, scheduling, and execution, including scenarios with MultiTokenizer and TP>1.
Context Propagation: Designed a three-level trace context (SglangTraceReqContext, SglangTraceThreadContext, SglangTraceSliceContext) and explicit context propagation (SglangTracePropagateContext) to handle concurrent requests and cross-thread execution flows.
Performance Overhead: Evaluated tracing overhead: approximately 40μs per slice and 90μs for cross-thread context propagation. Disabled tracing has negligible overhead (300-400ns).
Partial Implementation: This pull request is the first part of a two-part feature, focusing on core tracing infrastructure and normal request scenarios. Future parts will cover PD disaggregation and Perfetto data conversion.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive tracing feature using OpenTelemetry, which is a great addition for observability. The implementation is well-structured, with clear separation of concerns in the new tracing package. The changes span across the request lifecycle, from the HTTP server entrypoint to the scheduler, correctly instrumenting key stages. The inclusion of documentation and tests is also commendable.

I have identified a critical bug due to a missing import, and another bug where request attributes were not being set. I've also included some suggestions for improving code quality and adhering to Python best practices.

python/sglang/srt/tracing/trace.py

python/sglang/srt/managers/scheduler.py

python/sglang/srt/managers/tokenizer_manager.py

python/sglang/srt/tracing/trace.py

ishandhanani · 2025-09-03T21:28:03Z

python/sglang/srt/entrypoints/http_server.py

        )
    )
+
+    if server_args.enable_trace:


Seems like we can enable this when using the http server entrypoint. What do you think about enabling this for the sgl.Engine API as well?

Yes, that makes sense. I'll add it, run some tests, and push an update shortly.

I've enabled this for the sgl.Engine API.

Neat! Thank you.

I've kicked off the PR checks

ishandhanani · 2025-09-08T19:06:08Z

Seems like there are some conflicts. Can you resolve?

sufeng-buaa · 2025-09-09T07:34:36Z

Seems like there are some conflicts. Can you resolve?

ok, I have rebased my branch to the latest main.

Signed-off-by: Feng Su <[email protected]> Signed-off-by: Huaixin Chang <[email protected]> Signed-off-by: Peng Wang <[email protected]>

…ackage Signed-off-by: Feng Su <[email protected]> Signed-off-by: Huaixin Chang <[email protected]> Signed-off-by: Peng Wang <[email protected]>

…nd Jaeger Signed-off-by: Feng Su <[email protected]> Signed-off-by: Huaixin Chang <[email protected]> Signed-off-by: Peng Wang <[email protected]>

Signed-off-by: Feng Su <[email protected]> Signed-off-by: Huaixin Chang <[email protected]> Signed-off-by: Peng Wang <[email protected]>

Signed-off-by: Feng Su <[email protected]>

ishandhanani · 2025-09-12T16:33:59Z

@zhyncs - can you take a look?

merrymercy · 2025-10-01T04:28:21Z

python/sglang/srt/managers/scheduler.py


+            if batch:
+                for req in batch.reqs:
+                    trace_event("schedule", req.rid)


wrap this into a function and only call it if tracing is enabled.
The principal: if tracing is not enabled, the overhead should just be a single if/else, not a for loop

merrymercy · 2025-10-01T04:28:43Z

python/sglang/srt/managers/scheduler.py

+
        elif batch.forward_mode.is_extend():
            self.process_batch_result_prefill(batch, result, launch_done)
+            for req in batch.reqs:


If tracing is not enabled, the overhead should just be a single if/else, not a for loop

merrymercy · 2025-10-01T07:22:50Z

python/sglang/srt/managers/scheduler.py

            batch = self.get_next_batch_to_run()
            self.cur_batch = batch

+            if batch:


There is a if batch condition right after this, should you put this block under L855?

merrymercy · 2025-10-01T07:23:30Z

python/sglang/srt/managers/scheduler.py

+                for req in batch.reqs:
+                    trace_event("schedule", req.rid)


Wrap this into a function called trace_event_batch.

The code in event_loop_XXXX should be very concise, only expose core logics

sufeng-buaa requested review from merrymercy, Ying1123, hnyls2002, xiezhq-hermann, ispobock, CatherineSue, slin1237 and zhyncs as code owners September 3, 2025 09:07

gemini-code-assist bot reviewed Sep 3, 2025

View reviewed changes

sufeng-buaa force-pushed the sufeng-buaa/sglang-tracing branch 2 times, most recently from 691ff9b to df556fc Compare September 3, 2025 11:55

sufeng-buaa mentioned this pull request Sep 3, 2025

[Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency #8965

Open

2 tasks

sufeng-buaa changed the title ~~[Feature] Sglang Tracing: Fine-Grained Tracking for Request LatencySufeng buaa/sglang tracing - Part 1~~ [Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 1 Sep 3, 2025

ShangmingCai mentioned this pull request Sep 3, 2025

[Roadmap] Distributed Serving Enhancement on 2025 H2 #8210

Open

22 tasks

ishandhanani reviewed Sep 3, 2025

View reviewed changes

sufeng-buaa force-pushed the sufeng-buaa/sglang-tracing branch 2 times, most recently from 1854fa4 to c359921 Compare September 5, 2025 03:15

sufeng-buaa closed this Sep 9, 2025

sufeng-buaa reopened this Sep 9, 2025

sufeng-buaa force-pushed the sufeng-buaa/sglang-tracing branch from c359921 to a434ac1 Compare September 9, 2025 07:30

sufeng-buaa force-pushed the sufeng-buaa/sglang-tracing branch 2 times, most recently from d93a355 to 3dea78a Compare September 11, 2025 06:22

zhyncs assigned key4ng, yizhang2077, ispobock and hnyls2002 Sep 12, 2025

zhyncs added the high priority label Sep 12, 2025

zhyncs assigned ishandhanani Sep 12, 2025

sufeng-buaa added 8 commits September 12, 2025 14:27

feat: trace: introduce request tracing package

45e012c

Signed-off-by: Feng Su <[email protected]> Signed-off-by: Huaixin Chang <[email protected]> Signed-off-by: Peng Wang <[email protected]>

trace: update dependences for sglang requests tracing

d7e911e

Signed-off-by: Feng Su <[email protected]> Signed-off-by: Huaixin Chang <[email protected]> Signed-off-by: Peng Wang <[email protected]>

trace: Implement request tracing(non pd-disaggregation) via tracing p…

e732eb6

…ackage Signed-off-by: Feng Su <[email protected]> Signed-off-by: Huaixin Chang <[email protected]> Signed-off-by: Peng Wang <[email protected]>

trace: Add Docker Compose file for starting OpenTelemetry Collector a…

fb9fc33

…nd Jaeger Signed-off-by: Feng Su <[email protected]> Signed-off-by: Huaixin Chang <[email protected]> Signed-off-by: Peng Wang <[email protected]>

trace: Add Test Case

fea3847

Signed-off-by: Feng Su <[email protected]> Signed-off-by: Huaixin Chang <[email protected]> Signed-off-by: Peng Wang <[email protected]>

trace: Add Request Tracing Document

6115781

Signed-off-by: Feng Su <[email protected]> Signed-off-by: Huaixin Chang <[email protected]> Signed-off-by: Peng Wang <[email protected]>

trace: support request tracing for sgl.Engine

276f09e

Signed-off-by: Feng Su <[email protected]> Signed-off-by: Huaixin Chang <[email protected]> Signed-off-by: Peng Wang <[email protected]>

trace: Add Test Case for EmbeddingReqInput

624b110

Signed-off-by: Feng Su <[email protected]> Signed-off-by: Huaixin Chang <[email protected]> Signed-off-by: Peng Wang <[email protected]>

sufeng-buaa force-pushed the sufeng-buaa/sglang-tracing branch from 3dea78a to 624b110 Compare September 12, 2025 06:28

trace: Rename parameter '--otel-endpoint' to 'oltp-traces-endpoint'

a9a118a

Signed-off-by: Feng Su <[email protected]>

ishandhanani approved these changes Sep 12, 2025

View reviewed changes

hnyls2002 approved these changes Sep 14, 2025

View reviewed changes

hnyls2002 merged commit 4c21b09 into sgl-project:main Sep 14, 2025
70 of 77 checks passed

This was referenced Sep 23, 2025

[Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 2 #10804

Open

[Feature] Propose Unified Observability Interface for Request Tracing, PD Metric, and TimeStat Log #10916

Open

merrymercy reviewed Oct 1, 2025

View reviewed changes

[Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 1 #9962

[Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 1 #9962

Conversation

sufeng-buaa commented Sep 3, 2025

Motivation

Modifications

How to enable request tracing?

How to use request tracing APIs?

Instrumentation Overhead Evaluation

TODO

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ishandhanani Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

sufeng-buaa Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

sufeng-buaa Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

ishandhanani Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

ishandhanani commented Sep 8, 2025

Uh oh!

sufeng-buaa commented Sep 9, 2025

Uh oh!

ishandhanani commented Sep 12, 2025

Uh oh!

Uh oh!

merrymercy Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

merrymercy Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

merrymercy Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

merrymercy Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

merrymercy Oct 1, 2025 •

edited

Loading