-
Notifications
You must be signed in to change notification settings - Fork 3.1k
[Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 1 #9962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 1 #9962
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @sufeng-buaa, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a comprehensive request tracing system to SGLang, enabling detailed monitoring of request latency and execution flow. By integrating with OpenTelemetry, it provides the capability to visualize how requests are processed across different components and threads, which is crucial for performance analysis and debugging in complex, distributed environments. The changes lay the groundwork for understanding and optimizing the system's behavior under various loads.
Highlights
- New Tracing Package: Introduces a new
sglang.srt.tracing
package with static instrumentation APIs for fine-grained request tracking. - OpenTelemetry Integration: Leverages OpenTelemetry for exporting trace data, allowing visualization in tools like Jaeger. Includes new Docker Compose and OpenTelemetry configuration files for easy setup.
- Request Lifecycle Tracing: Implements tracing for the full lifecycle of normal requests, covering tokenization, scheduling, and execution, including scenarios with MultiTokenizer and TP>1.
- Context Propagation: Designed a three-level trace context (
SglangTraceReqContext
,SglangTraceThreadContext
,SglangTraceSliceContext
) and explicit context propagation (SglangTracePropagateContext
) to handle concurrent requests and cross-thread execution flows. - Performance Overhead: Evaluated tracing overhead: approximately 40μs per slice and 90μs for cross-thread context propagation. Disabled tracing has negligible overhead (300-400ns).
- Partial Implementation: This pull request is the first part of a two-part feature, focusing on core tracing infrastructure and normal request scenarios. Future parts will cover PD disaggregation and Perfetto data conversion.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a comprehensive tracing feature using OpenTelemetry, which is a great addition for observability. The implementation is well-structured, with clear separation of concerns in the new tracing
package. The changes span across the request lifecycle, from the HTTP server entrypoint to the scheduler, correctly instrumenting key stages. The inclusion of documentation and tests is also commendable.
I have identified a critical bug due to a missing import, and another bug where request attributes were not being set. I've also included some suggestions for improving code quality and adhering to Python best practices.
691ff9b
to
df556fc
Compare
) | ||
) | ||
|
||
if server_args.enable_trace: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we can enable this when using the http server entrypoint. What do you think about enabling this for the sgl.Engine
API as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense. I'll add it, run some tests, and push an update shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've enabled this for the sgl.Engine API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat! Thank you.
I've kicked off the PR checks
1854fa4
to
c359921
Compare
Seems like there are some conflicts. Can you resolve? |
c359921
to
a434ac1
Compare
ok, I have rebased my branch to the latest main. |
d93a355
to
3dea78a
Compare
Signed-off-by: Feng Su <[email protected]> Signed-off-by: Huaixin Chang <[email protected]> Signed-off-by: Peng Wang <[email protected]>
Signed-off-by: Feng Su <[email protected]> Signed-off-by: Huaixin Chang <[email protected]> Signed-off-by: Peng Wang <[email protected]>
…ackage Signed-off-by: Feng Su <[email protected]> Signed-off-by: Huaixin Chang <[email protected]> Signed-off-by: Peng Wang <[email protected]>
…nd Jaeger Signed-off-by: Feng Su <[email protected]> Signed-off-by: Huaixin Chang <[email protected]> Signed-off-by: Peng Wang <[email protected]>
Signed-off-by: Feng Su <[email protected]> Signed-off-by: Huaixin Chang <[email protected]> Signed-off-by: Peng Wang <[email protected]>
Signed-off-by: Feng Su <[email protected]> Signed-off-by: Huaixin Chang <[email protected]> Signed-off-by: Peng Wang <[email protected]>
Signed-off-by: Feng Su <[email protected]> Signed-off-by: Huaixin Chang <[email protected]> Signed-off-by: Peng Wang <[email protected]>
Signed-off-by: Feng Su <[email protected]> Signed-off-by: Huaixin Chang <[email protected]> Signed-off-by: Peng Wang <[email protected]>
3dea78a
to
624b110
Compare
Signed-off-by: Feng Su <[email protected]>
@zhyncs - can you take a look? |
|
||
if batch: | ||
for req in batch.reqs: | ||
trace_event("schedule", req.rid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrap this into a function and only call it if tracing is enabled.
The principal: if tracing is not enabled, the overhead should just be a single if/else, not a for loop
|
||
elif batch.forward_mode.is_extend(): | ||
self.process_batch_result_prefill(batch, result, launch_done) | ||
for req in batch.reqs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If tracing is not enabled, the overhead should just be a single if/else, not a for loop
batch = self.get_next_batch_to_run() | ||
self.cur_batch = batch | ||
|
||
if batch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a if batch
condition right after this, should you put this block under L855?
for req in batch.reqs: | ||
trace_event("schedule", req.rid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap this into a function called trace_event_batch
.
The code in event_loop_XXXX
should be very concise, only expose core logics
Motivation
The PR is response to #8965 . For details on the motivation and visual output, please refer to the issue.
Modifications
To avoid overwhelming reviewers with a large amount of code, we have split the patch into two parts.
This is the first part, which includes:
The second part will be submitted after this part has been reviewed or merged, include:
To help reviewers better understand the design, we would like to clarify a few key points:
To capture the execution flow of a request, adjacent spans are linked using the opentelemetry.sdk.trace.span.Span.add_link() API. As a result, the SglangTraceSliceContext keeps track of the prev_span_context to enable proper linking.
When a request crosses thread boundaries, the trace context is explicitly propagated using a dedicated structure,
SglangTracePropagateContext
, which includes both the root span context and the previous span context to ensure continuity across threads.How to enable request tracing?
Please refer to
docs/references/production_request_trace.md
How to use request tracing APIs?
Please refer to
docs/references/production_request_trace.md
andtest/srt/test_tracing.py
Instrumentation Overhead Evaluation
Our testing platform is based on an Intel® Xeon® Platinum 8469C processor, configured with 192 CPU cores, 1 TB of RAM, and 8 NVIDIA H20 GPUs.
The overhead of tracing a single slice is approximately 40 μs, including the combined cost of trace_slice_start() and trace_slice_end().
In the scheduler, multiple requests are processed concurrently, so the actual overhead = single slice overhead * batch_size.
However, due to the overlap scheduling mechanism, the tracing overhead can be largely hidden by the GPU-side forward computation time.
The overhead of a single cross-thread trace context propagation is approximately 90 μs.
TODO
Accuracy Tests
Benchmarking and Profiling
Checklist