Skip to content

Conversation

sufeng-buaa
Copy link
Contributor

Motivation

The PR is response to #8965 . For details on the motivation and visual output, please refer to the issue.

Modifications

To avoid overwhelming reviewers with a large amount of code, we have split the patch into two parts.

This is the first part, which includes:

  • A request tracing package, which provides a set of static instrumentation APIs.
  • Implementation of request tracing for normal requests, covering scenarios with MultiTokenizer and TP>1.

The second part will be submitted after this part has been reviewed or merged, include:

  • Request tracing support for PD disaggregation and DP attention scenarios
  • A script for converting opentelemetry data to perfetto data

To help reviewers better understand the design, we would like to clarify a few key points:

  1. To trace multiple concurrently executing requests and observe intra-request parallelism (e.g., TP > 1), a global variable is used to maintain the trace context of the currently active request.
SglangTraceReqContext (rid="req-123")
├── SglangTraceThreadContext(thread_label="scheduler", tp_rank=0)
│ └── SglangTraceSliceContext (name="prefill") # cur slice
|
└── SglangTraceThreadContext(thread_label="scheduler", tp_rank=1)
  └── SglangTraceSliceContext (name="prefill") # cur slice
  1. To capture the execution flow of a request, adjacent spans are linked using the opentelemetry.sdk.trace.span.Span.add_link() API. As a result, the SglangTraceSliceContext keeps track of the prev_span_context to enable proper linking.

  2. When a request crosses thread boundaries, the trace context is explicitly propagated using a dedicated structure, SglangTracePropagateContext, which includes both the root span context and the previous span context to ensure continuity across threads.

How to enable request tracing?

Please refer to docs/references/production_request_trace.md

How to use request tracing APIs?

Please refer to docs/references/production_request_trace.md and test/srt/test_tracing.py

Instrumentation Overhead Evaluation

Our testing platform is based on an Intel® Xeon® Platinum 8469C processor, configured with 192 CPU cores, 1 TB of RAM, and 8 NVIDIA H20 GPUs.

  1. enable tracing
  • The overhead of tracing a single slice is approximately 40 μs, including the combined cost of trace_slice_start() and trace_slice_end().
    In the scheduler, multiple requests are processed concurrently, so the actual overhead = single slice overhead * batch_size.

    However, due to the overlap scheduling mechanism, the tracing overhead can be largely hidden by the GPU-side forward computation time.

  • The overhead of a single cross-thread trace context propagation is approximately 90 μs.

  1. disable tracing
  • The overhead of an empty instrumentation point is approximately 300–400 ns, which is negligible under the overlap scheduling mechanism.

TODO

  1. Dynamically enable or disable via Http Post interface.
  2. Request tracing for EP and PP scenarios is currently under implementation.

Accuracy Tests

Benchmarking and Profiling

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @sufeng-buaa, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive request tracing system to SGLang, enabling detailed monitoring of request latency and execution flow. By integrating with OpenTelemetry, it provides the capability to visualize how requests are processed across different components and threads, which is crucial for performance analysis and debugging in complex, distributed environments. The changes lay the groundwork for understanding and optimizing the system's behavior under various loads.

Highlights

  • New Tracing Package: Introduces a new sglang.srt.tracing package with static instrumentation APIs for fine-grained request tracking.
  • OpenTelemetry Integration: Leverages OpenTelemetry for exporting trace data, allowing visualization in tools like Jaeger. Includes new Docker Compose and OpenTelemetry configuration files for easy setup.
  • Request Lifecycle Tracing: Implements tracing for the full lifecycle of normal requests, covering tokenization, scheduling, and execution, including scenarios with MultiTokenizer and TP>1.
  • Context Propagation: Designed a three-level trace context (SglangTraceReqContext, SglangTraceThreadContext, SglangTraceSliceContext) and explicit context propagation (SglangTracePropagateContext) to handle concurrent requests and cross-thread execution flows.
  • Performance Overhead: Evaluated tracing overhead: approximately 40μs per slice and 90μs for cross-thread context propagation. Disabled tracing has negligible overhead (300-400ns).
  • Partial Implementation: This pull request is the first part of a two-part feature, focusing on core tracing infrastructure and normal request scenarios. Future parts will cover PD disaggregation and Perfetto data conversion.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive tracing feature using OpenTelemetry, which is a great addition for observability. The implementation is well-structured, with clear separation of concerns in the new tracing package. The changes span across the request lifecycle, from the HTTP server entrypoint to the scheduler, correctly instrumenting key stages. The inclusion of documentation and tests is also commendable.

I have identified a critical bug due to a missing import, and another bug where request attributes were not being set. I've also included some suggestions for improving code quality and adhering to Python best practices.

@sufeng-buaa sufeng-buaa force-pushed the sufeng-buaa/sglang-tracing branch 2 times, most recently from 691ff9b to df556fc Compare September 3, 2025 11:55
@sufeng-buaa sufeng-buaa changed the title [Feature] Sglang Tracing: Fine-Grained Tracking for Request LatencySufeng buaa/sglang tracing - Part 1 [Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 1 Sep 3, 2025
)
)

if server_args.enable_trace:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we can enable this when using the http server entrypoint. What do you think about enabling this for the sgl.Engine API as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense. I'll add it, run some tests, and push an update shortly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've enabled this for the sgl.Engine API.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! Thank you.

I've kicked off the PR checks

@sufeng-buaa sufeng-buaa force-pushed the sufeng-buaa/sglang-tracing branch 2 times, most recently from 1854fa4 to c359921 Compare September 5, 2025 03:15
@ishandhanani
Copy link
Collaborator

Seems like there are some conflicts. Can you resolve?

@sufeng-buaa sufeng-buaa closed this Sep 9, 2025
@sufeng-buaa sufeng-buaa reopened this Sep 9, 2025
@sufeng-buaa sufeng-buaa force-pushed the sufeng-buaa/sglang-tracing branch from c359921 to a434ac1 Compare September 9, 2025 07:30
@sufeng-buaa
Copy link
Contributor Author

Seems like there are some conflicts. Can you resolve?

ok, I have rebased my branch to the latest main.

@sufeng-buaa sufeng-buaa force-pushed the sufeng-buaa/sglang-tracing branch 2 times, most recently from d93a355 to 3dea78a Compare September 11, 2025 06:22
Signed-off-by: Feng Su <[email protected]>
Signed-off-by: Huaixin Chang <[email protected]>
Signed-off-by: Peng Wang <[email protected]>
Signed-off-by: Feng Su <[email protected]>
Signed-off-by: Huaixin Chang <[email protected]>
Signed-off-by: Peng Wang <[email protected]>
…ackage

Signed-off-by: Feng Su <[email protected]>
Signed-off-by: Huaixin Chang <[email protected]>
Signed-off-by: Peng Wang <[email protected]>
…nd Jaeger

Signed-off-by: Feng Su <[email protected]>
Signed-off-by: Huaixin Chang <[email protected]>
Signed-off-by: Peng Wang <[email protected]>
Signed-off-by: Feng Su <[email protected]>
Signed-off-by: Huaixin Chang <[email protected]>
Signed-off-by: Peng Wang <[email protected]>
Signed-off-by: Feng Su <[email protected]>
Signed-off-by: Huaixin Chang <[email protected]>
Signed-off-by: Peng Wang <[email protected]>
Signed-off-by: Feng Su <[email protected]>
Signed-off-by: Huaixin Chang <[email protected]>
Signed-off-by: Peng Wang <[email protected]>
Signed-off-by: Feng Su <[email protected]>
Signed-off-by: Huaixin Chang <[email protected]>
Signed-off-by: Peng Wang <[email protected]>
@sufeng-buaa sufeng-buaa force-pushed the sufeng-buaa/sglang-tracing branch from 3dea78a to 624b110 Compare September 12, 2025 06:28
@ishandhanani
Copy link
Collaborator

@zhyncs - can you take a look?

@hnyls2002 hnyls2002 merged commit 4c21b09 into sgl-project:main Sep 14, 2025
70 of 77 checks passed

if batch:
for req in batch.reqs:
trace_event("schedule", req.rid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrap this into a function and only call it if tracing is enabled.
The principal: if tracing is not enabled, the overhead should just be a single if/else, not a for loop


elif batch.forward_mode.is_extend():
self.process_batch_result_prefill(batch, result, launch_done)
for req in batch.reqs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If tracing is not enabled, the overhead should just be a single if/else, not a for loop

batch = self.get_next_batch_to_run()
self.cur_batch = batch

if batch:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a if batch condition right after this, should you put this block under L855?

Comment on lines +852 to +853
for req in batch.reqs:
trace_event("schedule", req.rid)
Copy link
Contributor

@merrymercy merrymercy Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap this into a function called trace_event_batch.

The code in event_loop_XXXX should be very concise, only expose core logics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants