Skip to content

Commit d433e43

Browse files
Fixes usage calculation in streaming mode
Corrects the usage calculation for streaming responses by passing the correct argument to the base function. It ensures accurate token counting when `n` > 1 is requested, preventing potential discrepancies in billing or rate limiting. Signed-off-by: Xinyuan Tong <[email protected]>
1 parent c5a60e0 commit d433e43

File tree

2 files changed

+2
-2
lines changed

2 files changed

+2
-2
lines changed

python/sglang/srt/entrypoints/openai/serving_chat.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -578,7 +578,7 @@ async def generate_stream_resp():
578578
# Final chunk with usage
579579
if request.stream_options and request.stream_options.include_usage:
580580
usage = self._calculate_streaming_usage_base(
581-
prompt_tokens, completion_tokens, cached_tokens, request
581+
prompt_tokens, completion_tokens, cached_tokens, request.n
582582
)
583583
else:
584584
usage = None

python/sglang/srt/entrypoints/openai/serving_completions.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -289,7 +289,7 @@ async def generate_stream_resp():
289289
# Handle final usage chunk
290290
if request.stream_options and request.stream_options.include_usage:
291291
usage = self._calculate_streaming_usage_base(
292-
prompt_tokens, completion_tokens, cached_tokens, request
292+
prompt_tokens, completion_tokens, cached_tokens, request.n
293293
)
294294
final_usage_chunk = CompletionStreamResponse(
295295
id=content["meta_info"]["id"],

0 commit comments

Comments
 (0)