-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 5. Please use English, otherwise it will be closed.
Describe the bug
root@gaolaobao-Legion-R9000P-ADR10:/home/gaolaobao/project/20250823/sglang# python launch_server.py
[2025-08-25 22:39:09] server_args=ServerArgs(model_path='qwen/qwen2.5-0.5b-instruct', tokenizer_path='qwen/qwen2.5-0.5b-instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='0.0.0.0', port=30725, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.636, max_running_requests=128, max_queued_requests=9223372036854775807, max_total_tokens=20480, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, device='cuda', tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=612489127, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, api_key=None, served_model_name='qwen/qwen2.5-0.5b-instruct', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', disable_radix_cache=False, cuda_graph_max_bs=4, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False)
terminate called after throwing an instance of 'c10::Error'
what(): fast_1 >= fast_0 INTERNAL ASSERT FAILED at "/pytorch/c10/util/ApproximateClock.cpp":18, please report a bug to PyTorch. getCount is non-monotonic.
Exception raised from measurePair at /pytorch/c10/util/ApproximateClock.cpp:18 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x80 (0x70d08d97eeb0 in /usr/local/lib/python3.13/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x65 (0x70d08d91ba89 in /usr/local/lib/python3.13/dist-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) + 0x42 (0x70d08d97b402 in /usr/local/lib/python3.13/dist-packages/torch/lib/libc10.so)
frame #3: + 0x8966e (0x70d08d97666e in /usr/local/lib/python3.13/dist-packages/torch/lib/libc10.so)
frame #4: c10::ApproximateClockToUnixTimeConverter::measurePairs() + 0x27 (0x70d08d976697 in /usr/local/lib/python3.13/dist-packages/torch/lib/libc10.so)
frame #5: c10::ApproximateClockToUnixTimeConverter::ApproximateClockToUnixTimeConverter() + 0xc (0x70d08d9766bc in /usr/local/lib/python3.13/dist-packages/torch/lib/libc10.so)
frame #6: + 0x2b591 (0x70d08dd8c591 in /usr/local/lib/python3.13/dist-packages/torch/lib/libc10_cuda.so)
frame #7: + 0x11c50 (0x70d08dd72c50 in /usr/local/lib/python3.13/dist-packages/torch/lib/libc10_cuda.so)
frame #8: + 0x54af (0x70d1514bd4af in /lib64/ld-linux-x86-64.so.2)
frame #9: + 0x55c4 (0x70d1514bd5c4 in /lib64/ld-linux-x86-64.so.2)
frame #10: _dl_catch_exception + 0x132 (0x70d1514ba552 in /lib64/ld-linux-x86-64.so.2)
frame #11: + 0xcb89 (0x70d1514c4b89 in /lib64/ld-linux-x86-64.so.2)
frame #12: _dl_catch_exception + 0x9c (0x70d1514ba4bc in /lib64/ld-linux-x86-64.so.2)
frame #13: + 0xcfb4 (0x70d1514c4fb4 in /lib64/ld-linux-x86-64.so.2)
frame #14: + 0x9e684 (0x70d15109e684 in /lib/x86_64-linux-gnu/libc.so.6)
frame #15: _dl_catch_exception + 0x9c (0x70d1514ba4bc in /lib64/ld-linux-x86-64.so.2)
frame #16: + 0x2609 (0x70d1514ba609 in /lib64/ld-linux-x86-64.so.2)
frame #17: + 0x9e173 (0x70d15109e173 in /lib/x86_64-linux-gnu/libc.so.6)
frame #18: dlopen + 0x6f (0x70d15109e73f in /lib/x86_64-linux-gnu/libc.so.6)
frame #19: /usr/bin/python3() [0x698762]
frame #20: /usr/bin/python3() [0x5920d5]
frame #21: _PyEval_EvalFrameDefault + 0x4907 (0x569107 in /usr/bin/python3)
frame #22: /usr/bin/python3() [0x58cd34]
frame #23: PyObject_CallMethodObjArgs + 0xe5 (0x5cdd75 in /usr/bin/python3)
frame #24: PyImport_ImportModuleLevelObject + 0x234 (0x5ccc44 in /usr/bin/python3)
frame #25: _PyEval_EvalFrameDefault + 0x637d (0x56ab7d in /usr/bin/python3)
frame #26: PyEval_EvalCode + 0xcd (0x65c99d in /usr/bin/python3)
frame #27: /usr/bin/python3() [0x6778d5]
frame #28: /usr/bin/python3() [0x5838db]
frame #29: _PyEval_EvalFrameDefault + 0x4907 (0x569107 in /usr/bin/python3)
frame #30: /usr/bin/python3() [0x58cd34]
frame #31: PyObject_CallMethodObjArgs + 0xe5 (0x5cdd75 in /usr/bin/python3)
frame #32: PyImport_ImportModuleLevelObject + 0x234 (0x5ccc44 in /usr/bin/python3)
frame #33: _PyEval_EvalFrameDefault + 0x637d (0x56ab7d in /usr/bin/python3)
frame #34: PyEval_EvalCode + 0xcd (0x65c99d in /usr/bin/python3)
frame #35: /usr/bin/python3() [0x6778d5]
frame #36: /usr/bin/python3() [0x5838db]
frame #37: _PyEval_EvalFrameDefault + 0x4907 (0x569107 in /usr/bin/python3)
frame #38: /usr/bin/python3() [0x58cd34]
frame #39: PyObject_CallMethodObjArgs + 0xe5 (0x5cdd75 in /usr/bin/python3)
frame #40: PyImport_ImportModuleLevelObject + 0x234 (0x5ccc44 in /usr/bin/python3)
frame #41: _PyEval_EvalFrameDefault + 0x637d (0x56ab7d in /usr/bin/python3)
frame #42: PyEval_EvalCode + 0xcd (0x65c99d in /usr/bin/python3)
frame #43: /usr/bin/python3() [0x6778d5]
frame #44: /usr/bin/python3() [0x5838db]
frame #45: _PyEval_EvalFrameDefault + 0x4907 (0x569107 in /usr/bin/python3)
frame #46: /usr/bin/python3() [0x58cd34]
frame #47: PyObject_CallMethodObjArgs + 0xe5 (0x5cdd75 in /usr/bin/python3)
frame #48: PyImport_ImportModuleLevelObject + 0x234 (0x5ccc44 in /usr/bin/python3)
frame #49: _PyEval_EvalFrameDefault + 0x637d (0x56ab7d in /usr/bin/python3)
frame #50: PyEval_EvalCode + 0xcd (0x65c99d in /usr/bin/python3)
frame #51: /usr/bin/python3() [0x6778d5]
frame #52: /usr/bin/python3() [0x5838db]
frame #53: PyObject_Vectorcall + 0x33 (0x550fc3 in /usr/bin/python3)
frame #54: _PyEval_EvalFrameDefault + 0x3293 (0x567a93 in /usr/bin/python3)
frame #55: PyEval_EvalCode + 0xcd (0x65c99d in /usr/bin/python3)
frame #56: /usr/bin/python3() [0x67d782]
frame #57: /usr/bin/python3() [0x67968e]
frame #58: /usr/bin/python3() [0x66b766]
frame #59: /usr/bin/python3() [0x66b671]
frame #60: Py_RunMain + 0x2be (0x690e5e in /usr/bin/python3)
frame #61: Py_BytesMain + 0x2d (0x64bdcd in /usr/bin/python3)
frame #62: + 0x2a578 (0x70d15102a578 in /lib/x86_64-linux-gnu/libc.so.6)
[2025-08-25 22:39:13] Using default HuggingFace chat template with detected content format: string
[2025-08-25 22:39:21] Attention backend not explicitly specified. Use flashinfer backend by default.
[2025-08-25 22:39:21] Init torch distributed begin.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-08-25 22:39:21] Init torch distributed ends. mem usage=0.00 GB
[2025-08-25 22:39:22] MOE_RUNNER_BACKEND is not initialized, using triton backend
[2025-08-25 22:39:22] Ignore import error when loading sglang.srt.models.glm4v_moe: No module named 'transformers.models.glm4v_moe'
[2025-08-25 22:39:22] Load weight begin. avail mem=6.81 GB
[2025-08-25 22:39:24] Using model weights format ['*.safetensors']
[2025-08-25 22:39:25] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.09it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.09it/s]
[2025-08-25 22:39:26] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=5.82 GB, mem usage=0.99 GB.
[2025-08-25 22:39:26] KV Cache is allocated. #tokens: 20480, K size: 0.12 GB, V size: 0.12 GB
[2025-08-25 22:39:26] Memory pool end. avail mem=5.42 GB
[2025-08-25 22:39:26] Capture cuda graph begin. This can take up to several minutes. avail mem=4.78 GB
[2025-08-25 22:39:26] Capture cuda graph bs [1, 2, 4]
Capturing batches (bs=4 avail_mem=4.79 GB): 0%| | 0/3 [00:00<?, ?it/s][2025-08-25 22:39:26] IS_TBO_ENABLED is not initialized, using False
Capturing batches (bs=4 avail_mem=4.79 GB): 0%| | 0/3 [00:00<?, ?it/s]
[2025-08-25 22:39:27] Scheduler hit an exception: Traceback (most recent call last):
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 384, in init
self.capture()
~~~~~~~~~~~~^^
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 492, in capture
) = self.capture_one_batch_size(bs, forward)
~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 663, in capture_one_batch_size
run_once()
~~~~~~~~^^
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 652, in run_once
logits_output_or_pp_proxy_tensors = forward(
input_ids,
...<2 lines>...
**kwargs,
)
File "/usr/local/lib/python3.13/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/models/qwen2.py", line 472, in forward
hidden_states = self.model(
input_ids,
...<3 lines>...
pp_proxy_tensors=pp_proxy_tensors,
)
File "/usr/local/lib/python3.13/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.13/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/models/qwen2.py", line 340, in forward
hidden_states, residual = layer(
~~~~~^
positions,
^^^^^^^^^^
...<2 lines>...
residual,
^^^^^^^^^
)
^
File "/usr/local/lib/python3.13/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.13/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/models/qwen2.py", line 241, in forward
hidden_states = self.input_layernorm(hidden_states)
File "/usr/local/lib/python3.13/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.13/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/custom_op.py", line 59, in forward
return self._forward_method(*args, **kwargs)
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/layers/layernorm.py", line 87, in forward_cuda
out = rmsnorm(x, self.weight.data, self.variance_epsilon)
File "/usr/local/lib/python3.13/dist-packages/sgl_kernel/elementwise.py", line 45, in rmsnorm
torch.ops.sgl_kernel.rmsnorm.default(out, input, weight, eps, enable_pdl)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.13/dist-packages/torch/_ops.py", line 829, in call
return self._op(*args, **kwargs)
~~~~~~~~^^^^^^^^^^^^^^^^^
RuntimeError: RMSNorm failed with error code no kernel image is available for execution on the device
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/managers/scheduler.py", line 2561, in run_scheduler_process
scheduler = Scheduler(
server_args,
...<6 lines>...
dp_balance_meta=balance_meta,
)
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/managers/scheduler.py", line 323, in init
self.tp_worker = TpWorkerClass(
~~~~~~~~~~~~~^
server_args=server_args,
^^^^^^^^^^^^^^^^^^^^^^^^
...<5 lines>...
nccl_port=port_args.nccl_port,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 67, in init
self.worker = TpModelWorker(
~~~~~~~~~~~~~^
server_args, gpu_id, tp_rank, moe_ep_rank, pp_rank, dp_rank, nccl_port
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/managers/tp_worker.py", line 84, in init
self.model_runner = ModelRunner(
~~~~~~~~~~~^
model_config=self.model_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...<13 lines>...
token_to_kv_pool_allocator=token_to_kv_pool_allocator,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/model_executor/model_runner.py", line 245, in init
self.initialize(min_per_gpu_memory)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/model_executor/model_runner.py", line 350, in initialize
self.init_device_graphs()
~~~~~~~~~~~~~~~~~~~~~~~^^
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/model_executor/model_runner.py", line 1622, in init_device_graphs
CudaGraphRunner(self) if not _is_npu else NPUGraphRunner(self)
~~~~~~~~~~~~~~~^^^^^^
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 386, in init
raise Exception(
f"Capture cuda graph failed: {e}\n{CUDA_GRAPH_CAPTURE_FAILED_MSG}"
)
Exception: Capture cuda graph failed: RMSNorm failed with error code no kernel image is available for execution on the device
Possible solutions:
- set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
- set --cuda-graph-max-bs to a smaller value (e.g., 16)
- disable torch compile by not using --enable-torch-compile
- disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose
[2025-08-25 22:39:27] Received sigquit from a child process. It usually means the child failed.
Reproduction
python launch_server.py
Environment
oot@gaolaobao-Legion-R9000P-ADR10:/home/gaolaobao/project/20250823/sglang# python3 -m sglang.check_env
Python: 3.13.3 (main, Aug 14 2025, 11:53:40) [GCC 14.2.0]
CUDA available: True
GPU 0: NVIDIA GeForce RTX 5060 Laptop GPU
GPU 0 Compute Capability: 12.0
CUDA_HOME: /usr/local/cuda-12.9
NVCC: Cuda compilation tools, release 12.9, V12.9.41
CUDA Driver Version: 575.64.03
PyTorch: 2.8.0+cu128
sglang: 0.5.0rc2
sgl_kernel: 0.3.5
flashinfer_python: 0.2.11.post3
triton: 3.4.0
transformers: 4.55.2
torchao: 0.9.0
numpy: 2.3.2
aiohttp: 3.12.15
fastapi: 0.116.1
hf_transfer: 0.1.9
huggingface_hub: 0.34.4
interegular: 0.3.3
modelscope: 1.29.1
orjson: 3.11.2
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.7
python-multipart: 0.0.20
pyzmq: 27.0.2
uvicorn: 0.35.0
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.23
openai: 1.99.1
tiktoken: 0.11.0
anthropic: 0.64.0
litellm: Module Not Found
decord: 0.6.0
NVIDIA Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 0-31 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks