Skip to content

[Bug] RTX 5060: RMSNorm failed, same as the #7249 issue, when running qwen2.5-0.5b-instruct model. #9600

@gaolaobao

Description

@gaolaobao

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

root@gaolaobao-Legion-R9000P-ADR10:/home/gaolaobao/project/20250823/sglang# python launch_server.py
[2025-08-25 22:39:09] server_args=ServerArgs(model_path='qwen/qwen2.5-0.5b-instruct', tokenizer_path='qwen/qwen2.5-0.5b-instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='0.0.0.0', port=30725, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.636, max_running_requests=128, max_queued_requests=9223372036854775807, max_total_tokens=20480, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, device='cuda', tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=612489127, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, api_key=None, served_model_name='qwen/qwen2.5-0.5b-instruct', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', disable_radix_cache=False, cuda_graph_max_bs=4, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False)
terminate called after throwing an instance of 'c10::Error'
what(): fast_1 >= fast_0 INTERNAL ASSERT FAILED at "/pytorch/c10/util/ApproximateClock.cpp":18, please report a bug to PyTorch. getCount is non-monotonic.
Exception raised from measurePair at /pytorch/c10/util/ApproximateClock.cpp:18 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x80 (0x70d08d97eeb0 in /usr/local/lib/python3.13/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x65 (0x70d08d91ba89 in /usr/local/lib/python3.13/dist-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) + 0x42 (0x70d08d97b402 in /usr/local/lib/python3.13/dist-packages/torch/lib/libc10.so)
frame #3: + 0x8966e (0x70d08d97666e in /usr/local/lib/python3.13/dist-packages/torch/lib/libc10.so)
frame #4: c10::ApproximateClockToUnixTimeConverter::measurePairs() + 0x27 (0x70d08d976697 in /usr/local/lib/python3.13/dist-packages/torch/lib/libc10.so)
frame #5: c10::ApproximateClockToUnixTimeConverter::ApproximateClockToUnixTimeConverter() + 0xc (0x70d08d9766bc in /usr/local/lib/python3.13/dist-packages/torch/lib/libc10.so)
frame #6: + 0x2b591 (0x70d08dd8c591 in /usr/local/lib/python3.13/dist-packages/torch/lib/libc10_cuda.so)
frame #7: + 0x11c50 (0x70d08dd72c50 in /usr/local/lib/python3.13/dist-packages/torch/lib/libc10_cuda.so)
frame #8: + 0x54af (0x70d1514bd4af in /lib64/ld-linux-x86-64.so.2)
frame #9: + 0x55c4 (0x70d1514bd5c4 in /lib64/ld-linux-x86-64.so.2)
frame #10: _dl_catch_exception + 0x132 (0x70d1514ba552 in /lib64/ld-linux-x86-64.so.2)
frame #11: + 0xcb89 (0x70d1514c4b89 in /lib64/ld-linux-x86-64.so.2)
frame #12: _dl_catch_exception + 0x9c (0x70d1514ba4bc in /lib64/ld-linux-x86-64.so.2)
frame #13: + 0xcfb4 (0x70d1514c4fb4 in /lib64/ld-linux-x86-64.so.2)
frame #14: + 0x9e684 (0x70d15109e684 in /lib/x86_64-linux-gnu/libc.so.6)
frame #15: _dl_catch_exception + 0x9c (0x70d1514ba4bc in /lib64/ld-linux-x86-64.so.2)
frame #16: + 0x2609 (0x70d1514ba609 in /lib64/ld-linux-x86-64.so.2)
frame #17: + 0x9e173 (0x70d15109e173 in /lib/x86_64-linux-gnu/libc.so.6)
frame #18: dlopen + 0x6f (0x70d15109e73f in /lib/x86_64-linux-gnu/libc.so.6)
frame #19: /usr/bin/python3() [0x698762]
frame #20: /usr/bin/python3() [0x5920d5]
frame #21: _PyEval_EvalFrameDefault + 0x4907 (0x569107 in /usr/bin/python3)
frame #22: /usr/bin/python3() [0x58cd34]
frame #23: PyObject_CallMethodObjArgs + 0xe5 (0x5cdd75 in /usr/bin/python3)
frame #24: PyImport_ImportModuleLevelObject + 0x234 (0x5ccc44 in /usr/bin/python3)
frame #25: _PyEval_EvalFrameDefault + 0x637d (0x56ab7d in /usr/bin/python3)
frame #26: PyEval_EvalCode + 0xcd (0x65c99d in /usr/bin/python3)
frame #27: /usr/bin/python3() [0x6778d5]
frame #28: /usr/bin/python3() [0x5838db]
frame #29: _PyEval_EvalFrameDefault + 0x4907 (0x569107 in /usr/bin/python3)
frame #30: /usr/bin/python3() [0x58cd34]
frame #31: PyObject_CallMethodObjArgs + 0xe5 (0x5cdd75 in /usr/bin/python3)
frame #32: PyImport_ImportModuleLevelObject + 0x234 (0x5ccc44 in /usr/bin/python3)
frame #33: _PyEval_EvalFrameDefault + 0x637d (0x56ab7d in /usr/bin/python3)
frame #34: PyEval_EvalCode + 0xcd (0x65c99d in /usr/bin/python3)
frame #35: /usr/bin/python3() [0x6778d5]
frame #36: /usr/bin/python3() [0x5838db]
frame #37: _PyEval_EvalFrameDefault + 0x4907 (0x569107 in /usr/bin/python3)
frame #38: /usr/bin/python3() [0x58cd34]
frame #39: PyObject_CallMethodObjArgs + 0xe5 (0x5cdd75 in /usr/bin/python3)
frame #40: PyImport_ImportModuleLevelObject + 0x234 (0x5ccc44 in /usr/bin/python3)
frame #41: _PyEval_EvalFrameDefault + 0x637d (0x56ab7d in /usr/bin/python3)
frame #42: PyEval_EvalCode + 0xcd (0x65c99d in /usr/bin/python3)
frame #43: /usr/bin/python3() [0x6778d5]
frame #44: /usr/bin/python3() [0x5838db]
frame #45: _PyEval_EvalFrameDefault + 0x4907 (0x569107 in /usr/bin/python3)
frame #46: /usr/bin/python3() [0x58cd34]
frame #47: PyObject_CallMethodObjArgs + 0xe5 (0x5cdd75 in /usr/bin/python3)
frame #48: PyImport_ImportModuleLevelObject + 0x234 (0x5ccc44 in /usr/bin/python3)
frame #49: _PyEval_EvalFrameDefault + 0x637d (0x56ab7d in /usr/bin/python3)
frame #50: PyEval_EvalCode + 0xcd (0x65c99d in /usr/bin/python3)
frame #51: /usr/bin/python3() [0x6778d5]
frame #52: /usr/bin/python3() [0x5838db]
frame #53: PyObject_Vectorcall + 0x33 (0x550fc3 in /usr/bin/python3)
frame #54: _PyEval_EvalFrameDefault + 0x3293 (0x567a93 in /usr/bin/python3)
frame #55: PyEval_EvalCode + 0xcd (0x65c99d in /usr/bin/python3)
frame #56: /usr/bin/python3() [0x67d782]
frame #57: /usr/bin/python3() [0x67968e]
frame #58: /usr/bin/python3() [0x66b766]
frame #59: /usr/bin/python3() [0x66b671]
frame #60: Py_RunMain + 0x2be (0x690e5e in /usr/bin/python3)
frame #61: Py_BytesMain + 0x2d (0x64bdcd in /usr/bin/python3)
frame #62: + 0x2a578 (0x70d15102a578 in /lib/x86_64-linux-gnu/libc.so.6)

[2025-08-25 22:39:13] Using default HuggingFace chat template with detected content format: string
[2025-08-25 22:39:21] Attention backend not explicitly specified. Use flashinfer backend by default.
[2025-08-25 22:39:21] Init torch distributed begin.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-08-25 22:39:21] Init torch distributed ends. mem usage=0.00 GB
[2025-08-25 22:39:22] MOE_RUNNER_BACKEND is not initialized, using triton backend
[2025-08-25 22:39:22] Ignore import error when loading sglang.srt.models.glm4v_moe: No module named 'transformers.models.glm4v_moe'
[2025-08-25 22:39:22] Load weight begin. avail mem=6.81 GB
[2025-08-25 22:39:24] Using model weights format ['*.safetensors']
[2025-08-25 22:39:25] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.09it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.09it/s]

[2025-08-25 22:39:26] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=5.82 GB, mem usage=0.99 GB.
[2025-08-25 22:39:26] KV Cache is allocated. #tokens: 20480, K size: 0.12 GB, V size: 0.12 GB
[2025-08-25 22:39:26] Memory pool end. avail mem=5.42 GB
[2025-08-25 22:39:26] Capture cuda graph begin. This can take up to several minutes. avail mem=4.78 GB
[2025-08-25 22:39:26] Capture cuda graph bs [1, 2, 4]
Capturing batches (bs=4 avail_mem=4.79 GB): 0%| | 0/3 [00:00<?, ?it/s][2025-08-25 22:39:26] IS_TBO_ENABLED is not initialized, using False
Capturing batches (bs=4 avail_mem=4.79 GB): 0%| | 0/3 [00:00<?, ?it/s]
[2025-08-25 22:39:27] Scheduler hit an exception: Traceback (most recent call last):
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 384, in init
self.capture()
~~~~~~~~~~~~^^
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 492, in capture
) = self.capture_one_batch_size(bs, forward)
~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 663, in capture_one_batch_size
run_once()
~~~~~~~~^^
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 652, in run_once
logits_output_or_pp_proxy_tensors = forward(
input_ids,
...<2 lines>...
**kwargs,
)
File "/usr/local/lib/python3.13/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/models/qwen2.py", line 472, in forward
hidden_states = self.model(
input_ids,
...<3 lines>...
pp_proxy_tensors=pp_proxy_tensors,
)
File "/usr/local/lib/python3.13/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.13/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/models/qwen2.py", line 340, in forward
hidden_states, residual = layer(
~~~~~^
positions,
^^^^^^^^^^
...<2 lines>...
residual,
^^^^^^^^^
)
^
File "/usr/local/lib/python3.13/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.13/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/models/qwen2.py", line 241, in forward
hidden_states = self.input_layernorm(hidden_states)
File "/usr/local/lib/python3.13/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.13/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/custom_op.py", line 59, in forward
return self._forward_method(*args, **kwargs)
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/layers/layernorm.py", line 87, in forward_cuda
out = rmsnorm(x, self.weight.data, self.variance_epsilon)
File "/usr/local/lib/python3.13/dist-packages/sgl_kernel/elementwise.py", line 45, in rmsnorm
torch.ops.sgl_kernel.rmsnorm.default(out, input, weight, eps, enable_pdl)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.13/dist-packages/torch/_ops.py", line 829, in call
return self._op(*args, **kwargs)
~~~~~~~~^^^^^^^^^^^^^^^^^
RuntimeError: RMSNorm failed with error code no kernel image is available for execution on the device

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/managers/scheduler.py", line 2561, in run_scheduler_process
scheduler = Scheduler(
server_args,
...<6 lines>...
dp_balance_meta=balance_meta,
)
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/managers/scheduler.py", line 323, in init
self.tp_worker = TpWorkerClass(
~~~~~~~~~~~~~^
server_args=server_args,
^^^^^^^^^^^^^^^^^^^^^^^^
...<5 lines>...
nccl_port=port_args.nccl_port,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 67, in init
self.worker = TpModelWorker(
~~~~~~~~~~~~~^
server_args, gpu_id, tp_rank, moe_ep_rank, pp_rank, dp_rank, nccl_port
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/managers/tp_worker.py", line 84, in init
self.model_runner = ModelRunner(
~~~~~~~~~~~^
model_config=self.model_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...<13 lines>...
token_to_kv_pool_allocator=token_to_kv_pool_allocator,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/model_executor/model_runner.py", line 245, in init
self.initialize(min_per_gpu_memory)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/model_executor/model_runner.py", line 350, in initialize
self.init_device_graphs()
~~~~~~~~~~~~~~~~~~~~~~~^^
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/model_executor/model_runner.py", line 1622, in init_device_graphs
CudaGraphRunner(self) if not _is_npu else NPUGraphRunner(self)
~~~~~~~~~~~~~~~^^^^^^
File "/home/gaolaobao/project/20250823/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 386, in init
raise Exception(
f"Capture cuda graph failed: {e}\n{CUDA_GRAPH_CAPTURE_FAILED_MSG}"
)
Exception: Capture cuda graph failed: RMSNorm failed with error code no kernel image is available for execution on the device
Possible solutions:

  1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
  2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
  3. disable torch compile by not using --enable-torch-compile
  4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
    Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

[2025-08-25 22:39:27] Received sigquit from a child process. It usually means the child failed.

Reproduction

python launch_server.py

Environment

oot@gaolaobao-Legion-R9000P-ADR10:/home/gaolaobao/project/20250823/sglang# python3 -m sglang.check_env
Python: 3.13.3 (main, Aug 14 2025, 11:53:40) [GCC 14.2.0]
CUDA available: True
GPU 0: NVIDIA GeForce RTX 5060 Laptop GPU
GPU 0 Compute Capability: 12.0
CUDA_HOME: /usr/local/cuda-12.9
NVCC: Cuda compilation tools, release 12.9, V12.9.41
CUDA Driver Version: 575.64.03
PyTorch: 2.8.0+cu128
sglang: 0.5.0rc2
sgl_kernel: 0.3.5
flashinfer_python: 0.2.11.post3
triton: 3.4.0
transformers: 4.55.2
torchao: 0.9.0
numpy: 2.3.2
aiohttp: 3.12.15
fastapi: 0.116.1
hf_transfer: 0.1.9
huggingface_hub: 0.34.4
interegular: 0.3.3
modelscope: 1.29.1
orjson: 3.11.2
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.7
python-multipart: 0.0.20
pyzmq: 27.0.2
uvicorn: 0.35.0
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.23
openai: 1.99.1
tiktoken: 0.11.0
anthropic: 0.64.0
litellm: Module Not Found
decord: 0.6.0
NVIDIA Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 0-31 0 N/A

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions