Skip to content

[Bug]: --tensor-parallel-size 2 seems broken for Blackwell 6000 pro since version 10 #22479

@fernandaspets

Description

@fernandaspets

Your current environment

--tensor-parallel-size 2 seems broken for Blackwell 6000 pro since version 10. I can run with -tp 2 in version 9.2 with two rtx 6000 pro blackwell but it doesnt support newer models like GLM-4.5 etc

Version: 0.10.1.dev446+g7e3a8dc90.d20250808.cu129

🐛 Describe the bug

tensor parallel is broken for rtx 6000 pro's

Capturing CUDA graph shapes: 100%|███████████████████████████████████████████| 67/67 [00:07<00:00,  8.75it/s]
(VllmWorker TP0 pid=1422634) INFO 08-07 19:06:29 [custom_all_reduce.py:196] Registering 8643 cuda graph addresses
(VllmWorker TP1 pid=1422635) INFO 08-07 19:06:30 [custom_all_reduce.py:196] Registering 8643 cuda graph addresses
(VllmWorker TP1 pid=1422635) INFO 08-07 19:06:31 [gpu_model_runner.py:2537] Graph capturing finished in 9 secs, took 1.21 GiB
(VllmWorker TP0 pid=1422634) INFO 08-07 19:06:31 [gpu_model_runner.py:2537] Graph capturing finished in 9 secs, took 1.21 GiB
(EngineCore_0 pid=1422484) INFO 08-07 19:06:31 [core.py:198] init engine (profile, create kv cache, warmup model) took 114.28 seconds
(APIServer pid=1422128) INFO 08-07 19:06:32 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 29329
(APIServer pid=1422128) INFO 08-07 19:06:32 [api_server.py:1610] Supported_tasks: ['generate']
(APIServer pid=1422128) WARNING 08-07 19:06:32 [config.py:1670] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1422128) INFO 08-07 19:06:32 [serving_responses.py:107] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
(APIServer pid=1422128) INFO 08-07 19:06:32 [serving_chat.py:133] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
(APIServer pid=1422128) INFO 08-07 19:06:32 [serving_completion.py:77] Using default completion sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
(APIServer pid=1422128) INFO 08-07 19:06:32 [api_server.py:1865] Starting vLLM API server 0 on http://0.0.0.0:8001
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:29] Available routes are:
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /docs, Methods: HEAD, GET
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /redoc, Methods: HEAD, GET
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /health, Methods: GET
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /load, Methods: GET
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /ping, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /ping, Methods: GET
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /tokenize, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /detokenize, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /v1/models, Methods: GET
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /version, Methods: GET
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /v1/responses, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /v1/completions, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /v1/embeddings, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /pooling, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /classify, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /score, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /v1/score, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /v1/audio/translations, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /rerank, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /v1/rerank, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /v2/rerank, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /invocations, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /metrics, Methods: GET
(APIServer pid=1422128) INFO:     Started server process [1422128]
(APIServer pid=1422128) INFO:     Waiting for application startup.
(APIServer pid=1422128) INFO:     Application startup complete.
(APIServer pid=1422128) INFO 08-07 19:06:43 [chat_utils.py:470] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1422128) INFO:     127.0.0.1:57230 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [dump_input.py:69] Dumping input data for V1 LLM engine (v0.10.1.dev446+g7e3a8dc90.d20250808) with config: model='/mnt/2king/llama-models/Skywork/Skywork/MindLink-32B-0801', speculative_config=None, tokenizer='/mnt/2king/llama-models/Skywork/Skywork/MindLink-32B-0801', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=MindLink-32B-0801, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null},
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-5300a9bf236543b98c3d48fedd197299,prompt_token_ids_len=557,mm_inputs=[],mm_hashes=[],mm_positions=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[151643], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=40403, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None),block_ids=([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35],),num_computed_tokens=0,lora_request=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_from_preemption=[], new_token_ids=[], new_block_ids=[], num_computed_tokens=[]), num_scheduled_tokens={chatcmpl-5300a9bf236543b98c3d48fedd197299: 557}, total_num_scheduled_tokens=557, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[35], finished_req_ids=[], free_encoder_input_ids=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [dump_input.py:79] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.0012274540557127844, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=557, hits=0), spec_decoding_stats=None, num_corrupted_reqs=0)
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] EngineCore encountered a fatal error.
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] Traceback (most recent call last):
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]   File "/home/giga/vllm/vllm/v1/executor/multiproc_executor.py", line 243, in collective_rpc
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]     result = get_response(w, dequeue_timeout)
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]   File "/home/giga/vllm/vllm/v1/executor/multiproc_executor.py", line 226, in get_response
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]     status, result = w.worker_response_mq.dequeue(
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]   File "/home/giga/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 507, in dequeue
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]     with self.acquire_read(timeout, cancel) as buf:
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]   File "/home/giga/.pyenv/versions/3.12.11/lib/python3.12/contextlib.py", line 137, in __enter__
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]     return next(self.gen)
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]            ^^^^^^^^^^^^^^
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]   File "/home/giga/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 469, in acquire_read
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]     raise TimeoutError
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] TimeoutError
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] The above exception was the direct cause of the following exception:
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] Traceback (most recent call last):
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]   File "/home/giga/vllm/vllm/v1/engine/core.py", line 675, in run_engine_core
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]     engine_core.run_busy_loop()
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]   File "/home/giga/vllm/vllm/v1/engine/core.py", line 702, in run_busy_loop
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]     self._process_engine_step()
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]   File "/home/giga/vllm/vllm/v1/engine/core.py", line 727, in _process_engine_step
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]     outputs, model_executed = self.step_fn()
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]                               ^^^^^^^^^^^^^^
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]   File "/home/giga/vllm/vllm/v1/engine/core.py", line 272, in step
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]     model_output = self.execute_model_with_error_logging(
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]   File "/home/giga/vllm/vllm/v1/engine/core.py", line 258, in execute_model_with_error_logging
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]     raise err
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]   File "/home/giga/vllm/vllm/v1/engine/core.py", line 249, in execute_model_with_error_logging
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]     return model_fn(scheduler_output)
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]            ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]   File "/home/giga/vllm/vllm/v1/executor/multiproc_executor.py", line 173, in execute_model
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]     (output, ) = self.collective_rpc(
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]                  ^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]   File "/home/giga/vllm/vllm/v1/executor/multiproc_executor.py", line 249, in collective_rpc
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]     raise TimeoutError(f"RPC call to {method} timed out.") from e
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] TimeoutError: RPC call to execute_model timed out.
(VllmWorker TP1 pid=1422635) INFO 08-07 19:11:43 [multiproc_executor.py:520] Parent process exited, terminating worker
(VllmWorker TP0 pid=1422634) INFO 08-07 19:11:43 [multiproc_executor.py:520] Parent process exited, terminating worker
(APIServer pid=1422128) ERROR 08-07 19:11:43 [async_llm.py:430] AsyncLLM output_handler failed.
(APIServer pid=1422128) ERROR 08-07 19:11:43 [async_llm.py:430] Traceback (most recent call last):
(APIServer pid=1422128) ERROR 08-07 19:11:43 [async_llm.py:430]   File "/home/giga/vllm/vllm/v1/engine/async_llm.py", line 389, in output_handler
(APIServer pid=1422128) ERROR 08-07 19:11:43 [async_llm.py:430]     outputs = await engine_core.get_output_async()
(APIServer pid=1422128) ERROR 08-07 19:11:43 [async_llm.py:430]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1422128) ERROR 08-07 19:11:43 [async_llm.py:430]   File "/home/giga/vllm/vllm/v1/engine/core_client.py", line 809, in get_output_async
(APIServer pid=1422128) ERROR 08-07 19:11:43 [async_llm.py:430]     raise self._format_exception(outputs) from None
(APIServer pid=1422128) ERROR 08-07 19:11:43 [async_llm.py:430] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050] Error in chat completion stream generator.
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050] Traceback (most recent call last):
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050]   File "/home/giga/vllm/vllm/entrypoints/openai/serving_chat.py", line 544, in chat_completion_stream_generator
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050]     async for res in result_generator:
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050]   File "/home/giga/vllm/vllm/v1/engine/async_llm.py", line 337, in generate
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050]     out = q.get_nowait() or await q.get()
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050]                             ^^^^^^^^^^^^^
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050]   File "/home/giga/vllm/vllm/v1/engine/output_processor.py", line 57, in get
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050]     raise output
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050]   File "/home/giga/vllm/vllm/v1/engine/async_llm.py", line 389, in output_handler
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050]     outputs = await engine_core.get_output_async()
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050]   File "/home/giga/vllm/vllm/v1/engine/core_client.py", line 809, in get_output_async
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050]     raise self._format_exception(outputs) from None
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=1422128) INFO:     127.0.0.1:57230 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1422128) INFO:     Shutting down
(APIServer pid=1422128) INFO:     Waiting for application shutdown.
(APIServer pid=1422128) INFO:     Application shutdown complete.
(APIServer pid=1422128) INFO:     Finished server process [1422128]
/home/giga/.pyenv/versions/3.12.11/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/giga/.pyenv/versions/3.12.11/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 4 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d ' ```


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions