-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
--tensor-parallel-size 2 seems broken for Blackwell 6000 pro since version 10. I can run with -tp 2 in version 9.2 with two rtx 6000 pro blackwell but it doesnt support newer models like GLM-4.5 etc
Version: 0.10.1.dev446+g7e3a8dc90.d20250808.cu129
🐛 Describe the bug
tensor parallel is broken for rtx 6000 pro's
Capturing CUDA graph shapes: 100%|███████████████████████████████████████████| 67/67 [00:07<00:00, 8.75it/s]
(VllmWorker TP0 pid=1422634) INFO 08-07 19:06:29 [custom_all_reduce.py:196] Registering 8643 cuda graph addresses
(VllmWorker TP1 pid=1422635) INFO 08-07 19:06:30 [custom_all_reduce.py:196] Registering 8643 cuda graph addresses
(VllmWorker TP1 pid=1422635) INFO 08-07 19:06:31 [gpu_model_runner.py:2537] Graph capturing finished in 9 secs, took 1.21 GiB
(VllmWorker TP0 pid=1422634) INFO 08-07 19:06:31 [gpu_model_runner.py:2537] Graph capturing finished in 9 secs, took 1.21 GiB
(EngineCore_0 pid=1422484) INFO 08-07 19:06:31 [core.py:198] init engine (profile, create kv cache, warmup model) took 114.28 seconds
(APIServer pid=1422128) INFO 08-07 19:06:32 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 29329
(APIServer pid=1422128) INFO 08-07 19:06:32 [api_server.py:1610] Supported_tasks: ['generate']
(APIServer pid=1422128) WARNING 08-07 19:06:32 [config.py:1670] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1422128) INFO 08-07 19:06:32 [serving_responses.py:107] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
(APIServer pid=1422128) INFO 08-07 19:06:32 [serving_chat.py:133] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
(APIServer pid=1422128) INFO 08-07 19:06:32 [serving_completion.py:77] Using default completion sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
(APIServer pid=1422128) INFO 08-07 19:06:32 [api_server.py:1865] Starting vLLM API server 0 on http://0.0.0.0:8001
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:29] Available routes are:
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /docs, Methods: HEAD, GET
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /redoc, Methods: HEAD, GET
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /health, Methods: GET
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /load, Methods: GET
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /ping, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /ping, Methods: GET
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /tokenize, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /detokenize, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /v1/models, Methods: GET
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /version, Methods: GET
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /v1/responses, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /v1/completions, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /v1/embeddings, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /pooling, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /classify, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /score, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /v1/score, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /v1/audio/translations, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /rerank, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /v1/rerank, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /v2/rerank, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /invocations, Methods: POST
(APIServer pid=1422128) INFO 08-07 19:06:32 [launcher.py:37] Route: /metrics, Methods: GET
(APIServer pid=1422128) INFO: Started server process [1422128]
(APIServer pid=1422128) INFO: Waiting for application startup.
(APIServer pid=1422128) INFO: Application startup complete.
(APIServer pid=1422128) INFO 08-07 19:06:43 [chat_utils.py:470] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1422128) INFO: 127.0.0.1:57230 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [dump_input.py:69] Dumping input data for V1 LLM engine (v0.10.1.dev446+g7e3a8dc90.d20250808) with config: model='/mnt/2king/llama-models/Skywork/Skywork/MindLink-32B-0801', speculative_config=None, tokenizer='/mnt/2king/llama-models/Skywork/Skywork/MindLink-32B-0801', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=MindLink-32B-0801, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null},
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-5300a9bf236543b98c3d48fedd197299,prompt_token_ids_len=557,mm_inputs=[],mm_hashes=[],mm_positions=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[151643], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=40403, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None),block_ids=([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35],),num_computed_tokens=0,lora_request=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_from_preemption=[], new_token_ids=[], new_block_ids=[], num_computed_tokens=[]), num_scheduled_tokens={chatcmpl-5300a9bf236543b98c3d48fedd197299: 557}, total_num_scheduled_tokens=557, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[35], finished_req_ids=[], free_encoder_input_ids=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [dump_input.py:79] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.0012274540557127844, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=557, hits=0), spec_decoding_stats=None, num_corrupted_reqs=0)
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] EngineCore encountered a fatal error.
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] Traceback (most recent call last):
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] File "/home/giga/vllm/vllm/v1/executor/multiproc_executor.py", line 243, in collective_rpc
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] result = get_response(w, dequeue_timeout)
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] File "/home/giga/vllm/vllm/v1/executor/multiproc_executor.py", line 226, in get_response
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] status, result = w.worker_response_mq.dequeue(
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] File "/home/giga/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 507, in dequeue
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] with self.acquire_read(timeout, cancel) as buf:
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] File "/home/giga/.pyenv/versions/3.12.11/lib/python3.12/contextlib.py", line 137, in __enter__
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] return next(self.gen)
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] ^^^^^^^^^^^^^^
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] File "/home/giga/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 469, in acquire_read
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] raise TimeoutError
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] TimeoutError
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] The above exception was the direct cause of the following exception:
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684]
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] Traceback (most recent call last):
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] File "/home/giga/vllm/vllm/v1/engine/core.py", line 675, in run_engine_core
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] engine_core.run_busy_loop()
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] File "/home/giga/vllm/vllm/v1/engine/core.py", line 702, in run_busy_loop
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] self._process_engine_step()
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] File "/home/giga/vllm/vllm/v1/engine/core.py", line 727, in _process_engine_step
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] outputs, model_executed = self.step_fn()
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] ^^^^^^^^^^^^^^
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] File "/home/giga/vllm/vllm/v1/engine/core.py", line 272, in step
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] model_output = self.execute_model_with_error_logging(
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] File "/home/giga/vllm/vllm/v1/engine/core.py", line 258, in execute_model_with_error_logging
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] raise err
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] File "/home/giga/vllm/vllm/v1/engine/core.py", line 249, in execute_model_with_error_logging
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] return model_fn(scheduler_output)
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] File "/home/giga/vllm/vllm/v1/executor/multiproc_executor.py", line 173, in execute_model
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] (output, ) = self.collective_rpc(
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] ^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] File "/home/giga/vllm/vllm/v1/executor/multiproc_executor.py", line 249, in collective_rpc
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] raise TimeoutError(f"RPC call to {method} timed out.") from e
(EngineCore_0 pid=1422484) ERROR 08-07 19:11:43 [core.py:684] TimeoutError: RPC call to execute_model timed out.
(VllmWorker TP1 pid=1422635) INFO 08-07 19:11:43 [multiproc_executor.py:520] Parent process exited, terminating worker
(VllmWorker TP0 pid=1422634) INFO 08-07 19:11:43 [multiproc_executor.py:520] Parent process exited, terminating worker
(APIServer pid=1422128) ERROR 08-07 19:11:43 [async_llm.py:430] AsyncLLM output_handler failed.
(APIServer pid=1422128) ERROR 08-07 19:11:43 [async_llm.py:430] Traceback (most recent call last):
(APIServer pid=1422128) ERROR 08-07 19:11:43 [async_llm.py:430] File "/home/giga/vllm/vllm/v1/engine/async_llm.py", line 389, in output_handler
(APIServer pid=1422128) ERROR 08-07 19:11:43 [async_llm.py:430] outputs = await engine_core.get_output_async()
(APIServer pid=1422128) ERROR 08-07 19:11:43 [async_llm.py:430] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1422128) ERROR 08-07 19:11:43 [async_llm.py:430] File "/home/giga/vllm/vllm/v1/engine/core_client.py", line 809, in get_output_async
(APIServer pid=1422128) ERROR 08-07 19:11:43 [async_llm.py:430] raise self._format_exception(outputs) from None
(APIServer pid=1422128) ERROR 08-07 19:11:43 [async_llm.py:430] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050] Error in chat completion stream generator.
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050] Traceback (most recent call last):
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050] File "/home/giga/vllm/vllm/entrypoints/openai/serving_chat.py", line 544, in chat_completion_stream_generator
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050] async for res in result_generator:
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050] File "/home/giga/vllm/vllm/v1/engine/async_llm.py", line 337, in generate
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050] out = q.get_nowait() or await q.get()
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050] ^^^^^^^^^^^^^
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050] File "/home/giga/vllm/vllm/v1/engine/output_processor.py", line 57, in get
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050] raise output
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050] File "/home/giga/vllm/vllm/v1/engine/async_llm.py", line 389, in output_handler
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050] outputs = await engine_core.get_output_async()
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050] File "/home/giga/vllm/vllm/v1/engine/core_client.py", line 809, in get_output_async
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050] raise self._format_exception(outputs) from None
(APIServer pid=1422128) ERROR 08-07 19:11:43 [serving_chat.py:1050] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=1422128) INFO: 127.0.0.1:57230 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1422128) INFO: Shutting down
(APIServer pid=1422128) INFO: Waiting for application shutdown.
(APIServer pid=1422128) INFO: Application shutdown complete.
(APIServer pid=1422128) INFO: Finished server process [1422128]
/home/giga/.pyenv/versions/3.12.11/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/giga/.pyenv/versions/3.12.11/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 4 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ' ```
### Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working