Skip to content

[Bug] tensor_model_parallel_all_reduce' is not defined #2931

@bakch92

Description

@bakch92

Describe the bug

I attempted to serve the Phi-4 Lora Fine-tuning model by setting tensor parallel size 2 using the sglang framework, but the following error occurred.

[Error Log]

[2025-01-17 01:51:55 TP0] LoRA manager ready.
[2025-01-17 01:51:57 TP1] Load weight end. type=Phi3ForCausalLM, dtype=torch.float16, avail mem=15.70 GB
[2025-01-17 01:52:00 TP1] LoRA manager ready.
[2025-01-17 01:52:00 TP0] Memory pool end. avail mem=39.54 GB
[2025-01-17 01:52:02 TP1] Memory pool end. avail mem=13.43 GB
[2025-01-17 01:52:02 TP1] max_total_num_tokens=16384, max_prefill_tokens=16384, max_running_requests=2049, context_len=16384
[2025-01-17 01:52:02 TP0] max_total_num_tokens=16384, max_prefill_tokens=16384, max_running_requests=2049, context_len=16384
[2025-01-17 01:52:02] INFO:     Started server process [649817]
[2025-01-17 01:52:02] INFO:     Waiting for application startup.
[2025-01-17 01:52:02] INFO:     Application startup complete.
[2025-01-17 01:52:02] INFO:     Uvicorn running on http://0.0.0.0:8001 (Press CTRL+C to quit)
[2025-01-17 01:52:03] INFO:     127.0.0.1:47632 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-17 01:52:03 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-17 01:52:11 TP0] TpModelWorkerClient hit an exception: Traceback (most recent call last):
  File "/home/work/anaconda3/envs/unsloth_env/lib/python3.11/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 101, in forward_thread_func
    self.forward_thread_func_()
  File "/home/work/.local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/anaconda3/envs/unsloth_env/lib/python3.11/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 132, in forward_thread_func_
    logits_output, next_token_ids = self.worker.forward_batch_generation(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/anaconda3/envs/unsloth_env/lib/python3.11/site-packages/sglang/srt/managers/tp_worker.py", line 154, in forward_batch_generation
    logits_output = self.model_runner.forward(forward_batch)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/anaconda3/envs/unsloth_env/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 679, in forward
    return self.forward_extend(forward_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/anaconda3/envs/unsloth_env/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 648, in forward_extend
    return self.model.forward(
           ^^^^^^^^^^^^^^^^^^^
  File "/home/work/.local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/anaconda3/envs/unsloth_env/lib/python3.11/site-packages/sglang/srt/models/llama.py", line 337, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/anaconda3/envs/unsloth_env/lib/python3.11/site-packages/sglang/srt/models/llama.py", line 288, in forward
    hidden_states, residual = layer(
                              ^^^^^^
  File "/home/work/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/anaconda3/envs/unsloth_env/lib/python3.11/site-packages/sglang/srt/models/llama.py", line 237, in forward
    hidden_states = self.self_attn(
                    ^^^^^^^^^^^^^^^
  File "/home/work/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/anaconda3/envs/unsloth_env/lib/python3.11/site-packages/sglang/srt/models/llama.py", line 175, in forward
    output, _ = self.o_proj(attn_output)
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/anaconda3/envs/unsloth_env/lib/python3.11/site-packages/sglang/srt/lora/lora.py", line 248, in forward
    output_ = tensor_model_parallel_all_reduce(output_parallel)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NameError: name 'tensor_model_parallel_all_reduce' is not defined

Reproduction

Model Name: Microsoft Phi-4

nohup python -m sglang.launch_server --model-path /home/work/ai/Microsoft_Phi-4/phi-4_quantized_8bit --lora-paths lora=/home/work/ai/Microsoft_Phi-4/lora_tuning_1221 --port 8001 --mem-fraction-static 0.8 --host 0.0.0.0 --dtype auto --disable-radix-cache --disable-cuda-graph --quantization gptq_marlin --max-total-tokens 16384 --tp 2 &

Environment

Python: 3.11.11 (main, Dec 11 2024, 16:28:39) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2: CUDA GPU
GPU 0,1,2 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
CUDA Driver Version: 535.54.03
PyTorch: 2.5.1+cu124
sglang: 0.4.0
flashinfer: 0.1.6+cu121torch2.4
triton: 3.1.0
transformers: 4.48.0
torchao: 0.6.1
numpy: 1.26.4
aiohttp: 3.11.8
fastapi: 0.115.5
hf_transfer: 0.1.8
huggingface_hub: 0.27.0
interegular: 0.3.3
modelscope: 1.20.1
orjson: 3.10.12
packaging: 24.2
psutil: 6.1.0
pydantic: 2.10.4
multipart: 0.0.17
zmq: 26.2.0
uvicorn: 0.32.1
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.58.1
anthropic: Module Not Found
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PIX NODE SYS 1,3,5,7,9,11 1 N/A
GPU1 PIX X NODE SYS 1,3,5,7,9,11 1 N/A
GPU2 NODE NODE X SYS 1,3,5,7,9,11 1 N/A
NIC0 SYS SYS SYS X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0

ulimit soft: 1048576

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinglora

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions