-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Description
Describe the bug
With ZeRO-2 + CPU Offload + overlap_comm=true, the IPG (Independent Partition Gradient) buckets are never populated. During gradient reduction (reduce_ipg_grads), we consistently observe empty buckets (bucket.index=0 and len(bucket.buffer)=0). Earlier we also hit list index out of range (now avoided), but buckets remain empty.
We instrumented both our trainer and DeepSpeed ZeRO code. After backward, engine.optimizer.ipg_buckets[torch.float32] always shows elements=0 / params_len=0 / grads_len=0 while contiguous_gradients=true/false (tried both), partition_gradients=true, overlap_comm=true.
Forward/loss is normal and loss decreases, so gradients flow somewhere, but they never reach the IPG buckets.
To Reproduce
- Launcher: torchrun (not deepspeed launcher)
- Training: multi-task loop. Each iteration has multiple tasks. We run forward per task, and either:
- backward per task (engine.backward(loss_task)), with GAS set to the number of tasks in this iteration; or
- sum all task losses and backward once, with GAS=1.
Both strategies reproduce the empty IPG buckets.
- zero_grad: In DeepSpeed mode, we call engine.zero_grad() once per iteration (outside the per-task loop) to avoid clearing buckets during accumulation.
- GAS: We set engine.set_gradient_accumulation_steps(num_tasks_in_this_iter) so GAS matches actual backward calls in that iteration.
Pseudo-code:
# each train iter
engine.zero_grad()
engine.set_gradient_accumulation_steps(num_tasks)
for task, batch in data_batch.items():
batch, loss_dict = model(batch, mode='train', task=task)
loss = sum(loss_dict.values())
engine.backward(loss)
# later engine.step() is called in a callback
Key logs (from our instrumentation):
comm_type: torch.float32, bucket.index=0, len(bucket.buffer)=0
contig=True: params_len=0, elements=0
[DeepSpeed] bucket.buffer out-of-range: bucket.index=0, len(bucket.buffer)=0
[IPG][trainer] dtype=torch.float32 elements=0 params_len=0 grads_len=0 idx=0 buf_len=0 contig=True part_grads=True overlap=True
We also added prints inside reduce_independent_p_g_buckets_and_remove_grads (enter/swap/copy-before-after/append lengths), but those prints never appear, suggesting the fill path didn’t run or was short-circuited.
Expected behavior
- With ZeRO-2 + CPU offload + overlap_comm=true, reduce_independent_p_g_buckets_and_remove_grads should be triggered during backward hooks and populate the IPG buckets so that bucket.params/grads/elements > 0.
- reduce_ipg_grads should not encounter empty buffers, or should safely handle fallback without empty buckets repeatedly.
ds_report output
We will attach full ds_report
output when submitting (currently not available in this environment).
Screenshots
N/A (logs attached above).
System info
- OS: Linux 5.4.0-125-generic
- Python: 3.8 (Conda env)
- DeepSpeed: 0.17.6
- PyTorch: 2.0.x (launched with torchrun)
- GPUs / Interconnect: see ds_report
Launcher context
- torchrun (not the
deepspeed
launcher)
Docker context
- Non-official DeepSpeed Docker; using a Conda-based image. We can provide image details if needed.
Configuration
Current DeepSpeed config (also tried variants; issue persists):
{
"train_batch_size": 1,
"gradient_accumulation_steps": 1,
"optimizer": {
"type": "adamw",
"params": { "lr": 1e-4, "betas": [0.9, 0.999], "eps": 1e-8, "weight_decay": 0.0 }
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": { "device": "cpu", "pin_memory": true },
"reduce_bucket_size": 50000000,
"allgather_bucket_size": 50000000,
"overlap_comm": true,
"contiguous_gradients": false
},
"fp16": { "enabled": false },
"bf16": { "enabled": false },
"checkpoint_activations": true,
"activation_checkpointing": {
"partition_activations": true,
"contiguous_memory_optimization": true,
"cpu_checkpointing": false,
"synchronize_checkpoint_boundary": true,
"number_checkpoints": 0,
"profile": false
}
}
We also tried:
- contiguous_gradients=true (CPU offload prefers it) and false; both result in empty buckets.
- Different reduce_bucket_size/allgather_bucket_size values.
- Letting DeepSpeed create the optimizer (DeepSpeedCPUAdam) vs passing in AdamW.
- fp16/bf16 on/off (our model ultimately needs full FP32, so currently off).
- Per-task backward with GAS=task_num vs single backward with GAS=1.
Additional context
- Multi-task training loop. We ensured:
- No inner-loop zero_grad when using DeepSpeed; only once per iteration.
- GAS matches the number of backward calls in this iteration.
- Fixed earlier issues (non-contiguous parameters, ZeRO-Offload with external optimizer warning, CPU/GPU dtype mismatch in inputs).
What we’d like help with:
- Under ZeRO-2 + CPU offload + overlap_comm=true, in what conditions would reduce_independent_p_g_buckets_and_remove_grads not be triggered so that IPG buckets remain empty?
- Are there known incompatibilities with multi-task/multi-microstep backward ordering and IPG?
- Is using torchrun instead of the deepspeed launcher a factor for IPG trigger paths?
- Any env flags or debug toggles to verify backward hooks installation and triggers for IPG filling?
We can provide a minimal reproducible script (simplified multi-task iteration + per-task backward + torchrun) and ds_report upon request. Thank you!