[BUG]ZeRO-2 + CPU Offload + overlap_comm=true, the IPG (Independent Partition Gradient) buckets are never populated.

### Describe the bug
With ZeRO-2 + CPU Offload + overlap_comm=true, the IPG (Independent Partition Gradient) buckets are never populated. During gradient reduction (reduce_ipg_grads), we consistently observe empty buckets (bucket.index=0 and len(bucket.buffer)=0). Earlier we also hit list index out of range (now avoided), but buckets remain empty.

We instrumented both our trainer and DeepSpeed ZeRO code. After backward, engine.optimizer.ipg_buckets[torch.float32] always shows elements=0 / params_len=0 / grads_len=0 while contiguous_gradients=true/false (tried both), partition_gradients=true, overlap_comm=true.

Forward/loss is normal and loss decreases, so gradients flow somewhere, but they never reach the IPG buckets.

### To Reproduce
- Launcher: torchrun (not deepspeed launcher)
- Training: multi-task loop. Each iteration has multiple tasks. We run forward per task, and either:
  1) backward per task (engine.backward(loss_task)), with GAS set to the number of tasks in this iteration; or
  2) sum all task losses and backward once, with GAS=1.
  Both strategies reproduce the empty IPG buckets.
- zero_grad: In DeepSpeed mode, we call engine.zero_grad() once per iteration (outside the per-task loop) to avoid clearing buckets during accumulation.
- GAS: We set engine.set_gradient_accumulation_steps(num_tasks_in_this_iter) so GAS matches actual backward calls in that iteration.

Pseudo-code:
```python
# each train iter
engine.zero_grad()
engine.set_gradient_accumulation_steps(num_tasks)
for task, batch in data_batch.items():
    batch, loss_dict = model(batch, mode='train', task=task)
    loss = sum(loss_dict.values())
    engine.backward(loss)
# later engine.step() is called in a callback
```

Key logs (from our instrumentation):
```
comm_type: torch.float32, bucket.index=0, len(bucket.buffer)=0
contig=True: params_len=0, elements=0
[DeepSpeed] bucket.buffer out-of-range: bucket.index=0, len(bucket.buffer)=0
[IPG][trainer] dtype=torch.float32 elements=0 params_len=0 grads_len=0 idx=0 buf_len=0 contig=True part_grads=True overlap=True
```

We also added prints inside reduce_independent_p_g_buckets_and_remove_grads (enter/swap/copy-before-after/append lengths), but those prints never appear, suggesting the fill path didn’t run or was short-circuited.

### Expected behavior
- With ZeRO-2 + CPU offload + overlap_comm=true, reduce_independent_p_g_buckets_and_remove_grads should be triggered during backward hooks and populate the IPG buckets so that bucket.params/grads/elements > 0.
- reduce_ipg_grads should not encounter empty buffers, or should safely handle fallback without empty buckets repeatedly.

### ds_report output
We will attach full `ds_report` output when submitting (currently not available in this environment).

### Screenshots
N/A (logs attached above).

### System info
- OS: Linux 5.4.0-125-generic
- Python: 3.8 (Conda env)
- DeepSpeed: 0.17.6
- PyTorch: 2.0.x (launched with torchrun)
- GPUs / Interconnect: see ds_report

### Launcher context
- torchrun (not the `deepspeed` launcher)

### Docker context
- Non-official DeepSpeed Docker; using a Conda-based image. We can provide image details if needed.

### Configuration
Current DeepSpeed config (also tried variants; issue persists):
```json
{
  "train_batch_size": 1,
  "gradient_accumulation_steps": 1,
  "optimizer": {
    "type": "adamw",
    "params": { "lr": 1e-4, "betas": [0.9, 0.999], "eps": 1e-8, "weight_decay": 0.0 }
  },
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": { "device": "cpu", "pin_memory": true },
    "reduce_bucket_size": 50000000,
    "allgather_bucket_size": 50000000,
    "overlap_comm": true,
    "contiguous_gradients": false
  },
  "fp16": { "enabled": false },
  "bf16": { "enabled": false },
  "checkpoint_activations": true,
  "activation_checkpointing": {
    "partition_activations": true,
    "contiguous_memory_optimization": true,
    "cpu_checkpointing": false,
    "synchronize_checkpoint_boundary": true,
    "number_checkpoints": 0,
    "profile": false
  }
}
```

We also tried:
- contiguous_gradients=true (CPU offload prefers it) and false; both result in empty buckets.
- Different reduce_bucket_size/allgather_bucket_size values.
- Letting DeepSpeed create the optimizer (DeepSpeedCPUAdam) vs passing in AdamW.
- fp16/bf16 on/off (our model ultimately needs full FP32, so currently off).
- Per-task backward with GAS=task_num vs single backward with GAS=1.

### Additional context
- Multi-task training loop. We ensured:
  - No inner-loop zero_grad when using DeepSpeed; only once per iteration.
  - GAS matches the number of backward calls in this iteration.
  - Fixed earlier issues (non-contiguous parameters, ZeRO-Offload with external optimizer warning, CPU/GPU dtype mismatch in inputs).

What we’d like help with:
- Under ZeRO-2 + CPU offload + overlap_comm=true, in what conditions would reduce_independent_p_g_buckets_and_remove_grads not be triggered so that IPG buckets remain empty?
- Are there known incompatibilities with multi-task/multi-microstep backward ordering and IPG?
- Is using torchrun instead of the deepspeed launcher a factor for IPG trigger paths?
- Any env flags or debug toggles to verify backward hooks installation and triggers for IPG filling?

We can provide a minimal reproducible script (simplified multi-task iteration + per-task backward + torchrun) and ds_report upon request. Thank you!




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG]ZeRO-2 + CPU Offload + overlap_comm=true, the IPG (Independent Partition Gradient) buckets are never populated. #7579

Describe the bug

To Reproduce

Expected behavior

ds_report output

Screenshots

System info

Launcher context

Docker context

Configuration

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]ZeRO-2 + CPU Offload + overlap_comm=true, the IPG (Independent Partition Gradient) buckets are never populated. #7579

Description

Describe the bug

To Reproduce

Expected behavior

ds_report output

Screenshots

System info

Launcher context

Docker context

Configuration

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions