Skip to content

[BUG]ZeRO-2 + CPU Offload + overlap_comm=true, the IPG (Independent Partition Gradient) buckets are never populated. #7579

@XiDianZuoYun

Description

@XiDianZuoYun

Describe the bug

With ZeRO-2 + CPU Offload + overlap_comm=true, the IPG (Independent Partition Gradient) buckets are never populated. During gradient reduction (reduce_ipg_grads), we consistently observe empty buckets (bucket.index=0 and len(bucket.buffer)=0). Earlier we also hit list index out of range (now avoided), but buckets remain empty.

We instrumented both our trainer and DeepSpeed ZeRO code. After backward, engine.optimizer.ipg_buckets[torch.float32] always shows elements=0 / params_len=0 / grads_len=0 while contiguous_gradients=true/false (tried both), partition_gradients=true, overlap_comm=true.

Forward/loss is normal and loss decreases, so gradients flow somewhere, but they never reach the IPG buckets.

To Reproduce

  • Launcher: torchrun (not deepspeed launcher)
  • Training: multi-task loop. Each iteration has multiple tasks. We run forward per task, and either:
    1. backward per task (engine.backward(loss_task)), with GAS set to the number of tasks in this iteration; or
    2. sum all task losses and backward once, with GAS=1.
      Both strategies reproduce the empty IPG buckets.
  • zero_grad: In DeepSpeed mode, we call engine.zero_grad() once per iteration (outside the per-task loop) to avoid clearing buckets during accumulation.
  • GAS: We set engine.set_gradient_accumulation_steps(num_tasks_in_this_iter) so GAS matches actual backward calls in that iteration.

Pseudo-code:

# each train iter
engine.zero_grad()
engine.set_gradient_accumulation_steps(num_tasks)
for task, batch in data_batch.items():
    batch, loss_dict = model(batch, mode='train', task=task)
    loss = sum(loss_dict.values())
    engine.backward(loss)
# later engine.step() is called in a callback

Key logs (from our instrumentation):

comm_type: torch.float32, bucket.index=0, len(bucket.buffer)=0
contig=True: params_len=0, elements=0
[DeepSpeed] bucket.buffer out-of-range: bucket.index=0, len(bucket.buffer)=0
[IPG][trainer] dtype=torch.float32 elements=0 params_len=0 grads_len=0 idx=0 buf_len=0 contig=True part_grads=True overlap=True

We also added prints inside reduce_independent_p_g_buckets_and_remove_grads (enter/swap/copy-before-after/append lengths), but those prints never appear, suggesting the fill path didn’t run or was short-circuited.

Expected behavior

  • With ZeRO-2 + CPU offload + overlap_comm=true, reduce_independent_p_g_buckets_and_remove_grads should be triggered during backward hooks and populate the IPG buckets so that bucket.params/grads/elements > 0.
  • reduce_ipg_grads should not encounter empty buffers, or should safely handle fallback without empty buckets repeatedly.

ds_report output

We will attach full ds_report output when submitting (currently not available in this environment).

Screenshots

N/A (logs attached above).

System info

  • OS: Linux 5.4.0-125-generic
  • Python: 3.8 (Conda env)
  • DeepSpeed: 0.17.6
  • PyTorch: 2.0.x (launched with torchrun)
  • GPUs / Interconnect: see ds_report

Launcher context

  • torchrun (not the deepspeed launcher)

Docker context

  • Non-official DeepSpeed Docker; using a Conda-based image. We can provide image details if needed.

Configuration

Current DeepSpeed config (also tried variants; issue persists):

{
  "train_batch_size": 1,
  "gradient_accumulation_steps": 1,
  "optimizer": {
    "type": "adamw",
    "params": { "lr": 1e-4, "betas": [0.9, 0.999], "eps": 1e-8, "weight_decay": 0.0 }
  },
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": { "device": "cpu", "pin_memory": true },
    "reduce_bucket_size": 50000000,
    "allgather_bucket_size": 50000000,
    "overlap_comm": true,
    "contiguous_gradients": false
  },
  "fp16": { "enabled": false },
  "bf16": { "enabled": false },
  "checkpoint_activations": true,
  "activation_checkpointing": {
    "partition_activations": true,
    "contiguous_memory_optimization": true,
    "cpu_checkpointing": false,
    "synchronize_checkpoint_boundary": true,
    "number_checkpoints": 0,
    "profile": false
  }
}

We also tried:

  • contiguous_gradients=true (CPU offload prefers it) and false; both result in empty buckets.
  • Different reduce_bucket_size/allgather_bucket_size values.
  • Letting DeepSpeed create the optimizer (DeepSpeedCPUAdam) vs passing in AdamW.
  • fp16/bf16 on/off (our model ultimately needs full FP32, so currently off).
  • Per-task backward with GAS=task_num vs single backward with GAS=1.

Additional context

  • Multi-task training loop. We ensured:
    • No inner-loop zero_grad when using DeepSpeed; only once per iteration.
    • GAS matches the number of backward calls in this iteration.
    • Fixed earlier issues (non-contiguous parameters, ZeRO-Offload with external optimizer warning, CPU/GPU dtype mismatch in inputs).

What we’d like help with:

  • Under ZeRO-2 + CPU offload + overlap_comm=true, in what conditions would reduce_independent_p_g_buckets_and_remove_grads not be triggered so that IPG buckets remain empty?
  • Are there known incompatibilities with multi-task/multi-microstep backward ordering and IPG?
  • Is using torchrun instead of the deepspeed launcher a factor for IPG trigger paths?
  • Any env flags or debug toggles to verify backward hooks installation and triggers for IPG filling?

We can provide a minimal reproducible script (simplified multi-task iteration + per-task backward + torchrun) and ds_report upon request. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions