Skip to content

[BUG] Exception raised at the end of training with deepcompile enabled #7578

@eternalNight

Description

@eternalNight

Describe the bug

With a training script like the following:

import deepspeed
import deepspeed.comm as dist

def main(args):
    deepspeed.init_distributed()
    model = Model()
    ......
    model.destroy()
    dist.destroy_process_group()

The following exception is raised at the end of the training process if and only if deepcompile is enabled:

Exception ignored in: <function DeepSpeedEngine.__del__ at 0x7f241b4fe830>
Traceback (most recent call last):
  File "/mnt/engines/deepspeed/deepspeed/runtime/engine.py", line 519, in __del__
    self.destroy()
  File "/mnt/engines/deepspeed/deepspeed/runtime/engine.py", line 523, in destroy
    self.optimizer.destroy()
  File "/mnt/engines/deepspeed/deepspeed/runtime/zero/stage3.py", line 468, in destroy
    self.parameter_offload.destroy()
  File "/mnt/engines/deepspeed/deepspeed/runtime/zero/parameter_offload.py", line 227, in destroy
    self._remove_module_hooks()
  File "/mnt/engines/deepspeed/deepspeed/runtime/zero/parameter_offload.py", line 241, in _remove_module_hooks
    print_rank_0(f'Deleted module hooks: forward = {num_forward_hooks}, backward = {num_backward_hooks}',
  File "/mnt/engines/deepspeed/deepspeed/runtime/zero/partition_parameters.py", line 113, in print_rank_0
    rank = dist.get_rank()
  File "/mnt/engines/deepspeed/deepspeed/comm/comm.py", line 720, in get_rank
    assert cdb is not None and cdb.is_initialized(
AssertionError: DeepSpeed backend not set, please initialize it using init_process_group()

To Reproduce

Steps to reproduce the behavior:

  1. Run https://gist.github.com/eternalNight/3c2cf8c703f1e9e7742d3b7f9e1edae3 with deepspeed --num_gpus=N openvla-like.py -c

Expected behavior
No exception is raised.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtraining

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions