-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Open
Labels
Description
Describe the bug
With a training script like the following:
import deepspeed
import deepspeed.comm as dist
def main(args):
deepspeed.init_distributed()
model = Model()
......
model.destroy()
dist.destroy_process_group()
The following exception is raised at the end of the training process if and only if deepcompile is enabled:
Exception ignored in: <function DeepSpeedEngine.__del__ at 0x7f241b4fe830>
Traceback (most recent call last):
File "/mnt/engines/deepspeed/deepspeed/runtime/engine.py", line 519, in __del__
self.destroy()
File "/mnt/engines/deepspeed/deepspeed/runtime/engine.py", line 523, in destroy
self.optimizer.destroy()
File "/mnt/engines/deepspeed/deepspeed/runtime/zero/stage3.py", line 468, in destroy
self.parameter_offload.destroy()
File "/mnt/engines/deepspeed/deepspeed/runtime/zero/parameter_offload.py", line 227, in destroy
self._remove_module_hooks()
File "/mnt/engines/deepspeed/deepspeed/runtime/zero/parameter_offload.py", line 241, in _remove_module_hooks
print_rank_0(f'Deleted module hooks: forward = {num_forward_hooks}, backward = {num_backward_hooks}',
File "/mnt/engines/deepspeed/deepspeed/runtime/zero/partition_parameters.py", line 113, in print_rank_0
rank = dist.get_rank()
File "/mnt/engines/deepspeed/deepspeed/comm/comm.py", line 720, in get_rank
assert cdb is not None and cdb.is_initialized(
AssertionError: DeepSpeed backend not set, please initialize it using init_process_group()
To Reproduce
Steps to reproduce the behavior:
- Run https://gist.github.com/eternalNight/3c2cf8c703f1e9e7742d3b7f9e1edae3 with
deepspeed --num_gpus=N openvla-like.py -c
Expected behavior
No exception is raised.