Skip to content

Warmup schedulers crash if t_warmup == max_duration #1077

@abhi-mosaic

Description

@abhi-mosaic

To reproduce

Steps to reproduce the behavior:

  1. Use a scheduler like cosine_decay_with_warmup and set t_warmup == max_duration == 100ba.
  2. Launch a training run, it will run until the last step and then crash with a ZeroDivisionError:
Traceback (most recent call last):
  File "/root/composer/examples/run_composer_trainer.py", line 67, in <module>
    main()
  File "/root/composer/examples/run_composer_trainer.py", line 63, in main
    trainer.fit()
  File "/root/composer/composer/trainer/trainer.py", line 1289, in fit
    self._train_loop()
  File "/root/composer/composer/trainer/trainer.py", line 1497, in _train_loop
    scheduler.step()
  File "/usr/local/lib/python3.9/dist-packages/torch/optim/lr_scheduler.py", line 154, in step
    values = self.get_lr()
  File "/usr/local/lib/python3.9/dist-packages/torch/optim/lr_scheduler.py", line 252, in get_lr
    return [base_lr * lmbda(self.last_epoch)
  File "/usr/local/lib/python3.9/dist-packages/torch/optim/lr_scheduler.py", line 252, in <listcomp>
    return [base_lr * lmbda(self.last_epoch)
  File "/root/composer/composer/optim/scheduler.py", line 190, in scheduler_fn
    return scheduler(state, ssr)
  File "/root/composer/composer/optim/scheduler.py", line 697, in __call__
    frac_of_total = ((current_time - t_warmup) / (t_max - t_warmup)).value
  File "/root/composer/composer/core/time.py", line 308, in __truediv__
    return Time(self.value / other.value, TimeUnit.DURATION)
ZeroDivisionError: division by zero

Expected behavior

We should either allow this setting, and just not attempt to increment the schedule past the warmup... or we should catch this edge case and raise a ValueError out on the scheduler's __init__.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions