-
Notifications
You must be signed in to change notification settings - Fork 455
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
To reproduce
Steps to reproduce the behavior:
- Use a scheduler like
cosine_decay_with_warmup
and sett_warmup
==max_duration
==100ba
. - Launch a training run, it will run until the last step and then crash with a
ZeroDivisionError
:
Traceback (most recent call last):
File "/root/composer/examples/run_composer_trainer.py", line 67, in <module>
main()
File "/root/composer/examples/run_composer_trainer.py", line 63, in main
trainer.fit()
File "/root/composer/composer/trainer/trainer.py", line 1289, in fit
self._train_loop()
File "/root/composer/composer/trainer/trainer.py", line 1497, in _train_loop
scheduler.step()
File "/usr/local/lib/python3.9/dist-packages/torch/optim/lr_scheduler.py", line 154, in step
values = self.get_lr()
File "/usr/local/lib/python3.9/dist-packages/torch/optim/lr_scheduler.py", line 252, in get_lr
return [base_lr * lmbda(self.last_epoch)
File "/usr/local/lib/python3.9/dist-packages/torch/optim/lr_scheduler.py", line 252, in <listcomp>
return [base_lr * lmbda(self.last_epoch)
File "/root/composer/composer/optim/scheduler.py", line 190, in scheduler_fn
return scheduler(state, ssr)
File "/root/composer/composer/optim/scheduler.py", line 697, in __call__
frac_of_total = ((current_time - t_warmup) / (t_max - t_warmup)).value
File "/root/composer/composer/core/time.py", line 308, in __truediv__
return Time(self.value / other.value, TimeUnit.DURATION)
ZeroDivisionError: division by zero
Expected behavior
We should either allow this setting, and just not attempt to increment the schedule past the warmup... or we should catch this edge case and raise a ValueError
out on the scheduler's __init__
.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working