-
Notifications
You must be signed in to change notification settings - Fork 454
Fix the Speed Monitor #1123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the Speed Monitor #1123
Conversation
The speed monitor sometimes returns negative times. * Fixed by re-implementing the speed monitor to use the wall clock tracking built into `state.timestamp` and `state.eval_timestamp`. Using these variables ensures that all ranks have consistent timing information and simplifies the speed monitor implementation, as timestamp variables are already in duration units rather than wall clock timestamps * Added a validation dataloader and checks to ensure no negative values in the test * Logging wall clock time on every batch, rather than every epoch, to support NLP and single-epoch jobs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ravi-mosaicml Good stuff! Do you have a example WandB report of what the speed monitor looks like now? It'll help evaluate the PR pretty easily -- the code looks good to me, but I want to check for downstream effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to approve, most comments above are on improving code readability.
@ravi-mosaicml Glad I checked for downstream effects -- I get the following error when testing this PR:
|
…poser into fix_speed_monitor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better, it works! Bellissimo, I like this behavior a lot better too.
The speed monitor sometimes returns negative times (example: https://wandb.ai/mosaic-ml/regression-v0.7.0-RC1/table?workspace=user-ravimosaicml)
state.timestamp
andstate.eval_timestamp
. Using these variables ensures that all ranks have consistent timing information and simplifies the speed monitor implementation, as timestamp variables are already in duration units rather than wall clock timestamps