-
Notifications
You must be signed in to change notification settings - Fork 454
Auto resumption #1615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto resumption #1615
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this!! Approving to unblock, left some really minor nits for conventions we've used in repo. Actually, I'm not sure we're consistent with these... but it'd be nice to do so 🤷
Discussed offline -- lets actually hold this PR until we can test multinode instead of doing it in two parts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll let Hanlin approve since it'd be nice to get two pairs of eyes on this. LGTM tho. Amazing job!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one suggestion to clean up the code a bit
@hanlint I changed the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving to unblock!
What does this PR do?
Add support for autoresume with distributed training.
What issue(s) does this change relate to?
Fixes CO-1238
While fixing this issue, we discovered a race condition that is related to this, that is CO-1270
Manual testing
Manually tested a script that makes a trainer, trains it, makes another trainer with autoresume, and verifies that the params and run name are the same. This was tested on multiples gpus on one node.
Multi node does not currently work, but I also can't really test/debug it because of limited capacity on our cluster.
Before submitting
pre-commit
on your change? (see thepre-commit
section of prerequisites)