Auto resumption #1615

dakinggg · 2022-10-12T01:09:36Z

What does this PR do?

Add support for autoresume with distributed training.

What issue(s) does this change relate to?

Fixes CO-1238
While fixing this issue, we discovered a race condition that is related to this, that is CO-1270

Manual testing

Manually tested a script that makes a trainer, trains it, makes another trainer with autoresume, and verifies that the params and run name are the same. This was tested on multiples gpus on one node.

Multi node does not currently work, but I also can't really test/debug it because of limited capacity on our cluster.

Before submitting

Have you read the contributor guidelines?
Is this change a documentation change or typo fix? If so, skip the rest of this checklist.
Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
Did you update any related docs and document your change?
Did you update any related tests and add any new tests related to your change? (see testing)
Did you run the tests locally to make sure they pass?
Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

…rks at first run

composer/trainer/trainer.py

mvpatel2000

Thanks for fixing this!! Approving to unblock, left some really minor nits for conventions we've used in repo. Actually, I'm not sure we're consistent with these... but it'd be nice to do so 🤷

composer/trainer/trainer.py

Discussed offline -- lets actually hold this PR until we can test multinode instead of doing it in two parts

mvpatel2000

I'll let Hanlin approve since it'd be nice to get two pairs of eyes on this. LGTM tho. Amazing job!

composer/trainer/trainer.py

hanlint

LGTM, one suggestion to clean up the code a bit

composer/trainer/trainer.py

dakinggg · 2022-10-15T00:43:05Z

@hanlint I changed the num_concurrent_uploads default to 1. I experimented a bit, and at least on the demo example, if you checkpoint enough to want multiple upload workers, it seems that you end up bottlenecked on network communication with s3 (streaming data in and sending checkpoints out). With 1 worker, you should only end up with an invalid symlink (pointing to one checkpoint before the actual latest) if the job dies between finishing uploading the checkpoint and uploading the symlink (which is tiny). This would be quite unlikely, and even this issue could be resolved by having save_overwrite=True when using autoresume.

eracah

Approving to unblock!

Daniel King added 13 commits October 11, 2022 20:50

Add failing autoresume world size 2 test

9b7b607

Add hanlin's log statements

f936ce5

typo

6124366

Add error for deepspeed + autoresume

11ddc8a

Attempt to support autoresume for multigpu/multinode/deepspeed

861e7f2

Merge branch 'dev' into auto_resumption

ff478ff

Switch to warning for autoresume checkpoint not found so that True wo…

3f0c477

…rks at first run

Add a barrier

dc11c88

Reorganize code

1a8c6aa

Merge branch 'dev' into auto_resumption

d2e74ec

Clean up final file exists check

1285503

Fix doctests due to new errors raised

a32462a

rerun tests

a808a3a

dakinggg marked this pull request as ready for review October 13, 2022 16:47

dakinggg requested a review from hanlint October 13, 2022 16:47

hanlint reviewed Oct 13, 2022

View reviewed changes

composer/trainer/trainer.py Outdated Show resolved Hide resolved

composer/trainer/trainer.py Outdated Show resolved Hide resolved

Daniel King added 3 commits October 13, 2022 09:55

Rename _attempt_checkpoint_download

361727d

Move makedirs out of function

fc8b370

return None for first autoresume run

833cc5f

mvpatel2000 previously approved these changes Oct 13, 2022

View reviewed changes

composer/trainer/trainer.py Outdated Show resolved Hide resolved

composer/trainer/trainer.py Outdated Show resolved Hide resolved

composer/trainer/trainer.py Outdated Show resolved Hide resolved

Daniel King added 9 commits October 13, 2022 11:18

Nits to remove else

4af7de8

Merge branch 'dev' into auto_resumption

f89d2c8

rerun tests

db9e24b

Add debug statement

978ca89

Broadcast the remote path to all ranks

7e5153a

Add debug statement

1e72431

Adjust debug statement

8ad80ee

more logs

cda6d2f

not abspath for remote file name

5c80159

Daniel King added 6 commits October 14, 2022 14:11

format debug

50fb3f0

Change order of debug

f91dda0

Add barrier

ef7219d

Merge branch 'dev' into auto_resumption

3214e9a

Add another log

3b20561

Merge branch 'dev' into auto_resumption

7f09290

mvpatel2000 reviewed Oct 14, 2022

View reviewed changes

composer/trainer/trainer.py Outdated Show resolved Hide resolved

composer/trainer/trainer.py Show resolved Hide resolved

composer/trainer/trainer.py Show resolved Hide resolved

composer/trainer/trainer.py Show resolved Hide resolved

hanlint approved these changes Oct 14, 2022

View reviewed changes

composer/trainer/trainer.py Outdated Show resolved Hide resolved

Daniel King added 3 commits October 14, 2022 16:58

merge

d0b5257

PR comments

0a56763

Change concurrent uploads default to 1

552c737

dakinggg requested a review from eracah as a code owner October 15, 2022 00:16

Daniel King added 3 commits October 14, 2022 17:17

Undo, meant to push to a different branch

986b60d

Merge dev

39fbcc0

Change default concurrent uploads to 1

29fcc6d

dakinggg requested a review from hanlint October 15, 2022 00:40

dakinggg enabled auto-merge (squash) October 15, 2022 01:04

hanlint approved these changes Oct 15, 2022

View reviewed changes

eracah approved these changes Oct 15, 2022

View reviewed changes

dakinggg merged commit 07dc317 into mosaicml:dev Oct 15, 2022

dakinggg deleted the auto_resumption branch October 20, 2022 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Auto resumption #1615

Auto resumption #1615

Uh oh!

dakinggg commented Oct 12, 2022 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

mvpatel2000 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mvpatel2000 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hanlint left a comment

Uh oh!

Uh oh!

dakinggg commented Oct 15, 2022 •

edited

Loading

Uh oh!

eracah left a comment

Uh oh!

Uh oh!

Auto resumption #1615

Auto resumption #1615

Uh oh!

Conversation

dakinggg commented Oct 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

What issue(s) does this change relate to?

Manual testing

Before submitting

Uh oh!

Uh oh!

Uh oh!

mvpatel2000 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mvpatel2000 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hanlint left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dakinggg commented Oct 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eracah left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dakinggg commented Oct 12, 2022 •

edited

Loading

dakinggg commented Oct 15, 2022 •

edited

Loading