Add the ability to load a checkpoint without restoring state #169

moinnadeem · 2021-12-16T20:33:47Z

Motivation

This pull request enables the ability to load a checkpoint without loading the associated state. It does so via the following YAML changes:

Introduces a CheckpointLoaderHparams object, with three fields: checkpoint_filepath, load_weights_only, and strict.
If load_weights_only = False, then nothing changes and the previous codepaths are used. The strict value isn't considered if load_weights_only = False, since restoring a checkpoint with state should ensure that the model exactly matches up. YAML validation ensures that the strict value cannot be set without load_weights_only = True.
If load_weights_only = True, then it loads the checkpoint and avoids recovering the state via a new codepath. If strict = False, it also prints the keys that did not match up for user safety.

It creates the CheckpointLoader object when the Trainer is created via create_from_hparams, and passes the CheckpointLoader in as well.

Discussion Points

Are we happy with the YAML API change?
Should we add any tests for these new codepaths?

composer/trainer/checkpoint_hparams.py

ravi-mosaicml · 2021-12-16T20:41:25Z

composer/trainer/checkpoint_hparams.py

+            return CheckpointLoader(checkpoint_filepath=self.checkpoint_filepath,
+                                    load_weights_only=self.load_weights_only,
+                                    strict=self.strict)
+        return None


This return None is probably causing a pyright bug.

Taken care of!

composer/trainer/checkpoint.py

ravi-mosaicml

Looks pretty good overall, just some minor changes and comments.

composer/trainer/trainer.py

moinnadeem · 2021-12-18T21:55:01Z

Cool, addressed all feedback!

re: the PyRight issues, they're not problems in practice, should we add a manual ignore? In more detail:

 /home/runner/work/composer/composer/composer/trainer/checkpoint.py:47:60 - error: Argument of type "bool | None" cannot be assigned to parameter "strict" of type "bool" in function "load_model_state"
    Type "bool | None" cannot be assigned to type "bool"
      Type "None" cannot be assigned to type "bool" (reportGeneralTypeIssues)

The argument can't be None, because the method signature enforces a default. It seems as if PyRight isn't catching onto this?

/home/runner/work/composer/composer/composer/trainer/trainer.py
  /home/runner/work/composer/composer/composer/trainer/trainer.py:304:52 - error: "load_checkpoint" is not a known member of "None" (reportOptionalMemberAccess)

Before this line, we check if the checkpoint_loader is not None, so load_checkpoint can never be run on an object of NoneType. Should we manually ignore this?

ravi-mosaicml · 2021-12-21T01:18:14Z

For the pyright issues where it's complaining about optional variables, you need to add one of these options before the line it's complaining about:

assert x is not None (if it's an invariant violation)
if x is None: raise ValueError(f"x is None, but it shouldn't be because ...") (if it's a user error)

ravi-mosaicml

Code path looks great! Please a some test to verify it. Thinking something like this:

Train, save checkpoint
Load checkpoint with a different optimizer and scheduler with weights_only=True
Assert that the weights are the same as the first trainer, but that the optimizer is the new one

composer/trainer/checkpoint.py

ravi-mosaicml · 2021-12-21T17:51:06Z

composer/trainer/checkpoint.py

+    def __init__(self,
+                 checkpoint_filepath: str,
+                 load_weights_only: Optional[bool] = False,
+                 strict: Optional[bool] = False):


Suggested change

strict: Optional[bool] = False):

strict: bool = False):

composer/trainer/checkpoint_hparams.py

composer/trainer/trainer.py

moinnadeem · 2022-01-03T11:10:40Z

Cool, I've addressed all feedback and added the tests that Ravi requested. Hanlin also requested that, instead of passing the Checkpoint{Loader, Saver} object to the Trainer, I pass the hparams directly to make it more BYOT friendly. I agree there, so that has also been reflected.

I've addressed all PyRight issues on my changed files, but it seems as if PyRight is complaining about a few extra files that are outside of the scope of this PR. Namely, composer/algorithms/augmix/augmix.py, composer/algorithms/randaugment/randaugment.py, and composer/datasets/brats.py. What should we do about these?

ravi-mosaicml · 2022-01-03T15:04:36Z

Can you merge in the latest from dev and see if that fixes those files?

moinnadeem · 2022-01-03T15:22:34Z

@ravi-mosaicml Just did -- didn't help for some reason. Any clue why this is happening?

ravi-mosaicml · 2022-01-03T15:24:18Z

Probably a pyright update...can you fix what's it complaining about in those files if they're small changes?

…

On Mon, Jan 3, 2022, 7:22 AM Moin Nadeem ***@***.***> wrote: @ravi-mosaicml <https://github.com/ravi-mosaicml> Just did -- didn't help for some reason. Any clue why this is happening? — Reply to this email directly, view it on GitHub <#169 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUYBL6H5SR4MF4PB5IPANTTUUG5MPANCNFSM5KHHTSGA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

ravi-mosaicml

LGTM

composer/algorithms/alibi/alibi.py

composer/core/state.py

composer/datasets/hparams.py

composer/optim/pytorch_future.py

composer/trainer/trainer.py

…l#169) * fixing checkpoint bug * finalizing fine-tuning a checkpointed model * address PR feedback * adding save_checkpoint and load_checkpoint hparams interface * yapf & pyright * changing interface * everyone always asks 'what is yapf', but never 'how is yapf'? * renaming Checkpointer -> CheckpointSaver * renaming Checkpointer -> CheckpointSaver * addressing feedback & friendly renaming * addressing pyright * yapf * adding tests * moving commits to BERT branch * changing folder to be relative to run dir * adding tests * pyright part 1 * pyright on trainer file * moving restoring RNG & random seed to else clause * Fix tests * Addressed comments Co-authored-by: Moin Nadeem <[email protected]> Co-authored-by: Ravi Rahman <[email protected]>

jbloxham · 2022-03-02T22:22:30Z

tests/trainer/test_checkpoint.py

+    # setup a new LR scheduler
+    scheduler_options = [ConstantLRHparams(), CosineAnnealingLRHparams(T_max=f"{second_trainer_hparams.max_epochs}ep")]
+    second_trainer_hparams.schedulers = [random.choice(scheduler_options)]


@moinnadeem just ran into this now - it's very strange to use randomness in a test in this way since it can potentially cause flaky tests. Was there a reason you wanted this randomness here?

Ah, I see the rationale. I just wanted to make sure that we test several schedulers, but didn't think it was worth the time to test all of them. In hindsight, we should either pick the more difficult one, or do both. I agree with you.

I'm modifying this line in a PR for a different purpose anyways, so I'll fix it here.

Thanks Jamie!

moinnadeem requested a review from ravi-mosaicml December 16, 2021 20:33

ravi-mosaicml reviewed Dec 16, 2021

View reviewed changes

composer/trainer/checkpoint_hparams.py Outdated Show resolved Hide resolved

ravi-mosaicml reviewed Dec 16, 2021

View reviewed changes

composer/trainer/checkpoint.py Outdated Show resolved Hide resolved

ravi-mosaicml reviewed Dec 16, 2021

View reviewed changes

composer/trainer/checkpoint.py Outdated Show resolved Hide resolved

ravi-mosaicml reviewed Dec 16, 2021

View reviewed changes

hanlint reviewed Dec 16, 2021

View reviewed changes

composer/trainer/trainer.py Outdated Show resolved Hide resolved

ravi-mosaicml reviewed Dec 21, 2021

View reviewed changes

Moin Nadeem and others added 18 commits January 3, 2022 15:10

fixing checkpoint bug

c91d43c

finalizing fine-tuning a checkpointed model

7515cd1

address PR feedback

984c036

adding save_checkpoint and load_checkpoint hparams interface

8c44109

yapf & pyright

0897838

changing interface

4d5e02c

everyone always asks 'what is yapf', but never 'how is yapf'?

70aa099

renaming Checkpointer -> CheckpointSaver

c2fe018

renaming Checkpointer -> CheckpointSaver

56bce8c

addressing feedback & friendly renaming

44bb59d

addressing pyright

06b0292

yapf

5416b41

adding tests

0668582

moving commits to BERT branch

a94e238

changing folder to be relative to run dir

de064a8

adding tests

1ad6202

pyright part 1

b6780eb

pyright on trainer file

6f71a71

moinnadeem force-pushed the moin/finetune_checkpoints branch from 15c5695 to 6f71a71 Compare January 3, 2022 15:15

Moin Nadeem and others added 4 commits January 3, 2022 21:44

moving restoring RNG & random seed to else clause

9b0d764

Merge branch 'dev' into moin/finetune_checkpoints

cbafb18

Merge branch 'dev' into moin/finetune_checkpoints

9098c41

Fix tests

3d7478b

ravi-mosaicml approved these changes Jan 3, 2022

View reviewed changes

Addressed comments

bab0a07

moinnadeem merged commit 2b25192 into dev Jan 3, 2022

moinnadeem deleted the moin/finetune_checkpoints branch January 3, 2022 23:31

This was referenced Jan 3, 2022

Handle Miscellaneous / Extra Keys in load_state_dict #158

Closed

Load a checkpoint without loading its associated state #159

Closed

jbloxham reviewed Mar 2, 2022

View reviewed changes

Add the ability to load a checkpoint without restoring state #169

Add the ability to load a checkpoint without restoring state #169

Uh oh!

Conversation

moinnadeem commented Dec 16, 2021

Motivation

Discussion Points

Uh oh!

Uh oh!

ravi-mosaicml Dec 16, 2021

Choose a reason for hiding this comment

Uh oh!

moinnadeem Dec 18, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ravi-mosaicml left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

moinnadeem commented Dec 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ravi-mosaicml commented Dec 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ravi-mosaicml left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ravi-mosaicml Dec 21, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

moinnadeem commented Jan 3, 2022

Uh oh!

ravi-mosaicml commented Jan 3, 2022

Uh oh!

moinnadeem commented Jan 3, 2022

Uh oh!

ravi-mosaicml commented Jan 3, 2022 via email

Uh oh!

ravi-mosaicml left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jbloxham Mar 2, 2022

Choose a reason for hiding this comment

Uh oh!

moinnadeem Mar 2, 2022

Choose a reason for hiding this comment

Uh oh!

jbloxham Mar 2, 2022

Choose a reason for hiding this comment

Uh oh!

moinnadeem Mar 2, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

moinnadeem commented Dec 18, 2021 •

edited

Loading

ravi-mosaicml commented Dec 21, 2021 •

edited

Loading