Skip to content

Conversation

ravirahman
Copy link
Contributor

@ravirahman ravirahman commented Jan 5, 2022

Added support to load checkpoints stored in object storage (rather than just on the local disk) and from URLs. Closes #192.

  • Refactored the run directory uploader to separate out object store related utilities to composer.utils.object_store (and added test coverage).
  • Updated the checkpointer loader and hparams to support checkpoints stored in URLs and in object stores. Chunk sizing and whether to show tqdm bars are hparams.
  • Refactored how ddp checkpoints are loaded and stored to avoid an os.listdir and writing temporary files to the run directory
  • Updated the Trainer.__init__ to propagate through this change.

Added support to load checkpoints stored in object storage (rather than just on the local disk). Closes #192.

- Refactored the run directory uploader to separate out object store related utilites to composer.utils.object_store (and added test coverage).
- Updated the checkpointer hparams to optionally take `composer.utils.object_store.ObjectStoreProviderHparams`, which would be used to download the checkpoint from storage.
- Updated the trainer init to propegate through this change.
@ravi-mosaicml
Copy link
Contributor

Going to re-open this PR once #199 is merged in and this PR is rebased on top of it.

@ravi-mosaicml ravi-mosaicml reopened this Jan 7, 2022
@moinnadeem
Copy link
Contributor

I've tested this and it works! Great job Ravi!

@ravi-mosaicml ravi-mosaicml merged commit 2dd7a05 into dev Jan 7, 2022
@ravi-mosaicml ravi-mosaicml deleted the ravi/load_checkpoints_from_cloud_storage branch January 7, 2022 20:37
coryMosaicML pushed a commit to coryMosaicML/composer that referenced this pull request Feb 23, 2022
Added support to load checkpoints stored in object storage (rather than just on the local disk) and from URLs. Closes mosaicml#192.

- Refactored the run directory uploader to separate out object store related utilities to `composer.utils.object_store` (and added test coverage).
- Updated the checkpointer loader and hparams to support checkpoints stored in URLs and in object stores. Chunk sizing and whether to show tqdm bars are hparams.
- Refactored how ddp checkpoints are loaded and stored to avoid an `os.listdir` and writing temporary files to the run directory
- Updated the `Trainer.__init__` to propagate through this change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable loading a checkpoint from a blob storage
3 participants