Skip to content

Conversation

moinnadeem
Copy link
Contributor

@moinnadeem moinnadeem commented Dec 29, 2021

Motivation

Composer currently contains autoregressive language models, but masked language models are just as important for speeding up training. This PR serves to add the first of masked language models (BERT) to Composer.

Merging Criterion

  • Demonstrate that BERT achieves acceptable downstream GLUE performance on all benchmarks
  • Demonstrate average GLUE performance on validation set and test set
  • Code quality should be up-to-par, with well-commented abstractions
  • New TorchMetrics that have been added should include tests
  • New YAML files should be clean and not contain any unnecessary hparams

ravi-mosaicml and others added 29 commits January 4, 2022 15:33
Added support to load checkpoints stored in object storage (rather than just on the local disk). Closes #192.

- Refactored the run directory uploader to separate out object store related utilites to composer.utils.object_store (and added test coverage).
- Updated the checkpointer hparams to optionally take `composer.utils.object_store.ObjectStoreProviderHparams`, which would be used to download the checkpoint from storage.
- Updated the trainer init to propegate through this change.
return self.sum_loss / self.total_items #type: ignore (third-party)


# TODO (Moin): write tests for this!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah last thing (I think), could you remove these TODOs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh good catch! Let me actually grep for TODOs overall

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@Landanjs
Copy link
Contributor

Mostly reviewed new changes to metrics. LGTM!

@moinnadeem moinnadeem merged commit 0d6b3af into dev Jan 11, 2022
@moinnadeem moinnadeem deleted the moin/bert branch January 11, 2022 00:13
coryMosaicML pushed a commit to coryMosaicML/composer that referenced this pull request Feb 23, 2022
* Load Checkpoints from Cloud Storage

Added support to load checkpoints stored in object storage (rather than just on the local disk). Closes mosaicml#192.

- Refactored the run directory uploader to separate out object store related utilites to composer.utils.object_store (and added test coverage).
- Updated the checkpointer hparams to optionally take `composer.utils.object_store.ObjectStoreProviderHparams`, which would be used to download the checkpoint from storage.
- Updated the trainer init to propegate through this change.

* Libcloud intersphinx

* rebasing off of dev

* starting an LR sweep

* adding proper dataset and batch size

* 2.0e-3 LR causes NaNs, lowering lr

* changing adam

* adding SST-2

* adding validation tracking

* adding SST-2 -- training but not at the right accuracy

* cleaning up code & debugging why training loss is so large

* finalized YAML for SST-2, starting hparam sweeps

* updating hparams to sweep:

* finalized current setup for SST-2

* starting hparam sweeps on RTE

* adding support for warmup_ratio

* adding non-standard metrics

* adding support for duration as a time abstraction

* adding compatability with DataloaderSpec changes

* adding a linear learning rate decay

* adding linear LR warmup

* finalizing GLUE

* refactoring implementation to add regression tasks

* fixing checkpoint bug

* finalizing fine-tuning a checkpointed model

* fixing checkpoint bug

* adding validation

* adding mid-training

* starting LR sweep

* adding checkpointing feedback part 1

* fix validation interval

* address PR feedback

* address PR feedback

* adding save_checkpoint and load_checkpoint hparams interface

* adding save_checkpoint and load_checkpoint hparams interface

* yapf & pyright

* fixed error with logging pre-training validation loss

* cleaning up model forward pass

* cleaning up custom metrics

* renaming Checkpointer -> CheckpointSaver

* addressing pyright

* adding tests

* moving commits to BERT branch

* changing folder to be relative to run dir

* formatting

* adding tests

* adding initial YAML changes

* removing a copy of outdated files

* adding GLUE default params

* addressing pyright

* finalizing task-specific YAMLs

* code cleanup

* yapf

* adding license

* addressing tests

* formatting

* adding tests for the duration abstraction

* can i sue pyright for emotional damages?

* final formatting

* adding in finalized pre-training hyperparameters

* Update composer/models/bert/bert_hparams.py

Co-authored-by: Abhi Venigalla <[email protected]>

* Load Checkpoints from Cloud Storage

Added support to load checkpoints stored in object storage (rather than just on the local disk). Closes mosaicml#192.

- Refactored the run directory uploader to separate out object store related utilites to composer.utils.object_store (and added test coverage).
- Updated the checkpointer hparams to optionally take `composer.utils.object_store.ObjectStoreProviderHparams`, which would be used to download the checkpoint from storage.
- Updated the trainer init to propegate through this change.

* Libcloud intersphinx

* addressing PR feedback

* changing checkpoints into a cloud URl

* addressing Landan's feedback

* filepath -> checkpoint in the YAMLs

* Fixed merge

* Removed auto-parsing s3 and gs urls, as libcloud requires authentication. Fixed tests.

* Flattened run directory uploader hparams

* Fixed object store provider hparams

* updating sampler to be composer.dist

* Added tqdm progress bars and chunk sizing paramterization
Refactored checkpoint storage

* Fix pyright

* Fixed timeout

* Fix checkpointing

* Fixed deepspeed checkpoints

* Cleaned up PR

* finalized checkpointing loading

* refactored metric to avoid lists

* addressing pyright

* updating YAMLs with checkpoints

* final change

* adding unit tests

* adding LICENSE

* addressing conflicts & tests

* isort

* removing finished TODOs

* adding new GPT-2 YAMLs

Co-authored-by: Ravi Rahman <[email protected]>
Co-authored-by: Moin Nadeem <[email protected]>
Co-authored-by: Abhi Venigalla <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants