-
Notifications
You must be signed in to change notification settings - Fork 454
Add BERT Base to Composer #195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Added support to load checkpoints stored in object storage (rather than just on the local disk). Closes #192. - Refactored the run directory uploader to separate out object store related utilites to composer.utils.object_store (and added test coverage). - Updated the checkpointer hparams to optionally take `composer.utils.object_store.ObjectStoreProviderHparams`, which would be used to download the checkpoint from storage. - Updated the trainer init to propegate through this change.
Refactored checkpoint storage
composer/models/nlp_metrics.py
Outdated
return self.sum_loss / self.total_items #type: ignore (third-party) | ||
|
||
|
||
# TODO (Moin): write tests for this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah last thing (I think), could you remove these TODOs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooh good catch! Let me actually grep for TODOs overall
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Mostly reviewed new changes to metrics. LGTM! |
* Load Checkpoints from Cloud Storage Added support to load checkpoints stored in object storage (rather than just on the local disk). Closes mosaicml#192. - Refactored the run directory uploader to separate out object store related utilites to composer.utils.object_store (and added test coverage). - Updated the checkpointer hparams to optionally take `composer.utils.object_store.ObjectStoreProviderHparams`, which would be used to download the checkpoint from storage. - Updated the trainer init to propegate through this change. * Libcloud intersphinx * rebasing off of dev * starting an LR sweep * adding proper dataset and batch size * 2.0e-3 LR causes NaNs, lowering lr * changing adam * adding SST-2 * adding validation tracking * adding SST-2 -- training but not at the right accuracy * cleaning up code & debugging why training loss is so large * finalized YAML for SST-2, starting hparam sweeps * updating hparams to sweep: * finalized current setup for SST-2 * starting hparam sweeps on RTE * adding support for warmup_ratio * adding non-standard metrics * adding support for duration as a time abstraction * adding compatability with DataloaderSpec changes * adding a linear learning rate decay * adding linear LR warmup * finalizing GLUE * refactoring implementation to add regression tasks * fixing checkpoint bug * finalizing fine-tuning a checkpointed model * fixing checkpoint bug * adding validation * adding mid-training * starting LR sweep * adding checkpointing feedback part 1 * fix validation interval * address PR feedback * address PR feedback * adding save_checkpoint and load_checkpoint hparams interface * adding save_checkpoint and load_checkpoint hparams interface * yapf & pyright * fixed error with logging pre-training validation loss * cleaning up model forward pass * cleaning up custom metrics * renaming Checkpointer -> CheckpointSaver * addressing pyright * adding tests * moving commits to BERT branch * changing folder to be relative to run dir * formatting * adding tests * adding initial YAML changes * removing a copy of outdated files * adding GLUE default params * addressing pyright * finalizing task-specific YAMLs * code cleanup * yapf * adding license * addressing tests * formatting * adding tests for the duration abstraction * can i sue pyright for emotional damages? * final formatting * adding in finalized pre-training hyperparameters * Update composer/models/bert/bert_hparams.py Co-authored-by: Abhi Venigalla <[email protected]> * Load Checkpoints from Cloud Storage Added support to load checkpoints stored in object storage (rather than just on the local disk). Closes mosaicml#192. - Refactored the run directory uploader to separate out object store related utilites to composer.utils.object_store (and added test coverage). - Updated the checkpointer hparams to optionally take `composer.utils.object_store.ObjectStoreProviderHparams`, which would be used to download the checkpoint from storage. - Updated the trainer init to propegate through this change. * Libcloud intersphinx * addressing PR feedback * changing checkpoints into a cloud URl * addressing Landan's feedback * filepath -> checkpoint in the YAMLs * Fixed merge * Removed auto-parsing s3 and gs urls, as libcloud requires authentication. Fixed tests. * Flattened run directory uploader hparams * Fixed object store provider hparams * updating sampler to be composer.dist * Added tqdm progress bars and chunk sizing paramterization Refactored checkpoint storage * Fix pyright * Fixed timeout * Fix checkpointing * Fixed deepspeed checkpoints * Cleaned up PR * finalized checkpointing loading * refactored metric to avoid lists * addressing pyright * updating YAMLs with checkpoints * final change * adding unit tests * adding LICENSE * addressing conflicts & tests * isort * removing finished TODOs * adding new GPT-2 YAMLs Co-authored-by: Ravi Rahman <[email protected]> Co-authored-by: Moin Nadeem <[email protected]> Co-authored-by: Abhi Venigalla <[email protected]>
Motivation
Composer currently contains autoregressive language models, but masked language models are just as important for speeding up training. This PR serves to add the first of masked language models (BERT) to Composer.
Merging Criterion