Add BERT Base to Composer #195

moinnadeem · 2021-12-29T20:00:07Z

Motivation

Composer currently contains autoregressive language models, but masked language models are just as important for speeding up training. This PR serves to add the first of masked language models (BERT) to Composer.

Merging Criterion

Demonstrate that BERT achieves acceptable downstream GLUE performance on all benchmarks
Demonstrate average GLUE performance on validation set and test set
Code quality should be up-to-par, with well-commented abstractions
New TorchMetrics that have been added should include tests
New YAML files should be clean and not contain any unnecessary hparams

Added support to load checkpoints stored in object storage (rather than just on the local disk). Closes #192. - Refactored the run directory uploader to separate out object store related utilites to composer.utils.object_store (and added test coverage). - Updated the checkpointer hparams to optionally take `composer.utils.object_store.ObjectStoreProviderHparams`, which would be used to download the checkpoint from storage. - Updated the trainer init to propegate through this change.

Refactored checkpoint storage

Landanjs · 2022-01-10T23:48:14Z

composer/models/nlp_metrics.py

+        return self.sum_loss / self.total_items  #type: ignore (third-party)
+
+
+# TODO (Moin): write tests for this!


Ah last thing (I think), could you remove these TODOs?

ooh good catch! Let me actually grep for TODOs overall

Landanjs · 2022-01-10T23:52:47Z

Mostly reviewed new changes to metrics. LGTM!

* Load Checkpoints from Cloud Storage Added support to load checkpoints stored in object storage (rather than just on the local disk). Closes mosaicml#192. - Refactored the run directory uploader to separate out object store related utilites to composer.utils.object_store (and added test coverage). - Updated the checkpointer hparams to optionally take `composer.utils.object_store.ObjectStoreProviderHparams`, which would be used to download the checkpoint from storage. - Updated the trainer init to propegate through this change. * Libcloud intersphinx * rebasing off of dev * starting an LR sweep * adding proper dataset and batch size * 2.0e-3 LR causes NaNs, lowering lr * changing adam * adding SST-2 * adding validation tracking * adding SST-2 -- training but not at the right accuracy * cleaning up code & debugging why training loss is so large * finalized YAML for SST-2, starting hparam sweeps * updating hparams to sweep: * finalized current setup for SST-2 * starting hparam sweeps on RTE * adding support for warmup_ratio * adding non-standard metrics * adding support for duration as a time abstraction * adding compatability with DataloaderSpec changes * adding a linear learning rate decay * adding linear LR warmup * finalizing GLUE * refactoring implementation to add regression tasks * fixing checkpoint bug * finalizing fine-tuning a checkpointed model * fixing checkpoint bug * adding validation * adding mid-training * starting LR sweep * adding checkpointing feedback part 1 * fix validation interval * address PR feedback * address PR feedback * adding save_checkpoint and load_checkpoint hparams interface * adding save_checkpoint and load_checkpoint hparams interface * yapf & pyright * fixed error with logging pre-training validation loss * cleaning up model forward pass * cleaning up custom metrics * renaming Checkpointer -> CheckpointSaver * addressing pyright * adding tests * moving commits to BERT branch * changing folder to be relative to run dir * formatting * adding tests * adding initial YAML changes * removing a copy of outdated files * adding GLUE default params * addressing pyright * finalizing task-specific YAMLs * code cleanup * yapf * adding license * addressing tests * formatting * adding tests for the duration abstraction * can i sue pyright for emotional damages? * final formatting * adding in finalized pre-training hyperparameters * Update composer/models/bert/bert_hparams.py Co-authored-by: Abhi Venigalla <[email protected]> * Load Checkpoints from Cloud Storage Added support to load checkpoints stored in object storage (rather than just on the local disk). Closes mosaicml#192. - Refactored the run directory uploader to separate out object store related utilites to composer.utils.object_store (and added test coverage). - Updated the checkpointer hparams to optionally take `composer.utils.object_store.ObjectStoreProviderHparams`, which would be used to download the checkpoint from storage. - Updated the trainer init to propegate through this change. * Libcloud intersphinx * addressing PR feedback * changing checkpoints into a cloud URl * addressing Landan's feedback * filepath -> checkpoint in the YAMLs * Fixed merge * Removed auto-parsing s3 and gs urls, as libcloud requires authentication. Fixed tests. * Flattened run directory uploader hparams * Fixed object store provider hparams * updating sampler to be composer.dist * Added tqdm progress bars and chunk sizing paramterization Refactored checkpoint storage * Fix pyright * Fixed timeout * Fix checkpointing * Fixed deepspeed checkpoints * Cleaned up PR * finalized checkpointing loading * refactored metric to avoid lists * addressing pyright * updating YAMLs with checkpoints * final change * adding unit tests * adding LICENSE * addressing conflicts & tests * isort * removing finished TODOs * adding new GPT-2 YAMLs Co-authored-by: Ravi Rahman <[email protected]> Co-authored-by: Moin Nadeem <[email protected]> Co-authored-by: Abhi Venigalla <[email protected]>

moinnadeem force-pushed the moin/bert branch from e629056 to 1ddccad Compare January 4, 2022 19:41

ravi-mosaicml and others added 29 commits January 4, 2022 15:33

Libcloud intersphinx

0259219

rebasing off of dev

bb99b1e

starting an LR sweep

e202147

adding proper dataset and batch size

043e0cc

2.0e-3 LR causes NaNs, lowering lr

4e41eaf

changing adam

5c96a71

adding SST-2

34345b4

adding validation tracking

7b8a95b

adding SST-2 -- training but not at the right accuracy

37994a4

cleaning up code & debugging why training loss is so large

095f7f9

finalized YAML for SST-2, starting hparam sweeps

f0248cd

updating hparams to sweep:

0cae1e5

finalized current setup for SST-2

8529b82

starting hparam sweeps on RTE

185ae05

adding support for warmup_ratio

78a82ea

adding non-standard metrics

cda96c0

adding support for duration as a time abstraction

d5326f8

adding compatability with DataloaderSpec changes

fc7b5a4

adding a linear learning rate decay

8d1127f

adding linear LR warmup

e07651d

finalizing GLUE

87068a0

refactoring implementation to add regression tasks

abf6912

fixing checkpoint bug

81dbd05

finalizing fine-tuning a checkpointed model

f6302f1

fixing checkpoint bug

cb0dadf

adding validation

b2797b4

adding mid-training

ce15a5d

starting LR sweep

3c799ce

ravi-mosaicml and others added 20 commits January 7, 2022 12:57

Fixed object store provider hparams

a1a734d

updating sampler to be composer.dist

78fa9e2

Added tqdm progress bars and chunk sizing paramterization

f9ebda3

Refactored checkpoint storage

Fix pyright

1d6c81c

Fixed timeout

f2b382f

Fix checkpointing

f5f46de

Fixed deepspeed checkpoints

ed670f3

Cleaned up PR

f50ed88

finalized checkpointing loading

d23e04c

refactored metric to avoid lists

68db3bf

Merge branch 'dev' into moin/bert

cc9337a

addressing pyright

0a64a53

updating YAMLs with checkpoints

f861cdd

final change

bf051ec

adding unit tests

e69fcdb

Merge branch 'dev' into moin/bert

9ac0c3f

adding LICENSE

b90dd66

Merge branch 'moin/bert' of github.com:mosaicml/composer into moin/bert

11fbef5

addressing conflicts & tests

79cc6f2

isort

c3a8997

Landanjs reviewed Jan 10, 2022

View reviewed changes

removing finished TODOs

c5ffcef

Landanjs approved these changes Jan 10, 2022

View reviewed changes

Moin Nadeem added 2 commits January 11, 2022 00:06

adding new GPT-2 YAMLs

2ed9662

Merge branch 'dev' into moin/bert

f379e5d

moinnadeem merged commit 0d6b3af into dev Jan 11, 2022

moinnadeem deleted the moin/bert branch January 11, 2022 00:13

moinnadeem mentioned this pull request Jan 11, 2022

Add unit tests for nlp_metrics #202

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add BERT Base to Composer #195

Add BERT Base to Composer #195

Uh oh!

moinnadeem commented Dec 29, 2021 •

edited

Loading

Uh oh!

Landanjs Jan 10, 2022

Uh oh!

moinnadeem Jan 10, 2022

Uh oh!

moinnadeem Jan 10, 2022

Uh oh!

Landanjs commented Jan 10, 2022

Uh oh!

Uh oh!

		return self.sum_loss / self.total_items #type: ignore (third-party)


		# TODO (Moin): write tests for this!

Add BERT Base to Composer #195

Add BERT Base to Composer #195

Uh oh!

Conversation

moinnadeem commented Dec 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Merging Criterion

Uh oh!

Landanjs Jan 10, 2022

Choose a reason for hiding this comment

Uh oh!

moinnadeem Jan 10, 2022

Choose a reason for hiding this comment

Uh oh!

moinnadeem Jan 10, 2022

Choose a reason for hiding this comment

Uh oh!

Landanjs commented Jan 10, 2022

Uh oh!

Uh oh!

moinnadeem commented Dec 29, 2021 •

edited

Loading