LM Streaming Dataset #211

abhi-mosaic · 2022-01-08T01:53:34Z

This is a WIP but should work for some basic use cases.

Successfully used streaming C4 dataset for GPT2 pretraining, confirmed that multi-epoch works, and visually inspected tensors for MLM modeling. Will defer BERT+C4 testing to @moinnadeem.

I added a file to tests/datasets/... but it probably needs to be refactored / removed for now. I think we'll need a pretty cohesive set of tests for StreamingLM in the future.

Known issues:

determining the length of a new dataset is not implemented (right now info for C4 is cached)
Epoch end handling
- if world_size > n_shards and max_samples is not set, there could be a bug at the end of the epoch
- drop_last is currently required to be True, until this bug is fixed
- Basically, use max_samples to be safe, future updates will remove this requirement.
If there are multiple devices, and we are not iterating over the whole dataset, then there are very few guarantees about the exact subset of data that we will see. A single GPU run would see shard 1 only, but an 8 GPU run would see 1/8th of 8 shards. This could be a problem for reproducability... this is why I also included the max_shards parameter as a way to deterministically subsample a dataset.
I am quite nervous about how shuffling is handled overall, and whether the streaming dataset's shard order will / will not get reshuffled at the end of each epoch, if max_samples is set. I am also worried about how shuffling is handled when multiple devices are looking at the same shard... they should be careful to subsample first then shuffle or else data could be duplicated. These concerns can be addressed in future PRs, right now I think this class will mostly be used for large single-epoch pretraining.

abhi-mosaic · 2022-01-08T01:58:10Z

composer/yamls/models/gpt2_125m_streaming.yaml

+train_dataset:
+  streaming_lm:
+    dataset_name: c4
+    dataset_config_name: en
+    split: train
+    max_shards: -1
+    max_samples: 7168000
+    max_seq_len: 1024
+    group_method: concat
+    tokenizer_name: gpt2
+    use_masked_lm: false
+    seed: 17
+    shuffle: true
+    drop_last: true
+val_dataset:
+  streaming_lm:
+    dataset_name: c4
+    dataset_config_name: en
+    split: validation
+    max_shards: -1
+    max_samples: 128000
+    max_seq_len: 1024
+    group_method: concat
+    tokenizer_name: gpt2
+    use_masked_lm: false
+    seed: 17
+    shuffle: false
+    drop_last: true


Just want to highlight what the new YAML looks like.

For @moinnadeem you should use group_method: truncate for MLM.

moinnadeem · 2022-01-08T15:01:30Z

To be clear, do we want to actually merge this, or is this a "draft PR"?

abhi-mosaic · 2022-01-08T22:05:26Z

To be clear, do we want to actually merge this, or is this a "draft PR"?

Good call just made it a draft :) Needs a little bit more work.

Revert changes to last known stable setting.

jbloxham · 2022-01-18T21:57:25Z

composer/trainer/deepspeed.py

            deepspeed_config["fp16"] = {
                "enabled": True,
-                "initial_scale_power": 16,
+                "initial_scale_power": 0,


For reference, this is the change that was causing issues training models last week. An initial scale power of 0 means we multiply gradients by 2^0 - i.e. loss scaling is effectively disabled. 32 is DeepSpeed's default; 16 is consistent with PyTorch's defaults and seems to work well for the scale of model we work with.

abhi-mosaic · 2022-02-16T03:45:09Z

Closing, newer PR here: #489

abhi-mosaic requested review from moinnadeem and jbloxham January 8, 2022 01:53

abhi-mosaic self-assigned this Jan 8, 2022

abhi-mosaic commented Jan 8, 2022

View reviewed changes

abhi-mosaic marked this pull request as draft January 8, 2022 22:05

abhi-mosaic added 2 commits January 11, 2022 23:24

wip

15df497

wip

c19fc86

abhi-mosaic force-pushed the abhi/lm_streaming branch from f266c48 to c19fc86 Compare January 12, 2022 01:20

abhi-mosaic added 2 commits January 13, 2022 15:37

Update gpt3_1,3b.yaml

6c66115

Update gpt3_1,3b.yaml

40f9093

Revert changes to last known stable setting.

jbloxham reviewed Jan 18, 2022

View reviewed changes

abhi-mosaic closed this Feb 16, 2022

abhi-mosaic deleted the abhi/lm_streaming branch May 25, 2022 21:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LM Streaming Dataset #211

LM Streaming Dataset #211

Uh oh!

abhi-mosaic commented Jan 8, 2022 •

edited

Loading

Uh oh!

abhi-mosaic Jan 8, 2022

Uh oh!

abhi-mosaic Jan 8, 2022

Uh oh!

moinnadeem commented Jan 8, 2022

Uh oh!

abhi-mosaic commented Jan 8, 2022

Uh oh!

jbloxham Jan 18, 2022

Uh oh!

abhi-mosaic commented Feb 16, 2022

Uh oh!

Uh oh!

LM Streaming Dataset #211

LM Streaming Dataset #211

Uh oh!

Conversation

abhi-mosaic commented Jan 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abhi-mosaic Jan 8, 2022

Choose a reason for hiding this comment

Uh oh!

abhi-mosaic Jan 8, 2022

Choose a reason for hiding this comment

Uh oh!

moinnadeem commented Jan 8, 2022

Uh oh!

abhi-mosaic commented Jan 8, 2022

Uh oh!

jbloxham Jan 18, 2022

Choose a reason for hiding this comment

Uh oh!

abhi-mosaic commented Feb 16, 2022

Uh oh!

Uh oh!

abhi-mosaic commented Jan 8, 2022 •

edited

Loading