Skip to content

Conversation

abhi-mosaic
Copy link
Contributor

@abhi-mosaic abhi-mosaic commented Jan 8, 2022

This is a WIP but should work for some basic use cases.

Successfully used streaming C4 dataset for GPT2 pretraining, confirmed that multi-epoch works, and visually inspected tensors for MLM modeling. Will defer BERT+C4 testing to @moinnadeem.

I added a file to tests/datasets/... but it probably needs to be refactored / removed for now. I think we'll need a pretty cohesive set of tests for StreamingLM in the future.

Known issues:

  • determining the length of a new dataset is not implemented (right now info for C4 is cached)
  • Epoch end handling
    • if world_size > n_shards and max_samples is not set, there could be a bug at the end of the epoch
    • drop_last is currently required to be True, until this bug is fixed
    • Basically, use max_samples to be safe, future updates will remove this requirement.
  • If there are multiple devices, and we are not iterating over the whole dataset, then there are very few guarantees about the exact subset of data that we will see. A single GPU run would see shard 1 only, but an 8 GPU run would see 1/8th of 8 shards. This could be a problem for reproducability... this is why I also included the max_shards parameter as a way to deterministically subsample a dataset.
  • I am quite nervous about how shuffling is handled overall, and whether the streaming dataset's shard order will / will not get reshuffled at the end of each epoch, if max_samples is set. I am also worried about how shuffling is handled when multiple devices are looking at the same shard... they should be careful to subsample first then shuffle or else data could be duplicated. These concerns can be addressed in future PRs, right now I think this class will mostly be used for large single-epoch pretraining.

@abhi-mosaic abhi-mosaic self-assigned this Jan 8, 2022
Comment on lines 3 to 30
train_dataset:
streaming_lm:
dataset_name: c4
dataset_config_name: en
split: train
max_shards: -1
max_samples: 7168000
max_seq_len: 1024
group_method: concat
tokenizer_name: gpt2
use_masked_lm: false
seed: 17
shuffle: true
drop_last: true
val_dataset:
streaming_lm:
dataset_name: c4
dataset_config_name: en
split: validation
max_shards: -1
max_samples: 128000
max_seq_len: 1024
group_method: concat
tokenizer_name: gpt2
use_masked_lm: false
seed: 17
shuffle: false
drop_last: true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to highlight what the new YAML looks like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For @moinnadeem you should use group_method: truncate for MLM.

@moinnadeem
Copy link
Contributor

To be clear, do we want to actually merge this, or is this a "draft PR"?

@abhi-mosaic abhi-mosaic marked this pull request as draft January 8, 2022 22:05
@abhi-mosaic
Copy link
Contributor Author

To be clear, do we want to actually merge this, or is this a "draft PR"?

Good call just made it a draft :) Needs a little bit more work.

Revert changes to last known stable setting.
deepspeed_config["fp16"] = {
"enabled": True,
"initial_scale_power": 16,
"initial_scale_power": 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, this is the change that was causing issues training models last week. An initial scale power of 0 means we multiply gradients by 2^0 - i.e. loss scaling is effectively disabled. 32 is DeepSpeed's default; 16 is consistent with PyTorch's defaults and seems to work well for the scale of model we work with.

@abhi-mosaic
Copy link
Contributor Author

Closing, newer PR here: #489

@abhi-mosaic abhi-mosaic deleted the abhi/lm_streaming branch May 25, 2022 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants