-
Notifications
You must be signed in to change notification settings - Fork 454
LM Streaming Dataset #211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LM Streaming Dataset #211
Conversation
train_dataset: | ||
streaming_lm: | ||
dataset_name: c4 | ||
dataset_config_name: en | ||
split: train | ||
max_shards: -1 | ||
max_samples: 7168000 | ||
max_seq_len: 1024 | ||
group_method: concat | ||
tokenizer_name: gpt2 | ||
use_masked_lm: false | ||
seed: 17 | ||
shuffle: true | ||
drop_last: true | ||
val_dataset: | ||
streaming_lm: | ||
dataset_name: c4 | ||
dataset_config_name: en | ||
split: validation | ||
max_shards: -1 | ||
max_samples: 128000 | ||
max_seq_len: 1024 | ||
group_method: concat | ||
tokenizer_name: gpt2 | ||
use_masked_lm: false | ||
seed: 17 | ||
shuffle: false | ||
drop_last: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to highlight what the new YAML looks like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For @moinnadeem you should use group_method: truncate
for MLM.
To be clear, do we want to actually merge this, or is this a "draft PR"? |
Good call just made it a draft :) Needs a little bit more work. |
f266c48
to
c19fc86
Compare
Revert changes to last known stable setting.
deepspeed_config["fp16"] = { | ||
"enabled": True, | ||
"initial_scale_power": 16, | ||
"initial_scale_power": 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, this is the change that was causing issues training models last week. An initial scale power of 0 means we multiply gradients by 2^0 - i.e. loss scaling is effectively disabled. 32 is DeepSpeed's default; 16 is consistent with PyTorch's defaults and seems to work well for the scale of model we work with.
Closing, newer PR here: #489 |
This is a WIP but should work for some basic use cases.
Successfully used streaming C4 dataset for GPT2 pretraining, confirmed that multi-epoch works, and visually inspected tensors for MLM modeling. Will defer BERT+C4 testing to @moinnadeem.
I added a file to
tests/datasets/...
but it probably needs to be refactored / removed for now. I think we'll need a pretty cohesive set of tests for StreamingLM in the future.Known issues:
world_size
>n_shards
andmax_samples
is not set, there could be a bug at the end of the epochdrop_last
is currently required to be True, until this bug is fixedmax_samples
to be safe, future updates will remove this requirement.max_shards
parameter as a way to deterministically subsample a dataset.max_samples
is set. I am also worried about how shuffling is handled when multiple devices are looking at the same shard... they should be careful to subsample first then shuffle or else data could be duplicated. These concerns can be addressed in future PRs, right now I think this class will mostly be used for large single-epoch pretraining.