-
Notifications
You must be signed in to change notification settings - Fork 454
Add C4 Streaming dataset #489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
684becb
to
aa68b9c
Compare
This PR has been refactored, so instead of a general HF streaming adapter (which is a bit unfeasible, given that HF datasets are user-generated and do not have a consistent implementation of shards or sample dicts), this PR now adds a specific The intent is for In the future, we can try to abstract some things away, and/or coordinate with HF upstream to build better sharding support, but right now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did mostly a code style/ readability review. Will have to rely on more knowledgable (e.g. @moinnadeem ) to check the functional correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments, correctness looks good to me from what I can tell! Good work Abhi!
cb45757
to
49f9cc8
Compare
Hey @hanlint @moinnadeem , sorry for the delay on this PR. I think I should have addressed all the comments. Hope we can get this merged today! And then I can work with @moinnadeem to update the BERT yaml to use C4, as well as convert some of the large GPT2 yamls (> 125M) to GPT3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from code quality stand point, defer to @moinnadeem for NLP correctness
Also, I'm seeing a lot of new pyright errors.. but I don't think there are any related to |
Turns out the self documenting f-string syntax |
7d06c64
to
5691840
Compare
This PR adds an adapter for HF streaming datasets, so that we can train with DDP and sharding. There is also a small QoL fix for DeepSpeed.
I've also added a new GPT3-125m YAML which can be the start of our new GPT3 family of benchmarks. I think we should deprecate the GPT2 collection of YAMLs (-52m, -83m -125m) in the near future, say Composer v0.5.
Some TODOs, which can be follow-up PRs:
LMDataset
object