Skip to content

Conversation

abhi-mosaic
Copy link
Contributor

This PR adds an adapter for HF streaming datasets, so that we can train with DDP and sharding. There is also a small QoL fix for DeepSpeed.

I've also added a new GPT3-125m YAML which can be the start of our new GPT3 family of benchmarks. I think we should deprecate the GPT2 collection of YAMLs (-52m, -83m -125m) in the near future, say Composer v0.5.

Some TODOs, which can be follow-up PRs:

  • Add the rest of the GPT3 family YAMLs
  • Update the GPT2 model with QoL fixes
  • Update the BERT-base benchmark to use C4
  • Deprecate / combine the old LMDataset object

@abhi-mosaic abhi-mosaic added the enhancement New (engineering) enhancements, such as features or API changes. label Feb 16, 2022
@abhi-mosaic abhi-mosaic self-assigned this Feb 16, 2022
@abhi-mosaic abhi-mosaic mentioned this pull request Feb 16, 2022
@abhi-mosaic
Copy link
Contributor Author

This PR has been refactored, so instead of a general HF streaming adapter (which is a bit unfeasible, given that HF datasets are user-generated and do not have a consistent implementation of shards or sample dicts), this PR now adds a specific C4Dataset that happens to be backed by HF Datasets.

The intent is for C4Dataset to be an example for users who want to train with their own HF datasets, but they will have to construct their CustomDataset class independently. This is similar to how our ImageNet dataset is an example of a vision dataset that happens to be backed by torchvision.datasets.ImageFolder.

In the future, we can try to abstract some things away, and/or coordinate with HF upstream to build better sharding support, but right now C4Dataset is just a standalone example.

Copy link
Contributor

@hanlint hanlint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did mostly a code style/ readability review. Will have to rely on more knowledgable (e.g. @moinnadeem ) to check the functional correctness.

Copy link
Contributor

@moinnadeem moinnadeem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments, correctness looks good to me from what I can tell! Good work Abhi!

@abhi-mosaic
Copy link
Contributor Author

Hey @hanlint @moinnadeem , sorry for the delay on this PR. I think I should have addressed all the comments. Hope we can get this merged today! And then I can work with @moinnadeem to update the BERT yaml to use C4, as well as convert some of the large GPT2 yamls (> 125M) to GPT3.

Copy link
Contributor

@hanlint hanlint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from code quality stand point, defer to @moinnadeem for NLP correctness

@abhi-mosaic
Copy link
Contributor Author

Also, I'm seeing a lot of new pyright errors.. but I don't think there are any related to c4.py. Please let me know if this is expected / what I can do.

@ravi-mosaicml ravi-mosaicml added this to the v0.5 milestone Feb 28, 2022
@abhi-mosaic
Copy link
Contributor Author

abhi-mosaic commented Mar 1, 2022

Turns out the self documenting f-string syntax f"{var=}" did not appear until python3.8, so our python3.7 tests were failing 🙃 . Reverted...

@abhi-mosaic abhi-mosaic changed the title Add HF Streaming dataset Add C4 Streaming dataset Mar 2, 2022
@abhi-mosaic abhi-mosaic merged commit 7d1d801 into dev Mar 2, 2022
@abhi-mosaic abhi-mosaic deleted the abhi/hf_streaming branch March 2, 2022 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New (engineering) enhancements, such as features or API changes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants