v0.31.0
What's New
1. PyTorch 2.7.0 Compatibility (#3850)
We've added support for PyTorch 2.7.0 and created a Dockerfile to support PyTorch 2.7.0 + CUDA 12.8. The current Composer image supports PyTorch 2.7.0 + CUDA 12.6.3.
2. Experimental FSDP2 support has been added to Trainer
(#3852)
Experimental FSDP2 support was added to Trainer
with:
auto_wrap
based on_fsdp_wrap_fn
and/or_fsdp_wrap
attributes within the model (#3826)- Activation checkpointing and CPU offloading (#3832)
- Meta initialization (#3852)
Note: Not all features are supported yet (e.g. automicrobatching, monolithic checkpointing)
Usage:
Add FSDP_VERSION=2
as an environment variable and set your FSDP2 config (parallelism_config
) as desired. The full set of available attributes can be found here.
Bug Fixes
- Resolve a memory hang issue in Mlflow monitor process (#3830)
What's Changed
- Bump Composer 0.31.0.dev0 by @KuuCi in #3808
- Update Checkpoint Back-Compatibility Test by @KuuCi in #3810
- Extend docker build matrix to add an entry for pytorch2.6+cu126 by @sirejdua-db in #3805
- Bump databricks-sdk from 0.47.0 to 0.49.0 by @dependabot in #3814
- Bump pypandoc from 1.14 to 1.15 by @dependabot in #3813
- Update google-cloud-storage requirement from <3.0,>=2.0.0 to >=2.0.0,<4.0 by @dependabot in #3812
- Update setuptools version by @irenedea in #3816
- Kickstart FSDP2 by @bowenyang008 in #3806
- Remove network calls to HF in CI by @dakinggg in #3817
- Update psutil requirement from <7,>=5.8.0 to >=5.8.0,<8 by @dependabot in #3818
- [FSDP2] Init FSDP2 based checkpointing by @bowenyang008 in #3824
- Update torchmetrics requirement from <1.6.1,>=1.0 to >=1.0,<1.7.2 by @dependabot in #3829
- Bump coverage[toml] from 7.6.8 to 7.8.0 by @dependabot in #3827
- Bump yamllint from 1.35.1 to 1.37.0 by @dependabot in #3820
- Update numpy requirement from <2.2.0,>=1.21.5 to >=1.21.5,<2.3.0 by @dependabot in #3828
- Update optimizer params for fsdp2 by @rithwik-db in #3822
- Change Mlflow monitor process from fork to spawn to reduce memory usage by @dakinggg in #3830
- Ignore mlflow warning in test by @dakinggg in #3831
- Bump HF hub version by @dakinggg in #3839
- Bump databricks-sdk from 0.49.0 to 0.50.0 by @dependabot in #3834
- Update transformers requirement from !=4.34.0,<4.51,>=4.11 to >=4.11,!=4.34.0,<4.52 by @dependabot in #3838
- Eliminate dead code before torch version 2.4 by @bowenyang008 in #3833
- Support submodule wrapping for FSDP2 according to model definition (with
_fsdp_wrap
andfsdp_wrap_fn
) by @rithwik-db in #3826 - Activation Checkpointing and Offloading for FSDP2 by @rithwik-db in #3832
- Pin EFA installer version by @dakinggg in #3842
- Add two legacy torch images to the container build matrix by @asfandyarq in #3841
- Bump yamllint from 1.37.0 to 1.37.1 by @dependabot in #3845
- Update packaging requirement from <24.3,>=21.3.0 to >=21.3.0,<25.1 by @dependabot in #3846
- Bump cryptography from 44.0.0 to 44.0.3 by @dependabot in #3848
- Upgrade yapf version by @dakinggg in #3840
- Bump ipython from 8.11.0 to 8.36.0 by @dependabot in #3847
- Update huggingface-hub requirement from <0.31,>=0.21.2 to >=0.21.2,<0.32 by @dependabot in #3851
- Update EFA installer version by @dakinggg in #3844
- Fix typos by @omahs in #3853
- Integrate FSDP2 wrapper into Trainer by @bowenyang008 in #3852
- Deprecate code eval utils by @dakinggg in #3854
- FSDP2 time and verbose logging by @bowenyang008 in #3856
- Fix RDMA installation by @dakinggg in #3857
- Update ci-testing version to latest by @dakinggg in #3859
- Updating composer to support Torch 2.7 by @rithwik-db in #3850
- Cleanup version gating pre-2.6.0 by @rithwik-db in #3863
New Contributors
- @sirejdua-db made their first contribution in #3805
- @asfandyarq made their first contribution in #3841
- @omahs made their first contribution in #3853
Full Changelog: v0.30.0...v0.31.0