Releases: mosaicml/composer
v0.32.1
What's Changed
- Removed extraneous usage of
fsdp_config.load_monolith_rank0_only
since that's unreliable by @rithwik-db in #3901 - Fixed automicrobatching issue for FSDP1 by @rithwik-db in #3909
- reverted mlflow upgrade due to slowdowns by @ethantang-db in #3910
Full Changelog: v0.32.0...v0.32.1
v0.32.0
What's Changed
- Update FSDP checkpointing test to use UC Volumes and updated dockerfile for new composer version by @rithwik-db in #3865
- Using cu128 instead of cu126 for pr-gpu and daily tests by @rithwik-db in #3867
- Refactored auto-microbatching hook handles for FSDP by @rithwik-db in #3843
- Removed most s3 bucket based tests (replaced with UC Volumes) by @rithwik-db in #3869
- Supporting Mixed Init on FSDP2 by @rithwik-db in #3872
- Documentation Improvements: Clarify Explanations in Gated Linear Units and Squeeze-Excite README Files by @leopardracer in #3875
- Mlflow move to cpu by @dakinggg in #3878
- FSDP2 mixed init fixes by @rithwik-db in #3882
- Remove sklearn dep by @dakinggg in #3883
- Monolithic checkpointing by @rithwik-db in #3876
- Update docs conf.py copyright to 2025 by @jacobfulano in #3751
- Fix Typos in Comments for activation_monitor.py and mlperf.py by @kilavvy in #3877
- Mixed Precision for FSDP2 by @rithwik-db in #3884
- Add h200 to flops dict by @dakinggg in #3889
- updated fsdp2 config by @rithwik-db in #3896
- Supporting peft for FSDP2 by @rithwik-db in #3897
New Contributors
- @leopardracer made their first contribution in #3875
- @kilavvy made their first contribution in #3877
Full Changelog: v0.31.0...v0.32.0
v0.31.0
What's New
1. PyTorch 2.7.0 Compatibility (#3850)
We've added support for PyTorch 2.7.0 and created a Dockerfile to support PyTorch 2.7.0 + CUDA 12.8. The current Composer image supports PyTorch 2.7.0 + CUDA 12.6.3.
2. Experimental FSDP2 support has been added to Trainer
(#3852)
Experimental FSDP2 support was added to Trainer
with:
auto_wrap
based on_fsdp_wrap_fn
and/or_fsdp_wrap
attributes within the model (#3826)- Activation checkpointing and CPU offloading (#3832)
- Meta initialization (#3852)
Note: Not all features are supported yet (e.g. automicrobatching, monolithic checkpointing)
Usage:
Add FSDP_VERSION=2
as an environment variable and set your FSDP2 config (parallelism_config
) as desired. The full set of available attributes can be found here.
Bug Fixes
- Resolve a memory hang issue in Mlflow monitor process (#3830)
What's Changed
- Bump Composer 0.31.0.dev0 by @KuuCi in #3808
- Update Checkpoint Back-Compatibility Test by @KuuCi in #3810
- Extend docker build matrix to add an entry for pytorch2.6+cu126 by @sirejdua-db in #3805
- Bump databricks-sdk from 0.47.0 to 0.49.0 by @dependabot in #3814
- Bump pypandoc from 1.14 to 1.15 by @dependabot in #3813
- Update google-cloud-storage requirement from <3.0,>=2.0.0 to >=2.0.0,<4.0 by @dependabot in #3812
- Update setuptools version by @irenedea in #3816
- Kickstart FSDP2 by @bowenyang008 in #3806
- Remove network calls to HF in CI by @dakinggg in #3817
- Update psutil requirement from <7,>=5.8.0 to >=5.8.0,<8 by @dependabot in #3818
- [FSDP2] Init FSDP2 based checkpointing by @bowenyang008 in #3824
- Update torchmetrics requirement from <1.6.1,>=1.0 to >=1.0,<1.7.2 by @dependabot in #3829
- Bump coverage[toml] from 7.6.8 to 7.8.0 by @dependabot in #3827
- Bump yamllint from 1.35.1 to 1.37.0 by @dependabot in #3820
- Update numpy requirement from <2.2.0,>=1.21.5 to >=1.21.5,<2.3.0 by @dependabot in #3828
- Update optimizer params for fsdp2 by @rithwik-db in #3822
- Change Mlflow monitor process from fork to spawn to reduce memory usage by @dakinggg in #3830
- Ignore mlflow warning in test by @dakinggg in #3831
- Bump HF hub version by @dakinggg in #3839
- Bump databricks-sdk from 0.49.0 to 0.50.0 by @dependabot in #3834
- Update transformers requirement from !=4.34.0,<4.51,>=4.11 to >=4.11,!=4.34.0,<4.52 by @dependabot in #3838
- Eliminate dead code before torch version 2.4 by @bowenyang008 in #3833
- Support submodule wrapping for FSDP2 according to model definition (with
_fsdp_wrap
andfsdp_wrap_fn
) by @rithwik-db in #3826 - Activation Checkpointing and Offloading for FSDP2 by @rithwik-db in #3832
- Pin EFA installer version by @dakinggg in #3842
- Add two legacy torch images to the container build matrix by @asfandyarq in #3841
- Bump yamllint from 1.37.0 to 1.37.1 by @dependabot in #3845
- Update packaging requirement from <24.3,>=21.3.0 to >=21.3.0,<25.1 by @dependabot in #3846
- Bump cryptography from 44.0.0 to 44.0.3 by @dependabot in #3848
- Upgrade yapf version by @dakinggg in #3840
- Bump ipython from 8.11.0 to 8.36.0 by @dependabot in #3847
- Update huggingface-hub requirement from <0.31,>=0.21.2 to >=0.21.2,<0.32 by @dependabot in #3851
- Update EFA installer version by @dakinggg in #3844
- Fix typos by @omahs in #3853
- Integrate FSDP2 wrapper into Trainer by @bowenyang008 in #3852
- Deprecate code eval utils by @dakinggg in #3854
- FSDP2 time and verbose logging by @bowenyang008 in #3856
- Fix RDMA installation by @dakinggg in #3857
- Update ci-testing version to latest by @dakinggg in #3859
- Updating composer to support Torch 2.7 by @rithwik-db in #3850
- Cleanup version gating pre-2.6.0 by @rithwik-db in #3863
New Contributors
- @sirejdua-db made their first contribution in #3805
- @asfandyarq made their first contribution in #3841
- @omahs made their first contribution in #3853
Full Changelog: v0.30.0...v0.31.0
v0.30.0
What's New
1. Python 3.12 Bump (#3783)
We've added support for Python 3.12 and deprecated Python 3.9 support.
What's Changed
- Updated
test_fsdp_load_old_checkpoint
with 0.29.0 by @rithwik-db in #3771 - Mlflow rocm error by @KuuCi in #3775
- Update docker to have FA==2.7.4.post1 by @KuuCi in #3772
- [GRT-3415] Remove dead code for peft logging by @bowenyang008 in #3777
- Patch Mflow .trash directories by @KuuCi in #3778
- Remove TE ONNX Export Context to Enable TE FusedAttention on AMD Hardware by @jjuvonen-amd in #3779
- Update Makefile to use WORLD_SIZE by @irenedea in #3781
- Bump gitpython from 3.1.43 to 3.1.44 by @dependabot in #3785
- deprecate gcs test by @ethantang-db in #3791
- Update mosaicml-cli requirement from <0.7,>=0.5.25 to >=0.5.25,<0.8 by @dependabot in #3742
- Bump databricks-sdk from 0.44.1 to 0.47.0 by @dependabot in #3786
- deprecate ghcr by @KevDevSha in #3790
- Bump transformers by @dakinggg in #3793
- Bump Python 3.12 by @KuuCi in #3783
- Fix checkpoint loading in Pytorch 2.6.0 for ckpts exported before Pytorch 2.1.0 by @ethantang-db in #3792
- Update huggingface-hub requirement from <0.27,>=0.21.2 to >=0.21.2,<0.30 by @dependabot in #3795
- Update pytest-httpserver requirement from <1.1,>=1.0.4 to >=1.0.4,<1.2 by @dependabot in #3796
- Update scikit-learn requirement from <1.6,>=1.2.0 to >=1.2.0,<1.7 by @dependabot in #3799
- Bump Release Ref 0.3.3 by @KuuCi in #3804
- Remove huggyllama fixture by @dakinggg in #3807
- Fix release docker with 3.10 by @KuuCi in #3809
New Contributors
- @bowenyang008 made their first contribution in #3777
- @jjuvonen-amd made their first contribution in #3779
Full Changelog: v0.29.0...v0.30.0
v0.29.0
Deprecations
1. device_transforms
param in DataSpec
has been deprecated (#3770)
Composer no longer supports the device_transforms
parameter in DataSpec
. Instead, DataSpec
supports batch_transforms
for batch level transformations on CPU and microbatch_transforms
for micro-batch level transformations on target device.
What's Changed
- Add checkpoint BC tests for 0.27.0 and 0.28.0 by @snarayan21 in #3735
- Address sklearn device issues by @snarayan21 in #3748
- Update FAQ with hf-transfer info by @KuuCi in #3745
- Fix MLFlow logger CI error by ignoring UserWarning by @j316chuck in #3758
- Bump ci to v0.3.3 by @j316chuck in #3759
- Fix order of arguments to
loss
by @gsganden in #3754 - fix: make JSONTraceHandler.batch_end robust to /tmp/ being on diff mount to dest by @thundergolfer in #3766
- Bump pytorch to 2.6.0 by @rithwik-db in #3763
- Bump databricks-sdk from 0.38.0 to 0.44.1 by @dependabot in #3765
- Version bump to v0.30.0.dev0 by @rithwik-db in #3770
New Contributors
- @gsganden made their first contribution in #3754
- @thundergolfer made their first contribution in #3766
- @rithwik-db made their first contribution in #3763
Full Changelog: v0.28.0...v0.29.0
v0.28.0
Deprecations
1. Deepspeed Deprecation (#3732)
Composer no longer supports the Deepspeed deep learning library. Support has shifted to PyTorch-native solutions such as FSDP and DDP only. Please use Composer v0.27.0 or before to continue using Deepspeed!
What's Changed
- Fix composer gpu daily test to use torch 2.5.1 by @j316chuck in #3712
- Bump coverage[toml] from 7.6.4 to 7.6.7 by @dependabot in #3713
- Update torchmetrics requirement from <1.5.3,>=1.0 to >=1.0,<1.6.1 by @dependabot in #3714
- Bump ubuntu 22.04 + fix CI mlflow tests by @KuuCi in #3716
- Bump databricks-sdk from 0.36.0 to 0.37.0 by @dependabot in #3715
- Bump mosaicml/pytorch images to use new mosaicml/pytorch images with updated ubuntu 22.04 by @KuuCi in #3718
- migrated all possible assets from GCP to repo by @ethantang-db in #3717
- Bump databricks-sdk from 0.37.0 to 0.38.0 by @dependabot in #3720
- Bump coverage[toml] from 7.6.7 to 7.6.8 by @dependabot in #3721
- Expose
DistributedSampler
RNG seed argument by @janEbert in #3724 - Fix netifaces install in Dockerfile by @j316chuck in #3726
- Update protobuf requirement from <5.29 to <5.30 by @dependabot in #3728
- Bump cryptography from 43.0.3 to 44.0.0 by @dependabot in #3731
- Speed up CI tests :) by @KuuCi in #3727
- Remove deepspeed completely by @snarayan21 in #3732
- Fix daily test failures by @snarayan21 in #3733
- Version bump to v0.29.0.dev0 by @snarayan21 in #3734
New Contributors
Full Changelog: v0.27.0...v0.28.0
v0.27.0
What's New
1. Torch 2.5.1 Compatibility (#3701)
We've added support for torch 2.5.1, including checkpointing bug fixes from PyTorch.
2. Add batch/microbatch transforms (#3703)
Sped up device transformations by doing batch transform on CPU and microbatch transforms on GPU
Deprecations and Breaking Changes
1. MLFlow Metrics Deduplication (#3678)
We added a metric de-duplication feature for the MLflow logger in Composer. Metrics that remain unchanged since the last step are not logged unless specific conditions are met, which by default is if we've reached a 100th multiple of duplicated metric steps. This optimizes logging storage by reducing redundant entries, balancing detailed sampling with efficiency.
Example:
MlflowLogger(..., log_duplicated_metric_every_n_steps=100)
What's Changed
- Metrics dedup for MLflow logger by @chenmoneygithub in #3678
- Bump databricks-sdk from 0.33.0 to 0.36.0 by @dependabot in #3686
- Update pillow requirement from <11,>=10.3.0 to >=10.3.0,<12 by @dependabot in #3684
- Lower min torchmetrics version by @mvpatel2000 in #3691
- Private link error handling by @nancyhung in #3689
- Update checkpoint tests to use new version 0.26.0 by @irenedea in #3683
- Bump coverage[toml] from 7.6.3 to 7.6.4 by @dependabot in #3694
- Pin checkpoint state dict flattening patch by @b-chu in #3700
- Torch bump to 2.5.1 by @mvpatel2000 in #3701
- Fix typo in trainer doc by @XiaohanZhangCMU in #3702
- Update packaging requirement from <24.2,>=21.3.0 to >=21.3.0,<24.3 by @dependabot in #3707
- Update torchmetrics requirement from <1.4.1,>=1.0 to >=1.0,<1.5.3 by @dependabot in #3706
- Add batch/microbatch transforms by @mvpatel2000 in #3703
- Bump version to 0.28.0.dev0 by @j316chuck in #3709
- Add torch 2.5.1 composer tests by @j316chuck in #3710
Full Changelog: v0.26.1...v0.27.0
v0.26.1
v0.26.0
What's New
1. Torch 2.5.0 Compatibility (#3609)
We've added support for torch 2.5.0, including necessary patches to Torch.
Deprecations and Breaking Changes
1. FSDP Configuration Changes(#3681)
We no longer support passing fsdp_config
and fsdp_auto_wrap
directly to Trainer
.
If you'd like to specify an fsdp config and configure fsdp auto wrapping, you should use parallelism_config
.
trainer = Trainer(
parallelism_config = {
'fsdp': {
'auto_wrap': True
...
}
}
)
2. Removal of Pytorch Legacy Sharded Checkpoint Support (#3631)
PyTorch briefly used a different sharded checkpoint format than the current one, which was quickly deprecated by PyTorch. We have removed support for this format. We initially removed support for saving in this format in #2262, and the original feature was added in #1902. Please reach out if you have concerns or need help converting your checkpoints to the new format.
What's Changed
- Add backward compatibility checkpoint tests for v0.25.0 by @dakinggg in #3635
- Don't use TP when
tensor_parallel_degree
is 1 by @eitanturok in #3636 - Update huggingface-hub requirement from <0.25,>=0.21.2 to >=0.21.2,<0.26 by @dependabot in #3637
- Update transformers requirement from !=4.34.0,<4.45,>=4.11 to >=4.11,!=4.34.0,<4.46 by @dependabot in #3638
- Bump databricks-sdk from 0.32.0 to 0.33.0 by @dependabot in #3639
- Remove Legacy Checkpointing by @mvpatel2000 in #3631
- Surface UC permission error by @b-chu in #3642
- Tensor Parallelism Tests by @eitanturok in #3620
- Switch to log.info for deterministic mode by @mvpatel2000 in #3643
- Update pre-commit requirement from <4,>=3.4.0 to >=3.4.0,<5 by @dependabot in #3645
- Update peft requirement from <0.13,>=0.10.0 to >=0.10.0,<0.14 by @dependabot in #3646
- Create callback to load checkpoint by @irenedea in #3641
- Bump jupyter from 1.0.0 to 1.1.1 by @dependabot in #3595
- Fix DB SDK Import by @mvpatel2000 in #3648
- Bump coverage[toml] from 7.6.0 to 7.6.3 by @dependabot in #3651
- Bump pypandoc from 1.13 to 1.14 by @dependabot in #3652
- Replace list with Sequence by @KuuCi in #3654
- Add better error handling for non-rank 0 during Monolithic Checkpoint Loading by @j316chuck in #3647
- Raising a better warning if train or eval did not process any data. by @ethantang-db in #3656
- Fix Logo by @XiaohanZhangCMU in #3659
- Update huggingface-hub requirement from <0.26,>=0.21.2 to >=0.21.2,<0.27 by @dependabot in #3668
- Bump cryptography from 42.0.8 to 43.0.3 by @dependabot in #3667
- Bump pytorch to 2.5.0 by @b-chu in #3663
- Don't overwrite sys.excepthook in mlflow logger by @dakinggg in #3675
- Fix pull request target by @b-chu in #3676
- Use a temp path to save local checkpoints for remote save path by @irenedea in #3673
- Loss gen tokens by @dakinggg in #3677
- Refactor
maybe_create_object_store_from_uri
by @irenedea in #3679 - Don't error if some batch slice has no loss generating tokens by @dakinggg in #3682
- Bump version to 0.27.0.dev0 by @irenedea in #3681
New Contributors
- @ethantang-db made their first contribution in #3656
Full Changelog: v0.25.0...v0.26.0
v0.25.0
What's New
1. Torch 2.4.1 Compatibility (#3609)
We've added support for torch 2.4.1, including necessary patches to Torch.
Deprecations and breaking changes
1. Microbatch device movement (#3567)
Instead of moving the entire batch to device at once, we now move each microbatch to device. This saves memory for large inputs, e.g. multimodal data, when training with many microbatches.
This change may affect certain callbacks which run operations on the batch which require it to be moved to an accelerator ahead of time, such as the two changed in this PR. There shouldn't be too many of these callbacks, so we anticipate this change will be relatively safe.
2. DeepSpeed deprecation version (#3634)
We have update the Composer version that we will remove support for DeepSpeed to 0.27.0. Please reach out on GitHub if you have any concerns about this.
3. PyTorch legacy sharded checkpoint format
PyTorch briefly used a different sharded checkpoint format than the current one, which was quickly deprecated by PyTorch. We have continued to support loading legacy format checkpoints for a while, but we will likely be removing support for this format entirely in an upcoming release. We initially removed support for saving in this format in #2262, and the original feature was added in #1902. Please reach out if you have concerns or need help converting your checkpoints to the new format.
What's Changed
- Set dev version back to 0.25.0.dev0 by @snarayan21 in #3582
- Microbatch Device Movement by @mvpatel2000 in #3567
- Init Dist Default None by @mvpatel2000 in #3585
- Explicit None Check in get_device by @mvpatel2000 in #3586
- Update protobuf requirement from <5.28 to <5.29 by @dependabot in #3591
- Bump databricks-sdk from 0.30.0 to 0.31.1 by @dependabot in #3592
- Update ci-testing to 0.2.2 by @dakinggg in #3590
- Bump Mellanox Tools by @mvpatel2000 in #3597
- Roll back ci-testing for daillies by @mvpatel2000 in #3598
- Revert driver changes by @mvpatel2000 in #3599
- Remove step in log_image for MLFlow by @mvpatel2000 in #3601
- Reduce system metrics logging frequency by @chenmoneygithub in #3604
- Bump databricks-sdk from 0.31.1 to 0.32.0 by @dependabot in #3608
- torch2.4.1 by @bigning in #3609
- Test with torch2.4.1 image by @bigning in #3610
- fix 2.4.1 test by @bigning in #3612
- Remove tensor option for _global_exception_occured by @irenedea in #3611
- Update error message for overwrite to be more user friendly by @mvpatel2000 in #3619
- Update wandb requirement from <0.18,>=0.13.2 to >=0.13.2,<0.19 by @dependabot in #3615
- Fix RNG key checking by @dakinggg in #3623
- Update datasets requirement from <3,>=2.4 to >=2.4,<4 by @dependabot in #3626
- Disable exceptions for MosaicML Logger by @mvpatel2000 in #3627
- Fix CPU dailies by @mvpatel2000 in #3628
- fix 2.4.1ckpt by @bigning in #3629
- More checkpoint debug logs by @mvpatel2000 in #3632
- Lower DeepSpeed deprecation version by @mvpatel2000 in #3634
- Bump version 25 by @dakinggg in #3633
Full Changelog: v0.24.1...v0.25.0