Releases · mosaicml/composer

26 Jul 00:26

ethantang-db

v0.32.1

4bfb5fc

v0.32.1 Latest

Latest

What's Changed

Removed extraneous usage of fsdp_config.load_monolith_rank0_only since that's unreliable by @rithwik-db in #3901
Fixed automicrobatching issue for FSDP1 by @rithwik-db in #3909
reverted mlflow upgrade due to slowdowns by @ethantang-db in #3910

Full Changelog: v0.32.0...v0.32.1

Contributors

rithwik-db and ethantang-db

Assets 2

15 Jul 21:58

ethantang-db

v0.32.0

beed9b1

v0.32.0

What's Changed

Update FSDP checkpointing test to use UC Volumes and updated dockerfile for new composer version by @rithwik-db in #3865
Using cu128 instead of cu126 for pr-gpu and daily tests by @rithwik-db in #3867
Refactored auto-microbatching hook handles for FSDP by @rithwik-db in #3843
Removed most s3 bucket based tests (replaced with UC Volumes) by @rithwik-db in #3869
Supporting Mixed Init on FSDP2 by @rithwik-db in #3872
Documentation Improvements: Clarify Explanations in Gated Linear Units and Squeeze-Excite README Files by @leopardracer in #3875
Mlflow move to cpu by @dakinggg in #3878
FSDP2 mixed init fixes by @rithwik-db in #3882
Remove sklearn dep by @dakinggg in #3883
Monolithic checkpointing by @rithwik-db in #3876
Update docs conf.py copyright to 2025 by @jacobfulano in #3751
Fix Typos in Comments for activation_monitor.py and mlperf.py by @kilavvy in #3877
Mixed Precision for FSDP2 by @rithwik-db in #3884
Add h200 to flops dict by @dakinggg in #3889
updated fsdp2 config by @rithwik-db in #3896
Supporting peft for FSDP2 by @rithwik-db in #3897

New Contributors

@leopardracer made their first contribution in #3875
@kilavvy made their first contribution in #3877

Full Changelog: v0.31.0...v0.32.0

Contributors

dakinggg, jacobfulano, and 3 other contributors

Assets 2

28 May 17:30

rithwik-db

v0.31.0

ab8fee2

v0.31.0

What's New

1. PyTorch 2.7.0 Compatibility (#3850)

We've added support for PyTorch 2.7.0 and created a Dockerfile to support PyTorch 2.7.0 + CUDA 12.8. The current Composer image supports PyTorch 2.7.0 + CUDA 12.6.3.

2. Experimental FSDP2 support has been added to `Trainer` (#3852)

Experimental FSDP2 support was added to Trainer with:

auto_wrap based on _fsdp_wrap_fn and/or _fsdp_wrap attributes within the model (#3826)
Activation checkpointing and CPU offloading (#3832)
Meta initialization (#3852)

Note: Not all features are supported yet (e.g. automicrobatching, monolithic checkpointing)

Usage:

Add FSDP_VERSION=2 as an environment variable and set your FSDP2 config (parallelism_config) as desired. The full set of available attributes can be found here.

Bug Fixes

Resolve a memory hang issue in Mlflow monitor process (#3830)

What's Changed

Bump Composer 0.31.0.dev0 by @KuuCi in #3808
Update Checkpoint Back-Compatibility Test by @KuuCi in #3810
Extend docker build matrix to add an entry for pytorch2.6+cu126 by @sirejdua-db in #3805
Bump databricks-sdk from 0.47.0 to 0.49.0 by @dependabot in #3814
Bump pypandoc from 1.14 to 1.15 by @dependabot in #3813
Update google-cloud-storage requirement from <3.0,>=2.0.0 to >=2.0.0,<4.0 by @dependabot in #3812
Update setuptools version by @irenedea in #3816
Kickstart FSDP2 by @bowenyang008 in #3806
Remove network calls to HF in CI by @dakinggg in #3817
Update psutil requirement from <7,>=5.8.0 to >=5.8.0,<8 by @dependabot in #3818
[FSDP2] Init FSDP2 based checkpointing by @bowenyang008 in #3824
Update torchmetrics requirement from <1.6.1,>=1.0 to >=1.0,<1.7.2 by @dependabot in #3829
Bump coverage[toml] from 7.6.8 to 7.8.0 by @dependabot in #3827
Bump yamllint from 1.35.1 to 1.37.0 by @dependabot in #3820
Update numpy requirement from <2.2.0,>=1.21.5 to >=1.21.5,<2.3.0 by @dependabot in #3828
Update optimizer params for fsdp2 by @rithwik-db in #3822
Change Mlflow monitor process from fork to spawn to reduce memory usage by @dakinggg in #3830
Ignore mlflow warning in test by @dakinggg in #3831
Bump HF hub version by @dakinggg in #3839
Bump databricks-sdk from 0.49.0 to 0.50.0 by @dependabot in #3834
Update transformers requirement from !=4.34.0,<4.51,>=4.11 to >=4.11,!=4.34.0,<4.52 by @dependabot in #3838
Eliminate dead code before torch version 2.4 by @bowenyang008 in #3833
Support submodule wrapping for FSDP2 according to model definition (with _fsdp_wrap and fsdp_wrap_fn) by @rithwik-db in #3826
Activation Checkpointing and Offloading for FSDP2 by @rithwik-db in #3832
Pin EFA installer version by @dakinggg in #3842
Add two legacy torch images to the container build matrix by @asfandyarq in #3841
Bump yamllint from 1.37.0 to 1.37.1 by @dependabot in #3845
Update packaging requirement from <24.3,>=21.3.0 to >=21.3.0,<25.1 by @dependabot in #3846
Bump cryptography from 44.0.0 to 44.0.3 by @dependabot in #3848
Upgrade yapf version by @dakinggg in #3840
Bump ipython from 8.11.0 to 8.36.0 by @dependabot in #3847
Update huggingface-hub requirement from <0.31,>=0.21.2 to >=0.21.2,<0.32 by @dependabot in #3851
Update EFA installer version by @dakinggg in #3844
Fix typos by @omahs in #3853
Integrate FSDP2 wrapper into Trainer by @bowenyang008 in #3852
Deprecate code eval utils by @dakinggg in #3854
FSDP2 time and verbose logging by @bowenyang008 in #3856
Fix RDMA installation by @dakinggg in #3857
Update ci-testing version to latest by @dakinggg in #3859
Updating composer to support Torch 2.7 by @rithwik-db in #3850
Cleanup version gating pre-2.6.0 by @rithwik-db in #3863

New Contributors

@sirejdua-db made their first contribution in #3805
@asfandyarq made their first contribution in #3841
@omahs made their first contribution in #3853

Full Changelog: v0.30.0...v0.31.0

Contributors

irenedea, dependabot, and 7 other contributors

Assets 2

04 Apr 20:22

KuuCi

v0.30.0

a5c394f

v0.30.0

What's New

1. Python 3.12 Bump (#3783)

We've added support for Python 3.12 and deprecated Python 3.9 support.

What's Changed

Updated test_fsdp_load_old_checkpoint with 0.29.0 by @rithwik-db in #3771
Mlflow rocm error by @KuuCi in #3775
Update docker to have FA==2.7.4.post1 by @KuuCi in #3772
[GRT-3415] Remove dead code for peft logging by @bowenyang008 in #3777
Patch Mflow .trash directories by @KuuCi in #3778
Remove TE ONNX Export Context to Enable TE FusedAttention on AMD Hardware by @jjuvonen-amd in #3779
Update Makefile to use WORLD_SIZE by @irenedea in #3781
Bump gitpython from 3.1.43 to 3.1.44 by @dependabot in #3785
deprecate gcs test by @ethantang-db in #3791
Update mosaicml-cli requirement from <0.7,>=0.5.25 to >=0.5.25,<0.8 by @dependabot in #3742
Bump databricks-sdk from 0.44.1 to 0.47.0 by @dependabot in #3786
deprecate ghcr by @KevDevSha in #3790
Bump transformers by @dakinggg in #3793
Bump Python 3.12 by @KuuCi in #3783
Fix checkpoint loading in Pytorch 2.6.0 for ckpts exported before Pytorch 2.1.0 by @ethantang-db in #3792
Update huggingface-hub requirement from <0.27,>=0.21.2 to >=0.21.2,<0.30 by @dependabot in #3795
Update pytest-httpserver requirement from <1.1,>=1.0.4 to >=1.0.4,<1.2 by @dependabot in #3796
Update scikit-learn requirement from <1.6,>=1.2.0 to >=1.2.0,<1.7 by @dependabot in #3799
Bump Release Ref 0.3.3 by @KuuCi in #3804
Remove huggyllama fixture by @dakinggg in #3807
Fix release docker with 3.10 by @KuuCi in #3809

New Contributors

@bowenyang008 made their first contribution in #3777
@jjuvonen-amd made their first contribution in #3779

Full Changelog: v0.29.0...v0.30.0

Contributors

irenedea, dependabot, and 7 other contributors

Assets 2

25 Feb 00:24

rithwik-db

v0.29.0

4c4a621

v0.29.0

Deprecations

1. `device_transforms` param in `DataSpec` has been deprecated (#3770)

Composer no longer supports the device_transforms parameter in DataSpec. Instead, DataSpec supports batch_transforms for batch level transformations on CPU and microbatch_transforms for micro-batch level transformations on target device.

What's Changed

Add checkpoint BC tests for 0.27.0 and 0.28.0 by @snarayan21 in #3735
Address sklearn device issues by @snarayan21 in #3748
Update FAQ with hf-transfer info by @KuuCi in #3745
Fix MLFlow logger CI error by ignoring UserWarning by @j316chuck in #3758
Bump ci to v0.3.3 by @j316chuck in #3759
Fix order of arguments to loss by @gsganden in #3754
fix: make JSONTraceHandler.batch_end robust to /tmp/ being on diff mount to dest by @thundergolfer in #3766
Bump pytorch to 2.6.0 by @rithwik-db in #3763
Bump databricks-sdk from 0.38.0 to 0.44.1 by @dependabot in #3765
Version bump to v0.30.0.dev0 by @rithwik-db in #3770

New Contributors

@gsganden made their first contribution in #3754
@thundergolfer made their first contribution in #3766
@rithwik-db made their first contribution in #3763

Full Changelog: v0.28.0...v0.29.0

Contributors

thundergolfer, gsganden, and 5 other contributors

Assets 2

04 Dec 15:51

snarayan21

v0.28.0

f31ce5b

v0.28.0

Deprecations

1. Deepspeed Deprecation (#3732)

Composer no longer supports the Deepspeed deep learning library. Support has shifted to PyTorch-native solutions such as FSDP and DDP only. Please use Composer v0.27.0 or before to continue using Deepspeed!

What's Changed

Fix composer gpu daily test to use torch 2.5.1 by @j316chuck in #3712
Bump coverage[toml] from 7.6.4 to 7.6.7 by @dependabot in #3713
Update torchmetrics requirement from <1.5.3,>=1.0 to >=1.0,<1.6.1 by @dependabot in #3714
Bump ubuntu 22.04 + fix CI mlflow tests by @KuuCi in #3716
Bump databricks-sdk from 0.36.0 to 0.37.0 by @dependabot in #3715
Bump mosaicml/pytorch images to use new mosaicml/pytorch images with updated ubuntu 22.04 by @KuuCi in #3718
migrated all possible assets from GCP to repo by @ethantang-db in #3717
Bump databricks-sdk from 0.37.0 to 0.38.0 by @dependabot in #3720
Bump coverage[toml] from 7.6.7 to 7.6.8 by @dependabot in #3721
Expose DistributedSampler RNG seed argument by @janEbert in #3724
Fix netifaces install in Dockerfile by @j316chuck in #3726
Update protobuf requirement from <5.29 to <5.30 by @dependabot in #3728
Bump cryptography from 43.0.3 to 44.0.0 by @dependabot in #3731
Speed up CI tests :) by @KuuCi in #3727
Remove deepspeed completely by @snarayan21 in #3732
Fix daily test failures by @snarayan21 in #3733
Version bump to v0.29.0.dev0 by @snarayan21 in #3734

New Contributors

@janEbert made their first contribution in #3724

Full Changelog: v0.27.0...v0.28.0

Contributors

janEbert, j316chuck, and 4 other contributors

Assets 2

14 Nov 19:35

j316chuck

v0.27.0

6c9de3e

v0.27.0

What's New

1. Torch 2.5.1 Compatibility (#3701)

We've added support for torch 2.5.1, including checkpointing bug fixes from PyTorch.

2. Add batch/microbatch transforms (#3703)

Sped up device transformations by doing batch transform on CPU and microbatch transforms on GPU

Deprecations and Breaking Changes

1. MLFlow Metrics Deduplication (#3678)

We added a metric de-duplication feature for the MLflow logger in Composer. Metrics that remain unchanged since the last step are not logged unless specific conditions are met, which by default is if we've reached a 100th multiple of duplicated metric steps. This optimizes logging storage by reducing redundant entries, balancing detailed sampling with efficiency.

Example:

MlflowLogger(..., log_duplicated_metric_every_n_steps=100)

What's Changed

Metrics dedup for MLflow logger by @chenmoneygithub in #3678
Bump databricks-sdk from 0.33.0 to 0.36.0 by @dependabot in #3686
Update pillow requirement from <11,>=10.3.0 to >=10.3.0,<12 by @dependabot in #3684
Lower min torchmetrics version by @mvpatel2000 in #3691
Private link error handling by @nancyhung in #3689
Update checkpoint tests to use new version 0.26.0 by @irenedea in #3683
Bump coverage[toml] from 7.6.3 to 7.6.4 by @dependabot in #3694
Pin checkpoint state dict flattening patch by @b-chu in #3700
Torch bump to 2.5.1 by @mvpatel2000 in #3701
Fix typo in trainer doc by @XiaohanZhangCMU in #3702
Update packaging requirement from <24.2,>=21.3.0 to >=21.3.0,<24.3 by @dependabot in #3707
Update torchmetrics requirement from <1.4.1,>=1.0 to >=1.0,<1.5.3 by @dependabot in #3706
Add batch/microbatch transforms by @mvpatel2000 in #3703
Bump version to 0.28.0.dev0 by @j316chuck in #3709
Add torch 2.5.1 composer tests by @j316chuck in #3710

Full Changelog: v0.26.1...v0.27.0

Contributors

j316chuck, irenedea, and 6 other contributors

Assets 2

01 Nov 06:07

dakinggg

v0.26.1

b297981

v0.26.1

What's Changed

Private link error handling by @nancyhung in #3689

Full Changelog: v0.26.0...v0.26.1

Contributors

nancyhung

Assets 2

25 Oct 21:36

irenedea

v0.26.0

c0cb58f

v0.26.0

What's New

1. Torch 2.5.0 Compatibility (#3609)

We've added support for torch 2.5.0, including necessary patches to Torch.

Deprecations and Breaking Changes

1. FSDP Configuration Changes(#3681)

We no longer support passing fsdp_config and fsdp_auto_wrap directly to Trainer.

If you'd like to specify an fsdp config and configure fsdp auto wrapping, you should use parallelism_config.

trainer = Trainer(
    parallelism_config = {
        'fsdp': { 
            'auto_wrap': True
            ...
        }
    }
)

2. Removal of Pytorch Legacy Sharded Checkpoint Support (#3631)

PyTorch briefly used a different sharded checkpoint format than the current one, which was quickly deprecated by PyTorch. We have removed support for this format. We initially removed support for saving in this format in #2262, and the original feature was added in #1902. Please reach out if you have concerns or need help converting your checkpoints to the new format.

What's Changed

Add backward compatibility checkpoint tests for v0.25.0 by @dakinggg in #3635
Don't use TP when tensor_parallel_degree is 1 by @eitanturok in #3636
Update huggingface-hub requirement from <0.25,>=0.21.2 to >=0.21.2,<0.26 by @dependabot in #3637
Update transformers requirement from !=4.34.0,<4.45,>=4.11 to >=4.11,!=4.34.0,<4.46 by @dependabot in #3638
Bump databricks-sdk from 0.32.0 to 0.33.0 by @dependabot in #3639
Remove Legacy Checkpointing by @mvpatel2000 in #3631
Surface UC permission error by @b-chu in #3642
Tensor Parallelism Tests by @eitanturok in #3620
Switch to log.info for deterministic mode by @mvpatel2000 in #3643
Update pre-commit requirement from <4,>=3.4.0 to >=3.4.0,<5 by @dependabot in #3645
Update peft requirement from <0.13,>=0.10.0 to >=0.10.0,<0.14 by @dependabot in #3646
Create callback to load checkpoint by @irenedea in #3641
Bump jupyter from 1.0.0 to 1.1.1 by @dependabot in #3595
Fix DB SDK Import by @mvpatel2000 in #3648
Bump coverage[toml] from 7.6.0 to 7.6.3 by @dependabot in #3651
Bump pypandoc from 1.13 to 1.14 by @dependabot in #3652
Replace list with Sequence by @KuuCi in #3654
Add better error handling for non-rank 0 during Monolithic Checkpoint Loading by @j316chuck in #3647
Raising a better warning if train or eval did not process any data. by @ethantang-db in #3656
Fix Logo by @XiaohanZhangCMU in #3659
Update huggingface-hub requirement from <0.26,>=0.21.2 to >=0.21.2,<0.27 by @dependabot in #3668
Bump cryptography from 42.0.8 to 43.0.3 by @dependabot in #3667
Bump pytorch to 2.5.0 by @b-chu in #3663
Don't overwrite sys.excepthook in mlflow logger by @dakinggg in #3675
Fix pull request target by @b-chu in #3676
Use a temp path to save local checkpoints for remote save path by @irenedea in #3673
Loss gen tokens by @dakinggg in #3677
Refactor maybe_create_object_store_from_uri by @irenedea in #3679
Don't error if some batch slice has no loss generating tokens by @dakinggg in #3682
Bump version to 0.27.0.dev0 by @irenedea in #3681

New Contributors

@ethantang-db made their first contribution in #3656

Full Changelog: v0.25.0...v0.26.0

Contributors

j316chuck, irenedea, and 8 other contributors

Assets 2

24 Sep 20:56

dakinggg

v0.25.0

0c4e110

v0.25.0

What's New

1. Torch 2.4.1 Compatibility (#3609)

We've added support for torch 2.4.1, including necessary patches to Torch.

Deprecations and breaking changes

1. Microbatch device movement (#3567)

Instead of moving the entire batch to device at once, we now move each microbatch to device. This saves memory for large inputs, e.g. multimodal data, when training with many microbatches.

This change may affect certain callbacks which run operations on the batch which require it to be moved to an accelerator ahead of time, such as the two changed in this PR. There shouldn't be too many of these callbacks, so we anticipate this change will be relatively safe.

2. DeepSpeed deprecation version (#3634)

We have update the Composer version that we will remove support for DeepSpeed to 0.27.0. Please reach out on GitHub if you have any concerns about this.

3. PyTorch legacy sharded checkpoint format

PyTorch briefly used a different sharded checkpoint format than the current one, which was quickly deprecated by PyTorch. We have continued to support loading legacy format checkpoints for a while, but we will likely be removing support for this format entirely in an upcoming release. We initially removed support for saving in this format in #2262, and the original feature was added in #1902. Please reach out if you have concerns or need help converting your checkpoints to the new format.

What's Changed

Set dev version back to 0.25.0.dev0 by @snarayan21 in #3582
Microbatch Device Movement by @mvpatel2000 in #3567
Init Dist Default None by @mvpatel2000 in #3585
Explicit None Check in get_device by @mvpatel2000 in #3586
Update protobuf requirement from <5.28 to <5.29 by @dependabot in #3591
Bump databricks-sdk from 0.30.0 to 0.31.1 by @dependabot in #3592
Update ci-testing to 0.2.2 by @dakinggg in #3590
Bump Mellanox Tools by @mvpatel2000 in #3597
Roll back ci-testing for daillies by @mvpatel2000 in #3598
Revert driver changes by @mvpatel2000 in #3599
Remove step in log_image for MLFlow by @mvpatel2000 in #3601
Reduce system metrics logging frequency by @chenmoneygithub in #3604
Bump databricks-sdk from 0.31.1 to 0.32.0 by @dependabot in #3608
torch2.4.1 by @bigning in #3609
Test with torch2.4.1 image by @bigning in #3610
fix 2.4.1 test by @bigning in #3612
Remove tensor option for _global_exception_occured by @irenedea in #3611
Update error message for overwrite to be more user friendly by @mvpatel2000 in #3619
Update wandb requirement from <0.18,>=0.13.2 to >=0.13.2,<0.19 by @dependabot in #3615
Fix RNG key checking by @dakinggg in #3623
Update datasets requirement from <3,>=2.4 to >=2.4,<4 by @dependabot in #3626
Disable exceptions for MosaicML Logger by @mvpatel2000 in #3627
Fix CPU dailies by @mvpatel2000 in #3628
fix 2.4.1ckpt by @bigning in #3629
More checkpoint debug logs by @mvpatel2000 in #3632
Lower DeepSpeed deprecation version by @mvpatel2000 in #3634
Bump version 25 by @dakinggg in #3633

Full Changelog: v0.24.1...v0.25.0

Contributors

bigning, irenedea, and 5 other contributors

Assets 2

Releases: mosaicml/composer

v0.32.1

What's Changed

Contributors

Uh oh!

v0.32.0

What's Changed

New Contributors

Contributors

Uh oh!

v0.31.0

What's New

1. PyTorch 2.7.0 Compatibility (#3850)

2. Experimental FSDP2 support has been added to Trainer (#3852)

Bug Fixes

What's Changed

New Contributors

Contributors

Uh oh!

v0.30.0

What's New

1. Python 3.12 Bump (#3783)

What's Changed

New Contributors

Contributors

Uh oh!

v0.29.0

Deprecations

1. device_transforms param in DataSpec has been deprecated (#3770)

What's Changed

New Contributors

Contributors

Uh oh!

v0.28.0

Deprecations

1. Deepspeed Deprecation (#3732)

What's Changed

New Contributors

Contributors

Uh oh!

v0.27.0

What's New

1. Torch 2.5.1 Compatibility (#3701)

2. Add batch/microbatch transforms (#3703)

Deprecations and Breaking Changes

1. MLFlow Metrics Deduplication (#3678)

What's Changed

Contributors

Uh oh!

v0.26.1

What's Changed

Contributors

Uh oh!

v0.26.0

What's New

1. Torch 2.5.0 Compatibility (#3609)

Deprecations and Breaking Changes

1. FSDP Configuration Changes(#3681)

2. Removal of Pytorch Legacy Sharded Checkpoint Support (#3631)

What's Changed

New Contributors

Contributors

Uh oh!

v0.25.0

What's New

1. Torch 2.4.1 Compatibility (#3609)

Deprecations and breaking changes

1. Microbatch device movement (#3567)

2. DeepSpeed deprecation version (#3634)

3. PyTorch legacy sharded checkpoint format

What's Changed

Contributors

Uh oh!

2. Experimental FSDP2 support has been added to `Trainer` (#3852)

1. `device_transforms` param in `DataSpec` has been deprecated (#3770)