Mlflow move to cpu #3878

dakinggg · 2025-06-14T00:28:17Z

What does this PR do?

Moves the mlflow logger metrics cache to not keep CUDA tensors around since it doesn't need to. In particular, this allows the logger to be sent to a subprocess, which previously was an issue on some systems when expandable_segments was set to True.

Before (model registration fails): ipc-register-before-1-ilQ0bO
After (model registration succeeds): ipc-register-after-1-WSlNIv

Metrics and throughput are the same, in case somehow they were affected:

Daniel King added 5 commits June 13, 2025 17:20

do it

6af39b4

better fix

638b5cb

clean up

39cae70

fix

2b72fe8

pc

d9df39c

dakinggg marked this pull request as ready for review June 14, 2025 00:52

dakinggg requested a review from a team as a code owner June 14, 2025 00:52

dakinggg requested review from irenedea, rithwik-db and ethantang-db June 14, 2025 00:52

dakinggg enabled auto-merge (squash) June 14, 2025 00:52

bowenyang008 approved these changes Jun 16, 2025

View reviewed changes

dakinggg merged commit 44cda1f into mosaicml:main Jun 16, 2025
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mlflow move to cpu #3878

Mlflow move to cpu #3878

Uh oh!

dakinggg commented Jun 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Mlflow move to cpu #3878

Mlflow move to cpu #3878

Uh oh!

Conversation

dakinggg commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

Uh oh!

Uh oh!

dakinggg commented Jun 14, 2025 •

edited

Loading