Skip to content

Conversation

dakinggg
Copy link
Contributor

@dakinggg dakinggg commented Jun 14, 2025

What does this PR do?

Moves the mlflow logger metrics cache to not keep CUDA tensors around since it doesn't need to. In particular, this allows the logger to be sent to a subprocess, which previously was an issue on some systems when expandable_segments was set to True.

Before (model registration fails): ipc-register-before-1-ilQ0bO
After (model registration succeeds): ipc-register-after-1-WSlNIv

Metrics and throughput are the same, in case somehow they were affected:
Screenshot 2025-06-13 at 5 43 47 PM
Screenshot 2025-06-13 at 5 44 03 PM

@dakinggg dakinggg marked this pull request as ready for review June 14, 2025 00:52
@dakinggg dakinggg requested a review from a team as a code owner June 14, 2025 00:52
@dakinggg dakinggg enabled auto-merge (squash) June 14, 2025 00:52
@dakinggg dakinggg merged commit 44cda1f into mosaicml:main Jun 16, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants