Skip to content

Conversation

dakinggg
Copy link
Contributor

@dakinggg dakinggg commented Apr 23, 2025

What does this PR do?

The spawned mlflow monitor process was hanging on to memory for the whole model (initialized in the main process, and not freed until FSDP wrapping, which occurs after the monitor process is started). This resolves that by using spawn instead of fork for the monitor process.

Before and after memory usage:
Screenshot 2025-04-22 at 8 39 28 PM

Large model run that was hanging previously now works: 405b-mlf-after-1-Y26NwD
Manual test of error run: 70b-mlf-after-2-8OZEl7

Statuses all look correct:
Screenshot 2025-04-23 at 10 01 53 AM

@dakinggg dakinggg requested a review from Copilot April 23, 2025 17:03
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR resolves a memory hang issue in the MLflow monitor process by switching from fork to spawn, ensuring that the full model memory is released at the correct time.

  • Updated process creation to use spawn context via spawn_context.Process.
  • Replaced multiprocessing.Event with threading.Event in the run method.
  • Revised signal handling by introducing SIGUSR1 and SIGUSR2 for normal and crash exits, respectively.

@dakinggg dakinggg marked this pull request as ready for review April 23, 2025 17:27
@dakinggg dakinggg requested a review from a team as a code owner April 23, 2025 17:27
@dakinggg dakinggg changed the title Mlflow hang Change Mlflow monitor process from fork to spawn to reduce memory usage Apr 23, 2025
Copy link
Contributor

@irenedea irenedea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thanks, spawn ftw

@dakinggg dakinggg merged commit 4447b29 into mosaicml:main Apr 23, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants