Change Mlflow monitor process from fork to spawn to reduce memory usage #3830

dakinggg · 2025-04-23T01:55:37Z

What does this PR do?

The spawned mlflow monitor process was hanging on to memory for the whole model (initialized in the main process, and not freed until FSDP wrapping, which occurs after the monitor process is started). This resolves that by using spawn instead of fork for the monitor process.

Before and after memory usage:

Large model run that was hanging previously now works: 405b-mlf-after-1-Y26NwD
Manual test of error run: 70b-mlf-after-2-8OZEl7

Statuses all look correct:

Copilot

Pull Request Overview

This PR resolves a memory hang issue in the MLflow monitor process by switching from fork to spawn, ensuring that the full model memory is released at the correct time.

Updated process creation to use spawn context via spawn_context.Process.
Replaced multiprocessing.Event with threading.Event in the run method.
Revised signal handling by introducing SIGUSR1 and SIGUSR2 for normal and crash exits, respectively.

composer/loggers/mlflow_logger.py

irenedea

LGTM! thanks, spawn ftw

composer/loggers/mlflow_logger.py

Daniel King added 8 commits April 22, 2025 10:16

full logs

4a7f32b

more logs

92c0424

no monitor

7cc4b3a

no monitor

5b220fb

put it back

a6c83d5

spawn

d4f5566

maybe

695df96

pc

9ce079a

dakinggg force-pushed the mlflow-stuck branch from 6f75b52 to 9ce079a Compare April 23, 2025 02:55

Daniel King added 4 commits April 22, 2025 20:07

remove logs

4ba8003

pyright

fbb6a4a

cause an error

fea4594

remove error

396853b

dakinggg requested a review from Copilot April 23, 2025 17:03

Copilot AI reviewed Apr 23, 2025

View reviewed changes

composer/loggers/mlflow_logger.py Show resolved Hide resolved

composer/loggers/mlflow_logger.py Show resolved Hide resolved

dakinggg marked this pull request as ready for review April 23, 2025 17:27

dakinggg requested a review from a team as a code owner April 23, 2025 17:27

dakinggg requested review from bowenyang008 and irenedea April 23, 2025 17:27

dakinggg changed the title ~~Mlflow hang~~ Change Mlflow monitor process from fork to spawn to reduce memory usage Apr 23, 2025

irenedea approved these changes Apr 23, 2025

View reviewed changes

composer/loggers/mlflow_logger.py Show resolved Hide resolved

dakinggg merged commit 4447b29 into mosaicml:main Apr 23, 2025
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change Mlflow monitor process from fork to spawn to reduce memory usage #3830

Change Mlflow monitor process from fork to spawn to reduce memory usage #3830

Uh oh!

dakinggg commented Apr 23, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

irenedea left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Change Mlflow monitor process from fork to spawn to reduce memory usage #3830

Change Mlflow monitor process from fork to spawn to reduce memory usage #3830

Uh oh!

Conversation

dakinggg commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

irenedea left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dakinggg commented Apr 23, 2025 •

edited

Loading