Training Loop Profiler #97

ravi-mosaicml · 2021-11-20T00:20:44Z

The Training Loop Profiler gives a breakdown of how long is spent on each part of the training loop, and how long is spent on each algorithm and callback for each event. It is implemented through the engine and wraps calls around each event. It also can profile system performance metrics through psutil and incorporates the pytorch profiler.

TODO:

Examples of the visualization

Closes #11 This PR helps clean up some of the tests, rank zero callbacks, and will be used by future profiling work.

#65 made the global rank available in the process start, so it is no longer necessarry to wait until training_start() to create the dataloader. Instead, dataloaders are now initialized in __init__. This change will help with dataloader profiling, as now the dataloader will be immediately bound to the state.

…ion_point

Added uploading of the run directory to various cloud providers via a callback. Depends on the LibCloud plugin. Closes #98. Depends on #85 and (for tests) #92.

2. Renamed the `MosaicProfiler` to `Profiler`

The pytorch profiler uses `os.getpid()` for the thread id. Updating the training loop profiler to be consistent so the events will interleave. Updated the merge script to replace the PID with the global rank. This ensures that GPU streams will show up under the correct rank, since pytorch by default uses the local GPU rank as the PID. This change also ensures that traces will merge properly across nodes where PIDs could conflict.

bandish-shah · 2022-01-11T00:40:27Z

Tried running as follows with the torch profiler disabled:
python3 -m composer.cli.launcher -n 1 examples/run_mosaic_profiler.py -f composer/yamls/models/classify_mnist.yaml --profiler true --profiler.profilers dataloader system --datadir ~/datasets

Getting the following error:

Epoch 1: 100%|██████████| 9/9 [00:00<00:00, 10.57it/s, loss/train=1.8302]
Epoch 2: 100%|██████████| 9/9 [00:00<00:00, 18.80it/s, loss/train=0.6712]
Exception in thread Thread-4:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/pin_memory.py", line 28, in _pin_memory_loop
r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 116, in get
return _ForkingPickler.loads(res)
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/reductions.py", line 289, in rebuild_storage_fd
fd = df.detach()
File "/usr/lib/python3.8/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/usr/lib/python3.8/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 508, in Client
answer_challenge(c, authkey)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 752, in answer_challenge
message = connection.recv_bytes(256) # reject large message
File "/usr/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Waiting up to 30 seconds for all training processes to terminate...`

Works fine if the torch profiler is enabled.

ravi-mosaicml · 2022-01-18T23:37:28Z

Confirmed that the dataloader error @bandish-shah saw is an issue with the Pytorch dataloader. I suspect that we hit it with the profiler as we aren't spinning the dataloaders as long. See pytorch/pytorch#1551 (comment).

bandish-shah

LGTM thanks for addressing the various issues and debugging the dataloader error!

…aicml#105) Before mosaicml#65, composer.trainer.ddp ensured that DDP functionality was accessed only after ddp was initialized. Now, DDP is available from process start, so this class is no longer needed. Moved all the functionality from this class to the global composer.utils.ddp. This change allows callbacks, algorithms, etc... to use DDP (such as barriers and reductions) as needed. mosaicml#97 and mosaicml#101 depend on this functionality. Also removed DDP from the state, as that is available globally.

* Added `run_event` to callback Closes #11 This PR helps clean up some of the tests, rank zero callbacks, and will be used by future profiling work. * Removed callback helper methods * Fixed tests * Formatting * Addressed PR feedback * Fixed tests * Formatting * Fixed _run_event * Formatting * Removed ip * Instrumentation WIP * Stash * Create dataloader on trainer __init__() mosaicml#65 made the global rank available in the process start, so it is no longer necessarry to wait until training_start() to create the dataloader. Instead, dataloaders are now initialized in __init__. This change will help with dataloader profiling, as now the dataloader will be immediately bound to the state. * Stash * Added JSON trace handler * Formatting * Fixed trace generation * Prettified memory * Fixed setup.py * Changed setup.py * testing * Removed prepare * Run Directory Uploader Added uploading of the run directory to various cloud providers via a callback. Depends on the LibCloud plugin. Closes mosaicml#98. Depends on mosaicml#85 and (for tests) mosaicml#92. * Supporting both styles for callbacks Removed deferred logging since rank is now known at the init event * Minimizing Diff * Fixed tests * Added fasteners * Fixed tests * Formatting * Lazy population of kwargs * 1. Added object_name_prefix 2. Tested on google cloud storage 3. Added exponential backoff and retrying for transient errors * Addressed PR feedback * Remove the composer.trainer.ddp class Before mosaicml#65, composer.trainer.ddp ensured that DDP functionality was accessed only after ddp was initialized. Now, DDP is available from process start, so this class is no longer needed. Moved all the functionality from this class to the global composer.utils.ddp. This change allows callbacks, algroithms, etc... to use DDP (such as barriers and reductions) as needed. mosaicml#97 and mosaicml#101 depend on this functionality. Also removed DDP from the state, as that is available globally. * Added in DDP barrier * Fixed tests * Update composer/utils/ddp.py * Update composer/utils/ddp.py * Switched tqdm to using callback hooks Added test case for TQDM * Fixed pyright * Fixed DDP barriers * Increased timeout for run directory uploader * Switched callback format for run directory uploader * Replaced `atexit` with cleanup methods When running the trainer multiple times, such as in interactive enviroments, `atexit` does not fire. Instead, replaced it with `.close()` and `.post_close()` hooks on callbacks. `.close()` can be used to write and flush files. `.post_close()` can be used to backup the run directory and capture any changes that may have been made on `.close()` * Uncommented code * Running callbacks befor algorithms for the INIT event in the engine * For the INIT event, run the callbacks first to initialize the loggers. * For other events, run the algorithms first, so the callbacks have the state after algorithms modify it. * Fixed tests * Addressed PR feedback * Added in the scheduler * Added instant events * Fixes * Fixed profile scheduling * Added decorator option * Formatting * Added documentation for the profiler * 1. Added test cases 2. Fixed trace files to be proper json on successful training runs * Profiler entry point * Ravi/instrumentation point (mosaicml#140) 1. Using `os.getpid()` for process IDs to enable synchronization with the pytorch profiler 2. Switched to using object format instead of array format for the traces 3. Added in extra metadata such as global rank and timestamps for clock syncing * Writing metadata to a seperate file * Fixed tests * Removed the perf counter * Recording IO stats * Log global rank in each torch profiler file * Merging process traces (mosaicml#144) * Refactor the system profiler and dataloader profiler into callbacks Configuring the pytorch profiler based off of the mosaic profiler hparams * 1. Updated the merge script to merge pytorch trace files 2. Renamed the `MosaicProfiler` to `Profiler` * Increased timeout * Formatting * Fixed the `run_mosaic_profiler` * Added detailed option * Added sort index * Setting `pid` to global rank and `tid` to `os.getpid()` The pytorch profiler uses `os.getpid()` for the thread id. Updating the training loop profiler to be consistent so the events will interleave. Updated the merge script to replace the PID with the global rank. This ensures that GPU streams will show up under the correct rank, since pytorch by default uses the local GPU rank as the PID. This change also ensures that traces will merge properly across nodes where PIDs could conflict. * Simplifying diff * Put the backwards thread second * Thread sorting in trace * Fix * Fixes * Fixed tests * Fixed the profiler * Fixes Co-authored-by: Jamie Bloxham <[email protected]> Co-authored-by: Bandish Shah <[email protected]> Co-authored-by: anisehsani <[email protected]>

ravi-mosaicml added 18 commits November 15, 2021 14:33

Added run_event to callback

6357f2e

Closes #11 This PR helps clean up some of the tests, rank zero callbacks, and will be used by future profiling work.

Removed callback helper methods

f395df4

Fixed tests

0f1aa69

Formatting

06cac4b

Addressed PR feedback

d886af6

Fixed tests

9644ad9

Formatting

cf5e533

Fixed _run_event

b1bf400

Merge branch 'dev' into ravi/run_event

9bffe3b

Formatting

4ed9f4f

Removed ip

75944eb

Instrumentation WIP

c5141c8

Stash

c052736

Merge branch 'ravi/create_dataloaders_in_init' into ravi/instrumentat…

21e4f19

…ion_point

Stash

e44c1a7

Added JSON trace handler

bf98e10

Formatting

3338cda

ravi-mosaicml requested a review from anisehsani November 20, 2021 00:20

ravi-mosaicml added 6 commits November 19, 2021 16:38

Fixed trace generation

726e8aa

Prettified memory

b645d93

Fixed setup.py

0c9bf46

Changed setup.py

1c899eb

testing

e077733

Removed prepare

60b3a6a

ravi-mosaicml requested a review from bandish-shah November 20, 2021 17:58

ravi-mosaicml added 4 commits November 22, 2021 08:43

Merge branch 'ravi/run_event' into ravi/libcloud

f2f4ede

Merge branch 'ravi/create_dataloaders_in_init' into ravi/libcloud

8bf1c67

Run Directory Uploader

8b3563e

Added uploading of the run directory to various cloud providers via a callback. Depends on the LibCloud plugin. Closes #98. Depends on #85 and (for tests) #92.

Merge branch 'dev' into ravi/run_event

c8ccb49

ravi-mosaicml added 12 commits December 10, 2021 08:28

1. Updated the merge script to merge pytorch trace files

f0844c2

2. Renamed the `MosaicProfiler` to `Profiler`

Increased timeout

8f9fd8f

Formatting

5b25de5

Fixed the run_mosaic_profiler

0c0fad9

Added detailed option

5f6ac8d

Added sort index

fc8aa07

Merge branch 'dev' into ravi/instrumentation_point

7441ad5

Merge branch 'dev' into ravi/instrumentation_point

57ed361

Simplifying diff

2532793

Put the backwards thread second

3d59c8a

Thread sorting in trace

965c53d

bandish-shah requested a review from dskhudia January 10, 2022 22:37

ravi-mosaicml added 9 commits January 11, 2022 14:20

Merge branch 'dev' into ravi/instrumentation_point

d43ad17

Fix

7b445d7

Fixes

b885a85

Fixed tests

f26d74e

Merge branch 'dev' into ravi/instrumentation_point

562d966

Fixed the profiler

d35307c

Merge branch 'dev' into ravi/instrumentation_point

6619acc

Merge branch 'dev' into ravi/instrumentation_point

998d5a3

Fixes

6e78a5d

bandish-shah approved these changes Jan 19, 2022

View reviewed changes

Merge branch 'dev' into ravi/instrumentation_point

3a37d06

hanlint merged commit de39bf6 into dev Jan 19, 2022

hanlint deleted the ravi/instrumentation_point branch January 19, 2022 23:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training Loop Profiler #97

Training Loop Profiler #97

Uh oh!

ravi-mosaicml commented Nov 20, 2021 •

edited

Loading

Uh oh!

bandish-shah commented Jan 11, 2022 •

edited by ravi-mosaicml

Loading

Uh oh!

ravi-mosaicml commented Jan 18, 2022

Uh oh!

bandish-shah left a comment

Uh oh!

Uh oh!

Training Loop Profiler #97

Training Loop Profiler #97

Uh oh!

Conversation

ravi-mosaicml commented Nov 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Examples of the visualization

Uh oh!

bandish-shah commented Jan 11, 2022 • edited by ravi-mosaicml Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ravi-mosaicml commented Jan 18, 2022

Uh oh!

bandish-shah left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ravi-mosaicml commented Nov 20, 2021 •

edited

Loading

bandish-shah commented Jan 11, 2022 •

edited by ravi-mosaicml

Loading