ProgressBarLogger UX Enhancements #1264

ravi-mosaicml · 2022-07-07T00:39:47Z

Overview

The ProgressBarLogger had a few bugs which caused progress bars to jump around on the terminal. This was because the close closed the training bar on eval end, and the position argument was set incorrectly for epoch-wise evaluation.
The dataloader label is included as part of the progress bar output. This is helpful if using multiple evaluators.
If evaluating mid-epoch, including the batch number as part of the progress bar label (see example below)
Cleaned up the implementation to remove the self.is_train variable and self._current_pbar, since these likely lead to problems with the jumping.
Use dynamic_ncols only if local, since k8s doesn't know the output terminal size. If on k8s, limit the max width to 120 characters. This ensures that the right-most remnants of a previous progress bar are not left over overriding position=1 with a new progress bar.
Verified that progress bars display correctly both when running locally and when running over mcli (this only took 5 hours to find a combination of parameters that worked for both lol)

I didn't include test cases to verify the fix, as it's hard to verify visual improvements.

Examples

The static examples below don't do full justice, since they show how the progress bars appear at the end. I'd recommend cloning this fork and running the yamls to verify that all the updating formats nicely.

Max duration in epochs

YAML:

train_dataset:
  &train_dataset
  mnist:
    use_synthetic: true
evaluators:
  eval1:
    label: eval/batch
    eval_interval: 1ba
    subset_num_batches: 2
    eval_dataset:
      <<: *train_dataset
  eval2:
    label: eval/ep
    eval_interval: 1ep
    subset_num_batches: 2
    eval_dataset:
      <<: *train_dataset
model:
  mnist_classifier:
    num_classes: 10
max_duration: 2ep
train_batch_size: 2000
eval_batch_size: 2000
dataloader:
  pin_memory: true
  timeout: 0
  prefetch_factor: 2
  persistent_workers: true
  num_workers: 1
train_subset_num_batches: 2

Output:

eval/batch     Batch   1:  100%|█████████████████████████| 2/2 [00:00<00:00,  4.65ba/s]          
eval/batch     Batch   2:  100%|█████████████████████████| 2/2 [00:00<00:00,  6.66ba/s]          
train          Epoch   0:  100%|█████████████████████████| 2/2 [00:01<00:00,  1.41ba/s, loss/trai
eval/ep        Epoch   0:  100%|█████████████████████████| 2/2 [00:00<00:00,  5.33ba/s]          
eval/batch     Batch   3:  100%|█████████████████████████| 2/2 [00:00<00:00,  5.74ba/s]          
eval/batch     Batch   4:  100%|█████████████████████████| 2/2 [00:00<00:00,  6.23ba/s]          
train          Epoch   1:  100%|█████████████████████████| 2/2 [00:01<00:00,  1.53ba/s, loss/trai
eval/ep        Epoch   1:  100%|█████████████████████████| 2/2 [00:00<00:00,  5.97ba/s]

Max duration in batches

YAML

train_dataset:
  &train_dataset
  mnist:
    use_synthetic: true
evaluators:
  eval1:
    label: eval/batch
    eval_interval: 4ba
    subset_num_batches: 2
    eval_dataset:
      <<: *train_dataset
  eval2:
    label: eval/ep
    eval_interval: 1ep
    subset_num_batches: 2
    eval_dataset:
      <<: *train_dataset
model:
  mnist_classifier:
    num_classes: 10
max_duration: 20ba
train_batch_size: 2000
eval_batch_size: 2000
dataloader:
  pin_memory: true
  timeout: 0
  prefetch_factor: 2
  persistent_workers: true
  num_workers: 1
train_subset_num_batches: 10

Output:

eval/batch     Batch      4:  100%|█████████████████████████| 2/2 [00:00<00:00,  6.18ba/s]       
eval/batch     Batch      8:  100%|█████████████████████████| 2/2 [00:00<00:00,  6.23ba/s]       
eval/ep        Batch     10:  100%|█████████████████████████| 2/2 [00:00<00:00,  5.30ba/s]       
eval/batch     Batch     12:  100%|█████████████████████████| 2/2 [00:00<00:00,  6.16ba/s]       
eval/batch     Batch     16:  100%|█████████████████████████| 2/2 [00:00<00:00,  5.73ba/s]       
eval/batch     Batch     20:  100%|█████████████████████████| 2/2 [00:00<00:00,  5.53ba/s]       
eval/ep        Batch     20:  100%|█████████████████████████| 2/2 [00:00<00:00,  6.10ba/s]       
train                         100%|█████████████████████████| 20/20 [00:10<00:00,  2.08ba/s, loss

Max duration in samples (and eval at fit end)

YAML:

train_dataset:
  &train_dataset
  mnist:
    use_synthetic: true
evaluators:
  eval1:
    label: eval/batch
    eval_interval: 1ba
    subset_num_batches: 2
    eval_dataset:
      <<: *train_dataset
  eval2:
    label: eval/ep
    eval_interval: 1ep
    subset_num_batches: 2
    eval_dataset:
      <<: *train_dataset
model:
  mnist_classifier:
    num_classes: 10
max_duration: 6000sp
train_batch_size: 2000
eval_batch_size: 2000
dataloader:
  pin_memory: true
  timeout: 0
  prefetch_factor: 2
  persistent_workers: true
  num_workers: 1
train_subset_num_batches: 2

Output:

eval/batch     Sample  2000:  100%|█████████████████████████| 2/2 [00:00<00:00,  5.03ba/s]                                                                                                           
eval/batch     Sample  4000:  100%|█████████████████████████| 2/2 [00:00<00:00,  5.17ba/s]                                                                                                           
eval/ep        Sample  4000:  100%|█████████████████████████| 2/2 [00:00<00:00,  4.26ba/s]                                                                                                           
eval/batch     Sample  6000:  100%|█████████████████████████| 2/2 [00:00<00:00,  5.23ba/s]                                                                                                           
eval/ep        Sample  6000:  100%|█████████████████████████| 2/2 [00:00<00:00,  4.10ba/s]                                                                                                           
train                         100%|█████████████████████████| 6000/6000 [00:03<00:00, 1798.40sp/s, loss/train=2.2975]

Closes https://mosaicml.atlassian.net/browse/CO-633

TQDM Progress Bars streamed over the network do not display until a `\n` is written. In effect, this caused the progress bars not to show, until they were finished. This behavior defats the purpose of progress bars. This issue has been documented here: tqdm/tqdm#1319 This PR fixes this issue by attempting to detect whether we are in a K8S environment, and if so, then automatically write `\n` each time the progress bar is updated.

hanlint

Looks good -- can you run the same progress bar examples as here: #1190

to check for edge cases?

composer/loggers/progress_bar_logger.py

hanlint · 2022-07-07T02:19:16Z

Epoch     0 eval/batch      100%|█████████████████████████| 4/4 [00:00<00:00,  5.12ba/s]                                                                                                                        
Epoch     0 train           100%|█████████████████████████| 3/3 [00:03<00:00,  1.32s/ba, loss/train=2.2866]                                                                                                     
Epoch     0 eval/ep         100%|█████████████████████████| 4/4 [00:00<00:00,  4.73ba/s]

This is a bit counter-intutive to me.. can we group all the eval bars together?

siriuslee

LGTM. Really great work! Tested on MNIST in K8s and it works perfectly.

composer/loggers/progress_bar_logger.py

hanlint

LGTM, one UX question:

when eval_interval = 1ba, i think the user expects the timestamp to be ba1, ba2, ba3.. not the batch_in_epoch your current example emits:

eval/batch     Epoch   0, Batch   1:  100%|█████████████████████████| 2/2 [00:00<00:00,  4.71ba/s]                                                                                                   
eval/batch     Epoch   0, Batch   2:  100%|█████████████████████████| 2/2 [00:00<00:00,  5.45ba/s]                                                                                                   
train          Epoch   0:             100%|█████████████████████████| 2/2 [00:01<00:00,  1.22ba/s, loss/train=2.3006]

The common use case here is max_duration=1ep, eval_interval=1ba, which would lead to repeated Epoch 0 cluttering the progress bar.

could this be simplified (and also lead to a less wide pbar)?

…CO-633

ravi-mosaicml · 2022-07-07T20:59:03Z

LGTM, one UX question:

when eval_interval = 1ba, i think the user expects the timestamp to be ba1, ba2, ba3.. not the batch_in_epoch your current example emits:
eval/batch     Epoch   0, Batch   1:  100%|█████████████████████████| 2/2 [00:00<00:00,  4.71ba/s]                                                                                                   
eval/batch     Epoch   0, Batch   2:  100%|█████████████████████████| 2/2 [00:00<00:00,  5.45ba/s]                                                                                                   
train          Epoch   0:             100%|█████████████████████████| 2/2 [00:01<00:00,  1.22ba/s, loss/train=2.3006]                                       
The common use case here is max_duration=1ep, eval_interval=1ba, which would lead to repeated Epoch 0 cluttering the progress bar.

could this be simplified (and also lead to a less wide pbar)?

Good point; showing the global batch count instead.

…CO-633

* The ProgressBarLogger had a few bugs which caused progress bars to jump around on the terminal. This was because the `close` closed the training bar on eval end, and the position argument was set incorrectly for epoch-wise evaluation. * The dataloader label is included as part of the progress bar output. This is helpful if using multiple evaluators. * If evaluating mid-epoch, including the batch number as part of the progress bar label (see example below) * Cleaned up the implementation to remove the `self.is_train` variable and `self._current_pbar`, since these likely lead to problems with the jumping. * Use `dynamic_ncols` only if local, since k8s doesn't know the output terminal size. If on k8s, limit the max width to 120 characters. This ensures that the right-most remnants of a previous progress bar are not left over overriding position=1 with a new progress bar. * Verified that progress bars display correctly both when running locally and when running over mcli (this only took 5 hours to find a combination of parameters that worked for both lol)

mosaicml#1264 broke the progress bars in notebooks. It screwed up the formatting and caused an `io.UnsupportedOperation` error in Colab when calling `sys.stderr.fileno()`. This PR fixes these issues. Closes mosaicml#1312 Closes https://mosaicml.atlassian.net/browse/CO-770

#1264 broke the progress bars in notebooks. It screwed up the formatting and caused an io.UnsupportedOperation error in Colab when calling sys.stderr.fileno(). This PR fixes these issues. Closes #1312 Closes https://mosaicml.atlassian.net/browse/CO-770

ravi-mosaicml added 10 commits June 29, 2022 13:49

Merge branch 'dev' into CO-633

15d8ec3

Fix pre-commit

360447d

Merge branch 'dev' into CO-633

08183ea

Increasing position by 1

6a73806

Fixes

fdfb0ee

Fix interleaving of pbars

87a454d

Include the dataloader label

7d71a4d

Trying rich

edcaff0

Poor

a44b437

ravi-mosaicml requested review from siriuslee and anisehsani July 7, 2022 00:39

ravi-mosaicml added 8 commits July 6, 2022 17:41

Disable tqdm monkeypatching

5231cba

Higher position

2a6c0f4

Added dummy pbar

44d2dc4

Remove dummy pbar

44cbb58

debugging

d0ab6ce

Added back dummy pbar

ea7fbe6

debugging

777c149

Fix typo

16f8f77

hanlint reviewed Jul 7, 2022

View reviewed changes

composer/loggers/progress_bar_logger.py Show resolved Hide resolved

composer/loggers/progress_bar_logger.py Outdated Show resolved Hide resolved

composer/loggers/progress_bar_logger.py Outdated Show resolved Hide resolved

ravi-mosaicml added 8 commits July 6, 2022 19:22

debugging

bb9262a

Force position=1

dd78973

debugging

2a1aaee

debugging

ef68d7d

debugging

72cf91c

debugging

e7426d5

debugging

9045311

Prettify pbar on k8s

f979011

ravi-mosaicml requested a review from moinnadeem July 7, 2022 06:36

Merge branch 'dev' into CO-633

5102c7c

ravi-mosaicml marked this pull request as ready for review July 7, 2022 06:37

ravi-mosaicml requested a review from hanlint July 7, 2022 07:20

siriuslee approved these changes Jul 7, 2022

View reviewed changes

hanlint approved these changes Jul 7, 2022

View reviewed changes

ravi-mosaicml added 3 commits July 7, 2022 13:55

Address PR Feedback

80a2ce1

Merge branch 'CO-633' of github.com:ravi-mosaicml/ravi-composer into …

7e6184a

…CO-633

Merge branch 'dev' into CO-633

46164c4

ravi-mosaicml added 11 commits July 7, 2022 14:01

debugging

87c4b6d

Merge branch 'CO-633' of github.com:ravi-mosaicml/ravi-composer into …

91fff7a

…CO-633

debugging

96fafe6

debugging

1ab65b6

debugging

7573c8f

debugging

99794b6

debugging

6495f70

debugging

e6c0562

debugging

2898ba1

debugging

fb04507

debugging

c12e015

ravi-mosaicml enabled auto-merge (squash) July 7, 2022 21:21

ravi-mosaicml merged commit e2e94c8 into mosaicml:dev Jul 7, 2022

ravi-mosaicml deleted the CO-633 branch July 7, 2022 22:21

anisehsani mentioned this pull request Jul 15, 2022

Custom progress bar: PR 1 #1293

Closed

ravi-mosaicml mentioned this pull request Jul 25, 2022

Fix Notebook Progress Bars #1313

Merged

hanlint mentioned this pull request Oct 17, 2022

Monkepyatch TQDM Progress Bars when Streamed over the network #1237

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ProgressBarLogger UX Enhancements #1264

ProgressBarLogger UX Enhancements #1264

Uh oh!

ravi-mosaicml commented Jul 7, 2022 •

edited

Loading

Uh oh!

hanlint left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hanlint commented Jul 7, 2022

Uh oh!

siriuslee left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hanlint left a comment

Uh oh!

ravi-mosaicml commented Jul 7, 2022

Uh oh!

Uh oh!

ProgressBarLogger UX Enhancements #1264

ProgressBarLogger UX Enhancements #1264

Uh oh!

Conversation

ravi-mosaicml commented Jul 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Examples

Max duration in epochs

Max duration in batches

Max duration in samples (and eval at fit end)

Uh oh!

hanlint left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hanlint commented Jul 7, 2022

Uh oh!

siriuslee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hanlint left a comment

Choose a reason for hiding this comment

Uh oh!

ravi-mosaicml commented Jul 7, 2022

Uh oh!

Uh oh!

ravi-mosaicml commented Jul 7, 2022 •

edited

Loading