Skip to content

Conversation

ravi-mosaicml
Copy link
Contributor

@ravi-mosaicml ravi-mosaicml commented Jul 7, 2022

Overview

  • The ProgressBarLogger had a few bugs which caused progress bars to jump around on the terminal. This was because the close closed the training bar on eval end, and the position argument was set incorrectly for epoch-wise evaluation.
  • The dataloader label is included as part of the progress bar output. This is helpful if using multiple evaluators.
  • If evaluating mid-epoch, including the batch number as part of the progress bar label (see example below)
  • Cleaned up the implementation to remove the self.is_train variable and self._current_pbar, since these likely lead to problems with the jumping.
  • Use dynamic_ncols only if local, since k8s doesn't know the output terminal size. If on k8s, limit the max width to 120 characters. This ensures that the right-most remnants of a previous progress bar are not left over overriding position=1 with a new progress bar.
  • Verified that progress bars display correctly both when running locally and when running over mcli (this only took 5 hours to find a combination of parameters that worked for both lol)

I didn't include test cases to verify the fix, as it's hard to verify visual improvements.

Examples

The static examples below don't do full justice, since they show how the progress bars appear at the end. I'd recommend cloning this fork and running the yamls to verify that all the updating formats nicely.

Max duration in epochs

YAML:

train_dataset:
  &train_dataset
  mnist:
    use_synthetic: true
evaluators:
  eval1:
    label: eval/batch
    eval_interval: 1ba
    subset_num_batches: 2
    eval_dataset:
      <<: *train_dataset
  eval2:
    label: eval/ep
    eval_interval: 1ep
    subset_num_batches: 2
    eval_dataset:
      <<: *train_dataset
model:
  mnist_classifier:
    num_classes: 10
max_duration: 2ep
train_batch_size: 2000
eval_batch_size: 2000
dataloader:
  pin_memory: true
  timeout: 0
  prefetch_factor: 2
  persistent_workers: true
  num_workers: 1
train_subset_num_batches: 2

Output:

eval/batch     Batch   1:  100%|█████████████████████████| 2/2 [00:00<00:00,  4.65ba/s]          
eval/batch     Batch   2:  100%|█████████████████████████| 2/2 [00:00<00:00,  6.66ba/s]          
train          Epoch   0:  100%|█████████████████████████| 2/2 [00:01<00:00,  1.41ba/s, loss/trai
eval/ep        Epoch   0:  100%|█████████████████████████| 2/2 [00:00<00:00,  5.33ba/s]          
eval/batch     Batch   3:  100%|█████████████████████████| 2/2 [00:00<00:00,  5.74ba/s]          
eval/batch     Batch   4:  100%|█████████████████████████| 2/2 [00:00<00:00,  6.23ba/s]          
train          Epoch   1:  100%|█████████████████████████| 2/2 [00:01<00:00,  1.53ba/s, loss/trai
eval/ep        Epoch   1:  100%|█████████████████████████| 2/2 [00:00<00:00,  5.97ba/s] 

Max duration in batches

YAML

train_dataset:
  &train_dataset
  mnist:
    use_synthetic: true
evaluators:
  eval1:
    label: eval/batch
    eval_interval: 4ba
    subset_num_batches: 2
    eval_dataset:
      <<: *train_dataset
  eval2:
    label: eval/ep
    eval_interval: 1ep
    subset_num_batches: 2
    eval_dataset:
      <<: *train_dataset
model:
  mnist_classifier:
    num_classes: 10
max_duration: 20ba
train_batch_size: 2000
eval_batch_size: 2000
dataloader:
  pin_memory: true
  timeout: 0
  prefetch_factor: 2
  persistent_workers: true
  num_workers: 1
train_subset_num_batches: 10

Output:

eval/batch     Batch      4:  100%|█████████████████████████| 2/2 [00:00<00:00,  6.18ba/s]       
eval/batch     Batch      8:  100%|█████████████████████████| 2/2 [00:00<00:00,  6.23ba/s]       
eval/ep        Batch     10:  100%|█████████████████████████| 2/2 [00:00<00:00,  5.30ba/s]       
eval/batch     Batch     12:  100%|█████████████████████████| 2/2 [00:00<00:00,  6.16ba/s]       
eval/batch     Batch     16:  100%|█████████████████████████| 2/2 [00:00<00:00,  5.73ba/s]       
eval/batch     Batch     20:  100%|█████████████████████████| 2/2 [00:00<00:00,  5.53ba/s]       
eval/ep        Batch     20:  100%|█████████████████████████| 2/2 [00:00<00:00,  6.10ba/s]       
train                         100%|█████████████████████████| 20/20 [00:10<00:00,  2.08ba/s, loss

Max duration in samples (and eval at fit end)

YAML:

train_dataset:
  &train_dataset
  mnist:
    use_synthetic: true
evaluators:
  eval1:
    label: eval/batch
    eval_interval: 1ba
    subset_num_batches: 2
    eval_dataset:
      <<: *train_dataset
  eval2:
    label: eval/ep
    eval_interval: 1ep
    subset_num_batches: 2
    eval_dataset:
      <<: *train_dataset
model:
  mnist_classifier:
    num_classes: 10
max_duration: 6000sp
train_batch_size: 2000
eval_batch_size: 2000
dataloader:
  pin_memory: true
  timeout: 0
  prefetch_factor: 2
  persistent_workers: true
  num_workers: 1
train_subset_num_batches: 2

Output:

eval/batch     Sample  2000:  100%|█████████████████████████| 2/2 [00:00<00:00,  5.03ba/s]                                                                                                           
eval/batch     Sample  4000:  100%|█████████████████████████| 2/2 [00:00<00:00,  5.17ba/s]                                                                                                           
eval/ep        Sample  4000:  100%|█████████████████████████| 2/2 [00:00<00:00,  4.26ba/s]                                                                                                           
eval/batch     Sample  6000:  100%|█████████████████████████| 2/2 [00:00<00:00,  5.23ba/s]                                                                                                           
eval/ep        Sample  6000:  100%|█████████████████████████| 2/2 [00:00<00:00,  4.10ba/s]                                                                                                           
train                         100%|█████████████████████████| 6000/6000 [00:03<00:00, 1798.40sp/s, loss/train=2.2975]

Closes https://mosaicml.atlassian.net/browse/CO-633

TQDM Progress Bars streamed over the network do not display until a `\n` is written. In effect, this caused the progress bars not to show, until they were finished. This behavior defats the purpose of progress bars. This issue has been documented here: tqdm/tqdm#1319

This PR fixes this issue by attempting to detect whether we are in a K8S environment, and if so, then automatically write `\n` each time the progress bar is updated.
Copy link
Contributor

@hanlint hanlint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good -- can you run the same progress bar examples as here: #1190

to check for edge cases?

@hanlint
Copy link
Contributor

hanlint commented Jul 7, 2022

Epoch     0 eval/batch      100%|█████████████████████████| 4/4 [00:00<00:00,  5.12ba/s]                                                                                                                        
Epoch     0 train           100%|█████████████████████████| 3/3 [00:03<00:00,  1.32s/ba, loss/train=2.2866]                                                                                                     
Epoch     0 eval/ep         100%|█████████████████████████| 4/4 [00:00<00:00,  4.73ba/s]

This is a bit counter-intutive to me.. can we group all the eval bars together?

@ravi-mosaicml ravi-mosaicml requested a review from moinnadeem July 7, 2022 06:36
@ravi-mosaicml ravi-mosaicml marked this pull request as ready for review July 7, 2022 06:37
@ravi-mosaicml ravi-mosaicml requested a review from hanlint July 7, 2022 07:20
Copy link
Contributor

@siriuslee siriuslee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Really great work! Tested on MNIST in K8s and it works perfectly.

Copy link
Contributor

@hanlint hanlint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one UX question:

when eval_interval = 1ba, i think the user expects the timestamp to be ba1, ba2, ba3.. not the batch_in_epoch your current example emits:

eval/batch     Epoch   0, Batch   1:  100%|█████████████████████████| 2/2 [00:00<00:00,  4.71ba/s]                                                                                                   
eval/batch     Epoch   0, Batch   2:  100%|█████████████████████████| 2/2 [00:00<00:00,  5.45ba/s]                                                                                                   
train          Epoch   0:             100%|█████████████████████████| 2/2 [00:01<00:00,  1.22ba/s, loss/train=2.3006]                                       

The common use case here is max_duration=1ep, eval_interval=1ba, which would lead to repeated Epoch 0 cluttering the progress bar.

could this be simplified (and also lead to a less wide pbar)?

@ravi-mosaicml
Copy link
Contributor Author

LGTM, one UX question:

when eval_interval = 1ba, i think the user expects the timestamp to be ba1, ba2, ba3.. not the batch_in_epoch your current example emits:

eval/batch     Epoch   0, Batch   1:  100%|█████████████████████████| 2/2 [00:00<00:00,  4.71ba/s]                                                                                                   
eval/batch     Epoch   0, Batch   2:  100%|█████████████████████████| 2/2 [00:00<00:00,  5.45ba/s]                                                                                                   
train          Epoch   0:             100%|█████████████████████████| 2/2 [00:01<00:00,  1.22ba/s, loss/train=2.3006]                                       

The common use case here is max_duration=1ep, eval_interval=1ba, which would lead to repeated Epoch 0 cluttering the progress bar.

could this be simplified (and also lead to a less wide pbar)?

Good point; showing the global batch count instead.

@ravi-mosaicml ravi-mosaicml enabled auto-merge (squash) July 7, 2022 21:21
@ravi-mosaicml ravi-mosaicml merged commit e2e94c8 into mosaicml:dev Jul 7, 2022
@ravi-mosaicml ravi-mosaicml deleted the CO-633 branch July 7, 2022 22:21
ravi-mosaicml added a commit that referenced this pull request Jul 16, 2022
* The ProgressBarLogger had a few bugs which caused progress bars to jump around on the terminal. This was because the `close` closed the training bar on eval end, and the position argument was set incorrectly for epoch-wise evaluation.
* The dataloader label is included as part of the progress bar output. This is helpful if using multiple evaluators.
* If evaluating mid-epoch, including the batch number as part of the progress bar label (see example below)
* Cleaned up the implementation to remove the `self.is_train` variable and `self._current_pbar`, since these likely lead to problems with the jumping.
* Use `dynamic_ncols` only if local, since k8s doesn't know the output terminal size. If on k8s, limit the max width to 120 characters. This ensures that the right-most remnants of a previous progress bar are not left over overriding position=1 with a new progress bar.
* Verified that progress bars display correctly both when running locally and when running over mcli (this only took 5 hours to find a combination of parameters that worked for both lol)
ravi-mosaicml added a commit to ravi-mosaicml/ravi-composer that referenced this pull request Jul 25, 2022
mosaicml#1264 broke the progress bars in notebooks. It screwed up the formatting and caused an `io.UnsupportedOperation` error in Colab when calling `sys.stderr.fileno()`.

This PR fixes these issues.

Closes mosaicml#1312
Closes https://mosaicml.atlassian.net/browse/CO-770
ravi-mosaicml added a commit that referenced this pull request Jul 25, 2022
#1264 broke the progress bars in notebooks. It screwed up the formatting and caused an io.UnsupportedOperation error in Colab when calling sys.stderr.fileno().

This PR fixes these issues.

Closes #1312
Closes https://mosaicml.atlassian.net/browse/CO-770
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants