-
Notifications
You must be signed in to change notification settings - Fork 454
ProgressBarLogger UX Enhancements #1264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
TQDM Progress Bars streamed over the network do not display until a `\n` is written. In effect, this caused the progress bars not to show, until they were finished. This behavior defats the purpose of progress bars. This issue has been documented here: tqdm/tqdm#1319 This PR fixes this issue by attempting to detect whether we are in a K8S environment, and if so, then automatically write `\n` each time the progress bar is updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good -- can you run the same progress bar examples as here: #1190
to check for edge cases?
This is a bit counter-intutive to me.. can we group all the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Really great work! Tested on MNIST in K8s and it works perfectly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one UX question:
when eval_interval = 1ba
, i think the user expects the timestamp to be ba1, ba2, ba3.. not the batch_in_epoch
your current example emits:
eval/batch Epoch 0, Batch 1: 100%|█████████████████████████| 2/2 [00:00<00:00, 4.71ba/s]
eval/batch Epoch 0, Batch 2: 100%|█████████████████████████| 2/2 [00:00<00:00, 5.45ba/s]
train Epoch 0: 100%|█████████████████████████| 2/2 [00:01<00:00, 1.22ba/s, loss/train=2.3006]
The common use case here is max_duration=1ep, eval_interval=1ba
, which would lead to repeated Epoch 0
cluttering the progress bar.
could this be simplified (and also lead to a less wide pbar)?
Good point; showing the global batch count instead. |
* The ProgressBarLogger had a few bugs which caused progress bars to jump around on the terminal. This was because the `close` closed the training bar on eval end, and the position argument was set incorrectly for epoch-wise evaluation. * The dataloader label is included as part of the progress bar output. This is helpful if using multiple evaluators. * If evaluating mid-epoch, including the batch number as part of the progress bar label (see example below) * Cleaned up the implementation to remove the `self.is_train` variable and `self._current_pbar`, since these likely lead to problems with the jumping. * Use `dynamic_ncols` only if local, since k8s doesn't know the output terminal size. If on k8s, limit the max width to 120 characters. This ensures that the right-most remnants of a previous progress bar are not left over overriding position=1 with a new progress bar. * Verified that progress bars display correctly both when running locally and when running over mcli (this only took 5 hours to find a combination of parameters that worked for both lol)
mosaicml#1264 broke the progress bars in notebooks. It screwed up the formatting and caused an `io.UnsupportedOperation` error in Colab when calling `sys.stderr.fileno()`. This PR fixes these issues. Closes mosaicml#1312 Closes https://mosaicml.atlassian.net/browse/CO-770
#1264 broke the progress bars in notebooks. It screwed up the formatting and caused an io.UnsupportedOperation error in Colab when calling sys.stderr.fileno(). This PR fixes these issues. Closes #1312 Closes https://mosaicml.atlassian.net/browse/CO-770
Overview
close
closed the training bar on eval end, and the position argument was set incorrectly for epoch-wise evaluation.self.is_train
variable andself._current_pbar
, since these likely lead to problems with the jumping.dynamic_ncols
only if local, since k8s doesn't know the output terminal size. If on k8s, limit the max width to 120 characters. This ensures that the right-most remnants of a previous progress bar are not left over overriding position=1 with a new progress bar.I didn't include test cases to verify the fix, as it's hard to verify visual improvements.
Examples
The static examples below don't do full justice, since they show how the progress bars appear at the end. I'd recommend cloning this fork and running the yamls to verify that all the updating formats nicely.
Max duration in epochs
YAML:
Output:
Max duration in batches
YAML
Output:
Max duration in samples (and eval at fit end)
YAML:
Output:
Closes https://mosaicml.atlassian.net/browse/CO-633