Skip to content

Conversation

ravi-mosaicml
Copy link
Contributor

@ravi-mosaicml ravi-mosaicml commented Jun 3, 2022

The speed monitor sometimes returns negative times (example: https://wandb.ai/mosaic-ml/regression-v0.7.0-RC1/table?workspace=user-ravimosaicml)

  • Fixed by re-implementing the speed monitor to use the wall clock tracking built into state.timestamp and state.eval_timestamp. Using these variables ensures that all ranks have consistent timing information and simplifies the speed monitor implementation, as timestamp variables are already in duration units rather than wall clock timestamps
  • Added a validation dataloader and checks to ensure no negative values in the test
  • Logging wall clock time on every batch, rather than every epoch, to support NLP and single-epoch jobs
  • (Coming along for the ride): Added better tests to catch downstream effects of how loggers handle callbacks that log data.

The speed monitor sometimes returns negative times.

* Fixed by re-implementing the speed monitor to use the wall clock tracking built into `state.timestamp` and `state.eval_timestamp`. Using these variables ensures that all ranks have consistent timing information and simplifies the speed monitor implementation, as timestamp variables are already in duration units rather than wall clock timestamps
* Added a validation dataloader and checks to ensure no negative values in the test
* Logging wall clock time on every batch, rather than every epoch, to support NLP and single-epoch jobs
Copy link
Contributor

@moinnadeem moinnadeem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ravi-mosaicml Good stuff! Do you have a example WandB report of what the speed monitor looks like now? It'll help evaluate the PR pretty easily -- the code looks good to me, but I want to check for downstream effects.

Copy link
Contributor

@hanlint hanlint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to approve, most comments above are on improving code readability.

@moinnadeem
Copy link
Contributor

@ravi-mosaicml Glad I checked for downstream effects -- I get the following error when testing this PR:

Traceback (most recent call last):
  File "/root/composer/examples/run_composer_trainer.py", line 70, in <module>
    _main()
  File "/root/composer/examples/run_composer_trainer.py", line 66, in _main
    trainer.fit()
  File "/root/composer/composer/trainer/trainer.py", line 1293, in fit
    self._train_loop()
  File "/root/composer/composer/trainer/trainer.py", line 1507, in _train_loop
    self.engine.run_event(Event.BATCH_END)
  File "/root/composer/composer/core/engine.py", line 241, in run_event
    self._run_callbacks(event)
  File "/root/composer/composer/core/engine.py", line 361, in _run_callbacks
    cb.run_event(event, self.state, self.logger)
  File "/root/composer/composer/core/callback.py", line 97, in run_event
    return event_cb(state, logger)
  File "/root/composer/composer/callbacks/speed_monitor.py", line 151, in batch_end
    logger.data_batch({
  File "/root/composer/composer/loggers/logger.py", line 208, in data_batch
    self.data(LogLevel.BATCH, data)
  File "/root/composer/composer/loggers/logger.py", line 133, in data
    destination.log_data(self._state, log_level, data)
  File "/root/composer/composer/loggers/wandb_logger.py", line 106, in log_data
    wandb.log(data, step=int(state.timestamp.batch))
  File "/usr/local/lib/python3.9/dist-packages/wandb/sdk/wandb_run.py", line 256, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/wandb/sdk/wandb_run.py", line 222, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/wandb/sdk/wandb_run.py", line 1548, in log
    self._log(data=data, step=step, commit=commit)
  File "/usr/local/lib/python3.9/dist-packages/wandb/sdk/wandb_run.py", line 1339, in _log
    self._partial_history_callback(data, step, commit)
  File "/usr/local/lib/python3.9/dist-packages/wandb/sdk/wandb_run.py", line 1228, in _partial_history_callback
    self._backend.interface.publish_partial_history(
  File "/usr/local/lib/python3.9/dist-packages/wandb/sdk/interface/interface.py", line 548, in publish_partial_history
    item.value_json = json_dumps_safer_history(v)
  File "/usr/local/lib/python3.9/dist-packages/wandb/util.py", line 820, in json_dumps_safer_history
    return json.dumps(obj, cls=WandBHistoryJSONEncoder, **kwargs)
  File "/usr/lib/python3.9/json/__init__.py", line 234, in dumps
    return cls(
  File "/usr/lib/python3.9/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python3.9/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/usr/local/lib/python3.9/dist-packages/wandb/util.py", line 787, in default
    return json.JSONEncoder.default(self, obj)
  File "/usr/lib/python3.9/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type timedelta is not JSON serializable

@moinnadeem moinnadeem self-requested a review June 7, 2022 16:32
Copy link
Contributor

@moinnadeem moinnadeem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better, it works! Bellissimo, I like this behavior a lot better too.

@ravi-mosaicml ravi-mosaicml enabled auto-merge (squash) June 7, 2022 19:23
@ravi-mosaicml ravi-mosaicml merged commit feacfb7 into mosaicml:dev Jun 7, 2022
@ravi-mosaicml ravi-mosaicml deleted the fix_speed_monitor branch June 8, 2022 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants