Skip to content

Conversation

ravi-mosaicml
Copy link
Contributor

@ravi-mosaicml ravi-mosaicml commented Nov 30, 2021

  1. Before, the dataloder_spec, batch_size, and dataloader_hparams were passed as areguments into the trainer. Now, the trainer is initialized with a dataloader (or a dataloader, split_fn, and preprocessing_fn tuple). This change makes the DataloaderSpec optional and hidden to the user for simple datasets that do not require custom preprocessing or split functions.

  2. Removed dataloader_to_device and replaced it with explicit calls in the training loop to 1) move data onto the device, and 2) execute the preprocessing fn. The preprocessing fn is renamed to device transformation fn. Removed the option to execute the device transformation fn in a cuda stream, since that did not add any performance improvement. When using memory pinning, the batch_to_device should be a no-op, since the dataloader would have already moved the data onto the GPU.

TODO:

  • Regression test on resnet base to ensure no throughput or accuracy degredations

1. Before,the `dataloder_spec`, `batch_size`, and `dataloader_hparams` were passed as areguments into the trainer. Now, the trainer is initialized with a dataloader (or a dataloader, split_fn, and preprocessing_fn tuple). This change makes the `DataloaderSpec` optional and hidden to the user for simple datasets that do not require custom preprocessing or split functions.

2. Removed `dataloader_to_device` and replaced it with explicit calls in the training loop to 1) move data onto the device, and 2) execute the preprocessing fn. The preprocessing fn is renamed to device transformation fn. Removed the option to execute the device transformation fn in a cuda stream, since that did not add any performance improvement. When using memory pinning, the `batch_to_device` should be a no-op, since the dataloader would have already moved the data onto the GPU.

TODO:
- [ ] Regression test on resnet base to ensure no throughput or accuracy degredations
Copy link
Contributor

@jbloxham jbloxham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good - I'm very happy to see DataloaderSpec losing favor. The only thing I wonder is whether split_fn and device_transform_fn could be removed. The former I think is unnecessary if we just load N microbatches instead of 1 batch, and the latter could be replaced with explicit augmentations?

@ravi-mosaicml
Copy link
Contributor Author

Manually tested on a gpu instance. Throughput for a 2-wide box was 1290.

@abhi-mosaic
Copy link
Contributor

abhi-mosaic commented Dec 3, 2021

I'm getting a bit confused as to how the train_batch_size is computed / saved, just to clarify:

  • Trainer no longer knows what the batch size is at init, instead it looks at its device's dataloader, and the world size, and computes train_batch_size, and then creates State with this value
  • TrainerHparams, which has a field for train_batch_size, will carefully create each of the device dataloaders such that they have device batch sizes of train_batch_size / world_size, such that when they are re-interpreted by Trainer, it will be computed correctly

Also, can we rename total_batch_size -> train_batch_size? It's always weirded me out

Copy link
Contributor

@abhi-mosaic abhi-mosaic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @ravi-mosaicml and I think this looks good to me pending throughput sanity checks on ImageNet.

@ravi-mosaicml ravi-mosaicml merged commit 686aab9 into dev Dec 3, 2021
@ravi-mosaicml ravi-mosaicml deleted the ravi/dataloaders_in_trainer branch December 3, 2021 01:12
coryMosaicML pushed a commit to coryMosaicML/composer that referenced this pull request Feb 23, 2022
1. Before, the `dataloder_spec`, `batch_size`, and `dataloader_hparams` were passed as areguments into the trainer. Now, the trainer is initialized with a dataloader (or a dataloader, split_fn, and preprocessing_fn tuple). This change makes the `DataloaderSpec` optional and hidden to the user for simple datasets that do not require custom preprocessing or split functions.

2. Removed `dataloader_to_device` and replaced it with explicit calls in the training loop to 1) move data onto the device, and 2) execute the preprocessing fn. The preprocessing fn is renamed to device transformation fn. Removed the option to execute the device transformation fn in a cuda stream, since that did not add any performance improvement. When using memory pinning, the `batch_to_device` should be a no-op, since the dataloader would have already moved the data onto the GPU.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants