Cleaned Up State #223

ravi-mosaicml · 2022-01-13T02:25:34Z

Remove the dataspec from state! Instead, the trainer sets batch_num_samples, batch_num_tokens, microbatches, and microbatch_idx so these fields are accessible to algorithms that need to know the pertinent information that the data spec provided.
Removed last_batch_size from state. Replaced it with dist.all_reduce(state.batch_num_samples) where it was used.
Removed train_batch_size and eval_batch_size from state, as algorithms should not depend on constant batch sizing. Replaced it with state.train_dataloader.batch_size * dist.get_world_size() in the few places where it was used
Added a helper function to get_device_of_batch, which is required for part 3 (since dist.all_reduce requires tensors to be placed on the device it torch.dist was initialized with)
Fixed the type annotations for ensure_tuple to support get_device_of_batch

1. Remove the dataspec from state! Instead, the trainer sets `batch_num_samples`, `batch_num_tokens`, `microbatches`, and `microbatch_idx` so these fields are accessible to algorithms that need to know the pertinent information that the data spec provided. 2. Removed last_batch_size from state. Replaced it with `dist.all_reduce(state.batch_num_samples)` where it was used. 3. Removed `train_batch_size` and `eval_batch_size` from state, as algorithms should not depend on constant batch sizing. Replaced it with `state.train_dataloader.batch_size * dist.get_world_size()` in the few places where it was used 4. Added a helper function to get the device of the batch, which is required for part 3 (since `dist.all_reduce` requires tensors to be placed on the device it torch.dist was initialized with)

jbloxham

Looks good to me! Glad to see the state getting simplified.

composer/callbacks/speed_monitor.py

composer/core/state.py

composer/trainer/trainer.py

1. #223 introduced a bug where algorithms that run on the AFTER_DATALOADER and did a not-in-place modification of state.batch did not also update state.microbatches (which was used for training), so these algorithms were effectively ignored. Fixed this bug by computing the microbataches AFTER the Event.AFTER_DATALOADER event. 2. Removed the `microbatches` and `microbatch_idx` from the state. Instead, algorithms that need to run on smaller batch sizes should use the Event.BATCH_START event instead of Event.AFTER_DATALOADER, since Event.BATCH_START will get the forward-pass sized batch.

…258) 1. #223 introduced a bug where algorithms that run on the AFTER_DATALOADER and did a not-in-place modification of state.batch did not also update state.microbatches (which was used for training), so these algorithms were effectively ignored. Fixed this bug by computing the microbataches AFTER the Event.AFTER_DATALOADER event. 2. Removed the `microbatches` and `microbatch_idx` from the state. Instead, algorithms that need to run on smaller batch sizes should use the Event.BATCH_START event instead of Event.AFTER_DATALOADER, since Event.BATCH_START will get the forward-pass sized batch.

1. Remove the dataspec from state! Instead, the trainer sets `batch_num_samples`, `batch_num_tokens`, `microbatches`, and `microbatch_idx` so these fields are accessible to algorithms that need to know the pertinent information that the data spec provided. 2. Removed last_batch_size from state. Replaced it with `dist.all_reduce(state.batch_num_samples)` where it was used. 3. Removed `train_batch_size` and `eval_batch_size` from state, as algorithms should not depend on constant batch sizing. Replaced it with `state.train_dataloader.batch_size * dist.get_world_size()` in the few places where it was used 4. Added a helper function to `get_device_of_batch`, which is required for part 3 (since `dist.all_reduce` requires tensors to be placed on the device it torch.dist was initialized with) 5. Fixed the type annotations for `ensure_tuple` to support `get_device_of_batch`

…osaicml#258) 1. mosaicml#223 introduced a bug where algorithms that run on the AFTER_DATALOADER and did a not-in-place modification of state.batch did not also update state.microbatches (which was used for training), so these algorithms were effectively ignored. Fixed this bug by computing the microbataches AFTER the Event.AFTER_DATALOADER event. 2. Removed the `microbatches` and `microbatch_idx` from the state. Instead, algorithms that need to run on smaller batch sizes should use the Event.BATCH_START event instead of Event.AFTER_DATALOADER, since Event.BATCH_START will get the forward-pass sized batch.

ravi-mosaicml requested review from jbloxham and Averylamp January 13, 2022 02:25

ravi-mosaicml added 2 commits January 12, 2022 18:30

Bump pyright version

f711cc1

Fixed ensure tuple

942dc5f

ravi-mosaicml requested a review from anisehsani January 14, 2022 00:26

Fixed seq length warmup

96d099f

jbloxham approved these changes Jan 14, 2022

View reviewed changes

composer/callbacks/speed_monitor.py Outdated Show resolved Hide resolved

composer/core/state.py Show resolved Hide resolved

composer/trainer/trainer.py Outdated Show resolved Hide resolved

composer/trainer/trainer.py Show resolved Hide resolved

ravi-mosaicml added 2 commits January 14, 2022 15:09

Merge branch 'dev' into ravi/cleanup_state

23f992f

Fixed typo

357181d

ravi-mosaicml merged commit 0da611f into dev Jan 18, 2022

ravi-mosaicml deleted the ravi/cleanup_state branch January 18, 2022 17:29

ravi-mosaicml mentioned this pull request Jan 20, 2022

Fix bug with AFTER_DATALOADER event; remove microbatches from state #258

Merged

jbloxham mentioned this pull request Jan 28, 2022

Add kwargs back to the closure #292

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cleaned Up State #223

Cleaned Up State #223

Uh oh!

ravi-mosaicml commented Jan 13, 2022 •

edited

Loading

Uh oh!

jbloxham left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Cleaned Up State #223

Cleaned Up State #223

Uh oh!

Conversation

ravi-mosaicml commented Jan 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbloxham left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ravi-mosaicml commented Jan 13, 2022 •

edited

Loading