Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 10 additions & 10 deletions composer/algorithms/seq_length_warmup/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,18 +15,23 @@ Sequence Length Warmup linearly increases the sequence length (number of tokens

### Functional Interface

TODO(Moin): Fill this in and add comments as appropriate to describe what's happening.

```python
from composer import functional as cf

def training_loop(model, train_loader):
opt = torch.optim.Adam(model.parameters())
loss_fn = F.cross_entropy
model.train()
max_seq_length = 1024
curr_seq_length = 8
seq_length_step_size = 8

# in this example, we're going to define a warmup schedule that increases the
# sequence length by 8 at every step until it reaches the maximum sequence length
for epoch in range(num_epochs):
for X, y in train_loader:
curr_seq_length = max(max_seq_length, curr_seq_length + seq_length_step_size)
X = cf.set_batch_sequence_length(X, curr_seq_length)
y_hat = model(X)
loss = loss_fn(y_hat, y)
loss.backward()
Expand All @@ -36,15 +41,14 @@ def training_loop(model, train_loader):

### Composer Trainer

TODO(Moin): Fill this in and add comments as appropriate to describe what's happening.

```python
from composer.trainer import Trainer
from composer.algorithms import SeqLengthWarmup

trainer = Trainer(model=model,
train_dataloader=train_dataloader,
max_duration='1ep',
algorithms=[])
algorithms=[SeqLengthWarmup()])

trainer.fit()
```
Expand All @@ -55,9 +59,7 @@ We implement this as a pre-processing step during the forward pass when training

## Suggested Hyperparameters

We found that running Sequence Length Warmup for 30% of training (i.e., setting `duration=0.3`) provided the largest speedup that could still maintain full model quality on GPT2-52M.

TODO(Moin): Provide insights into the other hyperparameter choices.
We found that running Sequence Length Warmup for 30% of training (i.e., setting `duration=0.3`) provided the largest speedup that could still maintain full model quality on GPT-2 125M. We also recommend to always ensure that the sequence length is a multiple of eight in order to take advantage of hardware acceleration, such as Tensor Cores.

## Technical Details

Expand Down Expand Up @@ -87,8 +89,6 @@ There are two options for doing so:
* Truncating the sentence, discarding everything beyond the desired sequence length.
* Segmenting the sentence, breaking it up into segments of the desired sequence lenght and making all segments into separate trianing examples.

Jonathan to pick up here.

## Attribution

[*Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training*](https://arxiv.org/abs/2108.06084) by Conglong Li, Minjia Zhang, and Yuxiong He. Posted to arXiv in 2021.
Expand Down