Skip to content

Commit 6af76cc

Browse files
author
Moin Nadeem
authored
Update README.md (#770)
Added docs for sequential length warmup
1 parent a3739ed commit 6af76cc

File tree

1 file changed

+10
-10
lines changed
  • composer/algorithms/seq_length_warmup

1 file changed

+10
-10
lines changed

composer/algorithms/seq_length_warmup/README.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -15,18 +15,23 @@ Sequence Length Warmup linearly increases the sequence length (number of tokens
1515
1616
### Functional Interface
1717
18-
TODO(Moin): Fill this in and add comments as appropriate to describe what's happening.
19-
2018
```python
2119
from composer import functional as cf
2220
2321
def training_loop(model, train_loader):
2422
opt = torch.optim.Adam(model.parameters())
2523
loss_fn = F.cross_entropy
2624
model.train()
25+
max_seq_length = 1024
26+
curr_seq_length = 8
27+
seq_length_step_size = 8
2728
29+
# in this example, we're going to define a warmup schedule that increases the
30+
# sequence length by 8 at every step until it reaches the maximum sequence length
2831
for epoch in range(num_epochs):
2932
for X, y in train_loader:
33+
curr_seq_length = max(max_seq_length, curr_seq_length + seq_length_step_size)
34+
X = cf.set_batch_sequence_length(X, curr_seq_length)
3035
y_hat = model(X)
3136
loss = loss_fn(y_hat, y)
3237
loss.backward()
@@ -36,15 +41,14 @@ def training_loop(model, train_loader):
3641
3742
### Composer Trainer
3843
39-
TODO(Moin): Fill this in and add comments as appropriate to describe what's happening.
40-
4144
```python
4245
from composer.trainer import Trainer
46+
from composer.algorithms import SeqLengthWarmup
4347
4448
trainer = Trainer(model=model,
4549
train_dataloader=train_dataloader,
4650
max_duration='1ep',
47-
algorithms=[])
51+
algorithms=[SeqLengthWarmup()])
4852
4953
trainer.fit()
5054
```
@@ -55,9 +59,7 @@ We implement this as a pre-processing step during the forward pass when training
5559

5660
## Suggested Hyperparameters
5761

58-
We found that running Sequence Length Warmup for 30% of training (i.e., setting `duration=0.3`) provided the largest speedup that could still maintain full model quality on GPT2-52M.
59-
60-
TODO(Moin): Provide insights into the other hyperparameter choices.
62+
We found that running Sequence Length Warmup for 30% of training (i.e., setting `duration=0.3`) provided the largest speedup that could still maintain full model quality on GPT-2 125M. We also recommend to always ensure that the sequence length is a multiple of eight in order to take advantage of hardware acceleration, such as Tensor Cores.
6163

6264
## Technical Details
6365

@@ -87,8 +89,6 @@ There are two options for doing so:
8789
* Truncating the sentence, discarding everything beyond the desired sequence length.
8890
* Segmenting the sentence, breaking it up into segments of the desired sequence lenght and making all segments into separate trianing examples.
8991

90-
Jonathan to pick up here.
91-
9292
## Attribution
9393

9494
[*Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training*](https://arxiv.org/abs/2108.06084) by Conglong Li, Minjia Zhang, and Yuxiong He. Posted to arXiv in 2021.

0 commit comments

Comments
 (0)