Fix topology agnostic loading #68
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When loading a checkpoint with a different tp degree from the configured tp degree, the following error is raised:
This happens only for the
model.lm_head.pp_block.weight
parameter. I assume this is because the optimizer states for this parameter are stored under the tiedmodel.token_position_embeddings.pp_block.token_embedding.weight
parameter. This PR fixes this by skipping trying to load the lm_head optimizer states. This is similar to weight loading, where themodel.token_position_embeddings.pp_block.token_embedding.weight
weights are loaded formodel.lm_head.pp_block.weight
(see https://github.com/huggingface/nanotron/blob/main/src/nanotron/serialize/weights.py#L347), but I think the optimizer states can be skipped.To reproduce:
Setup config files:
Train first using
tp=4
:CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=4 run_train.py --config-file examples/debug_topology_agnostic.yaml
Then continue with
tp=2
:CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=2 run_train.py --config-file examples/debug_topology_agnostic_continue.yaml
On main, this will lead to the above error.