Skip to content

Clarify the model checkpoint arguments #4335

@ananthsub

Description

@ananthsub

🐛 Proposals

This is not so much a bug report as an RFC to clarify the ModelCheckpoint callback arguments:

  • save_last: to me, this means that whenever we save a checkpoint, we save a checkpoint with filename "last.ckpt". This provides a pre-determined checkpoint name, which is very helpful for resuming from failures. Importantly, it should not determine when checkpoints are saved. Currently it's easy to confuse this parameter to mean "save the checkpoint after the last epoch," which I think should be split out as a separate argument. This distinction would also clarify the typing and validation: there's no need for it to be an Optional[bool]: either we save a checkpoint as "last.ckpt" or not. So it could be a regular bool.

  • There's an inefficiency right now where we generate the checkpoint dict twice if save_last=True. For techniques like ZeRO that deal with sharded optimizer states, each checkpoint dict creation triggers communications across all ranks. Instead, we should gather the checkpoint dict once, and then save to different file paths accordingly (cc @justusschock @awaelchli @akihironitta @rohitgr7 @carmocca @ninginthecloud @jjenniferdai @SeanNaren, @blefaudeux)

  • save_top_k: since monitor is None by default, this should force save_top_k to be -1. The counterargument is that this can cause storage concerns. But I think this is easily correctable on the user-side: configure save_top_k + monitor

  • period: we should rename this as every_n_epochs. this opens up extensions for checkpointing after every_n_steps during training and checkpointing after a specified time interval. With those extensions in mind, period is ambiguous. Another request here is to change the default filename from "{epoch}" to "{epoch}-{global_step}" to better support mid-epoch checkpointing

cc @awaelchli @carmocca

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions