Skip to content

Conversation

ksivaman
Copy link
Member

Description

The amax history should be rolled only for modules which ran and thus produced a non-zero amax. This bug was introduced in #575.

Fixes #814

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
@ksivaman ksivaman added the 1.6.0 label Apr 30, 2024
@ksivaman ksivaman requested review from ptrendx and cyanguwa April 30, 2024 20:36
@ksivaman ksivaman self-assigned this Apr 30, 2024
@ksivaman
Copy link
Member Author

/te-ci pytorch

@ksivaman ksivaman merged commit a817868 into NVIDIA:main Apr 30, 2024
ptrendx pushed a commit that referenced this pull request Apr 30, 2024
Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
@timmoon10
Copy link
Collaborator

This fixes the bug discussed in #786 (review). Previously, if the amax history length didn't match the number of grad accumulation steps, the FP8 scaling factors could change in a step where the FP8 data does not change.

pggPL pushed a commit to pggPL/TransformerEngine that referenced this pull request May 15, 2024
Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
pggPL pushed a commit to pggPL/TransformerEngine that referenced this pull request May 16, 2024
Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
pggPL pushed a commit to pggPL/TransformerEngine that referenced this pull request May 23, 2024
Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

v1.6: FP8GlobalStateManager seems to be preserving state in distributed setting
3 participants