Avoid amax roll for non-run modules #825

ksivaman · 2024-04-30T20:36:12Z

Description

The amax history should be rolled only for modules which ran and thus produced a non-zero amax. This bug was introduced in #575.

Fixes #814

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

ksivaman · 2024-04-30T20:37:12Z

/te-ci pytorch

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

timmoon10 · 2024-05-03T23:44:01Z

This fixes the bug discussed in #786 (review). Previously, if the amax history length didn't match the number of grad accumulation steps, the FP8 scaling factors could change in a step where the FP8 data does not change.

Signed-off-by: Kirthi Shankar Sivamani <[email protected]> Signed-off-by: Pawel Gadzinski <[email protected]>

Avoid amax roll for non-run modules

eaa4d1c

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

ksivaman added the 1.6.0 label Apr 30, 2024

ksivaman requested review from ptrendx and cyanguwa April 30, 2024 20:36

ksivaman self-assigned this Apr 30, 2024

ksivaman mentioned this pull request Apr 30, 2024

v1.6: FP8GlobalStateManager seems to be preserving state in distributed setting #814

Closed

cyanguwa approved these changes Apr 30, 2024

View reviewed changes

ksivaman merged commit a817868 into NVIDIA:main Apr 30, 2024

ptrendx pushed a commit that referenced this pull request Apr 30, 2024

Avoid amax roll for non-run modules (#825)

3c604eb

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

timmoon10 mentioned this pull request May 3, 2024

[PyTorch] Update FP8 recipe test to handle recipe changes #834

Merged

11 tasks

pggPL pushed a commit to pggPL/TransformerEngine that referenced this pull request May 15, 2024

Avoid amax roll for non-run modules (NVIDIA#825)

d6a7d37

Signed-off-by: Kirthi Shankar Sivamani <[email protected]> Signed-off-by: Pawel Gadzinski <[email protected]>

pggPL pushed a commit to pggPL/TransformerEngine that referenced this pull request May 16, 2024

Avoid amax roll for non-run modules (NVIDIA#825)

2335ef4

Signed-off-by: Kirthi Shankar Sivamani <[email protected]> Signed-off-by: Pawel Gadzinski <[email protected]>

pggPL pushed a commit to pggPL/TransformerEngine that referenced this pull request May 23, 2024

Avoid amax roll for non-run modules (NVIDIA#825)

46fc3b0

Signed-off-by: Kirthi Shankar Sivamani <[email protected]> Signed-off-by: Pawel Gadzinski <[email protected]>

timmoon10 mentioned this pull request Nov 9, 2024

[PyTorch] Remove special handling for FP8 params in FP8 recipe infrastructure #1326

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid amax roll for non-run modules #825

Avoid amax roll for non-run modules #825

Uh oh!

ksivaman commented Apr 30, 2024

Uh oh!

ksivaman commented Apr 30, 2024

Uh oh!

timmoon10 commented May 3, 2024

Uh oh!

Uh oh!

Avoid amax roll for non-run modules #825

Avoid amax roll for non-run modules #825

Uh oh!

Conversation

ksivaman commented Apr 30, 2024

Description

Type of change

Uh oh!

ksivaman commented Apr 30, 2024

Uh oh!

timmoon10 commented May 3, 2024

Uh oh!

Uh oh!