Skip to content

Conversation

ptrendx
Copy link
Member

@ptrendx ptrendx commented Mar 13, 2025

Description

return_bias option was ignored silently in LayerNormLinear and LayerNormMLP after the 2.0 refactor.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@ptrendx
Copy link
Member Author

ptrendx commented Mar 13, 2025

/te-ci pytorch

@ptrendx
Copy link
Member Author

ptrendx commented Mar 13, 2025

/te-ci pytorch

@ptrendx ptrendx requested review from ksivaman and timmoon10 March 13, 2025 21:24
@ptrendx ptrendx marked this pull request as ready for review March 13, 2025 21:27
Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks correct, especially as a quick bugfix.

return_bias is spaghetti code, so I'll summarize my understanding for future reference. Mcore performs some kernel fusions like bias-dropout-add and bias-add, which involve separating the bias compute from the linear GEMM. For an example, see this Transformer layer residual connection:
https://github.com/NVIDIA/Megatron-LM/blob/34013dbd37e94e17f1434d71b6ce9705e0acf6ca/megatron/core/transformer/transformer_layer.py#L451-L453
However, Mcore would like to stay consistent with PyTorch's convention of including the bias with the linear module (I presume for checkpointing). return_bias tells TE to not actually apply the bias, but to return it for use by Mcore. For some reason we still pass the bias into the autograd function, but then have a bunch of checks everywhere so that it isn't ever used in the forward or backward pass. In the future we should just pass bias=None into the autograd function if return_bias=True, so we can remove all the redundant logic with use_bias=False.

@ptrendx
Copy link
Member Author

ptrendx commented Mar 15, 2025

/te-ci pytorch

Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the return_bias out of the autograd function is a big improvement.

This reverts commit 2e20a92.

Signed-off-by: Przemek Tredak <[email protected]>
Copy link
Member

@ksivaman ksivaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ptrendx ptrendx added the 2.2.0 label Mar 17, 2025
@ptrendx
Copy link
Member Author

ptrendx commented Mar 17, 2025

/te-ci pytorch

ptrendx and others added 4 commits March 17, 2025 11:16
Co-authored-by: Tim Moon <[email protected]>
Signed-off-by: Przemyslaw Tredak <[email protected]>
Signed-off-by: Przemek Tredak <[email protected]>
Signed-off-by: Przemek Tredak <[email protected]>
@ptrendx
Copy link
Member Author

ptrendx commented Mar 18, 2025

/te-ci pytorch

@ptrendx
Copy link
Member Author

ptrendx commented Mar 18, 2025

/te-ci jax

@ptrendx
Copy link
Member Author

ptrendx commented Mar 18, 2025

The CI failures are the expected failures coming from paged attention with cuDNN 9.8. Merging.

@ptrendx ptrendx merged commit 99f4067 into NVIDIA:main Mar 18, 2025
18 of 23 checks passed
@Marks101
Copy link
Contributor

Marks101 commented Apr 7, 2025

Hi @ptrendx and @ksivaman, could you maybe mention this in the known issues for version 2.1 and 2.0? We have been falling for this and it caused some trouble 🙈

lhb8125 pushed a commit to lhb8125/TransformerEngine that referenced this pull request Apr 8, 2025
* Do not apply bias when apply_bias is False

Signed-off-by: Przemek Tredak <[email protected]>

* Bwd fix for LNMLP and tests

Signed-off-by: Przemek Tredak <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix for the dbias calculation

Signed-off-by: Przemek Tredak <[email protected]>

* Improve tests and cleaning the logic

Signed-off-by: Przemek Tredak <[email protected]>

* Tightened test tolerances a little

Signed-off-by: Przemek Tredak <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Revert "Tightened test tolerances a little"

This reverts commit 2e20a92.

Signed-off-by: Przemek Tredak <[email protected]>

* Update tests/pytorch/test_numerics.py

Co-authored-by: Tim Moon <[email protected]>
Signed-off-by: Przemyslaw Tredak <[email protected]>

* Fix the Gelu Aux type

Signed-off-by: Przemek Tredak <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove use_fc1_bias option

Signed-off-by: Przemek Tredak <[email protected]>

---------

Signed-off-by: Przemek Tredak <[email protected]>
Signed-off-by: Przemyslaw Tredak <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <[email protected]>
@ptrendx
Copy link
Member Author

ptrendx commented Apr 10, 2025

Hi @Marks101, my apology about that :-(. I just added the note about it to release notes in 2.0/2.1 on GitHub.

@pggPL
Copy link
Collaborator

pggPL commented Aug 6, 2025

@ptrendx I noticed "/opt/transformerengine/tests/pytorch/test_numerics.py:1147: PytestCollectionWarning: cannot collect test class 'TestReturnBiasModule' because it has a init constructor (from: tests/pytorch/test_numerics.py)" in the test logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants