-
Notifications
You must be signed in to change notification settings - Fork 410
Remove unexpected kwargs passing to flce #651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Tcc0403 <[email protected]>
Signed-off-by: Tcc0403 <[email protected]>
Signed-off-by: Tcc0403 <[email protected]>
Signed-off-by: Tcc0403 <[email protected]>
Signed-off-by: Tcc0403 <[email protected]>
Investigating why gemma3 multimodal test failed @eljandoubi should we relax the tolerance further? |
target, | ||
reduction=reduction, | ||
ignore_index=ignore_index, | ||
**kwargs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looks good; don't understand why are we not passing kwargs to liger flce anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly because **kwargs
might contain FlashAttentionKwargs
, we might accidently pass them to flce when users set _attn_implementation
to flash_attention-2
.
We already catch required args by declaring names explicitly in function signature, so the rest is not needed. There are many features (weight, label_smoothing, z-loss, ...) in flce, and I thought they were good to have. But most of them are not supported in transformers, I just remove them for now in case future changes of transformers breaking it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks for the explanation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
Summary
Resolves #650
**kwargs
might containFlashAttentionKwargs
, so we should avoid passing it to flce directly.This PR also adopts the changes of ForCausalLMLoss and fixed_cross_entropy from huggingface/transformers, and rename softcap to final_logit_softcapping to match the naming.
Testing Done
make test
to ensure correctnessmake checkstyle
to ensure code stylemake test-convergence
to ensure convergence