[RFC] Logits numerical issues in convergence test

### 🐛 Describe the bug

This issue is to discuss how we should modify our convergence test to handle numerical issues of logits.

### Context
In #704, we make `FusedLinearCrossEntropy`(flce) patched models have the freedom to switch between materializing logits and skipping logits. It allows us to compare logits from the last step again in flce convergence test, `test/convergence/(bf16|fp32)/test_mini_models.py`. 

However, this change also exposed the hidden numerical issue that we didn't take care of. 

The reasons why tests failed can be categorized into two types:
1. numerical instability of floating points due to the limitation of bf16 representation, rounding errors, nonassociativity of floating point calculation, etc.
2. actual bugs inside our implementation. 

Ideally, we want to ignore type1 and handle type2 errors. But in reality, best we can do to ignore type1 errors is just increasing tolerences, which also increases the difficulty to capture type2 errors, the actual bugs in our kernels.

### Existed methods

Before discussing how we can modify our convergence tests to address the issue, we can take a look of how other mainstream libraries handle it:
#### sglang
[`check_close_model_outputs()`](https://github.com/sgl-project/sglang/blob/a2cb5913a00965a8bb33f3986710adaaf1828a8e/python/sglang/test/runners.py#L758)

techniques:
[calculate the ROUGE-L score](https://github.com/sgl-project/sglang/blob/a2cb5913a00965a8bb33f3986710adaaf1828a8e/python/sglang/test/runners.py#L770)
[compare top logprobs](https://github.com/sgl-project/sglang/blob/a2cb5913a00965a8bb33f3986710adaaf1828a8e/python/sglang/test/runners.py#L776)
#### vllm
[`check_logprobs_close()`](https://github.com/vllm-project/vllm/blob/ca2f6b9c301df6dbe2e5c83c705051f478140695/tests/models/utils.py#L80)

techniques:
[compare logprob only when token id mismatch](https://github.com/vllm-project/vllm/blob/ca2f6b9c301df6dbe2e5c83c705051f478140695/tests/models/utils.py#L97)
#### verl
[`test_hf_causal_models()`](https://github.com/volcengine/verl/blob/0e127b208b7ec89bbd3647668973a35fe5d81390/tests/models/test_transformer.py#L39)

techniques:
[set `num_hidden_layers=1` to lower discrepancy](https://github.com/volcengine/verl/blob/0e127b208b7ec89bbd3647668973a35fe5d81390/tests/models/test_transformer.py#L32)
<del>[compare `mean_of_logprob` instead of top logprob only](https://github.com/volcengine/verl/blob/0e127b208b7ec89bbd3647668973a35fe5d81390/tests/models/test_transformer.py#L88)</del> edit: it's more like cross entropy loss
### Discussion
With above examples, we can see that instead of comparing raw logits, doing it in logprob space is more common. We should definitely do it without a doubt.

The other options are open for discussion:
1. mean logprob or top logprob?
2. setting num_hidden_layers=1 can speed up testing while minimizing the effect of numerical issue, but is **one** layer sufficient to represent model?
3. forced to match top token? 

### Goal
A reliable convergence test with logits comparison, so we can distinguish whether the failure is due to numerical issue or actual bugs and no need to frequently adjust tolerances.

### Known failure models (including passed models with relaxed tolerances, mark *)
Feel free to report as long as you find any model having logits mismatch. I'll add them to the list.

- [ ] gemma
- [ ] gemma2
- [ ] gemma3*
- [ ] granite3*
- [ ] qwen3_moe*
- [ ] qwen2_vl*
- [ ] qwen2_5_vl*
- [ ] mixtral (disabled for a long time)

cc @lancerts @yundai424 @qingquansong @shivam15s @shimizust @vaibhavjindal  @austin362667  

### Reproduce

_No response_

### Versions

liger_kernel==https://github.com/linkedin/Liger-Kernel/commit/4a2da9928b1a867079276467d03f4364c33ea673

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Logits numerical issues in convergence test #742

🐛 Describe the bug

Context

Existed methods

sglang

vllm

verl

Discussion

Goal

Known failure models (including passed models with relaxed tolerances, mark *)

Reproduce

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Logits numerical issues in convergence test #742

Description

🐛 Describe the bug

Context

Existed methods

sglang

vllm

verl

Discussion

Goal

Known failure models (including passed models with relaxed tolerances, mark *)

Reproduce

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions