Skip to content

Conversation

cmpatino
Copy link
Collaborator

@cmpatino cmpatino commented Sep 29, 2025

Revert try_extract_without_anchor to True in IndicesExtractionConfig to recover behavior in gpqa:diamond eval.

The setting was changed to try_extract_without_anchor = False to avoid false positives in GPQA. However, we ran an experiment and noticed that answers in the format \boxed{X} were being graded as incorrect even if the model gave the correct answer.

GPQA is an eval focused on evaluating the model's knowledge in science and math, so answers in the \boxed{X} format should also be valid.

Below is a summary of the effects of the change in Qwen and SmolLM3 models.

Model Mode try_extract_without_anchor: True try_extract_without_anchor: False
Qwen_Qwen3-0.6B_main /no_think 26.26 15.97
Qwen_Qwen3-1.7B_main /no_think 31.76 24.31
Qwen_Qwen3-4B_main /no_think 44.38 46.59
Qwen_Qwen3-0.6B_main /think 28.16 22.54
Qwen_Qwen3-1.7B_main /think 39.90 38.26
Qwen_Qwen3-4B_main /think 55.30 53.66
HuggingFaceTB_SmolLM3-3B_main /think 41.70 28.70
HuggingFaceTB_SmolLM3-3B_main /no_think 35.70 22.35

Revert `try_extract_without_anchor` to True in `IndicesExtractionConfig` to avoid issues in `gpqa:diamond` eval
@HuggingFaceDocBuilderDev
Copy link
Collaborator

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but let's wait for approval from a core maintainer before merging

@clefourrier
Copy link
Member

Can you change it in GPQA specifically? We observed we got a number of false positives in other metrics with this setting

Copy link
Member

@clefourrier clefourrier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, you can merge once tests pass

@clefourrier
Copy link
Member

(you will have to update the tests for GPQA since you're changing the metric)

@cmpatino cmpatino merged commit b1d45e3 into huggingface:main Sep 30, 2025
4 checks passed
@cmpatino cmpatino deleted the indices-extraction-setting branch September 30, 2025 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants