feat: Compute per-query average for 2D `retrieval_normalized_dcg` #3229

rintaro121 · 2025-08-17T10:10:41Z

What does this PR do?

Before submitting

Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Motivation

Currently, when passing a 2D tensor ([num_queries, num_documents]) to retrieval_normalized_dcg, the function flattens both preds and target and computes DCG/IDCG on the concatenated list.
This treats all queries as a single large ranking problem.

# To reproduce
from torchmetrics.functional.retrieval import retrieval_normalized_dcg
import torch

# Query 1
p1 = retrieval_normalized_dcg(torch.tensor([0.1, 0.2, 0.3]), torch.tensor([0, 1, 0]))
print(p1)  # tensor(0.6309)

# Query 2
p2 = retrieval_normalized_dcg(torch.tensor([0.8, 0.1, 0.05]), torch.tensor([1, 0, 0]))
print(p2)  # tensor(1.0000)

print("Mean per-query NDCG:", (p1 + p2) / 2)
# tensor(0.8155)

# Batched input (2D)
p_batch = retrieval_normalized_dcg(
    torch.tensor([[0.1, 0.2, 0.3], [0.8, 0.1, 0.05]]),
    torch.tensor([[0, 1, 0], [1, 0, 0]]),
)
print("Batch NDCG:", p_batch)
# tensor(0.9197) <-- Not the mean per-query value

Changes

Fixes to retrieval_normalized_dcg:

Updated the logic to correctly handle 2D tensors as batches of samples.
Added a new argument empty_target_action to specify behavior when a target contains no positives:(This argument has been added because the presence of positive labels in target is not guaranteed.)
- "skip" (default): Exclude the sample from the average computation.
- "pos": Treat the score as 1.0.
- "neg": Treat the score as 0.0.

Test additions and modifications:

Added new test cases covering the above fixes and feature additions.
Added tests to verify that each option of empty_target_action ("skip", "pos", "neg") works correctly.
Defined input tensors for use in the new tests.
Adjusted the arguments (e.g., k=...) passed to scikit-learn's ndcg_score to align conditions with our implementation.

For the reviewer

Feedback on the naming and default behavior of the new empty_target_action argument would be greatly appreciated. Please also let us know if any test coverage is missing.

Did you have fun?

Make sure you had fun coding 🙃

📚 Documentation preview 📚: https://torchmetrics--3229.org.readthedocs.build/en/3229/

rintaro121 · 2025-08-31T15:47:43Z

Hi, @Borda

Although this PR focuses on fixing retrieval_normalized_dcg, the same bug is also present in other retrieval metrics, including MRR, HitRate, and several others. The root cause is that _check_retrieval_target_and_prediction_types currently flattens both preds and target before returning them.
As a result, when a 2D tensor is passed (representing a batch of queries), all queries are concatenated into a single ranking problem. This produces incorrect scores across all retrieval metrics.

For example, with Average Precision:

from torchmetrics.functional.retrieval import retrieval_average_precision
import torch

# Query 1
p1 = retrieval_average_precision(torch.tensor([0.1, 0.2, 0.3]), torch.tensor([0, 1, 0]))
print(p1)  # tensor(0.5000)

# Query 2
p2 = retrieval_average_precision(torch.tensor([0.8, 0.1, 0.05]), torch.tensor([1, 0, 0]))
print(p2)  # tensor(1.0000)

print("Mean per-query Precision:", (p1 + p2) / 2)  
# tensor(0.7500)

# Batched input (2D)
p_batch = retrieval_average_precision(
    torch.tensor([[0.1, 0.2, 0.3], [0.8, 0.1, 0.05]]),
    torch.tensor([[0, 1, 0], [1, 0, 0]]),
)
print("Batch Precision:", p_batch)
# tensor(0.8333) <-- Not the mean per-query value (same as the NDCG example)

Here, the batched input is incorrectly treated as one large query with num_queries * num_documents documents, which inflates the reported metric compared to the mean per-query value.

To keep the scope manageable, my current plan is to limit this PR to NDCG only, adding tests that demonstrate the correct per-query behavior, so it is easier to review.
In a follow-up PR, I would then:

Update the common utilities (e.g., _check_retrieval_target_and_prediction_types) so that 2D inputs are handled correctly without flattening across queries.
Apply the fix consistently across other retrieval metrics (MRR, HitRate, Precision, Recall, etc.).
Add shared tests to ensure consistent batched-input behavior across all retrieval metrics.

Do you think this level of granularity makes sense, or would you prefer the broader retrieval-wide fix to be included directly in this PR?

codecov · 2025-09-01T12:55:01Z

Codecov Report

❌ Patch coverage is 56.66667% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 68%. Comparing base (e08e009) to head (134d112).

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #3229    +/-   ##
=======================================
- Coverage      69%     68%    -0%     
=======================================
  Files         364     349    -15     
  Lines       20096   19923   -173     
=======================================
- Hits        13790   13605   -185     
- Misses       6306    6318    +12

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull Request Overview

This PR fixes the NDCG computation for 2D tensor inputs to correctly compute per-query averages instead of treating all queries as a single ranking problem. Previously, 2D inputs were flattened and computed as one large ranking, but now each query is processed separately and averaged.

Updated retrieval_normalized_dcg to handle 2D tensors as batches of samples with per-query computation
Added empty_target_action parameter to handle queries with no positive labels
Enhanced test infrastructure to support 2D tensor testing and the new parameter

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
src/torchmetrics/functional/retrieval/ndcg.py	Core logic changes for 2D tensor handling and empty target handling
tests/unittests/retrieval/_inputs.py	Added 2D test input data structure
tests/unittests/retrieval/helpers.py	Updated test helpers to support 2D inputs and new metric parameter
tests/unittests/retrieval/test_ndcg.py	Added test coverage for the new empty_target_action parameter

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-19T19:50:17Z

src/torchmetrics/functional/retrieval/ndcg.py

+    original_shape = preds.shape
    preds, target = _check_retrieval_functional_inputs(preds, target, allow_non_binary_target=True)

-    top_k = preds.shape[-1] if top_k is None else top_k
+    # reshape back if input was 2D
+    if len(original_shape) == 2:
+        preds = preds.view(original_shape)
+        target = target.view(original_shape)
+    else:
+        preds = preds.unsqueeze(0)


Line 143 should use target.unsqueeze(0) instead of target.view(original_shape). The else branch handles 1D inputs which need to be unsqueezed to match the preds tensor on line 142.

rintaro121 added 4 commits August 17, 2025 17:53

Fix typo

f7b7455

Add 2d input for ndcg functional test

cb18b3b

Change str to Optional[str]

f6c5740

Add support for 2d tensor input in ndcg functional implementation

894d411

rintaro121 requested review from SkafteNicki, Borda and justusschock as code owners August 17, 2025 10:10

github-actions bot added the topic: Retrieval label Aug 17, 2025

Merge branch 'master' into fix/ndcg-2d-input-behavior

273b956

Borda changed the title ~~feat: Compute per-query average for 2D retrieval_normalized_dcg~~ feat: Compute per-query average for 2D retrieval_normalized_dcg Aug 28, 2025

rintaro121 added 2 commits August 31, 2025 23:56

ruff format

9ad2bf5

Add argument(metric_name)

9c06e7f

Borda added 2 commits September 2, 2025 23:24

Merge branch 'master' into fix/ndcg-2d-input-behavior

09b7985

Merge branch 'master' into fix/ndcg-2d-input-behavior

134d112

Borda requested a review from Copilot September 19, 2025 19:49

Copilot AI reviewed Sep 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Compute per-query average for 2D `retrieval_normalized_dcg` #3229

feat: Compute per-query average for 2D `retrieval_normalized_dcg` #3229

Uh oh!

rintaro121 commented Aug 17, 2025 •

edited by Borda

Loading

Uh oh!

rintaro121 commented Aug 31, 2025 •

edited

Loading

Uh oh!

codecov bot commented Sep 1, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 19, 2025

Uh oh!

Uh oh!

feat: Compute per-query average for 2D retrieval_normalized_dcg #3229

Are you sure you want to change the base?

feat: Compute per-query average for 2D retrieval_normalized_dcg #3229

Uh oh!

Conversation

rintaro121 commented Aug 17, 2025 • edited by Borda Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Changes

For the reviewer

Did you have fun?

Uh oh!

rintaro121 commented Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

feat: Compute per-query average for 2D `retrieval_normalized_dcg` #3229

feat: Compute per-query average for 2D `retrieval_normalized_dcg` #3229

rintaro121 commented Aug 17, 2025 •

edited by Borda

Loading

rintaro121 commented Aug 31, 2025 •

edited

Loading

codecov bot commented Sep 1, 2025 •

edited

Loading