Skip to content

Conversation

rintaro121
Copy link

@rintaro121 rintaro121 commented Aug 17, 2025

What does this PR do?

Fixes #3216

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?
PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Motivation

Currently, when passing a 2D tensor ([num_queries, num_documents]) to retrieval_normalized_dcg, the function flattens both preds and target and computes DCG/IDCG on the concatenated list.
This treats all queries as a single large ranking problem.

# To reproduce
from torchmetrics.functional.retrieval import retrieval_normalized_dcg
import torch

# Query 1
p1 = retrieval_normalized_dcg(torch.tensor([0.1, 0.2, 0.3]), torch.tensor([0, 1, 0]))
print(p1)  # tensor(0.6309)

# Query 2
p2 = retrieval_normalized_dcg(torch.tensor([0.8, 0.1, 0.05]), torch.tensor([1, 0, 0]))
print(p2)  # tensor(1.0000)

print("Mean per-query NDCG:", (p1 + p2) / 2)
# tensor(0.8155)

# Batched input (2D)
p_batch = retrieval_normalized_dcg(
    torch.tensor([[0.1, 0.2, 0.3], [0.8, 0.1, 0.05]]),
    torch.tensor([[0, 1, 0], [1, 0, 0]]),
)
print("Batch NDCG:", p_batch)
# tensor(0.9197) <-- Not the mean per-query value

Changes

Fixes to retrieval_normalized_dcg:

  • Updated the logic to correctly handle 2D tensors as batches of samples.
  • Added a new argument empty_target_action to specify behavior when a target contains no positives:(This argument has been added because the presence of positive labels in target is not guaranteed.)
    • "skip" (default): Exclude the sample from the average computation.
    • "pos": Treat the score as 1.0.
    • "neg": Treat the score as 0.0.

Test additions and modifications:

  • Added new test cases covering the above fixes and feature additions.
  • Added tests to verify that each option of empty_target_action ("skip", "pos", "neg") works correctly.
  • Defined input tensors for use in the new tests.
  • Adjusted the arguments (e.g., k=...) passed to scikit-learn's ndcg_score to align conditions with our implementation.

For the reviewer

Feedback on the naming and default behavior of the new empty_target_action argument would be greatly appreciated. Please also let us know if any test coverage is missing.

Did you have fun?

Make sure you had fun coding 🙃


📚 Documentation preview 📚: https://torchmetrics--3229.org.readthedocs.build/en/3229/

@Borda Borda changed the title feat: Compute per-query average for 2D retrieval_normalized_dcg feat: Compute per-query average for 2D retrieval_normalized_dcg Aug 28, 2025
@rintaro121
Copy link
Author

rintaro121 commented Aug 31, 2025

Hi, @Borda

Although this PR focuses on fixing retrieval_normalized_dcg, the same bug is also present in other retrieval metrics, including MRR, HitRate, and several others. The root cause is that _check_retrieval_target_and_prediction_types currently flattens both preds and target before returning them.
As a result, when a 2D tensor is passed (representing a batch of queries), all queries are concatenated into a single ranking problem. This produces incorrect scores across all retrieval metrics.

For example, with Average Precision:

from torchmetrics.functional.retrieval import retrieval_average_precision
import torch

# Query 1
p1 = retrieval_average_precision(torch.tensor([0.1, 0.2, 0.3]), torch.tensor([0, 1, 0]))
print(p1)  # tensor(0.5000)

# Query 2
p2 = retrieval_average_precision(torch.tensor([0.8, 0.1, 0.05]), torch.tensor([1, 0, 0]))
print(p2)  # tensor(1.0000)

print("Mean per-query Precision:", (p1 + p2) / 2)  
# tensor(0.7500)

# Batched input (2D)
p_batch = retrieval_average_precision(
    torch.tensor([[0.1, 0.2, 0.3], [0.8, 0.1, 0.05]]),
    torch.tensor([[0, 1, 0], [1, 0, 0]]),
)
print("Batch Precision:", p_batch)
# tensor(0.8333) <-- Not the mean per-query value (same as the NDCG example)

Here, the batched input is incorrectly treated as one large query with num_queries * num_documents documents, which inflates the reported metric compared to the mean per-query value.

To keep the scope manageable, my current plan is to limit this PR to NDCG only, adding tests that demonstrate the correct per-query behavior, so it is easier to review.
In a follow-up PR, I would then:

  • Update the common utilities (e.g., _check_retrieval_target_and_prediction_types) so that 2D inputs are handled correctly without flattening across queries.
  • Apply the fix consistently across other retrieval metrics (MRR, HitRate, Precision, Recall, etc.).
  • Add shared tests to ensure consistent batched-input behavior across all retrieval metrics.

Do you think this level of granularity makes sense, or would you prefer the broader retrieval-wide fix to be included directly in this PR?

Copy link

codecov bot commented Sep 1, 2025

Codecov Report

❌ Patch coverage is 56.66667% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 68%. Comparing base (e08e009) to head (134d112).

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #3229    +/-   ##
=======================================
- Coverage      69%     68%    -0%     
=======================================
  Files         364     349    -15     
  Lines       20096   19923   -173     
=======================================
- Hits        13790   13605   -185     
- Misses       6306    6318    +12     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Borda Borda requested a review from Copilot September 19, 2025 19:49
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes the NDCG computation for 2D tensor inputs to correctly compute per-query averages instead of treating all queries as a single ranking problem. Previously, 2D inputs were flattened and computed as one large ranking, but now each query is processed separately and averaged.

  • Updated retrieval_normalized_dcg to handle 2D tensors as batches of samples with per-query computation
  • Added empty_target_action parameter to handle queries with no positive labels
  • Enhanced test infrastructure to support 2D tensor testing and the new parameter

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/torchmetrics/functional/retrieval/ndcg.py Core logic changes for 2D tensor handling and empty target handling
tests/unittests/retrieval/_inputs.py Added 2D test input data structure
tests/unittests/retrieval/helpers.py Updated test helpers to support 2D inputs and new metric parameter
tests/unittests/retrieval/test_ndcg.py Added test coverage for the new empty_target_action parameter

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +134 to +142
original_shape = preds.shape
preds, target = _check_retrieval_functional_inputs(preds, target, allow_non_binary_target=True)

top_k = preds.shape[-1] if top_k is None else top_k
# reshape back if input was 2D
if len(original_shape) == 2:
preds = preds.view(original_shape)
target = target.view(original_shape)
else:
preds = preds.unsqueeze(0)
Copy link
Preview

Copilot AI Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 143 should use target.unsqueeze(0) instead of target.view(original_shape). The else branch handles 1D inputs which need to be unsqueezed to match the preds tensor on line 142.

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

retrieval_normalized_dcg should compute per-query average when given 2D inputs (IR-standard behavior)
2 participants