Skip to content

Conversation

thomasht86
Copy link
Collaborator

@thomasht86 thomasht86 commented Sep 26, 2025

I confirm that this contribution is made under the terms of the license found in the root directory of this repository's source tree and that I have the authority necessary to make this contribution on behalf of its copyright owner.

Background:
VespaMatchEvaluator tries to exploit the recall query-parameter to find out whether or not a docid is matched by a query.
I took this statement too literally:

Note that the parameter recall only works if the document is matched by the query, which is exactly the behavior we want in this case.

Hence, the previous implementation was made under the naive assumption that the query and the recall-parameter are independent of each other, which does not hold true, at least for NN queries, as the recall-parameter will be translated to IN term and added to the query, and add a strong filter to the query.

To illustrate, consider the following queries:

A. select * from sources * where ({targetHits:1000}nearestNeighbor(embedding,q));
B. select * from sources * where ({targetHits:1000}nearestNeighbor(embedding,q)) AND docid IN(“id01”, “id02” ..);
(There can be many ids)

We want to make sure that all of the matched documents for query B will also have been matched by query A.

By setting the following parameters:

ranking.matching.postFilterThreshold: 0.0, # We always want postfiltering. Post-filtering is chosen when the estimated filter hit ratio of the query is larger than this threshold.
ranking.matching.approximateThreshold: 0.0, # We never want fallback to exact search. The fallback to exact search is chosen when the estimated filter hit ratio of the query is less than this threshold.

it will work for all cases like the above, but setting these parameters themselves will change the behavior of NN-queries, if the original queries themselves would have filters, which should not be postfiltered or which should actually fall back to exact search)

Maybe there is a better way of doing it..?

@thomasht86
Copy link
Collaborator Author

This is only a draft workaround for now, as we should probably do this automatically, but then also be very clear on the limitations.
@glebashnik

@glebashnik
Copy link
Member

glebashnik commented Sep 26, 2025

We also need target-hits-max-adjustment-factor: 1.0 to avoid target hits adjustment which will not reflect production use. Possibly we also need approximate:true, this shouldn't be necessary when setting approximate-threshold:0 but I have anecdotal experience that it might be needed too.

@thomasht86
Copy link
Collaborator Author

@glebashnik
I don't think setting target-hits-max-adjustment-factor: 1.0 would make sense for this, as it would change the behavior as you say, making the comparison less relevant. Of course that would mean that it might be adjusted, but then the results would also reflect the same adjustment, which is more important imho.

@thomasht86
Copy link
Collaborator Author

Re: approximate: true, this should be set to true by default if using hnsw-index. Do you know why/when it might be necessary to set it anyway?
Not super-trivial to add as it would require modifying the user-specified YQL to add the annotation, in contrast to setting a rank-parameter.

@thomasht86 thomasht86 marked this pull request as ready for review September 29, 2025 05:37
@thomasht86 thomasht86 marked this pull request as draft September 29, 2025 07:58
@thomasht86
Copy link
Collaborator Author

Let's wait a bit with this and see whether grouping approach is better!

@thomasht86 thomasht86 marked this pull request as ready for review September 30, 2025 07:15
@thomasht86
Copy link
Collaborator Author

@boeker fyi

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a bug in the VespaMatchEvaluator class when used with nearestNeighbor queries. The issue was with the previous naive assumption that query filters and recall parameters were independent, which doesn't hold for NN queries where the recall parameter gets translated to an IN term that can significantly alter query behavior. The fix replaces the recall-based approach with a grouping-based method that properly handles nearestNeighbor queries.

Key changes:

  • Replaces recall parameter with grouping queries to check document matches
  • Makes id_field parameter required with proper validation
  • Removes dual query approach (limit + recall) in favor of single grouping query
  • Updates all tests to work with the new grouping-based approach

Reviewed Changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 4 comments.

File Description
vespa/evaluation.py Main implementation changes - switches from recall to grouping approach, adds required id_field validation, implements new grouping filter and ID extraction methods
tests/unit/test_evaluator.py Updates all unit tests to work with new grouping response structure, adds comprehensive tests for new methods, removes dual-query test scenarios
tests/integration/test_integration_evaluation.py Updates integration tests to include required id_field parameter, adds new test cases for small targetHits scenarios
docs/sphinx/source/evaluating-vespa-application-cloud.ipynb Updates example notebook to include required id_field parameter in VespaMatchEvaluator usage

Comment on lines +12 to +13
from typing import Iterable
import re
Copy link
Preview

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import of re and Iterable should be added to the existing import block rather than creating a new one. Consider organizing imports by grouping standard library imports together at the top.

Copilot uses AI. Check for mistakes.

Comment on lines +918 to +920
def create_grouping_filter(
self, yql: str, id_field: str, relevant_ids: Union[str, Iterable]
) -> str:
Copy link
Preview

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint Union[str, Iterable] is too broad as str is also iterable. Consider using Union[str, Iterable[str]] or Union[str, List[str]] for better type safety and clarity.

Copilot uses AI. Check for mistakes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think List[str] is even better.

Comment on lines +817 to +843
def create_mock_response_json(matched_doc_id, searchtime, total_count):
return {
"root": {
"children": [
{
"id": "group:root:0",
"relevance": 1.0,
"continuation": {"this": ""},
"children": [
{
"id": "group:string:id",
"children": [
{
"id": f"group:string:{matched_doc_id}",
"relevance": 0.4395870752632618,
"value": matched_doc_id,
"fields": {"count()": 1},
}
],
}
],
}
],
"fields": {"totalCount": total_count},
},
"timing": {"searchtime": searchtime},
}
Copy link
Preview

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This nested JSON structure is repeated across multiple test methods with slight variations. Consider extracting this into a reusable helper method or test fixture to reduce code duplication and improve maintainability.

Copilot uses AI. Check for mistakes.

@glebashnik
Copy link
Member

@thomasht86 How this handles query timeouts and other errors during evaluation?
From my anecdotal experience it is easy to ignore query timeouts and generate stats any way.
This can be important for real-wold benchmarking on large datasets.
Maybe consider test for this.

Copy link
Member

@glebashnik glebashnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments.

logger.info(f"Wrote verbose match evaluation results to {csv_path}")

def create_grouping_filter(
self, yql: str, id_field: str, relevant_ids: Union[str, Iterable]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is unnecessary to allow both str and Iterable for relevant_ids because it can be interpreted as string with several ids, e.g. "d1,d2" etc. I would suggest to keep only list of strings as an option - as specified in docstring.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, it will actually be either a str or Set[str] if we look at what should be passed as relevant_docs to the init method.
using that and fixing docstring. agree?

modified_yql = yql.strip().rstrip(";")
return modified_yql + grouping_clause

def extract_matched_ids(self, resp: VespaQueryResponse, id_field: str) -> Set[str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a static method.

Comment on lines +918 to +920
def create_grouping_filter(
self, yql: str, id_field: str, relevant_ids: Union[str, Iterable]
) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think List[str] is even better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants