(bugfix)`VespaMatchEvaluator` with nearestNeighbor #1125

thomasht86 · 2025-09-26T04:42:39Z

I confirm that this contribution is made under the terms of the license found in the root directory of this repository's source tree and that I have the authority necessary to make this contribution on behalf of its copyright owner.

Background:
VespaMatchEvaluator tries to exploit the recall query-parameter to find out whether or not a docid is matched by a query.
I took this statement too literally:

Note that the parameter recall only works if the document is matched by the query, which is exactly the behavior we want in this case.

Hence, the previous implementation was made under the naive assumption that the query and the recall-parameter are independent of each other, which does not hold true, at least for NN queries, as the recall-parameter will be translated to IN term and added to the query, and add a strong filter to the query.

To illustrate, consider the following queries:

A. select * from sources * where ({targetHits:1000}nearestNeighbor(embedding,q));
B. select * from sources * where ({targetHits:1000}nearestNeighbor(embedding,q)) AND docid IN(“id01”, “id02” ..);
(There can be many ids)

We want to make sure that all of the matched documents for query B will also have been matched by query A.

By setting the following parameters:

ranking.matching.postFilterThreshold: 0.0, # We always want postfiltering. Post-filtering is chosen when the estimated filter hit ratio of the query is larger than this threshold.
ranking.matching.approximateThreshold: 0.0, # We never want fallback to exact search. The fallback to exact search is chosen when the estimated filter hit ratio of the query is less than this threshold.

it will work for all cases like the above, but setting these parameters themselves will change the behavior of NN-queries, if the original queries themselves would have filters, which should not be postfiltered or which should actually fall back to exact search)

Maybe there is a better way of doing it..?

thomasht86 · 2025-09-26T04:52:55Z

This is only a draft workaround for now, as we should probably do this automatically, but then also be very clear on the limitations.
@glebashnik

glebashnik · 2025-09-26T07:20:34Z

We also need target-hits-max-adjustment-factor: 1.0 to avoid target hits adjustment which will not reflect production use. Possibly we also need approximate:true, this shouldn't be necessary when setting approximate-threshold:0 but I have anecdotal experience that it might be needed too.

thomasht86 · 2025-09-26T08:57:54Z

@glebashnik
I don't think setting target-hits-max-adjustment-factor: 1.0 would make sense for this, as it would change the behavior as you say, making the comparison less relevant. Of course that would mean that it might be adjusted, but then the results would also reflect the same adjustment, which is more important imho.

thomasht86 · 2025-09-26T10:58:04Z

Re: approximate: true, this should be set to true by default if using hnsw-index. Do you know why/when it might be necessary to set it anyway?
Not super-trivial to add as it would require modifying the user-specified YQL to add the annotation, in contrast to setting a rank-parameter.

thomasht86 · 2025-09-29T07:59:39Z

Let's wait a bit with this and see whether grouping approach is better!

thomasht86 · 2025-09-30T10:07:41Z

@boeker fyi

Copilot

Pull Request Overview

This PR fixes a bug in the VespaMatchEvaluator class when used with nearestNeighbor queries. The issue was with the previous naive assumption that query filters and recall parameters were independent, which doesn't hold for NN queries where the recall parameter gets translated to an IN term that can significantly alter query behavior. The fix replaces the recall-based approach with a grouping-based method that properly handles nearestNeighbor queries.

Key changes:

Replaces recall parameter with grouping queries to check document matches
Makes id_field parameter required with proper validation
Removes dual query approach (limit + recall) in favor of single grouping query
Updates all tests to work with the new grouping-based approach

Reviewed Changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 4 comments.

File	Description
vespa/evaluation.py	Main implementation changes - switches from recall to grouping approach, adds required id_field validation, implements new grouping filter and ID extraction methods
tests/unit/test_evaluator.py	Updates all unit tests to work with new grouping response structure, adds comprehensive tests for new methods, removes dual-query test scenarios
tests/integration/test_integration_evaluation.py	Updates integration tests to include required id_field parameter, adds new test cases for small targetHits scenarios
docs/sphinx/source/evaluating-vespa-application-cloud.ipynb	Updates example notebook to include required id_field parameter in VespaMatchEvaluator usage

Copilot · 2025-09-30T12:18:10Z

vespa/evaluation.py

+from typing import Iterable
+import re


The import of re and Iterable should be added to the existing import block rather than creating a new one. Consider organizing imports by grouping standard library imports together at the top.

Copilot · 2025-09-30T12:18:10Z

vespa/evaluation.py

+    def create_grouping_filter(
+        self, yql: str, id_field: str, relevant_ids: Union[str, Iterable]
+    ) -> str:


The type hint Union[str, Iterable] is too broad as str is also iterable. Consider using Union[str, Iterable[str]] or Union[str, List[str]] for better type safety and clarity.

I think List[str] is even better.

vespa/evaluation.py

Copilot · 2025-09-30T12:18:10Z

tests/unit/test_evaluator.py

+        def create_mock_response_json(matched_doc_id, searchtime, total_count):
+            return {
+                "root": {
+                    "children": [
+                        {
+                            "id": "group:root:0",
+                            "relevance": 1.0,
+                            "continuation": {"this": ""},
+                            "children": [
+                                {
+                                    "id": "group:string:id",
+                                    "children": [
+                                        {
+                                            "id": f"group:string:{matched_doc_id}",
+                                            "relevance": 0.4395870752632618,
+                                            "value": matched_doc_id,
+                                            "fields": {"count()": 1},
+                                        }
+                                    ],
+                                }
+                            ],
+                        }
+                    ],
+                    "fields": {"totalCount": total_count},
+                },
+                "timing": {"searchtime": searchtime},
+            }


This nested JSON structure is repeated across multiple test methods with slight variations. Consider extracting this into a reusable helper method or test fixture to reduce code duplication and improve maintainability.

glebashnik · 2025-09-30T12:52:44Z

@thomasht86 How this handles query timeouts and other errors during evaluation?
From my anecdotal experience it is easy to ignore query timeouts and generate stats any way.
This can be important for real-wold benchmarking on large datasets.
Maybe consider test for this.

glebashnik

Some minor comments.

glebashnik · 2025-09-30T11:44:23Z

vespa/evaluation.py

        logger.info(f"Wrote verbose match evaluation results to {csv_path}")

+    def create_grouping_filter(
+        self, yql: str, id_field: str, relevant_ids: Union[str, Iterable]


I think it is unnecessary to allow both str and Iterable for relevant_ids because it can be interpreted as string with several ids, e.g. "d1,d2" etc. I would suggest to keep only list of strings as an option - as specified in docstring.

so, it will actually be either a str or Set[str] if we look at what should be passed as relevant_docs to the init method.
using that and fixing docstring. agree?

glebashnik · 2025-09-30T11:46:45Z

vespa/evaluation.py

+        modified_yql = yql.strip().rstrip(";")
+        return modified_yql + grouping_clause
+
+    def extract_matched_ids(self, resp: VespaQueryResponse, id_field: str) -> Set[str]:


This is a static method.

vespa/evaluation.py

glebashnik · 2025-09-30T12:50:32Z

vespa/evaluation.py

+    def create_grouping_filter(
+        self, yql: str, id_field: str, relevant_ids: Union[str, Iterable]
+    ) -> str:


I think List[str] is even better.

integration test for smaller targethits

f274531

thomasht86 added 2 commits September 26, 2025 12:04

adding default nn-parameters and adding caveats

5c3669d

remove parameters in test (as they are now added by default)

66cc6f3

thomasht86 marked this pull request as ready for review September 29, 2025 05:37

thomasht86 marked this pull request as draft September 29, 2025 07:58

thomasht86 added 11 commits September 29, 2025 10:23

revert workaround

13c0532

no limit query in unit test mock

a3b5ada

unit tests for new methods

9f6f133

add id field to integration test

01c729c

correct structure for responses in unit tests

5bf0881

grouping query approach

d5a1dae

integration test for up to 5000 ids

fef3213

dont assert on hits present

e4be396

increase timeout for testcase with 5000 relevant docs

0d09091

only 1000 for many relevant docs

157fc0b

increase timeout

dd515dc

thomasht86 marked this pull request as ready for review September 30, 2025 07:15

update tables and graphs in notebook

a3c9dd7

lrjball mentioned this pull request Sep 30, 2025

Fixed hardcoded id_field in VespaMatchEvaluator #1128

Closed

thomasht86 added 3 commits September 30, 2025 11:52

add id field to rag-blueprint code

c949c67

make id field mandatory

0abb49f

Merge branch 'master' into thomasht86/fix-nearestneighbor-matchevaluator

5683988

make test have id_field

5fc0e68

thomasht86 mentioned this pull request Sep 30, 2025

Add id field ragblueprint vespa-engine/sample-apps#1781

Open

thomasht86 requested a review from glebashnik September 30, 2025 11:43

glebashnik requested a review from Copilot September 30, 2025 12:16

Copilot AI reviewed Sep 30, 2025

View reviewed changes

glebashnik requested changes Sep 30, 2025

View reviewed changes

(bugfix)VespaMatchEvaluator with nearestNeighbor #1125

Are you sure you want to change the base?

(bugfix)VespaMatchEvaluator with nearestNeighbor #1125

Conversation

thomasht86 commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasht86 commented Sep 26, 2025

Uh oh!

glebashnik commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasht86 commented Sep 26, 2025

Uh oh!

thomasht86 commented Sep 26, 2025

Uh oh!

thomasht86 commented Sep 29, 2025

Uh oh!

thomasht86 commented Sep 30, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

glebashnik Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

glebashnik commented Sep 30, 2025

Uh oh!

glebashnik left a comment

Choose a reason for hiding this comment

Uh oh!

glebashnik Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

thomasht86 Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

glebashnik Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

glebashnik Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

(bugfix)`VespaMatchEvaluator` with nearestNeighbor #1125

(bugfix)`VespaMatchEvaluator` with nearestNeighbor #1125

thomasht86 commented Sep 26, 2025 •

edited

Loading

glebashnik commented Sep 26, 2025 •

edited

Loading