Close block series client at the end to not reuse chunk buf #7915
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In Cortex we run some query fuzz tests to make sure query results are compatible with latest Cortex or Prometheus release.
I noticed a strange query failure when comparing Cortex query results with latest Prometheus v2.55.1 result https://github.com/cortexproject/cortex/actions/runs/11858364637/job/33061424720?pr=6340. Cortex Block is loaded in Store Gateway and the same block is loaded in Prometheus. I was trying to find out the issue and reproduce it locally and found out that sometimes (1 run out of 1000 ish) query results from Cortex only return 2 series rather than expected 3 series.
I added some logs on to print out the chunk content for the series it misses. It seems that the chunk buf gets reused even before the series response sent over gRPC.
The issue here seems with https://github.com/thanos-io/thanos/pull/7821/files#diff-3e2896fafa6ff73509c77df2c4389b68828e02575bb4fb78b6c34bcfb922a7ceR3357, the block chunk reader is closed when the loser tree closes certain response series set. This doesn't guarantee the block chunk reader is closed at the end of the
Series
call because when a response series set is exhausted it will be closed first at https://github.com/thanos-io/thanos/blob/main/pkg/losertree/tree.go#L66 and then the chunk buffer can be reused.Changes
Since the original idea of #7821 is to fix an issue for the in process store client, I revert the code for the block series client back. Now it closes block series client at the end of the
Series
function to make sure chunk buffere is not reused before the function returns.Verification
I don't have a unit test to reproduce this bug but I was just re-running the same test case I use to reproduce the fuzzy test bug. Without this bug fix the test failed constantly. Not every run but 10000 queries probably fail 5 times. With the fix it always succeed.