Skip to content

Conversation

kdn36
Copy link
Collaborator

@kdn36 kdn36 commented Sep 23, 2025

fixes #24566

This PR replaces extends the internal handling of rolling groups to dynamic_group_by. It does so by replacing the rolling attribute in AggregationContext with overlapping and initializing it accordingly.

By implicaiton, this fixes an issue related to unroll on aggregation expressions.

@github-actions github-actions bot added internal An internal refactor or improvement python Related to Python Polars rust Related to Rust Polars labels Sep 23, 2025
{
// panic so we find cases where we accidentally explode overlapping groups
// we don't want this as this can create a lot of data
if let GroupsType::Slice { rolling: true, .. } = self.groups.as_ref().as_ref() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whilst this check has merit, it cannot be completely avoided (e.g. when evaluating sort_by with pl.arange(pl.len()) as the arguments in rolling context), hence removing it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we decide to not flatten if we have this state?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed a design goal to avoid memory explosion (AggregatedList aggstate) whenever we have overlapping/rolling groups, when we can avoid it.

However, we may not be able to always avoid it; the current implementation paths require refactoring; and even then, it may still be preferable to call flat_naive() when this aggstate was unavoidable in the first place.

Diving deeper, as I understand currently (still investigating, correct me I'm wrong):

  • First, I don't think this can be avoided entirely, e.g. when unique_counts or shuffle is called. Note that the underlying memory explosion happens earlier than flat_naive(), i.e. when an AggregatedList is created in combination with overlapping groups, e.g. by calling aggregated().
  • Second, there is implementation work: not every expression currently implementation has (a) a group_aware path, that (b) does not call aggregated(), in combination with (c) overlapping-aware. For example: SortBy or ApplyExpr::apply_single_group_aware. This is being worked.
  • Third, there are aggstate combinations where it is fine to call flat_naive with overlapping groups, e.g. when all inputs are AgggregatedList already, and the group lengths match, with an elementwise function. (See also proposed dispatch in the comments here PR#24520).
  • Fourth, there are cases where aggregated() is called when it shouldn't, without calling flat_naive(). It would be nice to catch these as well. Ex: ApplyExpr::apply_single_group_aware.

The following is an example where 2x AggregatedList is created and there is no group_aware implementation today, so it panics today in debug:

df = pl.DataFrame(
    {
        "time": [0, 6, 12],
        "val": [1, 2, 3],
        "by": [8, 6, 7],
    }
)
q = (
    df.lazy()
    .rolling(
        index_column="time",
        period="10i",
    )
    .agg(pl.int_range(pl.len()).sort_by(pl.int_range(pl.len()))) # panic in debug
    # .agg(pl.col("val").shuffle().sort_by(pl.col("by").shuffle())) # panic in debug
    # .agg(pl.col("val").unique_counts().sort_by(pl.col("by").unique_counts())) # panic in debug
)
out = q.collect()

I am still thinking about another way to accomplish the goal (my 2c - it should be at aggregation time, perhaps by resetting the overlapping flag when the state is inevitable, but that also has downsides). Any input is welcome.

Copy link

codecov bot commented Sep 23, 2025

Codecov Report

❌ Patch coverage is 96.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.82%. Comparing base (c4bda4d) to head (699c2e5).
⚠️ Report is 28 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-core/src/frame/group_by/position.rs 80.00% 1 Missing ⚠️
crates/polars-expr/src/expressions/mod.rs 95.65% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #24577      +/-   ##
==========================================
+ Coverage   81.78%   81.82%   +0.04%     
==========================================
  Files        1685     1688       +3     
  Lines      229717   229966     +249     
  Branches     2954     2974      +20     
==========================================
+ Hits       187865   188177     +312     
+ Misses      41110    41036      -74     
- Partials      742      753      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kdn36 kdn36 marked this pull request as draft September 25, 2025 10:21
@kdn36
Copy link
Collaborator Author

kdn36 commented Sep 25, 2025

Draft pending change: overlapping should be reset to false following invocation of aggregated, as the groups will no longer be overlapping after such an event, and this may impact downstream implementation dispatch logic.

@kdn36 kdn36 marked this pull request as ready for review September 25, 2025 14:12
@ritchie46 ritchie46 force-pushed the main branch 3 times, most recently from ddf5907 to d0914d4 Compare September 27, 2025 11:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal An internal refactor or improvement python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect result with cum_sum in dynamic_group_by ternary expression
2 participants