-
Notifications
You must be signed in to change notification settings - Fork 2.4k
refactor: Extend and rename rolling
groups to overlapping
#24577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
{ | ||
// panic so we find cases where we accidentally explode overlapping groups | ||
// we don't want this as this can create a lot of data | ||
if let GroupsType::Slice { rolling: true, .. } = self.groups.as_ref().as_ref() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whilst this check has merit, it cannot be completely avoided (e.g. when evaluating sort_by
with pl.arange(pl.len())
as the arguments in rolling
context), hence removing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we decide to not flatten
if we have this state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed a design goal to avoid memory explosion (AggregatedList
aggstate) whenever we have overlapping/rolling groups, when we can avoid it.
However, we may not be able to always avoid it; the current implementation paths require refactoring; and even then, it may still be preferable to call flat_naive()
when this aggstate was unavoidable in the first place.
Diving deeper, as I understand currently (still investigating, correct me I'm wrong):
- First, I don't think this can be avoided entirely, e.g. when
unique_counts
orshuffle
is called. Note that the underlying memory explosion happens earlier thanflat_naive()
, i.e. when anAggregatedList
is created in combination with overlapping groups, e.g. by callingaggregated()
. - Second, there is implementation work: not every expression currently implementation has (a) a
group_aware
path, that (b) does not callaggregated()
, in combination with (c) overlapping-aware. For example:SortBy
orApplyExpr::apply_single_group_aware
. This is being worked. - Third, there are aggstate combinations where it is fine to call
flat_naive
with overlapping groups, e.g. when all inputs areAgggregatedList
already, and the group lengths match, with an elementwise function. (See also proposed dispatch in the comments here PR#24520). - Fourth, there are cases where
aggregated()
is called when it shouldn't, without callingflat_naive()
. It would be nice to catch these as well. Ex:ApplyExpr::apply_single_group_aware
.
The following is an example where 2x AggregatedList
is created and there is no group_aware implementation today, so it panics today in debug:
df = pl.DataFrame(
{
"time": [0, 6, 12],
"val": [1, 2, 3],
"by": [8, 6, 7],
}
)
q = (
df.lazy()
.rolling(
index_column="time",
period="10i",
)
.agg(pl.int_range(pl.len()).sort_by(pl.int_range(pl.len()))) # panic in debug
# .agg(pl.col("val").shuffle().sort_by(pl.col("by").shuffle())) # panic in debug
# .agg(pl.col("val").unique_counts().sort_by(pl.col("by").unique_counts())) # panic in debug
)
out = q.collect()
I am still thinking about another way to accomplish the goal (my 2c - it should be at aggregation time, perhaps by resetting the overlapping
flag when the state is inevitable, but that also has downsides). Any input is welcome.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #24577 +/- ##
==========================================
+ Coverage 81.78% 81.82% +0.04%
==========================================
Files 1685 1688 +3
Lines 229717 229966 +249
Branches 2954 2974 +20
==========================================
+ Hits 187865 188177 +312
+ Misses 41110 41036 -74
- Partials 742 753 +11 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Draft pending change: |
ddf5907
to
d0914d4
Compare
fixes #24566
This PR replaces extends the internal handling of
rolling
groups todynamic_group_by
. It does so by replacing therolling
attribute inAggregationContext
withoverlapping
and initializing it accordingly.By implicaiton, this fixes an issue related to
unroll
on aggregation expressions.