-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ESQL: Make Categorize usable in aggs when identical to a grouping #117835
Conversation
Pinging @elastic/es-analytical-engine (Team:Analytics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 5 out of 7 changed files in this pull request and generated no suggestions.
Files not reviewed (2)
- x-pack/plugin/esql/qa/testFixtures/src/main/resources/categorize.csv-spec: Language not supported
- x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java: Evaluated as low risk
aggregateFunction -> aggregateFunction.forEachDown( | ||
Categorize.class, | ||
categorize -> failures.add( | ||
fail(categorize, "cannot use CATEGORIZE grouping function [{}] within an aggregation", categorize.sourceText()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this specific to CATEGORIZE? Or should this apply to all grouping functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only categorize; BUCKET does work, as it works technically as a normal scalar function.
I guess, if another stateful grouping function appears, we may either add a StatefulGroupingFunction
superclass, or change Bucket to be a normal function with exceptions (maybe?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I have one question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ivancea ! This essentially LGTM - I'd only ask to please make the code change in ReplaceAggregateAggExpressionWithEval.java
specific to the Categorize
class to not make it seem like Bucket
also uses this code path - see below.
Otherwise, only minor comments :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice test cases!
|
||
// Forbid CATEGORIZE being referenced in the aggregation functions | ||
// Forbid CATEGORIZE being referenced as a child of an aggregation function | ||
Map<NameId, Categorize> categorizeByAliasId = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of scope nit, but I just realized this could also just be an AttributeSet
.
error("FROM test | STATS COUNT(`CATEGORIZE(first_name)`) BY CATEGORIZE(first_name)") | ||
); | ||
|
||
assertEquals( | ||
"1:28: can only use grouping function [CATEGORIZE(last_name)] part of the BY clause", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of scope nit:
"1:28: can only use grouping function [CATEGORIZE(last_name)] part of the BY clause", | |
"1:28: can only use grouping function [CATEGORIZE(last_name)] as part of the BY clause", |
@@ -51,6 +54,14 @@ protected LogicalPlan rule(Aggregate aggregate) { | |||
AttributeMap<Expression> aliases = new AttributeMap<>(); | |||
aggregate.forEachExpressionUp(Alias.class, a -> aliases.put(a.toAttribute(), a.child())); | |||
|
|||
// build grouping functions map | |||
Map<GroupingFunction, Attribute> groupingAttributes = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we be specific for Categorize
, instead of applying this code to all grouping functions?
The added code in this file seems correct to me, but it has a chance to introduce edge cases to BUCKET
that we simply have no tests against.
And even if it doesn't, it means it doesn't have an effect on BUCKET
and is thus needlessly general because I can remove it without consequences for BUCKET
but that's absolutely not evident.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I changed the types to Categorize, and added a comment explaining why Bucket doesn't apply. Eventually, this should be change to a StatefulGroupingFunction
or whatever we create, if we do so. With all the other usages of Categorize
around the rules
required_capability: categorize_v5 | ||
|
||
FROM sample_data | ||
| STATS COUNT(), x = MV_APPEND(CATEGORIZE(message), `CATEGORIZE(message)`) BY CATEGORIZE(message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I don't think we have a corresponding test for BUCKET
. Do you mind adding one so the tests ensure consistency between the two?
The closest we have is bucket.reuseGroupingFunctionWithExpression
, but this doesn't refer to the same grouping function in 2 ways, it refers to 2 different grouping functions in 2 ways.
required_capability: categorize_v5 | ||
|
||
FROM sample_data | ||
| STATS COUNT(), x = MV_APPEND(CATEGORIZE(message), category) BY category=CATEGORIZE(message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here, I'd love a test for BUCKET
that uses it both with and without its alias inside a STAT
for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have this (The bucket.reuseGroupingFunctionWithExpression
you mentioned has the same components). So I just added another one with the implicit alias
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 5 out of 8 changed files in this pull request and generated no suggestions.
Files not reviewed (3)
- x-pack/plugin/esql/qa/testFixtures/src/main/resources/bucket.csv-spec: Language not supported
- x-pack/plugin/esql/qa/testFixtures/src/main/resources/categorize.csv-spec: Language not supported
- x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LogicalPlanOptimizerTests.java: Evaluated as low risk
# Conflicts: # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java
💚 Backport successful
|
…astic#117835) Cases like `STATS MV_APPEND(cat, CATEGORIZE(x)) BY cat=CATEGORIZE(x)` should work, as they're moved to an EVAL by a rule. Also, these cases were discarded, as they fail because of other verifications (Which also fail for BUCKET): ``` STATS x = category BY category=CATEGORIZE(message) STATS x = CATEGORIZE(message) BY CATEGORIZE(message) STATS x = CATEGORIZE(message) BY category=CATEGORIZE(message)
…17835) (#117895) Cases like `STATS MV_APPEND(cat, CATEGORIZE(x)) BY cat=CATEGORIZE(x)` should work, as they're moved to an EVAL by a rule. Also, these cases were discarded, as they fail because of other verifications (Which also fail for BUCKET): ``` STATS x = category BY category=CATEGORIZE(message) STATS x = CATEGORIZE(message) BY CATEGORIZE(message) STATS x = CATEGORIZE(message) BY category=CATEGORIZE(message)
@@ -407,7 +407,7 @@ public enum Cap { | |||
/** | |||
* Supported the text categorization function "CATEGORIZE". | |||
*/ | |||
CATEGORIZE_V4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heya, now that CATEGORIZE
is available in tech preview, it's important we do not replace the capability anymore - instead, any change that does not work on old nodes requires that we add a new capability and mark it as requirements for the corresponding tests.
Otherwise, bwc tests (esp. rolling upgrade tests for e.g. Serverless) will not be run.
Cases like
STATS MV_APPEND(cat, CATEGORIZE(x)) BY cat=CATEGORIZE(x)
should work, as they're moved to an EVAL by a rule.Also, these cases were discarded, as they fail because of other verifications (Which also fail for BUCKET):