Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESQL: Make Categorize usable in aggs when identical to a grouping #117835

Merged
merged 8 commits into from
Dec 3, 2024

Conversation

ivancea
Copy link
Contributor

@ivancea ivancea commented Dec 2, 2024

Cases like STATS MV_APPEND(cat, CATEGORIZE(x)) BY cat=CATEGORIZE(x) should work, as they're moved to an EVAL by a rule.

Also, these cases were discarded, as they fail because of other verifications (Which also fail for BUCKET):

STATS x = category BY category=CATEGORIZE(message)
STATS x = CATEGORIZE(message) BY CATEGORIZE(message)
STATS x = CATEGORIZE(message) BY category=CATEGORIZE(message)

@ivancea ivancea added >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) auto-backport Automatically create backport pull requests when merged :Analytics/ES|QL AKA ESQL v9.0.0 v8.18.0 labels Dec 2, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 5 out of 7 changed files in this pull request and generated no suggestions.

Files not reviewed (2)
  • x-pack/plugin/esql/qa/testFixtures/src/main/resources/categorize.csv-spec: Language not supported
  • x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java: Evaluated as low risk
aggregateFunction -> aggregateFunction.forEachDown(
Categorize.class,
categorize -> failures.add(
fail(categorize, "cannot use CATEGORIZE grouping function [{}] within an aggregation", categorize.sourceText())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this specific to CATEGORIZE? Or should this apply to all grouping functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only categorize; BUCKET does work, as it works technically as a normal scalar function.
I guess, if another stateful grouping function appears, we may either add a StatefulGroupingFunction superclass, or change Bucket to be a normal function with exceptions (maybe?)

Copy link
Contributor

@jan-elastic jan-elastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I have one question

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ivancea ! This essentially LGTM - I'd only ask to please make the code change in ReplaceAggregateAggExpressionWithEval.java specific to the Categorize class to not make it seem like Bucket also uses this code path - see below.

Otherwise, only minor comments :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test cases!


// Forbid CATEGORIZE being referenced in the aggregation functions
// Forbid CATEGORIZE being referenced as a child of an aggregation function
Map<NameId, Categorize> categorizeByAliasId = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of scope nit, but I just realized this could also just be an AttributeSet.

error("FROM test | STATS COUNT(`CATEGORIZE(first_name)`) BY CATEGORIZE(first_name)")
);

assertEquals(
"1:28: can only use grouping function [CATEGORIZE(last_name)] part of the BY clause",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of scope nit:

Suggested change
"1:28: can only use grouping function [CATEGORIZE(last_name)] part of the BY clause",
"1:28: can only use grouping function [CATEGORIZE(last_name)] as part of the BY clause",

@@ -51,6 +54,14 @@ protected LogicalPlan rule(Aggregate aggregate) {
AttributeMap<Expression> aliases = new AttributeMap<>();
aggregate.forEachExpressionUp(Alias.class, a -> aliases.put(a.toAttribute(), a.child()));

// build grouping functions map
Map<GroupingFunction, Attribute> groupingAttributes = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be specific for Categorize, instead of applying this code to all grouping functions?

The added code in this file seems correct to me, but it has a chance to introduce edge cases to BUCKET that we simply have no tests against.

And even if it doesn't, it means it doesn't have an effect on BUCKET and is thus needlessly general because I can remove it without consequences for BUCKET but that's absolutely not evident.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I changed the types to Categorize, and added a comment explaining why Bucket doesn't apply. Eventually, this should be change to a StatefulGroupingFunction or whatever we create, if we do so. With all the other usages of Categorize around the rules

required_capability: categorize_v5

FROM sample_data
| STATS COUNT(), x = MV_APPEND(CATEGORIZE(message), `CATEGORIZE(message)`) BY CATEGORIZE(message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I don't think we have a corresponding test for BUCKET. Do you mind adding one so the tests ensure consistency between the two?

The closest we have is bucket.reuseGroupingFunctionWithExpression, but this doesn't refer to the same grouping function in 2 ways, it refers to 2 different grouping functions in 2 ways.

required_capability: categorize_v5

FROM sample_data
| STATS COUNT(), x = MV_APPEND(CATEGORIZE(message), category) BY category=CATEGORIZE(message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here, I'd love a test for BUCKET that uses it both with and without its alias inside a STAT for consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have this (The bucket.reuseGroupingFunctionWithExpression you mentioned has the same components). So I just added another one with the implicit alias

@ivancea ivancea requested review from alex-spies and Copilot December 3, 2024 11:27

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 5 out of 8 changed files in this pull request and generated no suggestions.

Files not reviewed (3)
  • x-pack/plugin/esql/qa/testFixtures/src/main/resources/bucket.csv-spec: Language not supported
  • x-pack/plugin/esql/qa/testFixtures/src/main/resources/categorize.csv-spec: Language not supported
  • x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LogicalPlanOptimizerTests.java: Evaluated as low risk
# Conflicts:
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java
@ivancea ivancea merged commit 03a71d2 into elastic:main Dec 3, 2024
16 checks passed
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x

ivancea added a commit to ivancea/elasticsearch that referenced this pull request Dec 3, 2024
…astic#117835)

Cases like `STATS MV_APPEND(cat, CATEGORIZE(x)) BY cat=CATEGORIZE(x)` should work, as they're moved to an EVAL by a rule.

Also, these cases were discarded, as they fail because of other verifications (Which also fail for BUCKET):
```
STATS x = category BY category=CATEGORIZE(message)
STATS x = CATEGORIZE(message) BY CATEGORIZE(message)
STATS x = CATEGORIZE(message) BY category=CATEGORIZE(message)
elasticsearchmachine pushed a commit that referenced this pull request Dec 3, 2024
…17835) (#117895)

Cases like `STATS MV_APPEND(cat, CATEGORIZE(x)) BY cat=CATEGORIZE(x)` should work, as they're moved to an EVAL by a rule.

Also, these cases were discarded, as they fail because of other verifications (Which also fail for BUCKET):
```
STATS x = category BY category=CATEGORIZE(message)
STATS x = CATEGORIZE(message) BY CATEGORIZE(message)
STATS x = CATEGORIZE(message) BY category=CATEGORIZE(message)
@@ -407,7 +407,7 @@ public enum Cap {
/**
* Supported the text categorization function "CATEGORIZE".
*/
CATEGORIZE_V4,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heya, now that CATEGORIZE is available in tech preview, it's important we do not replace the capability anymore - instead, any change that does not work on old nodes requires that we add a new capability and mark it as requirements for the corresponding tests.

Otherwise, bwc tests (esp. rolling upgrade tests for e.g. Serverless) will not be run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants