Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49565][SQL] Improve auto-generated expression aliases with pipe SQL operators #49245

Closed
wants to merge 8 commits into from

Conversation

dtenedor
Copy link
Contributor

@dtenedor dtenedor commented Dec 19, 2024

What changes were proposed in this pull request?

This RP improves auto-generated expression aliases with pipe SQL operators.

For example, consider the pipe SQL syntax query:

table t
|> extend 1

Previously, the analyzed plan was:

Project [x#x, y#x, pipeexpression(1, false, EXTEND) AS pipeexpression(1)#x]
+- SubqueryAlias spark_catalog.default.t
   +- Relation spark_catalog.default.t[x#x,y#x] csv

After this PR, it is:

Project [x#x, y#x, pipeexpression(1, false, EXTEND) AS 1#x]
+- SubqueryAlias spark_catalog.default.t
   +- Relation spark_catalog.default.t[x#x,y#x] csv

Note that the output aliases visible in the resulting DataFrame for the query derive from the AS <alias> part of the analyzed plans shown.

Why are the changes needed?

This improves the user experience with pipe SQL syntax.

Does this PR introduce any user-facing change?

Yes, see above.

How was this patch tested?

Existing golden file tests update to show the improved aliases.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Dec 19, 2024
@dtenedor dtenedor marked this pull request as ready for review December 19, 2024 18:42
@dtenedor
Copy link
Contributor Author

cc @cloud-fan @gengliangwang

@cloud-fan
Copy link
Contributor

Shall we make AliasResolution to properly support PipeExpression?

@dtenedor
Copy link
Contributor Author

dtenedor commented Dec 20, 2024

Shall we make AliasResolution to properly support PipeExpression?

@cloud-fan I tried this, but the PipeExpression needs to check whether its child expression tree contains any aggregate functions, and this needs its child tree to be fully resolved first (which may not have happened yet by the time we run the ResolveAliases rule).

@cloud-fan
Copy link
Contributor

@dtenedor after a few iterations, ResolveAliases should see PipeExpression with its child resolved, right? The thing is that we should have consistent auto alias generation behavior between normal and pipe SQL syntaxes, e.g.

      case ne: NamedExpression => ne
      case go @ GeneratorOuter(g: Generator) if g.resolved => MultiAlias(go, Nil)
      case e if !e.resolved => u
      case g: Generator => MultiAlias(g, Nil)
      case c @ Cast(ne: NamedExpression, _, _, _) => Alias(c, ne.name)()
      case e: ExtractValue if extractOnly(e) => Alias(e, toPrettySQL(e))()

@dtenedor
Copy link
Contributor Author

dtenedor commented Jan 6, 2025

@cloud-fan I looked again and I was able to make this work in ResolveAliases by moving the check for the PipeExpression until after the existing check that the child of the UnresolvedAlias is itself resolved.

@cloud-fan
Copy link
Contributor

@dtenedor there are more test failures in the golden files.

@dtenedor dtenedor requested a review from cloud-fan January 7, 2025 20:28
@dtenedor dtenedor requested a review from cloud-fan January 8, 2025 18:53
@@ -503,7 +504,8 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
*/
object ResolveAliases extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan =
plan.resolveOperatorsUpWithPruning(_.containsPattern(UNRESOLVED_ALIAS), ruleId) {
plan.resolveOperatorsUpWithPruning(
_.containsAnyPattern(UNRESOLVED_ALIAS, PIPE_EXPRESSION), ruleId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, reverted this change.

@@ -538,6 +540,13 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
if c.child.resolved && AliasResolution.hasUnresolvedAlias(c.metrics) =>
c.copy(metrics = AliasResolution.assignAliases(c.metrics))
}

private def removePipeExpressions(exprs: Seq[NamedExpression]): Seq[NamedExpression] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, reverted this change.

case node: LogicalPlan =>
node.resolveExpressions {
case p: PipeExpression if p.child.resolved =>
p.checkInvariantsAndRemove
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we remove this lazy val and inline the validation code here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, done.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in a8bec11 Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants