forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-44901][SQL] Add API in Python UDTF 'analyze' method to return …
…partitioning/ordering expressions ### What changes were proposed in this pull request? This PR adds an API in the Python UDTF 'analyze' method to require partitioning/ordering properties from the input relation. Catalyst then performs necessary repartitioning and/or sorting as needed to fulfill the requested properties. For example, the following property would request for Catalyst to behave as if the UDTF call included `PARTITION BY partition_col ORDER BY input`: ``` from pyspark.sql.functions import AnalyzeResult, OrderingColumn, PartitioningColumn from pyspark.sql.types import IntegerType, StructType udtf class MyUDTF: staticmethod def analyze(self): return AnalyzeResult( schema=StructType() .add("partition_col", IntegerType()) .add("count", IntegerType()) .add("total", IntegerType()) .add("last", IntegerType()), partition_by=[ PartitioningColumn("partition_col") ], order_by=[ OrderingColumn("input") ]) ... ``` Or, the following property would request for Catalyst to behave as if the UDTF call included `WITH SINGLE PARTITION`: ``` from pyspark.sql.functions import AnalyzeResult from pyspark.sql.types import IntegerType, StructType udtf class MyUDTF: staticmethod def analyze(self): return AnalyzeResult( schema=StructType() .add("partition_col", IntegerType()) .add("count", IntegerType()) .add("total", IntegerType()) .add("last", IntegerType()), with_single_partition=True) ... ``` ### Why are the changes needed? This gives Python UDTF authors the ability to write table functions that can assume constraints about which rows are consumed by which instances of the UDTF class. ### Does this PR introduce _any_ user-facing change? Yes, see above. ### How was this patch tested? This PR adds unit test coverage in Scala and Python. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#42595 from dtenedor/anlayze-result. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
- Loading branch information
Showing
17 changed files
with
1,716 additions
and
55 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.