Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix inconsistent array type for binary numerical operators result between array and scalar #6269

Merged
merged 4 commits into from
May 9, 2023

Conversation

viirya
Copy link
Member

@viirya viirya commented May 6, 2023

Which issue does this PR close?

Closes #6243.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) labels May 6, 2023
@viirya viirya changed the title Cast binary numerical operators result between array and scalar to primitive array Fix inconsistent array type for binary numerical operators result between array and scalar May 6, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that that queries that used to fail pass with this change I think it is a step forward in the right direction.

As I understand it this code will force the output of an expression back to non dictionary at query time.

Now that some of the kernels actually support dictionary encoded arrays natively, I wonder if it would be possible to maintain the dictionary encoding as part of the coercion rules (rather than coercing the output to a primitive type)?

https://github.com/apache/arrow-datafusion/blob/2e9beeba01b85afb6d4f6557201e673008ea9edd/datafusion/expr/src/type_coercion/binary.rs#L475-L483

@alamb
Copy link
Contributor

alamb commented May 8, 2023

Thank you @viirya

@viirya
Copy link
Member Author

viirya commented May 8, 2023

Now that some of the kernels actually support dictionary encoded arrays natively, I wonder if it would be possible to maintain the dictionary encoding as part of the coercion rules (rather than coercing the output to a primitive type)?

For mathematics numerical kernels, the returned type of two dictionary input arrays is primitive array. So for mathematics_numerical_coercion, this looks still correct.

@alamb
Copy link
Contributor

alamb commented May 8, 2023

For mathematics numerical kernels, the returned type of two dictionary input arrays is primitive array. So for mathematics_numerical_coercion, this looks still correct.

Yeah, I guess I was thinking it would nice to avoid the unpacking of the dictionary result into a primitive array (when possible)

@viirya
Copy link
Member Author

viirya commented May 8, 2023

Yeah, I guess I was thinking it would nice to avoid the unpacking of the dictionary result into a primitive array (when possible)

I meant, for mathematics numerical kernels (e.g. add, minus etc.), the result of operation between two dictionary arrays is primitive array. We don't unpack dictionary array into primitive array. This is why the coercion rule specifies the result type of such op as primitive type instead of dictionary of it.

But for such op between dictionary and a scalar, the result is dictionary array as for such op it can simply apply on dictionary values which is not the same for above case (dictionary and dictionary). So the inconsistency (primitive for dictionary/dictionary and dictionary for dictionary/scalar) leads to the bug we saw.

We can either changing primitive result of op on dictionary/dictionary to dictionary, or changing dictionary result of op on dictionary/scalar to primitive. This takes the later one as a fix. One reason is that this is simply to apply to fix the issue now and has less impact on performance I think. Another reason is that I'm not sure packing op result of dictionary/dictionary as dictionary making sense. It is doable but considering dictionary encoding during mathematics numerical op, it might be introducing performance penalty. I'll find some time trying that.

@alamb
Copy link
Contributor

alamb commented May 9, 2023

I agree then that this solution makes sense

@viirya viirya merged commit 1dd3674 into apache:main May 9, 2023
@viirya
Copy link
Member Author

viirya commented May 9, 2023

Thanks @alamb. I will find some time looking at the possibility to packing primitive result of math kernels on dictionary/dictionary as dictionary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Aggregation with group by cannot work for Dictionary array
2 participants