-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrow String View Type #10481
Arrow String View Type #10481
Conversation
@bkietz you might be interested to take a look |
@pdet apache/arrow#39852 just merged. With a development build of PyArrow, now you can create Arrow objects using the StringView type like this: import pyarrow as pa
pa.array(['foo', 'bar'], type=pa.string_view()) |
D_ASSERT(!IsInline()); | ||
return ref.offset; | ||
} | ||
static constexpr uint8_t max_inlined_bytes = 12; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAX_INLINED_BYTES
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, everything looks correct to me
Just a couple of clarity and readability related comments
One curiosity is that it seems that the last version of Polars can read string views produced by DuckDB (I'll add a proper test for that tomorrow), but I'm not yet sure what is happening internally in Polars when calling I've seen a blog post that they changed their internal representation to string views, but I'm not sure what operations they do yet, maybe would be interesting to do a quick benchmark on that. :-) |
…iffer in the C Data Interface (#40156) ### Rationale for this change Attempt to draw more attention to the fact that the buffer listing / number of buffers differ between the main Format spec and the C Data Interface, for the Binary View layout. Triggered by feedback from implementing this in duckdb at duckdb/duckdb#10481 (comment) Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…iffer in the C Data Interface (apache#40156) ### Rationale for this change Attempt to draw more attention to the fact that the buffer listing / number of buffers differ between the main Format spec and the C Data Interface, for the Binary View layout. Triggered by feedback from implementing this in duckdb at duckdb/duckdb#10481 (comment) Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…iffer in the C Data Interface (apache#40156) ### Rationale for this change Attempt to draw more attention to the fact that the buffer listing / number of buffers differ between the main Format spec and the C Data Interface, for the Binary View layout. Triggered by feedback from implementing this in duckdb at duckdb/duckdb#10481 (comment) Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
This PR implements the production and consumption of Arrow String Views.
By default, we will still produce regular strings unless
produce_arrow_string_view
is set.In that case, all strings produced will be string views.
If necessary, we could offer users the option to specify which columns they would prefer to be produced in one way or another. However, it currently seems to me that if a system implements operations on string views, it will most likely always prefer that.
Another point to consider is that the production of string views, especially for strings that are not inlined, can be performed over multiple data buffers. Perhaps, at some point, we might also want to allow users to specify this if it could be advantageous for them. Currently, we use only one data buffer for all strings.
Because PyArrow is still in the process of implementing support for creating String Views in their API (see apache/arrow#39633), we only test over string views that we produce ourselves. Since it is much more relevant to verify our compliance with the Arrow Spec using their own strings, maybe we should wait to consider merging this PR until we can properly test our interoperability with PyArrow String Views.
cc @ianmcook