Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow String View Type #10481

Merged
merged 22 commits into from
Feb 20, 2024
Merged

Arrow String View Type #10481

merged 22 commits into from
Feb 20, 2024

Conversation

pdet
Copy link
Contributor

@pdet pdet commented Feb 6, 2024

This PR implements the production and consumption of Arrow String Views.

By default, we will still produce regular strings unless produce_arrow_string_view is set.

 SET produce_arrow_string_view=True

In that case, all strings produced will be string views.

If necessary, we could offer users the option to specify which columns they would prefer to be produced in one way or another. However, it currently seems to me that if a system implements operations on string views, it will most likely always prefer that.

Another point to consider is that the production of string views, especially for strings that are not inlined, can be performed over multiple data buffers. Perhaps, at some point, we might also want to allow users to specify this if it could be advantageous for them. Currently, we use only one data buffer for all strings.

Because PyArrow is still in the process of implementing support for creating String Views in their API (see apache/arrow#39633), we only test over string views that we produce ourselves. Since it is much more relevant to verify our compliance with the Arrow Spec using their own strings, maybe we should wait to consider merging this PR until we can properly test our interoperability with PyArrow String Views.

cc @ianmcook

@github-actions github-actions bot marked this pull request as draft February 6, 2024 15:57
@ianmcook
Copy link

ianmcook commented Feb 6, 2024

@bkietz you might be interested to take a look

@ianmcook
Copy link

ianmcook commented Feb 7, 2024

@pdet apache/arrow#39852 just merged. With a development build of PyArrow, now you can create Arrow objects using the StringView type like this:

import pyarrow as pa
pa.array(['foo', 'bar'], type=pa.string_view())

@Mytherin Mytherin added the Draft label Feb 7, 2024
D_ASSERT(!IsInline());
return ref.offset;
}
static constexpr uint8_t max_inlined_bytes = 12;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAX_INLINED_BYTES ?

Copy link
Contributor

@Tishj Tishj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, everything looks correct to me
Just a couple of clarity and readability related comments

@pdet
Copy link
Contributor Author

pdet commented Feb 14, 2024

One curiosity is that it seems that the last version of Polars can read string views produced by DuckDB (I'll add a proper test for that tomorrow), but I'm not yet sure what is happening internally in Polars when calling to_arrow from a Polars Dataframe constructed by a DuckDB string view, we get regular strings.

I've seen a blog post that they changed their internal representation to string views, but I'm not sure what operations they do yet, maybe would be interesting to do a quick benchmark on that. :-)

@pdet pdet marked this pull request as ready for review February 15, 2024 16:41
@github-actions github-actions bot marked this pull request as draft February 19, 2024 11:56
@Mytherin Mytherin marked this pull request as ready for review February 19, 2024 14:05
@Mytherin Mytherin merged commit dec38d6 into duckdb:feature Feb 20, 2024
46 checks passed
jorisvandenbossche added a commit to apache/arrow that referenced this pull request Feb 27, 2024
…iffer in the C Data Interface (#40156)

### Rationale for this change

Attempt to draw more attention to the fact that the buffer listing / number of buffers differ between the main Format spec and the C Data Interface, for the Binary View layout.

Triggered by feedback from implementing this in duckdb at duckdb/duckdb#10481 (comment)

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Feb 28, 2024
…iffer in the C Data Interface (apache#40156)

### Rationale for this change

Attempt to draw more attention to the fact that the buffer listing / number of buffers differ between the main Format spec and the C Data Interface, for the Binary View layout.

Triggered by feedback from implementing this in duckdb at duckdb/duckdb#10481 (comment)

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
thisisnic pushed a commit to thisisnic/arrow that referenced this pull request Mar 8, 2024
…iffer in the C Data Interface (apache#40156)

### Rationale for this change

Attempt to draw more attention to the fact that the buffer listing / number of buffers differ between the main Format spec and the C Data Interface, for the Binary View layout.

Triggered by feedback from implementing this in duckdb at duckdb/duckdb#10481 (comment)

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@pdet pdet deleted the arrow_string_view branch June 25, 2024 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants