Arrow String View Type #10481

pdet · 2024-02-06T15:55:26Z

This PR implements the production and consumption of Arrow String Views.

By default, we will still produce regular strings unless produce_arrow_string_view is set.

 SET produce_arrow_string_view=True

In that case, all strings produced will be string views.

If necessary, we could offer users the option to specify which columns they would prefer to be produced in one way or another. However, it currently seems to me that if a system implements operations on string views, it will most likely always prefer that.

Another point to consider is that the production of string views, especially for strings that are not inlined, can be performed over multiple data buffers. Perhaps, at some point, we might also want to allow users to specify this if it could be advantageous for them. Currently, we use only one data buffer for all strings.

Because PyArrow is still in the process of implementing support for creating String Views in their API (see apache/arrow#39633), we only test over string views that we produce ourselves. Since it is much more relevant to verify our compliance with the Arrow Spec using their own strings, maybe we should wait to consider merging this PR until we can properly test our interoperability with PyArrow String Views.

cc @ianmcook

ianmcook · 2024-02-06T16:35:05Z

@bkietz you might be interested to take a look

ianmcook · 2024-02-07T14:57:49Z

@pdet apache/arrow#39852 just merged. With a development build of PyArrow, now you can create Arrow objects using the StringView type like this:

import pyarrow as pa
pa.array(['foo', 'bar'], type=pa.string_view())

src/include/duckdb/main/settings.hpp

src/include/duckdb/common/arrow/appender/varchar_data.hpp

src/function/table/arrow_conversion.cpp

… sufficiently high

src/include/duckdb/common/arrow/appender/varchar_data.hpp

…ting

src/function/table/arrow_conversion.cpp

src/include/duckdb/common/arrow/appender/append_data.hpp

src/include/duckdb/common/arrow/appender/varchar_data.hpp

Tishj · 2024-02-14T15:50:36Z

src/include/duckdb/common/types/arrow_string_view_type.hpp

+		D_ASSERT(!IsInline());
+		return ref.offset;
+	}
+	static constexpr uint8_t max_inlined_bytes = 12;


MAX_INLINED_BYTES ?

Tishj

Thanks, everything looks correct to me
Just a couple of clarity and readability related comments

pdet · 2024-02-14T17:06:10Z

One curiosity is that it seems that the last version of Polars can read string views produced by DuckDB (I'll add a proper test for that tomorrow), but I'm not yet sure what is happening internally in Polars when calling to_arrow from a Polars Dataframe constructed by a DuckDB string view, we get regular strings.

I've seen a blog post that they changed their internal representation to string views, but I'm not sure what operations they do yet, maybe would be interesting to do a quick benchmark on that. :-)

…iffer in the C Data Interface (#40156) ### Rationale for this change Attempt to draw more attention to the fact that the buffer listing / number of buffers differ between the main Format spec and the C Data Interface, for the Binary View layout. Triggered by feedback from implementing this in duckdb at duckdb/duckdb#10481 (comment) Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

…iffer in the C Data Interface (apache#40156) ### Rationale for this change Attempt to draw more attention to the fact that the buffer listing / number of buffers differ between the main Format spec and the C Data Interface, for the Binary View layout. Triggered by feedback from implementing this in duckdb at duckdb/duckdb#10481 (comment) Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

pdet added 13 commits January 24, 2024 16:02

basics of arrow strign view

4556a7c

more on arrow string views

88561f4

Add set option to produce string views in arrow

a047727

More on produce_arrow_string_view propagation

08b939f

Make the string view produce small strings

aeac533

Got small strings to work

32f47bf

produce not inline strings

05c3c8a

Can write and read big strings

a27eac1

Make this test go over vector sizes

dece9c4

merge

952fda1

small adjustment

21a9e79

this should be true

87a73fd

Accidental commit of merge gunk

d05abf4

github-actions bot marked this pull request as draft February 6, 2024 15:57

Mytherin added the Draft label Feb 7, 2024

bkietz reviewed Feb 7, 2024

View reviewed changes

src/include/duckdb/main/settings.hpp Outdated Show resolved Hide resolved

src/include/duckdb/common/arrow/appender/varchar_data.hpp Outdated Show resolved Hide resolved

ianmcook reviewed Feb 8, 2024

View reviewed changes

src/function/table/arrow_conversion.cpp Show resolved Hide resolved

pdet added 4 commits February 13, 2024 11:14

use union of arrow string view in consumption

c34ebc6

Also using string_view_t in producer

700d75c

Adding first pyarrow string_view test, skip if pyarrow version is not…

4053649

… sufficiently high

correctly setting the result_idx

796f437

bkietz suggested changes Feb 13, 2024

View reviewed changes

src/include/duckdb/common/arrow/appender/varchar_data.hpp Outdated Show resolved Hide resolved

src/include/duckdb/common/arrow/appender/varchar_data.hpp Outdated Show resolved Hide resolved

pdet added 2 commits February 14, 2024 14:13

Creating extra buffer, bits of cleanup, more pyarrow - stringview tes…

8da231d

…ting

Add a simple test to verify the reads work from multiple data buffers

9112a2b