UNNEST is very slow on a struct with high number of keys over arrow data #9628
Closed
Description
What happens?
Running UNNEST function over a single row table containing a struct with 350 keys is taking more than 5 seconds on my machine. When I analyze how UNNEST function behaves, I see that the time it takes for the function to run increases exponentially with the number of keys in the data. I also added a comparison to simulate the same behavior by extracting the data, unnesting in python ingesting back on duckdb and reading it back. I would expect for both functions to behave similarly but I see a very large divergence. You can see below the comparison between these two methods of unnest.
To Reproduce
import duckdb
import pyarrow as pa
import time
from matplotlib import pyplot as plt
import timeit
from functools import partial
def get_table(col_size):
"Creates a one row arrow table with the given column size"
row = {f"col_{idx}": f"my_value_{idx}" for idx in range(col_size)}
return pa.Table.from_pydict({"data": pa.array([row])})
def run_unnest(col_size):
"Run unnest natively"
with duckdb.connect() as con:
con.register("my_view", get_table(col_size))
start_time = time.time()
con.sql("select unnest(data) from my_view").fetchall()
end_time = time.time()
return end_time-start_time
def run_python_unnest(col_size):
"Read the data unnest and put it back, then read again"
with duckdb.connect() as con:
con.register("my_view", get_table(col_size))
start_time = time.time()
data = con.sql("select data from my_view").fetchall()
con.register("my_new_view", pa.Table.from_pylist([k[0] for k in data]))
con.execute("select * from my_new_view").fetchall()
end_time = time.time()
return end_time - start_time
def time_function(fn, col_size):
"Take an average of a 5 run. Return milliseconds"
return timeit.timeit(stmt=partial(fn, col_size=col_size), number=5)/5*1000
x = list(range(1,402,20))
y1 = [time_function(run_unnest, col_size=col_size) for col_size in x]
y2 = [time_function(run_python_unnest, col_size=col_size) for col_size in x]
plt.plot(x, y1, "r--", label="duckdb unnest")
plt.plot(x, y2, "b-", label="python unnest")
plt.legend(loc="upper left")
plt.xlabel("Number of columns")
plt.ylabel("Milliseconds to run")
plt.show()
OS:
MacOS 13.2.1 aarch64
DuckDB Version:
0.9.1
DuckDB Client:
Python
Full Name:
Izel Levy
Affiliation:
Explorium
Have you tried this on the latest main
branch?
I have tested with a main build
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- Yes, I have
Activity
l1t1 commentedon Nov 10, 2023
izellevy commentedon Nov 10, 2023
@l1t1 sorry for the confusion. Now I replaced the title to be more clear. I meant unnesting a struct with high number of keys over arrow data. I realized that when I create a table in duckdb from arrow data and running over it, it is 5x faster. But it is still relatively slow.
For a struct with 350 keys on my machine:
Unnest over struct (arrow): 5s
Unnest over struct (duckdb table): 1s
Unnest over struct (python unnest): 0.05s
Avsha-Chai commentedon Feb 14, 2024
I'm experiencing the same thing with every variation of json_extract and json_extract_string (including -> and ->> operators)
dgrozenok commentedon Mar 11, 2024
Experiencing similar issue for even fewer number of fields. Was trying to define a macro as a unit of functionality that accepts a struct with about 50 fields created from a table's columns and returning a struct with about 15 fields that are converted to columns using unnest. Seeing about 30% decrease in performance compared to just having all the expressions as part of the SQL statement explicitly.
github-actions commentedon Jul 4, 2024
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.
github-actions commentedon Aug 4, 2024
This issue was closed because it has been stale for 30 days with no activity.
kumarspark commentedon Aug 28, 2024
This is still an issue. UNNEST on a struct with high number of keys doesn't finish executing / extremely slow
5 remaining items