UNNEST is very slow on a struct with high number of keys over arrow data

### What happens?

Running UNNEST function over a single row table containing a struct with 350 keys is taking more than 5 seconds on my machine. When I analyze how UNNEST function behaves, I see that the time it takes for the function to run increases exponentially with the number of keys in the data. I also added a comparison to simulate the same behavior by extracting the data, unnesting in python ingesting back on duckdb and reading it back. I would expect for both functions to behave similarly but I see a very large divergence. You can see below the comparison between these two methods of unnest.

<img width="617" alt="Screenshot 2023-11-09 at 18 01 46" src="https://github.com/duckdb/duckdb/assets/7451457/b969efa1-1b71-43ba-97dd-7a654634f4e4">




### To Reproduce

```python

import duckdb
import pyarrow as pa
import time
from matplotlib import pyplot as plt
import timeit
from functools import partial

def get_table(col_size):
    "Creates a one row arrow table with the given column size"
    row = {f"col_{idx}": f"my_value_{idx}" for idx in range(col_size)}
    return pa.Table.from_pydict({"data": pa.array([row])})

def run_unnest(col_size):    
    "Run unnest natively"
    with duckdb.connect() as con:
        con.register("my_view", get_table(col_size))
        start_time = time.time()
        con.sql("select unnest(data) from my_view").fetchall()
        end_time = time.time()
    return end_time-start_time

def run_python_unnest(col_size):
    "Read the data unnest and put it back, then read again"
    with duckdb.connect() as con:
        con.register("my_view", get_table(col_size))
        start_time = time.time()
        data = con.sql("select data from my_view").fetchall()
        con.register("my_new_view", pa.Table.from_pylist([k[0] for k in data]))
        con.execute("select * from my_new_view").fetchall()
        end_time = time.time()
    return end_time - start_time

def time_function(fn, col_size):
    "Take an average of a 5 run. Return milliseconds"
    return timeit.timeit(stmt=partial(fn, col_size=col_size), number=5)/5*1000

x = list(range(1,402,20))
y1 = [time_function(run_unnest, col_size=col_size) for col_size in x]
y2 = [time_function(run_python_unnest, col_size=col_size) for col_size in x]

plt.plot(x, y1, "r--", label="duckdb unnest")
plt.plot(x, y2, "b-", label="python unnest")
plt.legend(loc="upper left")
plt.xlabel("Number of columns")
plt.ylabel("Milliseconds to run")
plt.show()

```


### OS:

MacOS 13.2.1 aarch64

### DuckDB Version:

0.9.1

### DuckDB Client:

Python

### Full Name:

Izel Levy

### Affiliation:

Explorium

### Have you tried this on the latest `main` branch?

I have tested with a main build

### Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

- [X] Yes, I have

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UNNEST is very slow on a struct with high number of keys over arrow data #9628

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

Have you tried this on the latest `main` branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

5 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

UNNEST is very slow on a struct with high number of keys over arrow data #9628

Description

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

Have you tried this on the latest main branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

Activity

l1t1 commented on Nov 10, 2023

izellevy commented on Nov 10, 2023

Avsha-Chai commented on Feb 14, 2024

dgrozenok commented on Mar 11, 2024

github-actions commented on Jul 4, 2024

github-actions commented on Aug 4, 2024

kumarspark commented on Aug 28, 2024