Skip to content

UNNEST is very slow on a struct with high number of keys over arrow data #9628

Closed
@izellevy

Description

What happens?

Running UNNEST function over a single row table containing a struct with 350 keys is taking more than 5 seconds on my machine. When I analyze how UNNEST function behaves, I see that the time it takes for the function to run increases exponentially with the number of keys in the data. I also added a comparison to simulate the same behavior by extracting the data, unnesting in python ingesting back on duckdb and reading it back. I would expect for both functions to behave similarly but I see a very large divergence. You can see below the comparison between these two methods of unnest.

Screenshot 2023-11-09 at 18 01 46

To Reproduce

import duckdb
import pyarrow as pa
import time
from matplotlib import pyplot as plt
import timeit
from functools import partial

def get_table(col_size):
    "Creates a one row arrow table with the given column size"
    row = {f"col_{idx}": f"my_value_{idx}" for idx in range(col_size)}
    return pa.Table.from_pydict({"data": pa.array([row])})

def run_unnest(col_size):    
    "Run unnest natively"
    with duckdb.connect() as con:
        con.register("my_view", get_table(col_size))
        start_time = time.time()
        con.sql("select unnest(data) from my_view").fetchall()
        end_time = time.time()
    return end_time-start_time

def run_python_unnest(col_size):
    "Read the data unnest and put it back, then read again"
    with duckdb.connect() as con:
        con.register("my_view", get_table(col_size))
        start_time = time.time()
        data = con.sql("select data from my_view").fetchall()
        con.register("my_new_view", pa.Table.from_pylist([k[0] for k in data]))
        con.execute("select * from my_new_view").fetchall()
        end_time = time.time()
    return end_time - start_time

def time_function(fn, col_size):
    "Take an average of a 5 run. Return milliseconds"
    return timeit.timeit(stmt=partial(fn, col_size=col_size), number=5)/5*1000

x = list(range(1,402,20))
y1 = [time_function(run_unnest, col_size=col_size) for col_size in x]
y2 = [time_function(run_python_unnest, col_size=col_size) for col_size in x]

plt.plot(x, y1, "r--", label="duckdb unnest")
plt.plot(x, y2, "b-", label="python unnest")
plt.legend(loc="upper left")
plt.xlabel("Number of columns")
plt.ylabel("Milliseconds to run")
plt.show()

OS:

MacOS 13.2.1 aarch64

DuckDB Version:

0.9.1

DuckDB Client:

Python

Full Name:

Izel Levy

Affiliation:

Explorium

Have you tried this on the latest main branch?

I have tested with a main build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • Yes, I have

Activity

l1t1

l1t1 commented on Nov 10, 2023

@l1t1
changed the title UNNEST is very slow on data with high number of columns UNNEST is very slow on struct with high number of keys on Nov 10, 2023
changed the title UNNEST is very slow on struct with high number of keys UNNEST is very slow on struct with high number of keys over arrow data on Nov 10, 2023
izellevy

izellevy commented on Nov 10, 2023

@izellevy
Author

@l1t1 sorry for the confusion. Now I replaced the title to be more clear. I meant unnesting a struct with high number of keys over arrow data. I realized that when I create a table in duckdb from arrow data and running over it, it is 5x faster. But it is still relatively slow.

For a struct with 350 keys on my machine:

Unnest over struct (arrow): 5s
Unnest over struct (duckdb table): 1s
Unnest over struct (python unnest): 0.05s

changed the title UNNEST is very slow on struct with high number of keys over arrow data UNNEST is very slow on a struct with high number of keys over arrow data on Nov 10, 2023
Avsha-Chai

Avsha-Chai commented on Feb 14, 2024

@Avsha-Chai

I'm experiencing the same thing with every variation of json_extract and json_extract_string (including -> and ->> operators)

dgrozenok

dgrozenok commented on Mar 11, 2024

@dgrozenok

Experiencing similar issue for even fewer number of fields. Was trying to define a macro as a unit of functionality that accepts a struct with about 50 fields created from a table's columns and returning a struct with about 15 fields that are converted to columns using unnest. Seeing about 30% decrease in performance compared to just having all the expressions as part of the SQL statement explicitly.

github-actions

github-actions commented on Jul 4, 2024

@github-actions

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

github-actions

github-actions commented on Aug 4, 2024

@github-actions

This issue was closed because it has been stale for 30 days with no activity.

kumarspark

kumarspark commented on Aug 28, 2024

@kumarspark

This is still an issue. UNNEST on a struct with high number of keys doesn't finish executing / extremely slow

5 remaining items

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      UNNEST is very slow on a struct with high number of keys over arrow data · Issue #9628 · duckdb/duckdb