Skip to content

Issues with list_reduce() #13397

Open
Open
@pkoppstein

Description

In a nutshell:

  1. The documentation of list_reduce should clarify how the existing implementation initializes the accumulator.

  2. DuckDB should be enhanced so that, in effect, it is possible to specify the initial value of the accumulator.

What happens?

Consider the task of counting the number of 0s in a list efficiently (i.e., without constructing an intermediate list).
For example, the following correctly produces 1:

select list_reduce([0,1,2], (sum,x) -> sum + CASE WHEN x = 1 then 1 ELSE 0 END) as sum; 

But the following produces 0:

select list_reduce([0,1,2], (sum,x) -> sum + CASE WHEN x = 0 then 1 ELSE 0 END) as sum; 

This is evidently because the reduction algorithm is initially using the first item in the list as sum, in accordance with the javascript specification of list reduction. The javascript specification, however, also allows the user to specify the initial value of the accumulator. Perhaps therefore the "bug" here should be understood as an oversight in the specification of DuckDB's "list_reduce".

DuckDB versions:

v1.0.0
v1.0.1-dev4052

To Reproduce

select list_reduce([0,1,2], (sum,x) -> sum + CASE WHEN x = 0 then 1 ELSE 0 END) as sum; 

Note that the problem cannot be explained away by SUM initially being NULL as the following also produces 0:

select list_reduce([0,1,2], 
  (sum,x)  -> (CASE WHEN sum IS NULL THEN 0 else sum END) +
                      (CASE WHEN x = 0 then 1 ELSE 0 END) ) as sum; 

OS:

macOS

DuckDB Version:

1.0.0 and 1.0.1

DuckDB Client:

CLI

Full Name:

Peter Koppstein

Affiliation:

Princeton University

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a source build

Did you include all relevant data sets for reproducing the issue?

Not applicable - the reproduction does not require a data set

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Assignees

No one assigned

    Labels

    Needs DocumentationUse for issues or PRs that require changes in the documentationreproduced

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions