Description
In a nutshell:
-
The documentation of list_reduce should clarify how the existing implementation initializes the accumulator.
-
DuckDB should be enhanced so that, in effect, it is possible to specify the initial value of the accumulator.
What happens?
Consider the task of counting the number of 0s in a list efficiently (i.e., without constructing an intermediate list).
For example, the following correctly produces 1
:
select list_reduce([0,1,2], (sum,x) -> sum + CASE WHEN x = 1 then 1 ELSE 0 END) as sum;
But the following produces 0
:
select list_reduce([0,1,2], (sum,x) -> sum + CASE WHEN x = 0 then 1 ELSE 0 END) as sum;
This is evidently because the reduction algorithm is initially using the first item in the list as sum
, in accordance with the javascript
specification of list reduction. The javascript specification, however, also allows the user to specify the initial value of the accumulator. Perhaps therefore the "bug" here should be understood as an oversight in the specification of DuckDB's "list_reduce".
DuckDB versions:
v1.0.0
v1.0.1-dev4052
To Reproduce
select list_reduce([0,1,2], (sum,x) -> sum + CASE WHEN x = 0 then 1 ELSE 0 END) as sum;
Note that the problem cannot be explained away by SUM initially being NULL as the following also produces 0:
select list_reduce([0,1,2],
(sum,x) -> (CASE WHEN sum IS NULL THEN 0 else sum END) +
(CASE WHEN x = 0 then 1 ELSE 0 END) ) as sum;
OS:
macOS
DuckDB Version:
1.0.0 and 1.0.1
DuckDB Client:
CLI
Full Name:
Peter Koppstein
Affiliation:
Princeton University
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a source build
Did you include all relevant data sets for reproducing the issue?
Not applicable - the reproduction does not require a data set
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- Yes, I have