BUG: sparse.hstack
returns incorrect result when the stack would result in indices too large for np.int32
#16569
Closed
Description
Describe your issue.
Over in scikit-learn we ran into this bug in the pursuit of this PR. Essentially when using sparse.hstack
on a collection of sparse (csr) matrices whose indices
arrays contain values no greater than the maximum for np.int32
the operation produces incorrect results.
I believe the problem is within sparse._construct._stack_along_minor_axis
. In particular its dtype
resolution for indices
and indptr
misses this edge case (and perhaps some others).
Reproducing Code Example
from scipy import sparse
import numpy as np
data = [1.0]
row = [0]
max_int32 = np.iinfo(np.int32).max
ind_1 = max_int32
ind_2 = 2
assert ind_1 + ind_2 - 1 > max_int32 #condition of failure
assert max(ind_1 - 1, ind_2 - 1) < max_int32 #condition of failure
col_1 = [ind_1 - 1]
col_2 = [ind_2 - 1]
X_1 = sparse.csr_matrix((data, (row, col_1)))
X_2 = sparse.csr_matrix((data, (row, col_2)))
Z = sparse.hstack([X_1, X_2], format="csr")
print(Z.indices) # [65534 -2147450882]
assert Z.indices.max() == ind_1 + ind_2 - 1
Error message
N/A
SciPy/NumPy/Python version information
1.10.0.dev0+0.df3fe4e 1.24.0.dev0+449.g353fea031 sys.version_info(major=3, minor=9, micro=12, releaselevel='final', serial=0)