Skip to content

BUG: sparse.hstack returns incorrect result when the stack would result in indices too large for np.int32 #16569

Closed
@Micky774

Description

Describe your issue.

Over in scikit-learn we ran into this bug in the pursuit of this PR. Essentially when using sparse.hstack on a collection of sparse (csr) matrices whose indices arrays contain values no greater than the maximum for np.int32 the operation produces incorrect results.

I believe the problem is within sparse._construct._stack_along_minor_axis. In particular its dtype resolution for indices and indptr misses this edge case (and perhaps some others).

Reproducing Code Example

from scipy import sparse 
import numpy as np

data = [1.0]
row = [0]

max_int32 = np.iinfo(np.int32).max
ind_1 = max_int32
ind_2 = 2
assert ind_1 + ind_2 - 1 > max_int32 #condition of failure
assert max(ind_1 - 1, ind_2 - 1) < max_int32 #condition of failure

col_1 = [ind_1 - 1] 
col_2 = [ind_2 - 1]
X_1 = sparse.csr_matrix((data, (row, col_1)))
X_2 = sparse.csr_matrix((data, (row, col_2)))
Z = sparse.hstack([X_1, X_2], format="csr")

print(Z.indices) # [65534 -2147450882]
assert Z.indices.max() == ind_1 + ind_2 - 1

Error message

N/A

SciPy/NumPy/Python version information

1.10.0.dev0+0.df3fe4e 1.24.0.dev0+449.g353fea031 sys.version_info(major=3, minor=9, micro=12, releaselevel='final', serial=0)

Metadata

Assignees

No one assigned

    Labels

    defectA clear bug or issue that prevents SciPy from being installed or used as expectedscipy.sparse

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions