-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
draft: symmetric sparse matrix support #22200
Conversation
+1 for the direction. |
Nice! Is this faster than just using a regular and symmetric sparse matrix? Also, would it make sense to store only the upper or lower triangle of A, instead of the full A? |
|
That's fair, but I am assuming that there are two reasons to have an optimized sparse symmetric implementation: storage and speed. I was curious about this because I use Hermitian sparse matrices and so an optimized type is something I am interested in. I made a quick benchmark to compare the regular sparse matrix multiplication, with this one, and with one where the storage data comprises only the upper triangle. Code is there, and here are the results, for a density of 0.32, and square matrices of different sizes: SymmetricSparseTests.runtests(5)
Normal sparse: Trial(146.313 ns)
Symmetric sparse: Trial(164.220 ns)
Symmetric sparse (triu only): Trial(161.990 ns)
Symmetric sparse (triu only) optimized: Trial(152.664 ns)
SymmetricSparseTests.runtests(10)
Normal sparse: Trial(613.115 ns)
Symmetric sparse: Trial(632.982 ns)
Symmetric sparse (triu only): Trial(605.547 ns)
Symmetric sparse (triu only) optimized: Trial(562.321 ns)
SymmetricSparseTests.runtests(25)
Normal sparse: Trial(5.992 μs)
Symmetric sparse: Trial(6.255 μs)
Symmetric sparse (triu only): Trial(6.196 μs)
Symmetric sparse (triu only) optimized: Trial(5.992 μs)
SymmetricSparseTests.runtests(50)
Normal sparse: Trial(37.411 μs)
Symmetric sparse: Trial(41.211 μs)
Symmetric sparse (triu only): Trial(40.626 μs)
Symmetric sparse (triu only) optimized: Trial(37.119 μs)
SymmetricSparseTests.runtests(200)
Normal sparse: Trial(2.304 ms)
Symmetric sparse: Trial(2.766 ms)
Symmetric sparse (triu only): Trial(2.678 ms)
Symmetric sparse (triu only) optimized: Trial(2.422 ms)
SymmetricSparseTests.runtests(500)
Normal sparse: Trial(34.553 ms)
Symmetric sparse: Trial(45.762 ms)
Symmetric sparse (triu only): Trial(44.226 ms)
Symmetric sparse (triu only) optimized: Trial(40.840 ms) |
You can always create |
I understand that, but if the wrapper cannot assume that only the triu part of the matrix is stored, there still need to be checks on which triangle should be looked at, and which entries should be looked at. |
There were some benchmarks in the Discourse thread using a large sparse matrix. You can see them in this gist: https://gist.github.com/dpo/481b0c03dd08d26af342573df98ddc21. Profiling reveals that index tests inside the multiplication kernel do consume substantial time. |
row > col && break # assume indices are sorted | ||
a = nzval[j] | ||
C[row, k] += a * αxj | ||
row == col || (tmp += a * B[row, k]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should have an alpha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, thanks. I chose to multiply tmp
by alpha
after the loop to save a few cycles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ps: should the kernels be exported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so.
base/sparse/symmetric.jl
Outdated
|
||
function A_mul_B_L_kernel!(α::Number, A::Symmetric{TA,SparseMatrixCSC{TA,S}}, B::StridedVecOrMat, β::Number, C::StridedVecOrMat) where {TA,S} | ||
|
||
colptr = A.data.colptr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 space indent, and shouldn't start a function body with a blank line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed. Besides doing the same job for Hermitian
, what other methods not yet covered would users find crucial?
Is there anything that can be done to make it faster than the regular sparse type? Right now this is slower, which seems undesirable. |
base/sparse/symmetric.jl
Outdated
@@ -0,0 +1,72 @@ | |||
# This file is a part of Julia. License is MIT: https://julialang.org/license | |||
|
|||
function Symmetric(A::SparseMatrixCSC, uplo::Symbol=:U) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed since it is equivalent to the default constructor:
julia/base/linalg/symmetric.jl
Line 43 in 063e8f1
Symmetric(A::AbstractMatrix, uplo::Symbol=:U) = (checksquare(A); Symmetric{eltype(A),typeof(A)}(A, char_uplo(uplo))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Fixed.
A_mul_B!(one(T), A, B, zero(T), similar(B, T, (A.data.n, size(B, 2)))) | ||
end | ||
|
||
function A_mul_B_U_kernel!(α::Number, A::Symmetric{TA,SparseMatrixCSC{TA,S}}, B::StridedVecOrMat, β::Number, C::StridedVecOrMat) where {TA,S} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are just 2 lines that are different between the U
/L
kernels. I think we can merge them, and add a Val{:U}
/Val{:L}
as last argument to the kernel? Something like:
function A_mul_B_UL_kernel!(α::Number, A::Symmetric{TA,SparseMatrixCSC{TA,S}},
B::StridedVecOrMat, β::Number, C::StridedVecOrMat, ::Type{Val{uplo}}) where {TA,S,uplo}
colptr = A.data.colptr
rowval = A.data.rowval
nzval = A.data.nzval
if β != 1
β != 0 ? scale!(C, β) : fill!(C, zero(eltype(C)))
end
@inbounds for k = 1 : size(C, 2)
@inbounds for col = 1 : A.data.n
αxj = α * B[col, k]
tmp = TA(0)
@inbounds for j = (uplo == :U ? (colptr[col] : (colptr[col + 1] - 1)) :
(colptr[col] : (colptr[col + 1] - 1)))
row = rowval[j]
uplo == :U ? (row > col && break) : (row > col && break) # assume indices are sorted
a = nzval[j]
C[row, k] += a * αxj
row == col || (tmp += a * B[row, k])
end
C[col, k] += α * tmp
end
end
C
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a performance impact for that, which is why I separated them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, there's a performance impact for allowing C
to be general, which makes me think that a separate version where C
is just a vector is desirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a performance impact for that, which is why I separated them.
That should be compiled away.
In fact, there's a performance impact for allowing
C
to be general, which makes me think that a separate version whereC
is just a vector is desirable.
If the impact is noticeable it makes sense to have a separate kernel for matvec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the matrix in my gist, here's the benchmark of A * b
:
BenchmarkTools.Trial:
memory estimate: 3.85 MiB
allocs estimate: 2
--------------
minimum time: 21.475 ms (0.00% GC)
median time: 22.218 ms (0.00% GC)
mean time: 22.425 ms (1.17% GC)
maximum time: 27.263 ms (5.92% GC)
--------------
samples: 223
evals/sample: 1
Now U * b
and L * b
using separate kernels:
BenchmarkTools.Trial:
memory estimate: 3.85 MiB
allocs estimate: 2
--------------
minimum time: 28.259 ms (0.00% GC)
median time: 29.112 ms (0.00% GC)
mean time: 29.481 ms (0.88% GC)
maximum time: 35.077 ms (0.00% GC)
--------------
samples: 170
evals/sample: 1
BenchmarkTools.Trial:
memory estimate: 3.85 MiB
allocs estimate: 2
--------------
minimum time: 33.965 ms (0.00% GC)
median time: 35.219 ms (0.00% GC)
mean time: 35.797 ms (0.92% GC)
maximum time: 40.802 ms (0.00% GC)
--------------
samples: 140
evals/sample: 1
and U * b
and L * b
using the joint kernel (after obvious corrections):
BenchmarkTools.Trial:
memory estimate: 3.85 MiB
allocs estimate: 4
--------------
minimum time: 33.787 ms (0.00% GC)
median time: 34.744 ms (0.00% GC)
mean time: 35.347 ms (0.88% GC)
maximum time: 40.835 ms (0.00% GC)
--------------
samples: 142
evals/sample: 1
BenchmarkTools.Trial:
memory estimate: 3.85 MiB
allocs estimate: 4
--------------
minimum time: 33.890 ms (0.00% GC)
median time: 34.753 ms (0.00% GC)
mean time: 35.240 ms (0.89% GC)
maximum time: 41.668 ms (0.00% GC)
--------------
samples: 142
evals/sample: 1
I didn't know about Val
. It improves over my initial joint version somewhat, but there'll still be a noticeable difference over large amounts of matrix-vector products.
Try some larger matrices, 500x500 matrix with .32 density is not really representable for a real use case of sparse matrices. For the cases in your benchmark post above, calling BLAS is more than 13 times faster on my computer. |
I have tried larger and sparser matrices with similar results. I think the cost of accessing the dense array not in column order is high: SymmetricSparseTests.run_tests(800,0.05)
Normal sparse: Trial(25.844 ms)
Symmetric sparse: Trial(32.871 ms)
Symmetric sparse (triu only): Trial(30.654 ms)
Symmetric sparse (triu only) optimized: Trial(27.898 ms)
SymmetricSparseTests.run_tests(1500,0.05)
Normal sparse: Trial(159.849 ms)
Symmetric sparse: Trial(205.979 ms)
Symmetric sparse (triu only): Trial(194.236 ms)
Symmetric sparse (triu only) optimized: Trial(179.533 ms) |
I just did a small test using MKL with We could consider adding the check all(1:size(A,2)) do j
i = last(colptr[j]:colptr[j+1] - 1)
return i > 0 ? rowval[i] < j : true
end to determine if only the triangle is stored. It's an O(n) before an O(nnz) and a few examples suggests that it might be worth it although it complicates the code a bit. |
Option 3 in my gist above adds O(n) storage to basically recover the performance of general matric-vector multiplication. The disadvantage is that a lot of methods involving sparse |
Perhaps the index vector in that option can be computed in |
Is there a plan for implementing more sophisticated algorithms for sparse-dense matrix multiplication? |
You could but it would be very inefficient to reconstruct the same array over and over each time you multiply. In iterative methods for linear systems, you may need tens or hundreds of products. |
Variant 3 in the gist above, with O(n) extra storage, does lend itself to a "fused" implementation using |
I wonder if it wouldn't make more sense for |
A problem might be that |
Sure, but what do we do if the test isn't satisfied? If |
Just fall back on the slower version that checks the indices. In most cases, users will provide a genuinely triangular matrix anyway and we could document that if it isn't already then using |
(Tangentially, the above (and related past) discussion highlights that introducing a separate class of matrix annotations might be worthwhile at some future point: In some cases you need an annotation that asserts some property of the wrapped storage's contents (for example that some |
@dpo Would you be able to revisit this? It would be great to have. |
@andreasnoack Sure thing. I'll try to get to it this week. |
@andreasnoack Is this what you have in mind: https://gist.github.com/dpo/481b0c03dd08d26af342573df98ddc21#file-symmetric_matvec_v4-jl Here's the expected behavior. The performance will be good if the user supplies a triangular matrix and poor if they don't. |
I'd actually prefer the version that uses the current |
You don't want to do that each time you compute a product. It would completely defeat the savings of storing and using a sparse matrix. Krylov methods would grind to a halt. |
That check is presumably very cheap in comparison to the actual product, no? |
I just did some timings. I thought it would be cheaper to check so I now agree with @dpo's conclusion and we might need a different approach. I can see two solutions
|
Out of curiosity, how expensive is the check compared to the multiply? |
About 50% for a tridiagonal and 35-40% for a pentadiagonal matvec. |
In my view, that's the whole point of symmetric sparse matrices. More than just promise that the underlying matrix is symmetric (as in the dense case), you want to save storage and preserve efficient matvecs. |
Ref. #17367 for discussion of mutating constructors ( |
|
Should be obsolete with #30018. Please comment if not. |
Are people still open to |
Could you remind us, what would be the benefit? To faster symmetric multiplication, you'd need the guarantee that the storage is triangular, right? |
Right, |
So the only benefit would be storage, right? In that case, I think it would be more transparent to ask people to write |
AFAIU from the discussion above (e.g. #22200 (comment) and following comments) the benefit would be that we could know that not entries were stored on the upper/lower half and not have to check it, which wouldn't be the case with |
...but then |
Yea. Maybe we could specialize on Edit: Doesn't work because |
There is background information for this, including benchmarks of a draft version of this code at https://discourse.julialang.org/t/slow-sparse-matrix-vector-product-with-symmetric-matrices. The more general implementation in this PR, which follows the code for generic sparse matrix-vector product, is a bit slower than what's reported in the thread.
I'd like to gather some feedback and, assuming this is of interest, I'll be happy to implement what's missing and do the same for
Hermitian
.@andreasnoack