Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LinearAlgebra: improve type-inference in Symmetric/Hermitian matmul #54303

Merged
merged 15 commits into from
May 7, 2024
Merged
Prev Previous commit
Next Next commit
Use all(map(...)) instead of all_in
  • Loading branch information
jishnub committed May 2, 2024
commit 541a76d6f24e943c98869a73559eedaf11bac9d6
34 changes: 14 additions & 20 deletions stdlib/LinearAlgebra/src/matmul.jl
Original file line number Diff line number Diff line change
Expand Up @@ -360,23 +360,15 @@ julia> lmul!(F.Q, B)
"""
lmul!(A, B)

# unroll the in(a, b) computation to enable constant propagation
# This is a 2-valued in implementation that doesn't account for missing values
_in(t::AbstractChar, ::Tuple{}) = false
function _in(t::AbstractChar, chars::Tuple{Vararg{AbstractChar}})
return t == first(chars) || _in(t, Base.tail(chars))
end
all_in(chars, (tA, tB)) = _in(tA, chars) && _in(tB, chars)

# THE one big BLAS dispatch
# aggressive constant propagation makes mul!(C, A, B) invoke gemm_wrapper! directly
Base.@constprop :aggressive function generic_matmatmul!(C::StridedMatrix{T}, tA, tB, A::StridedVecOrMat{T}, B::StridedVecOrMat{T},
_add::MulAddMul=MulAddMul()) where {T<:BlasFloat}
# if all(in(('N', 'T', 'C')), (tA, tB)), but we unroll the implementation to enable constprop
# We convert the chars to uppercase to potentially unwrap a WrapperChar,
# and extract the char corresponding to the wrapper type
tA_uc, tB_uc = uppercase(tA), uppercase(tB)
if all_in(('N', 'T', 'C'), map(uppercase, (tA_uc, tB_uc)))
# the map in all ensures constprop by acting on tA and tB individually, instead of looping over them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is true, this should be a giant contribution to the reduction of compile times, right? If we land in this branch, then we don't need to compile symm and hemm, or in the other case syrk/herk/gemm_wrapper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some compile-time improvement indeed, although it's not dramatic.
Each execution is in a separate session in the following:

julia> A = rand(2,2); B = rand(2,2); C = zeros(2,2);

julia> @time mul!(C, A, B);
  0.847057 seconds (3.39 M allocations: 171.963 MiB, 24.97% gc time, 100.00% compilation time) # nightly
  0.757433 seconds (3.94 M allocations: 202.922 MiB, 4.65% gc time, 100.00% compilation time) # This PR

julia> A = rand(2,2); B = Symmetric(rand(2,2)); C = zeros(2,2);

julia> @time mul!(C, A, B);
  1.098831 seconds (3.68 M allocations: 189.159 MiB, 24.52% gc time, 99.99% compilation time) # nightly
  0.687847 seconds (4.72 M allocations: 238.864 MiB, 7.04% gc time, 99.99% compilation time) # This PR

Descending into generic_matmatmul! using Cthulhu does seem to indicate that unused branches are eliminated, and e.g. in the first case, only gemm_wrapper! is being compiled, and in the second, only BLAS.symm! is compiled.

Copy link
Contributor Author

@jishnub jishnub May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code_typed for the first case (gemm) is identical between this PR and nightly:

julia> A = rand(2,2); B = rand(2,2); C = zeros(2,2);

julia> @code_typed mul!(C, A, B)
CodeInfo(
1%1 = invoke LinearAlgebra.gemm_wrapper!(C::Matrix{Float64}, 'N'::Char, 'N'::Char, A::Matrix{Float64}, B::Matrix{Float64}, $(QuoteNode(LinearAlgebra.MulAddMul{true, true, Bool, Bool}(true, false)))::LinearAlgebra.MulAddMul{true, true, Bool, Bool})::Matrix{Float64}
└──      return %1
) => Matrix{Float64}

I'm not certain why there's a compile-time improvement here. (perhaps noise?) In this case, the all is already being folded (despite the loop over the characters). I suspect the loop is being unrolled entirely, as the characters are all Chars that are fully known at compile time.

The second case (symm) is where the major improvement comes in:

julia> A = rand(2,2); B = Symmetric(rand(2,2)); C = zeros(2,2);

julia> @code_typed mul!(C, A, B)
CodeInfo(
1 ── %1  = Base.getfield(B, :uplo)::Char%2  = Base.bitcast(Base.UInt32, %1)::UInt32%3  = Base.bitcast(Base.UInt32, 'U')::UInt32%4  = (%2 === %3)::Bool%5  = Base.getfield(B, :data)::Matrix{Float64}
└───       goto #3 if not %4
2 ──       goto #4
3 ──       goto #4
4 ┄─ %9  = φ (#2 => 'S', #3 => 's')::Char%10 = Base.bitcast(Base.UInt32, %9)::UInt32%11 = Base.bitcast(Base.UInt32, 'S')::UInt32%12 = (%10 === %11)::Bool
└───       goto #5
5 ──       goto #7 if not %12
6 ──       goto #8
7 ──       nothing::Nothing
8 ┄─ %17 = φ (#6 => 'U', #7 => 'L')::Char%18 = invoke LinearAlgebra.BLAS.symm!('R'::Char, %17::Char, 1.0::Float64, %5::Matrix{Float64}, A::Matrix{Float64}, 0.0::Float64, C::Matrix{Float64})::Matrix{Float64}
└───       goto #9
9 ──       goto #10
10 ─       goto #11
11return %18
) => Matrix{Float64}

The BLAS.symm! branch that is being followed is "inlined" now. This is the case where the loop is not unrolled ordinarily, but using the all(map(..)) combination permits constant propagation.

if all(map(in(('N', 'T', 'C')), (tA_uc, tB_uc)))
if tA_uc == 'T' && tB_uc == 'N' && A === B
return syrk_wrapper!(C, 'T', A, _add)
elseif tA_uc == 'N' && tB_uc == 'T' && A === B
Expand Down Expand Up @@ -407,10 +399,11 @@ end
# Complex matrix times (transposed) real matrix. Reinterpret the first matrix to real for efficiency.
Base.@constprop :aggressive function generic_matmatmul!(C::StridedVecOrMat{Complex{T}}, tA, tB, A::StridedVecOrMat{Complex{T}}, B::StridedVecOrMat{T},
_add::MulAddMul=MulAddMul()) where {T<:BlasReal}
# if all(in(('N', 'T', 'C')), (tA, tB)), but we unroll the implementation to enable constprop
# We convert the chars to uppercase to potentially unwrap a WrapperChar,
# and extract the char corresponding to the wrapper type
if all_in(('N', 'T', 'C'), map(uppercase, (tA, tB)))
tA_uc, tB_uc = uppercase(tA), uppercase(tB)
# the map in all ensures constprop by acting on tA and tB individually, instead of looping over them.
if all(map(in(('N', 'T', 'C')), (tA_uc, tB_uc)))
gemm_wrapper!(C, tA, tB, A, B, _add)
else
_generic_matmatmul!(C, wrap(A, tA), wrap(B, tB), _add)
Expand Down Expand Up @@ -453,15 +446,15 @@ Base.@constprop :aggressive function gemv!(y::StridedVector{T}, tA::AbstractChar
if alpha isa Union{Bool,T} && beta isa Union{Bool,T} &&
stride(A, 1) == 1 && abs(stride(A, 2)) >= size(A, 1) &&
!iszero(stride(x, 1)) && # We only check input's stride here.
if _in(tA_uc, ('N', 'T', 'C'))
if tA_uc in ('N', 'T', 'C')
return BLAS.gemv!(tA, alpha, A, x, beta, y)
elseif tA_uc == 'S'
return BLAS.symv!(tA == 'S' ? 'U' : 'L', alpha, A, x, beta, y)
elseif tA_uc == 'H'
return BLAS.hemv!(tA == 'H' ? 'U' : 'L', alpha, A, x, beta, y)
end
end
if _in(tA_uc, ('S', 'H'))
if tA_uc in ('S', 'H')
# re-wrap again and use plain ('N') matvec mul algorithm,
# because _generic_matvecmul! can't handle the HermOrSym cases specifically
return _generic_matvecmul!(y, 'N', wrap(A, tA), x, MulAddMul(α, β))
Expand All @@ -488,7 +481,7 @@ Base.@constprop :aggressive function gemv!(y::StridedVector{Complex{T}}, tA::Abs
BLAS.gemv!(tA, alpha, reinterpret(T, A), x, beta, reinterpret(T, y))
return y
else
Anew, ta = _in(tA_uc, ('S', 'H')) ? (wrap(A, tA), oftype(tA, 'N')) : (A, tA)
Anew, ta = tA_uc in ('S', 'H') ? (wrap(A, tA), oftype(tA, 'N')) : (A, tA)
return _generic_matvecmul!(y, ta, Anew, x, MulAddMul(α, β))
end
end
Expand All @@ -507,13 +500,13 @@ Base.@constprop :aggressive function gemv!(y::StridedVector{Complex{T}}, tA::Abs
tA_uc = uppercase(tA) # potentially convert a WrapperChar to a Char
@views if alpha isa Union{Bool,T} && beta isa Union{Bool,T} &&
stride(A, 1) == 1 && abs(stride(A, 2)) >= size(A, 1) &&
!iszero(stride(x, 1)) && _in(tA_uc, ('N', 'T', 'C'))
!iszero(stride(x, 1)) && tA_uc in ('N', 'T', 'C')
xfl = reinterpret(reshape, T, x) # Use reshape here.
yfl = reinterpret(reshape, T, y)
BLAS.gemv!(tA, alpha, A, xfl[1, :], beta, yfl[1, :])
BLAS.gemv!(tA, alpha, A, xfl[2, :], beta, yfl[2, :])
return y
elseif _in(tA_uc, ('S', 'H'))
elseif tA_uc in ('S', 'H')
# re-wrap again and use plain ('N') matvec mul algorithm,
# because _generic_matvecmul! can't handle the HermOrSym cases specifically
return _generic_matvecmul!(y, 'N', wrap(A, tA), x, MulAddMul(α, β))
Expand Down Expand Up @@ -613,10 +606,11 @@ Base.@constprop :aggressive function gemm_wrapper(tA::AbstractChar, tB::Abstract
mA, nA = lapack_size(tA, A)
mB, nB = lapack_size(tB, B)
C = similar(B, T, mA, nB)
# if all(in(('N', 'T', 'C')), (tA, tB)), but we unroll the implementation to enable constprop
# We convert the chars to uppercase to potentially unwrap a WrapperChar,
# and extract the char corresponding to the wrapper type
if all_in(('N', 'T', 'C'), map(uppercase, (tA, tB)))
tA_uc, tB_uc = uppercase(tA), uppercase(tB)
# the map in all ensures constprop by acting on tA and tB individually, instead of looping over them.
if all(map(in(('N', 'T', 'C')), (tA_uc, tB_uc)))
gemm_wrapper!(C, tA, tB, A, B)
else
_generic_matmatmul!(C, wrap(A, tA), wrap(B, tB), _add)
Expand Down Expand Up @@ -789,7 +783,7 @@ end
@inline function generic_matvecmul!(C::AbstractVector, tA, A::AbstractVecOrMat, B::AbstractVector,
_add::MulAddMul = MulAddMul())
tA_uc = uppercase(tA) # potentially convert a WrapperChar to a Char
Anew, ta = _in(tA_uc, ('S', 'H')) ? (wrap(A, tA), oftype(tA, 'N')) : (A, tA)
Anew, ta = tA_uc in ('S', 'H') ? (wrap(A, tA), oftype(tA, 'N')) : (A, tA)
return _generic_matvecmul!(C, ta, Anew, B, _add)
end

Expand Down