Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory corruption in 0.9.3 #183

Closed
kbarros opened this issue Jan 23, 2024 · 12 comments · Fixed by #184
Closed

Memory corruption in 0.9.3 #183

kbarros opened this issue Jan 23, 2024 · 12 comments · Fixed by #184
Assignees
Labels
bug Something isn't working

Comments

@kbarros
Copy link
Contributor

kbarros commented Jan 23, 2024

We have noticed intermittent crashes since upgrading from Spglib 0.9.2 to 0.9.3. Looks like the only real difference between these versions is the switch from enums to SumTypes: 819b7f6.

It appears that there is memory corruption when reading an spglib_jll response into a sum type, and this has been confirmed @MasonProtter on the SumTypes repo: MasonProtter/SumTypes.jl#71

This isn't a bug with SumTypes, it's just being used in a completely invalid way by that library. It's asserting to julia that it can interpret the return type from a ccall as a sum type, which encloses a union, so yeah that's almost guaranteed to cause memory corruption.

The error could appear as a segfault or some other error. First found on our Github actions CI with Julia 1.9.4, Ubuntu, x86: https://github.com/SunnySuite/Sunny.jl/actions/runs/7629267203/job/20782289111?pr=217

I haven't seen the crashes yet on my Mac. On Linux/x86, reproducing might be as simple as ] add Sunny#spglib_crash and then using Sunny.

@kbarros kbarros added the bug Something isn't working label Jan 23, 2024
@MasonProtter
Copy link

To explain a bit more what I meant by it being invalid, this line:

get_error_code() = @ccall libsymspg.spg_get_error_code()::SpglibReturnCode

is taking something (presumably an integer?) emitted from a C library and then trying to interpret it as an SpglibReturnCode. But the implementation of SpglibReturnCode does not contain an integer, it actually contains a union, it basically looks like this:

struct SpglibReturnCode
    data::Union{Variant{:SUCCESS, (), Tuple{}}, Variant{:SPACEGROUP_SEARCH_FAILED, (), Tuple{}},  ...}
end

where each of those Variants are singleton structs. So you can't just take an integer and turn it into one of these. What you could do is something like

function SpglibReturnCode(i::Int)
    if i == 0
        SUCCESS
    elseif i == 1
        SPACEGROUP_SEARCH_FAILED
    elseif ...
        ...
     else
        error("invalid return code")
    end
end

and then you'd have a sum type you can work with I guess.

@kbarros
Copy link
Contributor Author

kbarros commented Jan 24, 2024

Many thanks Mason, and also @singularitti for providing this very useful package!

@singularitti
Copy link
Owner

Thank you for reporting this issue @kbarros and thank you for your guidance @MasonProtter! I probably misunderstood how to use your package. However, I am not sure what the bug is: the code actually work for my tests. Could you shed light on me what specific example is breaking? @kbarros

@MasonProtter
Copy link

Yeah, that's because julia appears to use an Int under the hood in its implementation of union types, but that's not a stable thing and it probably is causing undefined behaviour:

julia> let r = Ref(0)
           p::Ptr{SpglibReturnCode} = pointer_from_objref(r)
           unsafe_load(p)
       end
SUCCESS::SpglibReturnCode

julia> let r = Ref(1)
           p::Ptr{SpglibReturnCode} = pointer_from_objref(r)
           unsafe_load(p)
       end
SPACEGROUP_SEARCH_FAILED::SpglibReturnCode

julia> let r = Ref(2)
           p::Ptr{SpglibReturnCode} = pointer_from_objref(r)
           unsafe_load(p)
       end
CELL_STANDARDIZATION_FAILED::SpglibReturnCode

julia> let r = Ref(3)
           p::Ptr{SpglibReturnCode} = pointer_from_objref(r)
           unsafe_load(p)
       end
SYMMETRY_OPERATION_SEARCH_FAILED::SpglibReturnCode

This works locally, but it's definitely possible for this to cause problems in unexpected ways

@CedricTravelletti
Copy link

CedricTravelletti commented Jan 24, 2024

We at DFTK.jl are also noticing crashes since upgrading Spglib to 0.9.3 (see JuliaMolSim/DFTK.jl#947 (comment)). So the issues is not specific to @kbarros 's example.

@kbarros
Copy link
Contributor Author

kbarros commented Jan 24, 2024

@singularitti, I have reached out on the Julia Slack server. Id be happy to work together with you to resolve this, since it might be affecting quite a lot of users.

@kbarros
Copy link
Contributor Author

kbarros commented Jan 24, 2024

This fork has contains the fix proposed by @MasonProtter: https://github.com/kbarros/Spglib.jl. Specifically, here is the revised function:

function get_error_code()
    # The return value for the C function `spg_get_error_code` is a C enum
    # called SpglibError. The C compiler guarantees that enums are internally
    # stored as type `int`; its size is "implementation dependent". Typically 4
    # or 8 bytes, but this could depend on the architecture and compiler details
    # https://stackoverflow.com/a/366033/500314. Julia provides `CInt` to
    # guarantee an integer that has the same size as the C `int` value. Manually
    # map this enum tag to variants of the `@enum SpglibReturnCode`. The latter
    # is defined as a Union using the Julia SumTypes library, and its
    # representation details are also implementation dependent (e.g., the "tag"
    # of this Union need not have the same size as a `CInt`). Note that a
    # mistake here could cause memory corruption, as discussed in
    # https://github.com/singularitti/Spglib.jl/issues/183.
    code = @ccall libsymspg.spg_get_error_code()::CInt
    if code == 0
        return SUCCESS
    elseif code == 1
        return SPACEGROUP_SEARCH_FAILED
    elseif code == 2
        return CELL_STANDARDIZATION_FAILED
    elseif code == 3
        return SYMMETRY_OPERATION_SEARCH_FAILED
    elseif code == 4
        return ATOMS_TOO_CLOSE
    elseif code == 5
        return POINTGROUP_NOT_FOUND
    elseif code == 6
        return NIGGLI_FAILED
    elseif code == 7
        return DELAUNAY_FAILED
    elseif code == 8
        return ARRAY_SIZE_SHORTAGE
    elseif code == 9
        return NONE
    end
end

CI was already not passing, but I verified that this change does not cause any new test failures. I can't reproduce the crash on my Mac.

@kbarros
Copy link
Contributor Author

kbarros commented Jan 24, 2024

It's not ready yet, I need to update other places where SpglibReturnCode is used.

@MasonProtter
Copy link

Given that this package is only using these return codes to pass values back and forth from a C library, it seems better to just revert the change and use @enum, SumTypes doesn't appear to be adding any value

@kbarros
Copy link
Contributor Author

kbarros commented Jan 24, 2024

Agreed. Updated my fork to remove SumTypes and label return types explicitly. Please see: https://github.com/kbarros/Spglib.jl/blob/main/src/error.jl

Note that Spglib <= v0.9.2 also seemed to be dependent on undefined behavior (casting Cint into a Julia @enum). I think the correct thing is to explicitly mark the C enum as a Cint.

@singularitti
Copy link
Owner

singularitti commented Jan 24, 2024

Thank you for your help @kbarros @MasonProtter! And I apologize all the trouble this bug have caused @CedricTravelletti. The only reason I want to use SumTypes.jl is that I want to switch to a mechanism where SpglibReturnCode is not dependent on number enumeration, but it seems that it is not what I have imagined. I will just revert it back or use @kbarros's PR immediately.

@kbarros
Copy link
Contributor Author

kbarros commented Jan 24, 2024

Testing the fork is tricky because the memory corruption only causes problems on certain machines (not my Mac, for example). We have just now tested the fork on a colleague's machine and verified that it seems to work. I will open a pull request. (Now opened: #184).

@CedricTravelletti would you like to test the fork before merging?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants