Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internals MoarVM Error messages are LTA #1850

Open
2 tasks
timo opened this issue Oct 24, 2024 · 4 comments
Open
2 tasks

Internals MoarVM Error messages are LTA #1850

timo opened this issue Oct 24, 2024 · 4 comments

Comments

@timo
Copy link
Member

timo commented Oct 24, 2024

For example, #1849

When you corrupt your moarvm memory by doing un-thread-safe things, you get one of a variety of "MoarVM Panic: Internal bla bla", which gives you zero clue what might be going wrong.

In this issue I would like to collect ideas for improving error messages.

  • [easy] add a short hint to these error messages that you need to check your code for thread unsafe things
  • [easy] if running under valgrind memcheck, see if a valgrind client request can give useful information about some of the pointers involved
    • see if the "search for pointers at this address" can be useful, and is available
    • see if we can do the same thing with asan, even though having to rebuild moar with asan is a little bit of a hassle
  • [easy] does outputting more information about the object we're looking at at the moment of the crash help at all?
    • at the very least we should be able to output the debugname, if the st pointer is valid (though if it's bogus, trying to access it will cause a segfault)
  • [medium] [work-intensive] write up a bunch of documentation for crash hunting
    • valgrind or asan for finding bogus memory accesses
    • investigate how well helgrind and tsan play with moar. last I checked, there were boatloads of false-positives from our lock-free code
    • look again at the MVM_CROSS_THREAD_WRITE_LOG; has it bit rotted? are there many false-positives generally?
    • rr for reliable reproduction and sharing of crash recordings, and the incredible power of "watchpoint and reverse-continue". but knowing what exactly to watchpoint is the tricky bit.
  • [medium] especially for "invalid thread ID in work pass" and similar, can we use the heap snapshot machinery to find out what exactly we are looking at?
    • with work items we only get a pointer into the inside of an object, or worse, into a blob of memory that lives somewhere, for example an MVMArray's storage.
    • But with even a partial heap snapshot, we should at least get the object that "owns" the pointer that the work item relates to.
    • With bad luck, the object we're looking at at this time already has some unpredictable nonsense in it.
  • [medium] investigate creating a compile-time #define to turn on extra checks that make moar slower, but can find problems?
  • [medium/hard] since a core dump just has all memory in it, why not dump some human-readable stuff in a well-known / easily-discoverable memory location when we reach a crash? "blackbox style"
    • Can we interrupt all threads and get a MVM_dump_backtrace for all of them?
    • Can we dump bytecode, locals, and lexicals as well?
    • the functionality of the debugserver to tell all threads to stop should be usable here, but if we crash inside the garbage collector for example, there's not yet any way to tell all threads to stop what they're doing ASAP
      • if one thread crashes, it's not unlikely that the others would soon find something to crash over very soon as well
      • we don't want multiple threads to fight over who gets to "crash properly" and get in each other's way
  • [hard] investigate what we could do in case a segfault happens by handling the sigsegv in a signal handler
    • famously, a bunch of stdlib functions are not safe to use in a signal handler, especially memory allocations.
    • depending on what caused the segfault, and whether we are in the middle of something, looking at moar's internal structures or at objects can be problematic
    • a mode that does extra work on segv may want to be opt-in, by env var or compile time flag
    • is it better to just leave a all of these tasks to a gdb that we instruct the user to attach?
      • annoying if the problem is difficult to reproduce (long time after start, only once in fifty runs, etc)
    • [research task] check how other language VMs and related software react to segfaults
  • [hard,fun?] when something goes wrong in a way where we have one or a few objects that are of particular interest (such as a hash access with stale pointer, for example moarvm crash on parallel usage of http useragent rakudo/rakudo#5683) we could look for names that refer to the given object
    • WIP / experimental pull request with a possible implementation when a hash oopses, heap snapshot and look for refs #1862
    • in the example, content of lexical variable %rfc_grammar of the current thread's cur_frame -> outer -> outer would be one interesting name (but reporting just %rfc_grammar would suffice / be better)
    • if we can stop all other threads quickly enough, then the other thread that was concurrently updating the hash in the example would also still have a frame on its stack that has that lexical available
      • though just having the lexical in scope is only a weak signal for problematic concurrent access, since any stack with the parse or subparse methods from that module in it would also report that
        • so if we want to go the extra mile, maybe inspect the frame's locals? inspect the frame's code for getlex/bindlex?
    • this vaguely mirrors the idea above related to using heap snapshots to figure out what might cause "invalid thread id in work item pass"
      • duplicating the work of heap snapshot creation for a special-purpose "identify an object" feature would likely be a bad idea. making "only output the most relevant paths to the object" be a kind of filter on all paths to the given object may be the way to go.
    • is it desirable to do a lot of work in case of a crash?
      • how long does it take to record a heap snapshot, depending on the number of objects in use?
      • if we leave out a lot of the permanent roots and such, can we reduce the heap down to a more manageable portion?
        • can we add roots one at a time until the object we're interested in has shown up for the first time, then stop?

Tasks that should be done:

@dontlaugh
Copy link

dontlaugh commented Nov 11, 2024

Clang has a couple of annotations that can be added to code to C code that say, "This variable must be guarded by this mutex". This is for compile time, static analysis. They call it Thread Safety Analysis

Thread safety analysis works very much like a type system for multi-threaded programs. In addition to declaring the type of data (e.g. int, float, etc.), the programmer can (optionally) declare how access to that data is controlled in a multi-threaded environment. For example, if foo is guarded by the mutex mu, then the analysis will issue a warning whenever a piece of code reads or writes to foo without first locking mu. Similarly, if there are particular routines that should only be called by the GUI thread, then the analysis will warn if other threads call those routines.

Whether this is immediately useful to MoarMV depends on

  • Whether these annotations will work for the libuv mutex types, which are widely used
  • Whether we need bleeding edge LLVM for our code.

@dontlaugh
Copy link

investigate how well helgrind and tsan play with moar. last I checked, there were boatloads of false-positives from our lock-free code

I wasn't aware, but there is an annotation to suppress this in clang. https://clang.llvm.org/docs/ThreadSanitizer.html#attribute-no-sanitize-thread


I think runtime tools will be essential, but perhaps focusing on refactorings that support static analysis will yield good results in the long run, for both static and runtime analysis.

@timo
Copy link
Member Author

timo commented Nov 15, 2024

added a section on identifying objects by paths from roots to the objects using the heap snapshot machinery to the end of the original post, inspired by a recent rakudo issue report that also has a useful real-world test case for causing concurrent hash writes

@timo
Copy link
Member Author

timo commented Nov 15, 2024

ran the example code from that same issue with CROSS_THREAD_WRITE_LOG and got these results: https://gist.github.com/timo/a938a7d4fb9f07bd008b3ba33e7b3f73 - verdict: at the moment extremely spammy with false positives from Lock::Async (likely caused by supply and react blocks internals), and we want to find a way to not report assignments to different thread pool worker's attributes. maybe changing the owning thread of the object to the worker's thread?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants