Internals MoarVM Error messages are LTA #1850

timo · 2024-10-24T10:51:32Z

For example, #1849

When you corrupt your moarvm memory by doing un-thread-safe things, you get one of a variety of "MoarVM Panic: Internal bla bla", which gives you zero clue what might be going wrong.

In this issue I would like to collect ideas for improving error messages.

[easy] add a short hint to these error messages that you need to check your code for thread unsafe things
[easy] if running under valgrind memcheck, see if a valgrind client request can give useful information about some of the pointers involved
- see if the "search for pointers at this address" can be useful, and is available
- see if we can do the same thing with asan, even though having to rebuild moar with asan is a little bit of a hassle
[easy] does outputting more information about the object we're looking at at the moment of the crash help at all?
- at the very least we should be able to output the debugname, if the st pointer is valid (though if it's bogus, trying to access it will cause a segfault)
[medium] [work-intensive] write up a bunch of documentation for crash hunting
- valgrind or asan for finding bogus memory accesses
- investigate how well helgrind and tsan play with moar. last I checked, there were boatloads of false-positives from our lock-free code
- look again at the MVM_CROSS_THREAD_WRITE_LOG; has it bit rotted? are there many false-positives generally?
- rr for reliable reproduction and sharing of crash recordings, and the incredible power of "watchpoint and reverse-continue". but knowing what exactly to watchpoint is the tricky bit.
[medium] especially for "invalid thread ID in work pass" and similar, can we use the heap snapshot machinery to find out what exactly we are looking at?
- with work items we only get a pointer into the inside of an object, or worse, into a blob of memory that lives somewhere, for example an MVMArray's storage.
- But with even a partial heap snapshot, we should at least get the object that "owns" the pointer that the work item relates to.
- With bad luck, the object we're looking at at this time already has some unpredictable nonsense in it.
[medium] investigate creating a compile-time #define to turn on extra checks that make moar slower, but can find problems?
[medium/hard] since a core dump just has all memory in it, why not dump some human-readable stuff in a well-known / easily-discoverable memory location when we reach a crash? "blackbox style"
- Can we interrupt all threads and get a MVM_dump_backtrace for all of them?
- Can we dump bytecode, locals, and lexicals as well?
- the functionality of the debugserver to tell all threads to stop should be usable here, but if we crash inside the garbage collector for example, there's not yet any way to tell all threads to stop what they're doing ASAP
  - if one thread crashes, it's not unlikely that the others would soon find something to crash over very soon as well
  - we don't want multiple threads to fight over who gets to "crash properly" and get in each other's way
[hard] investigate what we could do in case a segfault happens by handling the sigsegv in a signal handler
- famously, a bunch of stdlib functions are not safe to use in a signal handler, especially memory allocations.
- depending on what caused the segfault, and whether we are in the middle of something, looking at moar's internal structures or at objects can be problematic
- a mode that does extra work on segv may want to be opt-in, by env var or compile time flag
- is it better to just leave a all of these tasks to a gdb that we instruct the user to attach?
  - annoying if the problem is difficult to reproduce (long time after start, only once in fifty runs, etc)
- [research task] check how other language VMs and related software react to segfaults
  - JVM can spit out a huge amount of info when it segfaults
  - SDL used to have a "parachute", was that for more than deinitializing video? (so after a crash you're not left with a desktop at 640x480 resolution, in the long-long-ago times)
  - cpython and pypy: https://docs.python.org/dev/library/faulthandler.html loading this module or having python dev mode active will give a stack trace when a deadly signal happens
    - https://docs.python.org/dev/library/devmode.html might be interesting for rakudo in general
  - ruby: https://gist.github.com/schmurfy/509c4efc0e840fccb912#file-ruby_trace rubinius spits out some information about the system, some information about rubinius itself, and backtraces for its threads
  - go: when a go program receives sigsegv, it dumps its stack including memory contents of the stack frame, and other goroutines, for example like this SIGSEGV: segmentation violation code=0x1 - help golang/go#49992
  - [research] the PHP runtimes from zend, or hiphop?
[hard,fun?] when something goes wrong in a way where we have one or a few objects that are of particular interest (such as a hash access with stale pointer, for example moarvm crash on parallel usage of http useragent rakudo/rakudo#5683) we could look for names that refer to the given object
- WIP / experimental pull request with a possible implementation when a hash oopses, heap snapshot and look for refs #1862
- in the example, content of lexical variable %rfc_grammar of the current thread's cur_frame -> outer -> outer would be one interesting name (but reporting just %rfc_grammar would suffice / be better)
- if we can stop all other threads quickly enough, then the other thread that was concurrently updating the hash in the example would also still have a frame on its stack that has that lexical available
  - though just having the lexical in scope is only a weak signal for problematic concurrent access, since any stack with the parse or subparse methods from that module in it would also report that
    - so if we want to go the extra mile, maybe inspect the frame's locals? inspect the frame's code for getlex/bindlex?
- this vaguely mirrors the idea above related to using heap snapshots to figure out what might cause "invalid thread id in work item pass"
  - duplicating the work of heap snapshot creation for a special-purpose "identify an object" feature would likely be a bad idea. making "only output the most relevant paths to the object" be a kind of filter on all paths to the given object may be the way to go.
- is it desirable to do a lot of work in case of a crash?
  - how long does it take to record a heap snapshot, depending on the number of objects in use?
  - if we leave out a lot of the permanent roots and such, can we reduce the heap down to a more manageable portion?
    - can we add roots one at a time until the object we're interested in has shown up for the first time, then stop?

Tasks that should be done:

integrate libunwind
add new interesting stuff to crossthreadwrite's "filtered_out": https://github.com/MoarVM/MoarVM/blob/621838b4ddb94158273088bd7224cd6bd1bee06f/src/instrument/crossthreadwrite.c#L143

The text was updated successfully, but these errors were encountered:

dontlaugh · 2024-11-11T15:10:32Z

Clang has a couple of annotations that can be added to code to C code that say, "This variable must be guarded by this mutex". This is for compile time, static analysis. They call it Thread Safety Analysis

Thread safety analysis works very much like a type system for multi-threaded programs. In addition to declaring the type of data (e.g. int, float, etc.), the programmer can (optionally) declare how access to that data is controlled in a multi-threaded environment. For example, if foo is guarded by the mutex mu, then the analysis will issue a warning whenever a piece of code reads or writes to foo without first locking mu. Similarly, if there are particular routines that should only be called by the GUI thread, then the analysis will warn if other threads call those routines.

Whether this is immediately useful to MoarMV depends on

Whether these annotations will work for the libuv mutex types, which are widely used
Whether we need bleeding edge LLVM for our code.

dontlaugh · 2024-11-11T15:25:43Z

investigate how well helgrind and tsan play with moar. last I checked, there were boatloads of false-positives from our lock-free code

I wasn't aware, but there is an annotation to suppress this in clang. https://clang.llvm.org/docs/ThreadSanitizer.html#attribute-no-sanitize-thread

I think runtime tools will be essential, but perhaps focusing on refactorings that support static analysis will yield good results in the long run, for both static and runtime analysis.

timo · 2024-11-15T19:58:08Z

added a section on identifying objects by paths from roots to the objects using the heap snapshot machinery to the end of the original post, inspired by a recent rakudo issue report that also has a useful real-world test case for causing concurrent hash writes

timo · 2024-11-15T20:19:08Z

ran the example code from that same issue with CROSS_THREAD_WRITE_LOG and got these results: https://gist.github.com/timo/a938a7d4fb9f07bd008b3ba33e7b3f73 - verdict: at the moment extremely spammy with false positives from Lock::Async (likely caused by supply and react blocks internals), and we want to find a way to not report assignments to different thread pool worker's attributes. maybe changing the owning thread of the object to the worker's thread?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internals MoarVM Error messages are LTA #1850

Internals MoarVM Error messages are LTA #1850

timo commented Oct 24, 2024 •

edited

Loading

dontlaugh commented Nov 11, 2024 •

edited

Loading

dontlaugh commented Nov 11, 2024

timo commented Nov 15, 2024

timo commented Nov 15, 2024 •

edited

Loading

Internals MoarVM Error messages are LTA #1850

Internals MoarVM Error messages are LTA #1850

Comments

timo commented Oct 24, 2024 • edited Loading

dontlaugh commented Nov 11, 2024 • edited Loading

dontlaugh commented Nov 11, 2024

timo commented Nov 15, 2024

timo commented Nov 15, 2024 • edited Loading

timo commented Oct 24, 2024 •

edited

Loading

dontlaugh commented Nov 11, 2024 •

edited

Loading

timo commented Nov 15, 2024 •

edited

Loading