Race condition in SUIT around threads terminating #19195
Description
Description
I think that the code around starting and stopping the worker thread in SUIT (sys/suit/transport/worker.c
lines 156 and 162, around locking and unlocking _worker_lock
) rely on subtle architecture and optimization details to work.
Say, the trigger is on a high priority thread, then
- it is waiting for the _worker_lock.
- the worker terminates, unlocking the mutex. We're still one line above return NULL, which may or may not be an instruction that modifies the stack, depending on the precise architecture and inlining decisions.
- The high-priority thread becomes runnable, the mutex lock completes, and it starts initializing the stack (which is still in use by the old thread)
- That thread runs up to its first blocking point.
- The scheduler schedules the old worker thread, which has still not terminated, and returns NULL, thereby subtly thrashing the new thread's stack.
Granted, on most architectures the return NULL won't do anything, or will just set a return value on the stack which the new thread will not use until it returns on its own -- but we can't know.
Also, the TCB is allocated inside the stack, so that might cause some mixup.
As pointed out by @maribu, "Due to the TCB being allocated at the same position in RAM, the new thread also overrides the state of the old thread. If the old thread gets scheduled, the new thread context will run. I'm sure that this will eventually cause issues, but likely even harder to debug ones :)"