-
-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix scheduler thread wakeup bug #2926
Conversation
Bug summary: For platforms such as Linux that do not define the CPP macro USE_SCHEDULER_SCALING_PTHREADS, it's possible for a scheduler thread to go to sleep, wake up, and then get stuck in the bottom half of `suspend_scheduler()` because the boolean `scheduler_count_changing` is always `true` Bug fix: Maintain the invariant mentioned in the comments at the top of `suspend_scheduler()`: when USE_SCHEDULER_SCALING_PTHREADS is not defined, then the `scheduler_count_changing` flag must be `false` to "unlock" the critical section. This patch adds code to unlock prior to a `return` near the top of `suspend_scheduler()`. Background: We've seen intermittent problems with Pony runtime shutdown, for example, this CircleCI test failure for ponyc, https://circleci.com/gh/ponylang/ponyc/19395, where the runtime doesn't shut down completely after the unit tests have finished executing, and the test is killed due to 10 minutes of stdout/stderr inactivity. On an aarch64 platform, I managed to catch a `stdlib` unit test process while stuck after running the 303 unit tests. The process was consuming approx. 100% CPU time (1 CPU's/thread's worth) but hadn't exited after 20+ minutes of runtime. With LLDB attached, I witnessed one of the 4 scheduler threads stuck in `suspend_scheduler()`, busy looping through the `while (true)` loop at the bottom of the function. The value of `scheduler_count_changing` was always true. I set a watchpoint to catch when that value might be set to false by some other thread, but the watchpoint never tripped: the other three schedulers were usually asleep in `ponyint_cpu_core_pause()` while attempting to `steal()` work. I checked the other functions in scheduler.c that also change `scheduler_count_changing`'s value but didn't find a similar flow control error of forgetting to unlock when locked. Reviewers: please triple-check. ^_^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slfritchie Thanks for finding/fixing this bug that I introduced in #2641. Much appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes pretty much sense to me!
One small suggestion from someone new to this part of the runtime: What about encapsulating reading and writing to scheduler_count_changing
and the #if defined
ceremony in separate functions to make the code overall more readable and comprehensible?
But first of all: thanks for investigating, finding and fixing this! |
I am super excited about this one and want this to be merged today, i looked at the file 3 times. I think we good here. |
Thanks to @slfritchie's hard work in ponylang#2926 and ponylang#2561 to find and fix the bugs in the dynamic scheduler scaling logic, scheduler thread 0 now also suspends and resumes correctly without issues (in my testing). This commit changes the default for `--ponyminthreads` from `1` to `0`. The last time @slfritchie ran into issues was in March and left a comment with the commands he used at: ponylang#2561 (comment) I ran the commands from @slfritchie's comment (with minor changes to ensure `--ponyminthreads=0`) using Pony 0.26.0 for a few hours without any issues.
Thanks to @slfritchie's hard work in ponylang#2926 and ponylang#2561 to find and fix the bugs in the dynamic scheduler scaling logic, scheduler thread 0 now also suspends and resumes correctly without issues (in my testing). This commit changes the default for `--ponyminthreads` from `1` to `0`. The last time @slfritchie ran into issues was in March and left a comment with the commands he used at: ponylang#2561 (comment) I ran the commands from @slfritchie's comment (with minor changes to ensure `--ponyminthreads=0`) using Pony 0.26.0 for a few hours without any issues.
Thanks to @slfritchie's hard work in #2926 and #2561 to find and fix the bugs in the dynamic scheduler scaling logic, scheduler thread 0 now also suspends and resumes correctly without issues (in my testing). This commit changes the default for `--ponyminthreads` from `1` to `0`. The last time @slfritchie ran into issues was in March and left a comment with the commands he used at: #2561 (comment) I ran the commands from @slfritchie's comment (with minor changes to ensure `--ponyminthreads=0`) using Pony 0.26.0 for a few hours without any issues.
Bug summary:
For platforms such as Linux that do not define the CPP macro
USE_SCHEDULER_SCALING_PTHREADS, it's possible for a scheduler
thread to go to sleep, wake up, and then get stuck in the
bottom half of
suspend_scheduler()
because the booleanscheduler_count_changing
is alwaystrue
Bug fix:
Maintain the invariant mentioned in the comments at the top of
suspend_scheduler()
: when USE_SCHEDULER_SCALING_PTHREADS isnot defined, then the
scheduler_count_changing
flag must befalse
to "unlock" the critical section. This patch adds codeto unlock prior to a
return
near the top ofsuspend_scheduler()
.Background:
We've seen intermittent problems with Pony runtime shutdown, for example,
this CircleCI test failure for ponyc,
https://circleci.com/gh/ponylang/ponyc/19395, where the runtime doesn't
shut down completely after the unit tests have finished executing, and
the test is killed due to 10 minutes of stdout/stderr inactivity.
On an aarch64 platform, I managed to catch a
stdlib
unit testprocess while stuck after running the 303 unit tests. The process
was consuming approx. 100% CPU time (1 CPU's/thread's worth) but
hadn't exited after 20+ minutes of runtime. With LLDB attached, I
witnessed one of the 4 scheduler threads stuck in
suspend_scheduler()
, busy looping through thewhile (true)
loopat the bottom of the function. The value of
scheduler_count_changing
was always true. I set a watchpoint to catch when that value
might be set to false by some other thread, but the watchpoint never
tripped: the other three schedulers were usually asleep in
ponyint_cpu_core_pause()
while attempting tosteal()
work.I checked the other functions in scheduler.c that also change
scheduler_count_changing
's value but didn't find a similarflow control error of forgetting to unlock when locked.
Reviewers: please triple-check. ^_^