Fix scheduler thread wakeup bug #2926

slfritchie · 2018-10-24T19:36:21Z

Bug summary:

For platforms such as Linux that do not define the CPP macro
USE_SCHEDULER_SCALING_PTHREADS, it's possible for a scheduler
thread to go to sleep, wake up, and then get stuck in the
bottom half of suspend_scheduler() because the boolean
scheduler_count_changing is always true

Bug fix:

Maintain the invariant mentioned in the comments at the top of
suspend_scheduler(): when USE_SCHEDULER_SCALING_PTHREADS is
not defined, then the scheduler_count_changing flag must be
false to "unlock" the critical section. This patch adds code
to unlock prior to a return near the top of
suspend_scheduler().

Background:

We've seen intermittent problems with Pony runtime shutdown, for example,
this CircleCI test failure for ponyc,
https://circleci.com/gh/ponylang/ponyc/19395, where the runtime doesn't
shut down completely after the unit tests have finished executing, and
the test is killed due to 10 minutes of stdout/stderr inactivity.

On an aarch64 platform, I managed to catch a stdlib unit test
process while stuck after running the 303 unit tests. The process
was consuming approx. 100% CPU time (1 CPU's/thread's worth) but
hadn't exited after 20+ minutes of runtime. With LLDB attached, I
witnessed one of the 4 scheduler threads stuck in
suspend_scheduler(), busy looping through the while (true) loop
at the bottom of the function. The value of scheduler_count_changing
was always true. I set a watchpoint to catch when that value
might be set to false by some other thread, but the watchpoint never
tripped: the other three schedulers were usually asleep in
ponyint_cpu_core_pause() while attempting to steal() work.

I checked the other functions in scheduler.c that also change
scheduler_count_changing's value but didn't find a similar
flow control error of forgetting to unlock when locked.
Reviewers: please triple-check. ^_^

Bug summary: For platforms such as Linux that do not define the CPP macro USE_SCHEDULER_SCALING_PTHREADS, it's possible for a scheduler thread to go to sleep, wake up, and then get stuck in the bottom half of `suspend_scheduler()` because the boolean `scheduler_count_changing` is always `true` Bug fix: Maintain the invariant mentioned in the comments at the top of `suspend_scheduler()`: when USE_SCHEDULER_SCALING_PTHREADS is not defined, then the `scheduler_count_changing` flag must be `false` to "unlock" the critical section. This patch adds code to unlock prior to a `return` near the top of `suspend_scheduler()`. Background: We've seen intermittent problems with Pony runtime shutdown, for example, this CircleCI test failure for ponyc, https://circleci.com/gh/ponylang/ponyc/19395, where the runtime doesn't shut down completely after the unit tests have finished executing, and the test is killed due to 10 minutes of stdout/stderr inactivity. On an aarch64 platform, I managed to catch a `stdlib` unit test process while stuck after running the 303 unit tests. The process was consuming approx. 100% CPU time (1 CPU's/thread's worth) but hadn't exited after 20+ minutes of runtime. With LLDB attached, I witnessed one of the 4 scheduler threads stuck in `suspend_scheduler()`, busy looping through the `while (true)` loop at the bottom of the function. The value of `scheduler_count_changing` was always true. I set a watchpoint to catch when that value might be set to false by some other thread, but the watchpoint never tripped: the other three schedulers were usually asleep in `ponyint_cpu_core_pause()` while attempting to `steal()` work. I checked the other functions in scheduler.c that also change `scheduler_count_changing`'s value but didn't find a similar flow control error of forgetting to unlock when locked. Reviewers: please triple-check. ^_^

dipinhora

@slfritchie Thanks for finding/fixing this bug that I introduced in #2641. Much appreciated.

mfelsche

Makes pretty much sense to me!

One small suggestion from someone new to this part of the runtime: What about encapsulating reading and writing to scheduler_count_changing and the #if defined ceremony in separate functions to make the code overall more readable and comprehensible?

mfelsche · 2018-10-25T07:52:17Z

But first of all: thanks for investigating, finding and fixing this!

mfelsche · 2018-10-26T14:23:39Z

I am super excited about this one and want this to be merged today, i looked at the file 3 times. I think we good here.

@slfritchie

Thanks to @slfritchie's hard work in ponylang#2926 and ponylang#2561 to find and fix the bugs in the dynamic scheduler scaling logic, scheduler thread 0 now also suspends and resumes correctly without issues (in my testing). This commit changes the default for `--ponyminthreads` from `1` to `0`. The last time @slfritchie ran into issues was in March and left a comment with the commands he used at: ponylang#2561 (comment) I ran the commands from @slfritchie's comment (with minor changes to ensure `--ponyminthreads=0`) using Pony 0.26.0 for a few hours without any issues.

@slfritchie

Thanks to @slfritchie's hard work in ponylang#2926 and ponylang#2561 to find and fix the bugs in the dynamic scheduler scaling logic, scheduler thread 0 now also suspends and resumes correctly without issues (in my testing). This commit changes the default for `--ponyminthreads` from `1` to `0`. The last time @slfritchie ran into issues was in March and left a comment with the commands he used at: ponylang#2561 (comment) I ran the commands from @slfritchie's comment (with minor changes to ensure `--ponyminthreads=0`) using Pony 0.26.0 for a few hours without any issues.

@slfritchie

Thanks to @slfritchie's hard work in #2926 and #2561 to find and fix the bugs in the dynamic scheduler scaling logic, scheduler thread 0 now also suspends and resumes correctly without issues (in my testing). This commit changes the default for `--ponyminthreads` from `1` to `0`. The last time @slfritchie ran into issues was in March and left a comment with the commands he used at: #2561 (comment) I ran the commands from @slfritchie's comment (with minor changes to ensure `--ponyminthreads=0`) using Pony 0.26.0 for a few hours without any issues.

slfritchie added the changelog - fixed Automatically add "Fixed" CHANGELOG entry on merge label Oct 24, 2018

slfritchie requested a review from dipinhora October 24, 2018 19:36

dipinhora approved these changes Oct 24, 2018

View reviewed changes

mfelsche approved these changes Oct 25, 2018

View reviewed changes

mfelsche merged commit e13f83c into master Oct 26, 2018

ponylang-main added a commit that referenced this pull request Oct 26, 2018

Update CHANGELOG for PR #2926 [skip ci]

6b70ff0

mfelsche deleted the suspend_scheduler-wakeup branch October 26, 2018 14:23

EpicEric mentioned this pull request Oct 30, 2018

Allow getHostCPUFeatures to fail on ARM platforms #2917

Merged

dipinhora mentioned this pull request Feb 23, 2019

Change --ponyminthreads default to 0 #3020

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix scheduler thread wakeup bug #2926

Fix scheduler thread wakeup bug #2926

slfritchie commented Oct 24, 2018

dipinhora left a comment

mfelsche left a comment

mfelsche commented Oct 25, 2018

mfelsche commented Oct 26, 2018

Fix scheduler thread wakeup bug #2926

Fix scheduler thread wakeup bug #2926

Conversation

slfritchie commented Oct 24, 2018

dipinhora left a comment

Choose a reason for hiding this comment

mfelsche left a comment

Choose a reason for hiding this comment

mfelsche commented Oct 25, 2018

mfelsche commented Oct 26, 2018