Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix scheduler thread wakeup bug #2926

Merged
merged 1 commit into from
Oct 26, 2018
Merged

Fix scheduler thread wakeup bug #2926

merged 1 commit into from
Oct 26, 2018

Conversation

slfritchie
Copy link
Contributor

Bug summary:

For platforms such as Linux that do not define the CPP macro
USE_SCHEDULER_SCALING_PTHREADS, it's possible for a scheduler
thread to go to sleep, wake up, and then get stuck in the
bottom half of suspend_scheduler() because the boolean
scheduler_count_changing is always true

Bug fix:

Maintain the invariant mentioned in the comments at the top of
suspend_scheduler(): when USE_SCHEDULER_SCALING_PTHREADS is
not defined, then the scheduler_count_changing flag must be
false to "unlock" the critical section. This patch adds code
to unlock prior to a return near the top of
suspend_scheduler().

Background:

We've seen intermittent problems with Pony runtime shutdown, for example,
this CircleCI test failure for ponyc,
https://circleci.com/gh/ponylang/ponyc/19395, where the runtime doesn't
shut down completely after the unit tests have finished executing, and
the test is killed due to 10 minutes of stdout/stderr inactivity.

On an aarch64 platform, I managed to catch a stdlib unit test
process while stuck after running the 303 unit tests. The process
was consuming approx. 100% CPU time (1 CPU's/thread's worth) but
hadn't exited after 20+ minutes of runtime. With LLDB attached, I
witnessed one of the 4 scheduler threads stuck in
suspend_scheduler(), busy looping through the while (true) loop
at the bottom of the function. The value of scheduler_count_changing
was always true. I set a watchpoint to catch when that value
might be set to false by some other thread, but the watchpoint never
tripped: the other three schedulers were usually asleep in
ponyint_cpu_core_pause() while attempting to steal() work.

I checked the other functions in scheduler.c that also change
scheduler_count_changing's value but didn't find a similar
flow control error of forgetting to unlock when locked.
Reviewers: please triple-check. ^_^

Bug summary:

For platforms such as Linux that do not define the CPP macro
USE_SCHEDULER_SCALING_PTHREADS, it's possible for a scheduler
thread to go to sleep, wake up, and then get stuck in the
bottom half of `suspend_scheduler()` because the boolean
`scheduler_count_changing` is always `true`

Bug fix:

Maintain the invariant mentioned in the comments at the top of
`suspend_scheduler()`: when USE_SCHEDULER_SCALING_PTHREADS is
not defined, then the `scheduler_count_changing` flag must be
`false` to "unlock" the critical section.  This patch adds code
to unlock prior to a `return` near the top of
`suspend_scheduler()`.

Background:

We've seen intermittent problems with Pony runtime shutdown, for example,
this CircleCI test failure for ponyc,
https://circleci.com/gh/ponylang/ponyc/19395, where the runtime doesn't
shut down completely after the unit tests have finished executing, and
the test is killed due to 10 minutes of stdout/stderr inactivity.

On an aarch64 platform, I managed to catch a `stdlib` unit test
process while stuck after running the 303 unit tests.  The process
was consuming approx. 100% CPU time (1 CPU's/thread's worth) but
hadn't exited after 20+ minutes of runtime.  With LLDB attached, I
witnessed one of the 4 scheduler threads stuck in
`suspend_scheduler()`, busy looping through the `while (true)` loop
at the bottom of the function.  The value of `scheduler_count_changing`
was always true.  I set a watchpoint to catch when that value
might be set to false by some other thread, but the watchpoint never
tripped: the other three schedulers were usually asleep in
`ponyint_cpu_core_pause()` while attempting to `steal()` work.

I checked the other functions in scheduler.c that also change
`scheduler_count_changing`'s value but didn't find a similar
flow control error of forgetting to unlock when locked.
Reviewers: please triple-check.  ^_^
@slfritchie slfritchie added the changelog - fixed Automatically add "Fixed" CHANGELOG entry on merge label Oct 24, 2018
@slfritchie slfritchie requested a review from dipinhora October 24, 2018 19:36
Copy link
Contributor

@dipinhora dipinhora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slfritchie Thanks for finding/fixing this bug that I introduced in #2641. Much appreciated.

Copy link
Contributor

@mfelsche mfelsche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes pretty much sense to me!

One small suggestion from someone new to this part of the runtime: What about encapsulating reading and writing to scheduler_count_changing and the #if defined ceremony in separate functions to make the code overall more readable and comprehensible?

@mfelsche
Copy link
Contributor

But first of all: thanks for investigating, finding and fixing this!

@mfelsche
Copy link
Contributor

I am super excited about this one and want this to be merged today, i looked at the file 3 times. I think we good here.

@mfelsche mfelsche merged commit e13f83c into master Oct 26, 2018
ponylang-main added a commit that referenced this pull request Oct 26, 2018
@mfelsche mfelsche deleted the suspend_scheduler-wakeup branch October 26, 2018 14:23
dipinhora added a commit to dipinhora/ponyc that referenced this pull request Feb 23, 2019
Thanks to @slfritchie's hard work in ponylang#2926 and ponylang#2561 to find and
fix the bugs in the dynamic scheduler scaling logic, scheduler
thread 0 now also suspends and resumes correctly without issues
(in my testing).

This commit changes the default for `--ponyminthreads` from `1` to
`0`. The last time @slfritchie ran into issues was in March and
left a comment with the commands he used at:
ponylang#2561 (comment)

I ran the commands from @slfritchie's comment (with minor changes
to ensure `--ponyminthreads=0`) using Pony 0.26.0 for a few hours
without any issues.
dipinhora added a commit to dipinhora/ponyc that referenced this pull request Feb 26, 2019
Thanks to @slfritchie's hard work in ponylang#2926 and ponylang#2561 to find and
fix the bugs in the dynamic scheduler scaling logic, scheduler
thread 0 now also suspends and resumes correctly without issues
(in my testing).

This commit changes the default for `--ponyminthreads` from `1` to
`0`. The last time @slfritchie ran into issues was in March and
left a comment with the commands he used at:
ponylang#2561 (comment)

I ran the commands from @slfritchie's comment (with minor changes
to ensure `--ponyminthreads=0`) using Pony 0.26.0 for a few hours
without any issues.
SeanTAllen pushed a commit that referenced this pull request Feb 27, 2019
Thanks to @slfritchie's hard work in #2926 and #2561 to find and
fix the bugs in the dynamic scheduler scaling logic, scheduler
thread 0 now also suspends and resumes correctly without issues
(in my testing).

This commit changes the default for `--ponyminthreads` from `1` to
`0`. The last time @slfritchie ran into issues was in March and
left a comment with the commands he used at:
#2561 (comment)

I ran the commands from @slfritchie's comment (with minor changes
to ensure `--ponyminthreads=0`) using Pony 0.26.0 for a few hours
without any issues.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog - fixed Automatically add "Fixed" CHANGELOG entry on merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants