Await comply with reactive streams Subscriber rule 2.7 #3360

EgorKulbachka · 2022-07-07T09:09:19Z

Rule 2.7 requires Subscription methods to be executed serially.

There is a possibility of race between Subscription.request and Subscription.cancel methods since cancellation handler could be executed in a separate thread.

In particular this bug leads to database connection leaks if used with jooq subscription as cancel can happen while request being initialized and hence it misses to close connection.

There is a possibility of race of Subscription.request and Subscription.cancel methods since cancellation handler could be executed in a separate thread. Rule [2.7](https://github.com/reactive-streams/reactive-streams-jvm/blob/v1.0.3/README.md#2-subscriber-code) requires Subscription methods to be executed serially.

dkhalanskyjb · 2022-07-07T10:35:23Z

Thank you! This is indeed a problem, and I don't think your solution is enough to fix it. Looks like it is possible that cont.cancel also happens in parallel with a cont.cancel in the onNext implementation. I think one can't get around having to use locking here.

EgorKulbachka · 2022-07-07T11:02:17Z

Thank you for the quick feedback!

I'm not sure if parallel cancel executions might be the problem, as specified in subscription rules 3.5 cancel method itself should be thread-safe (i.e. without requiring external synchronization to ensure program correctness.)

I can try myself to introduce lock here but I feel it might be excessive as well might have additional performance implications which is hard for me to estimate.

dkhalanskyjb · 2022-07-07T12:25:14Z

Good point about 3.5, but still, I think we should comply with 2.7 to the letter. For example, see its proposed interpretation:

The intent of this rule is to permit the calling of the request and cancel methods (including from multiple threads) if and only if a happens-before relation between each of the calls is established.

I don't think there's room for ambiguity here: the calls must have a happen-before link between them.

I can try myself to introduce lock here but I feel it might be excessive as well might have additional performance implications which is hard for me to estimate.

I would estimate the implications as insignificant. @qwwdfsad has more in-depth knowledge of this and can correct me if I'm wrong, but as I understand it, the performance impact of synchronized is twofold:

To establish happens-before relations, it must synchronize the state to the main memory,
Some heavy machinery is used orchestrate threads in case of contention.

1 is insignificant, because using coroutines already causes the state to synchronize to the main memory on dispatches, and 2 is insignificant because, when threads don't fight for the right to enter a synchronized block, the orchestration is trivial and fast, but if they do, well, that's what we want to guard against.

…hods

EgorKulbachka · 2022-07-07T14:52:14Z

Makes sense to me.

I've added synchronization to the subscription methods. Unfortunately I couldn't figure out how to avoid delay in the test with this solution.

dkhalanskyjb · 2022-07-07T15:32:27Z

Great, thanks! If, at any point, you feel like you don't want to continue working on this pull request, please say so, and we will polish it ourselves, because we want it to land in the next release, which should happen soon. You will be credited in any case.

Regarding the test: yes, I think we need to fix it, but more radically.

Currently, the test is structured in such a way that it will pass most of the time even if the bug is not fixed: I removed the @Synchronized annotation, but the test still passes successfully. So, it clearly doesn't test that there is no bug.

What do we want to test here? Judging by the name of the pull request, we want to test that rule 2.7 is upheld even in the presence of cancellation and multithreading, that is, that the operations on a subscriber are not entered in parallel. This feels like a good job for a stress test. We have plenty of those (see PublisherRequestStressTest.kt for an example), and they are a type of test where you spin some code in a loop for many iterations and check that what you want to uphold was not violated in any of them.

So, in this case, a good test, I think, would be something like this: create a subscriber that sets some variable to 1 when request or cancel is entered, and to 0 when the execution leaves the operation. Crash if, on trying to call request or cancel, that variable is already 1. Then, in parallel do awaitOne on such a subscriber and cancellation of the scope in which the awaitOne happens. Do all of the above many times.

If you do write this test, please also check that, when the bug is present, the test finds it consistently.

reactive/kotlinx-coroutines-reactive/src/Await.kt

EgorKulbachka · 2022-07-07T17:26:35Z

Thanks for the suggestions!

I've added suggested test, it fails for me locally only when I have sleep in request method. Please let me know if this is the right direction you had in mind.

dkhalanskyjb · 2022-07-08T09:08:28Z

testPublisher is almost exactly what I had in mind.

sleep is a sign that something is heavily dependent on the execution time, which is not good. More on that below.
This publisher never sends any values, so it can't check that there is no race between cancellation and calls to sub.cancel in onNext.

Now, the test procedure itself could be improved significantly. Let's look at what happens currently.

First, jobsToRun jobs are created and launched. On my machine, this process takes about 40 ms, give or take. These jobs don't wait for the moment cancellation is allowed to start to happen, they just start working immediately, so the first job has about 40 ms to run before job.cancel happens.
Those 40 ms are more than enough for the request in the first job to finish, so this is why you need sleep: otherwise, there is no race and, by the time cancellation happens, the publisher is just waiting to be cancelled.
While cancellation of the first n publishers happens, the subsequent publishers are running as well, so every publisher suffers from this issue. Increasing their number doesn't help, it just ensures that each publisher has even more time to execute and thus avoid the race.

The fact that publishers are running in parallel also obscures the results: there aren't that many cores in the typical CPU, so, despite thousands of coroutines launching, less than 20 of them will typically execute in parallel.

What can be done here to improve the situation: instead of launching everything and only then cancelling everything, you can perform many self-contained iterations that don't influence one another in any way. PublisherCompletionStressTest, for example, follows this general pattern. In each iteration, launch a single coroutine and immediately try to cancel it in parallel. This way, cancellations should have a good chance to race with calls to request, even without sleep.

EgorKulbachka · 2022-07-08T10:19:23Z

I've rewritten the test addressing your comments, please let me know what you think.

When I've added onNext to the test it reproduces the issue better but I faced another problem: we technically still call cancel from within request (when Subscription calls onNext inside request) and hence counter checks do not work for test anymore, I've changed that to use Lock instead to make sure no concurrent threads execute methods in parallel.

Do you think this is good enough to consider cancel and request are executed serially?
I find a lot of Subscriber implementations follow same pattern (example project reactor) but I guess it's not really serial execution since we run cancel method before request method return. Curious to hear your opinion.

Thanks for the patience and explanations btw.

dkhalanskyjb · 2022-07-08T11:39:06Z

This test doesn't actually check the synchronization in onNext, because request is itself already synchronized and cancellation can't happen in parallel with it. This is unfortunate, but, looking at this, I only see ways to test this that involve fairly advanced tricks, which may just not be worth it.

So, I think this is good to go. Good work, thank you!

sadensmol · 2022-07-12T13:06:17Z

Really nice fix, we are struggling from it for a long time! Thank you @EgorKulbachka

dkhalanskyjb · 2022-07-12T13:08:53Z

@sadensmol we take such issues seriously, so if you experience something else of the sort, please feel free to report it.

There is a possibility of a race between Subscription.request and Subscription.cancel methods since cancellation handler could be executed in a separate thread. Rule [2.7](https://github.com/reactive-streams/reactive-streams-jvm/blob/v1.0.3/README.md#2-subscriber-code) requires Subscription methods to be executed serially.

…to baxter/upstream-flow-timeout * origin/baxter/upstream-flow-timeout: (328 commits) Commit API dump Cleanup API, update knit Fix typo in runTest method docs (Kotlin#3417) Update coroutines-and-channels.md (Kotlin#3410) chore: update the website's release step (Kotlin#3397) ktl-695 chore: support Dokka HTML customization (Kotlin#3388) update: KT-50122 adding kotlinx.dependencies Improve bump-version.sh (Kotlin#3365) Fix documentation for `DEBUG_PROPERTY_VALUE_OFF` (Kotlin#3389) feat: moving coroutines hands-on to docs (Kotlin#3369) Version 1.6.4 Improve CoroutineDispatcher documentation (Kotlin#3359) Update binary compatibility validator to 0.11.0 (Kotlin#3362) Add a scope for launching background work in tests (Kotlin#3348) Fix debug module publication with shadow plugin (Kotlin#3357) Comply with Subscriber rule 2.7 in the `await*` impl (Kotlin#3360) Update readme (Kotlin#3343) Reduce reachable references of disposed invokeOnTimeout handle (Kotlin#3353) breakleg; knit validation fix Additional comment in CoroutineScheduler ... # Conflicts: # README.md

qwwdfsad requested a review from dkhalanskyjb July 7, 2022 10:42

Use synchronized block to ensure serial execution of subscription met…

4db2131

…hods

dkhalanskyjb reviewed Jul 7, 2022

View reviewed changes

reactive/kotlinx-coroutines-reactive/src/Await.kt Outdated Show resolved Hide resolved

EgorKulbachka added 2 commits July 7, 2022 18:50

Stress test for await cancel race condition

90045c4

Revert await test changes

179bf3f

Rename test method

e7a342b

Rewrite stress test to sequential iterations and including onNext

947adb8

dkhalanskyjb merged commit 8b6473d into Kotlin:develop Jul 8, 2022

EgorKulbachka deleted the cancellation-race branch July 8, 2022 12:01

dkhalanskyjb added the reactive label Jul 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Await comply with reactive streams Subscriber rule 2.7 #3360

Await comply with reactive streams Subscriber rule 2.7 #3360

EgorKulbachka commented Jul 7, 2022

dkhalanskyjb commented Jul 7, 2022

EgorKulbachka commented Jul 7, 2022

dkhalanskyjb commented Jul 7, 2022

EgorKulbachka commented Jul 7, 2022

dkhalanskyjb commented Jul 7, 2022 •

edited

Loading

EgorKulbachka commented Jul 7, 2022

dkhalanskyjb commented Jul 8, 2022 •

edited

Loading

EgorKulbachka commented Jul 8, 2022

dkhalanskyjb commented Jul 8, 2022

sadensmol commented Jul 12, 2022

dkhalanskyjb commented Jul 12, 2022

Await comply with reactive streams Subscriber rule 2.7 #3360

Await comply with reactive streams Subscriber rule 2.7 #3360

Conversation

EgorKulbachka commented Jul 7, 2022

dkhalanskyjb commented Jul 7, 2022

EgorKulbachka commented Jul 7, 2022

dkhalanskyjb commented Jul 7, 2022

EgorKulbachka commented Jul 7, 2022

dkhalanskyjb commented Jul 7, 2022 • edited Loading

EgorKulbachka commented Jul 7, 2022

dkhalanskyjb commented Jul 8, 2022 • edited Loading

EgorKulbachka commented Jul 8, 2022

dkhalanskyjb commented Jul 8, 2022

sadensmol commented Jul 12, 2022

dkhalanskyjb commented Jul 12, 2022

dkhalanskyjb commented Jul 7, 2022 •

edited

Loading

dkhalanskyjb commented Jul 8, 2022 •

edited

Loading