Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race between periodic sync and delete event. #26167

Closed
wants to merge 1 commit into from

Conversation

jsafrane
Copy link
Member

Controller should ignore "Sync" event on deleted objects. When controller receives 'object deleted' event from etcd and at the same time it starts periodic sync, "Sync" event can be enqueued after "Delete" event. Processing "Deleted" event, controller removes the object from its cache. It should not create the object there when processing "Sync" event after that.

Fixes #26082

Please review carefully, I'm not sure if it's the right thing to do - this must have been tested to death, everybody uses NewInformer()...

Also, an unit test would be nice, however all controller tests are in framework_test package and I can't call controller.config.Queue.Resync() from there to simulate periodic resync (controller.config is not exported). Is there a reliable way to wait for periodic sync in a AddFunc callback? I could sleep, but that's flaky...

Controller should ignore "Sync" event on deleted objects. When controller
receives 'object deleted' event from etcd and at the same time it starts
periodic sync, "Sync" event can be enqueued after "Delete" event.
Processing "Deleted" event, controller removes the object from its cache.
It should not create the object there when processing "Sync" event after that.
@jsafrane
Copy link
Member Author

@lavalamp, PTAL

@k8s-github-robot k8s-github-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. release-note-label-needed labels May 24, 2016
@k8s-bot
Copy link

k8s-bot commented May 24, 2016

GCE e2e build/test failed for commit a1feee8.

Please reference the list of currently known flakes when examining this failure. If you request a re-test, you must reference the issue describing the flake.

@adohe-zz
Copy link

@jsafrane I am just a little bit confused about this race. IIUC you mean that there is race condition between Delete and Resync, I think after Delete, the item should be deleted, then Resync will not see that item.

@jsafrane
Copy link
Member Author

@adohe, resync happens when Delete event is in queue and not processed yet. Then you get both Delete and Sync events in the queue and you hit the bug.

@jsafrane
Copy link
Member Author

Notice that Delete does not remove the object being deleted from f.knownObjects, so subsequent Resync can still find it there. (umm, that could mean we could fix it in DeltaFIFO and not on the controller level...)

@jsafrane
Copy link
Member Author

oops, unit test did not pass, I'll look at it tomorrow. Anyway, you can fix it by yourself, the PR here is just to show what's probably wrong.

@adohe-zz
Copy link

@jsafrane I think the Resync method could be improved to fix this.

@lavalamp
Copy link
Member

Thanks for the PR-- I need to think about this carefully to understand. I'm swamped, it may take a bit before I have a chance.

@adohe-zz
Copy link

@jsafrane I think this PR can fix the race condition you describe. I like this fix. 👍

@jsafrane
Copy link
Member Author

The tests are failing because the initial sync is done as Sync events which are ignored because of this PR. I think the real fix will be few lines changed here and there, however I am pretty swamped too :to investigate how these reflectors and informers and queues work together.

@adohe-zz
Copy link

Also I will take a look at the failed test.

@hongchaodeng
Copy link
Contributor

hongchaodeng commented Jun 8, 2016

When controller receives 'object deleted' event from etcd and at the same time it starts periodic sync, "Sync" event can be enqueued after "Delete" event.

How is it possible?
If the queue gets Delete 'A' event first, 'A' should be deleted from remote store. If we get Sync, it means reflector will List from remote store and return the dataset that doesn't contain 'A'. In this case, syncWith() wouldn't return Sync event anyway. Right?

@caesarxuchao
Copy link
Member

caesarxuchao commented Jun 8, 2016

Resync and Relist are separated now. Resync is from the local store, so the race described is possible.

This problem and the #27004 are exposed by the separation of resync and relist.

@hongchaodeng
Copy link
Contributor

I see.
Thanks @caesarxuchao for the explanation.

@caesarxuchao
Copy link
Member

@hongchaodeng perhaps the cure for this issue and #27004 is to wait for the queue to be drained before reading the store (i.e., before the resyncing in this issue, and before checking if an object exists in store in #27004).

@saad-ali
Copy link
Member

saad-ali commented Jun 8, 2016

@lavalamp is this something we want for 1.3?

@lavalamp
Copy link
Member

lavalamp commented Jun 8, 2016

Yes, if the list/sync split exacerbated this then we should fix that.

On Wed, Jun 8, 2016 at 11:35 AM, Saad Ali notifications@github.com wrote:

@lavalamp https://github.com/lavalamp is this something we want for 1.3?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#26167 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAnglhv7jZ_IIJ6dgJVCma7BEellcy7kks5qJwtjgaJpZM4Ilh2H
.

@saad-ali saad-ali added this to the v1.3 milestone Jun 8, 2016
@jsafrane jsafrane added the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Jun 9, 2016
@jsafrane
Copy link
Member Author

jsafrane commented Jun 9, 2016

I'm marking it as 'do not merge', my fix is wrong. I'm leaving it open just for documentation and search purposes, feel free to close this PR when you have a better one.

@dchen1107
Copy link
Member

I am closing the pr per author's comment above. Please document this as known issue in #27076.

@dchen1107 dchen1107 closed this Jun 10, 2016
@jsafrane jsafrane deleted the fix-controller-race branch August 19, 2016 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants