Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Racy controller from framework.NewInformer? #26082

Closed
jsafrane opened this issue May 23, 2016 · 5 comments
Closed

Racy controller from framework.NewInformer? #26082

jsafrane opened this issue May 23, 2016 · 5 comments
Assignees
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.

Comments

@jsafrane
Copy link
Member

When I stress-test PersistentVolume controller I sometimes get (correct) "PersistentVolumeClaim deleted" event followed by (wrong) "PersistentVolumeClaim added". This happens very rarely (<0.01%) when I delete a PersistentVolumeClaim and the controller is under heavy load.

I use framework.NewInformer (link ) like everybody else. Reading its code, what happens when the controller is processing 'PVC deleted' event at line 251, i.e just before Delete() is called and at the same time periodic sync starts? The PVC that is being deleted is still in the cache and a 'PVC Sync' will be added into Queue and my controller gets OnDelete() callback followed by OnAdd().

My PersistentVolume controller then sees an old claim and tries to do stuff with it and sometimes it updates the claim in etcd. So the claim gets 'resurrected' in etcd even though user deleted it.

It is possible that my controller does something terribly wrong - I checked it twice, but it's pretty complicated piece of code. I am 100% sure that OnAdd() is called as reaction to cache.Sync event, I did a debug log in Process function of Informer.

@jsafrane
Copy link
Member Author

@lavalamp, you're author of the potentially racy code, can you please take a look?

@jsafrane
Copy link
Member Author

I have reproduced of the observed behavior. Get it at reproduce-informer-race branch of my Kubernetes fork, patch jsafrane@0b5bc1d

Compile it:

cd test/integration/
godep go test -tags 'integration no-docker' -c

Run etcd somewhere and then run the test:

./integration.test -test.run TestInformerRace -v 2

(with -v 3 you get some logs I added to the informer and reflector)

The test spawns a very simple controller with AddFunc, ModifyFunc and DeleteFunc and creates, modifies and deletes one PV. ModifyFunc is intentionally slow to trigger periodic sync when the PV was already deleted by the client, but this delete event was not yet processed by the controller - the volume is still in the controller cache. This means in the controller queue is cache.Deleted and cache.Sync after that. cache.Deleted event removes the volume from cache and cache.Sync does not find it there, therefore invokes AddFunc callback and adds the PV to the cache -> troubles.

@jsafrane
Copy link
Member Author

jsafrane commented May 24, 2016

I have sort of a fix, please review very carefully #26167

@lavalamp lavalamp added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels May 24, 2016
@lavalamp
Copy link
Member

I was pretty sure there was a race in there, but never able to repro, thanks.

@jsafrane
Copy link
Member Author

Fixed by #27341

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
Development

No branches or pull requests

2 participants