Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in Informer #27004

Closed
hongchaodeng opened this issue Jun 7, 2016 · 40 comments
Closed

Race condition in Informer #27004

hongchaodeng opened this issue Jun 7, 2016 · 40 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Milestone

Comments

@hongchaodeng
Copy link
Contributor

hongchaodeng commented Jun 7, 2016

What's the issue?

Informer could end up in wrong state.

Example?

I have been stress testing RC of 100 replicas. I kept deleting the pod and let RC recover it to 100. The program will do informer.store.List() to check if there are 100 pods. But at the end of day it returned more than 100 (e.g. 131), while listing directly from apiserver returned 100.

Why?

Informer has a queue and a store (aka indexer) internally.

  • The store relies on the queue to do blocking pop() and and then updates itself in Process().
  • The queue (delta fifo) is used to queue events from watcher. For each operation in reflector, e.g. Delete, we will enqueue an event to the queue.
  • However, in DeltaFIFO.Delete(), it will check the store first to see if it needs to send a delete event. This logic is wrong because the queue and store are run concurrently. There could be an ADD event not processed by store yet thus queue won't enqueue DELETE event.
@hongchaodeng
Copy link
Contributor Author

I'm able to reproduce this reliably in integration test.
I'm going to submit a unit test shortly.

@hongchaodeng
Copy link
Contributor Author

@kubernetes/sig-api-machinery @kubernetes/sig-scalability @xiang90 @philips

@hongchaodeng hongchaodeng changed the title Race condition of Informer and DeltaFIFO Race condition in Informer and DeltaFIFO Jun 7, 2016
@hongchaodeng hongchaodeng changed the title Race condition in Informer and DeltaFIFO Race condition in Informer Jun 7, 2016
@lavalamp
Copy link
Member

lavalamp commented Jun 7, 2016

I think there's another issue out there for this race already...

On Tue, Jun 7, 2016 at 3:39 PM, Hongchao Deng notifications@github.com
wrote:

@kubernetes/sig-api-machinery
https://github.com/orgs/kubernetes/teams/sig-api-machinery
@kubernetes/sig-scalability
https://github.com/orgs/kubernetes/teams/sig-scalability @xiang90
https://github.com/xiang90 @philips https://github.com/philips


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#27004 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAnglqZtHhw01q0DTEUlm9DnTVvSsrm_ks5qJfMdgaJpZM4IwbhE
.

@xiang90
Copy link
Contributor

xiang90 commented Jun 7, 2016

@lavalamp You mean #26082? I would wait for @hongchaodeng's reproduce.

@lavalamp
Copy link
Member

lavalamp commented Jun 7, 2016

Yes, thanks for linking.

On Tue, Jun 7, 2016 at 4:46 PM, Xiang Li notifications@github.com wrote:

@lavalamp https://github.com/lavalamp You mean #26082
#26082? I would wait for
@hongchaodeng https://github.com/hongchaodeng's reproduce.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#27004 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAngluCIOx94_cWA-XAsN4Hy4SIfuYi3ks5qJgLVgaJpZM4IwbhE
.

@hongchaodeng
Copy link
Contributor Author

I have some doubts on #26082's description of event sequence. See #26167 (comment)

Let's keep it separated from this issue.

@caesarxuchao
Copy link
Member

@saad-ali this may be related to the flake you are working on.

@wojtek-t wojtek-t added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Jun 8, 2016
@lavalamp
Copy link
Member

lavalamp commented Jun 8, 2016

The problem is a lack of syncronization between the store and the queue. If you can make a test case that reproduces an error, that would be great. I didn't manage that last time I messed with that code.

@hongchaodeng
Copy link
Contributor Author

hongchaodeng commented Jun 8, 2016

Reproduced in unit test. See #27064

@hongchaodeng
Copy link
Contributor Author

@lavalamp
I totally agree.
I think we should keep all the checking, syncing in one place -- the store. Let me sort it out and submit the code shortly.

@saad-ali
Copy link
Member

saad-ali commented Jun 8, 2016

@lavalamp This should be triaged for 1.3

@hongchaodeng
Copy link
Contributor Author

I read through informer code more thoroughly today. I figure out there is more problems beyond missing Delete events. Resync can return inconsistent data, Replace (on relist) can return inconsistent data, and also Delete. These 3 functions are the ones calling store functions: ListKeys, GetByKey.

@wojtek-t wojtek-t added this to the v1.3 milestone Jun 9, 2016
@wojtek-t
Copy link
Member

wojtek-t commented Jun 9, 2016

@hongchaodeng - what do you mean by "inconsistent data" above? Can you please clarify?

@hongchaodeng
Copy link
Contributor Author

Currently the store is used in delta fifo like this:

Replace, Resync:
    for _, k := range f.knownObjects.ListKeys() {
        ...
        deletedObj, exists, err := f.knownObjects.GetByKey(k)

In this code, there could be new events being processed, data being changed between ListKeys() and iterating, between each iteration of GetByKey().

Things can get weird. For example, I have event A=1,B=1, B=2, A=2, . I could end up getting B=1 and A=2. But I would definitely expect B=2 for a List(). I might get wrong for the use case. But without strong guarantee of consistency, we end up with a lot of workarounds and dealing with complicated code...

Another problem is the interleaving of Resync and normal flow in ListAndWatch() of Reflector, i.e. Replace -> Add/Update/Delete. #26966 is the example of that.

Delete:
    if _, exists, err := f.knownObjects.GetByKey(id); err == nil && !exists {
        return nil
    }

This is what the issue is about.

@wojtek-t
Copy link
Member

wojtek-t commented Jun 9, 2016

I don't understand - the whole Replace method is under lock, as well as Add, Update, ...

These cannot interrupt each other...

@hongchaodeng
Copy link
Contributor Author

hongchaodeng commented Jun 9, 2016

Replace method is under lock, as well as Add, Update, ...

Not between methods of DeltaFIFO.
There is race that the events are pop()-ed from DeltaFIFO but concurrently processed by informer controller. Thus interleaving sequences of operations might occur.

@wojtek-t
Copy link
Member

wojtek-t commented Jun 9, 2016

Sorry - I still don't understand - can you please give some exact example? Like what needs to happen to trigger the race (e.g. on Replace?)

@caesarxuchao
Copy link
Member

@wojtek-t I think @hongchaodeng is talking about this bug:
#26167 (comment)

On Thu, Jun 9, 2016 at 12:24 AM, Wojciech Tyczynski <
notifications@github.com> wrote:

Sorry - I still don't understand - can you please give some exact example?
Like what needs to happen to trigger the race (e.g. on Replace?)


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#27004 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ACsVaU0nHhHD5ocbBT_EiZaNb86kDm5Bks5qJ7-SgaJpZM4IwbhE
.

Regards,
Chao Xu

@fejta
Copy link
Contributor

fejta commented Jun 11, 2016

Randomly assigning people per thread participation. Please correct assignees as necessary.

@wojtek-t
Copy link
Member

I thought about it a bit and to be honest, I don't think we can do a smart small change to fix that.

I think that there are basically two options possible here:

  1. spread some really complicated locking all over the code to lock the "knownObjects" (which is unfortunately not owned by the DeltaFIFO in most of cases
  2. don't pass "knownObjects" to DeltaFIFO as an argument and instead store the own copy of "knownObjects" internally. Then proper locking would be quite simple in that case, (but it will have a significant memory overhead).

To be honest, I think 1 is not-maintainable. So I would vote for 2. but that's also far from ideal solution...

@smarterclayton
Copy link
Contributor

All accessors to the store are locked today, right? So could we allow DeltaFIFO to start a critical section that also holds the lock? That's kind of like 1, but doesn't DeltaFIFO have a relatively small critical section on the store it needs to borrow?

@wojtek-t
Copy link
Member

@smarterclayton - I'm not sure I understand.
The real issue here is "controller framework" is running "DeltaFIFO::Pop()" in a loop and processing what it gets by updating its internal storage (which is also knownKeys in DeltaFIFO).
So basically there is an inconsistency in state of "DeltaFIFO" and "knownKeys" during the period from "when Pop() is already called to the moment that information is applied to knownKeys in the controller".

So the only possibility I can see here is that "Pop()" also returns a "unlock function" which is called by the caller when the even returned from Pop is already processed and in fact this unlocks DeltaFIFO (but this isn't a nice solution...)

@smarterclayton
Copy link
Contributor

This is a perfect opportunity to use PopLockAndDrop() function... :) Jokes aside, is there not a need for Pop to allow another caller to properly hold a lock anyway? We've had this use case in a lot of places where we wanted:

fifo.PopObject(func(object interface{}) error {
  // do something with object
})

that operates under the store lock. The downside is that we'd have to provide to the function the store without locks (since they aren't reentrant):

fifo.PopObject(func(store cache.Store, object interface{}) error {
  // store assumes the locks are already held
  // do something with object
  // do something lock safe with store
})

Wouldn't that allow us to atomically apply changes and talk back to store?

@wojtek-t
Copy link
Member

Yeah - I think that passing a function to Pop() should work. And it should be a big change too.

@lavalamp
Copy link
Member

Back when I last thought about this problem, I was imagining having some
mechanism for Informer to get a Locker interface from the store passed in.
Then I think you just need a mechanism so the store doesn't double-lock
when Informer e.g. calls Add. I agree it's not a trivial change.

On Mon, Jun 13, 2016 at 8:57 AM, Wojciech Tyczynski <
notifications@github.com> wrote:

Yeah - I think that passing a function to Pop() should work. And it should
be a big change too.


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#27004 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAnglqWsLDO3WuCJfdEF61aPgB8HOMP6ks5qLX3hgaJpZM4IwbhE
.

@wojtek-t
Copy link
Member

Sorry - I actually wanted to write that passing a function to Pop should NOT be a big change...

@lavalamp do you think it's a bad idea about passing a function to Pop()?
Your suggestion seems much more difficult (and would require much more work).

@lavalamp
Copy link
Member

I guess I don't know that just changing Pop can fix this-- I thought the
problem was with putting stuff into the queue, not taking it out. If you
can fix it by passing a function into Pop then by all means go for it. I
think @hongchao has a repro test case you can use to see if it works.

On Mon, Jun 13, 2016 at 9:30 AM, Wojciech Tyczynski <
notifications@github.com> wrote:

Sorry - I actually wanted to write that passing a function to Pop should
NOT be a big change...

@lavalamp https://github.com/lavalamp do you think it's a bad idea
about passing a function to Pop()?
Your suggestion seems much more difficult (and would require much more
work).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#27004 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAnglvS1t6mkIAJRUe7L8i09KF8f6TS4ks5qLYW4gaJpZM4IwbhE
.

@hongchaodeng
Copy link
Contributor Author

hongchaodeng commented Jun 13, 2016

I have tried several alternatives to solve the problem, including all kinds of weird locking mechanisms. It only gets the codebase super complicated and coupling DeltaFIFO queue and store more.

There is fundamental design flaw in current informer.
informer 2
As shown above, the controller gets events from DeltaFIFO and process events to update the store, notify handlers. But there is another place handling events as well. The DeltaFIFO also does logic on (re)listing store, checking if items is in store, etc. The design doesn't make sense if we want to achieve serializability on operations on single store!

The design should be:
informer
The queue is just queuing events. No more operations about store should be done there.
The controller should be the single master that has control over queue, store and serialize all operations here.

I have tried a prototype of rewriting the informer in this way. It's a joyful experience with more maintainable code. (I have also found design problems in reflectors. But it's another issue.)

@wojtek-t
Copy link
Member

#27341 is supposed to fix this issue

@caesarxuchao
Copy link
Member

@wojtek-t I think #27341 is solving a different problem from the one described by @hongchaodeng. The original problem reported in this issue is that some code in DeltaFIFO is written under the impression that DeltaFIFO.knownobjects is reflecting the latest watch events, but in fact, the DeltaFIFO.knownobjects only reflects the objects that have been processed, that is, not including the deltas that are still in DeltaFIFO.items.

We know this causes at least two issues:

  1. this issue: the check in the DeltaFIFO.Delete() should also check DeltaFIFO.items to see if the object exists after all the queued deltas are applied.
  2. Fix race between periodic sync and delete event. #26167, in DeltaFIFO.Resync(), we should only sync the objects that still exist after all the deltas in DeltaFIFO.items are applied.

@hongchaodeng
Copy link
Contributor Author

hongchaodeng commented Jun 14, 2016

Here is the interaction flow between the queue and store:
informer store

What @caesarxuchao is questioning are valid concerns. Basically, in Relist, Resync, there are still interactions (mostly getting).

@timothysc
Copy link
Member

Having dredged through that code during our resync adventure, I'd give a +1 to @hongchaodeng's idea. But I'm not in favor of making such a change until 1.4...

@wojtek-t
Copy link
Member

@caesarxuchao - thanks for comments; those are valid points;
However, those issues are easy to fix - those can be easily fixed in the DeltaFIFO itself. I can send a second PR tomorrow (that will fix those two).

k8s-github-robot pushed a commit that referenced this issue Jun 14, 2016
Automatic merge from submit-queue

Fix race in informer

Fix #27004

@smarterclayton @hongchaodeng @caesarxuchao
@wojtek-t wojtek-t reopened this Jun 14, 2016
@caesarxuchao
Copy link
Member

Thanks @wojtek-t!

@hongchaodeng
Copy link
Contributor Author

@timothysc
We have been digging in controller and informer to do profiling. It's also our v1.4 goal to do it and other optimizations.

@wojtek-t
Copy link
Member

@caesarxuchao - in fact, those two issue that you described were bugs, not even races.
Sending PR that is fixing it now.

@timothysc
Copy link
Member

/cc @sdminonne

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
Development

No branches or pull requests

10 participants