-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition in Informer #27004
Comments
I'm able to reproduce this reliably in integration test. |
I think there's another issue out there for this race already... On Tue, Jun 7, 2016 at 3:39 PM, Hongchao Deng notifications@github.com
|
@lavalamp You mean #26082? I would wait for @hongchaodeng's reproduce. |
Yes, thanks for linking. On Tue, Jun 7, 2016 at 4:46 PM, Xiang Li notifications@github.com wrote:
|
I have some doubts on #26082's description of event sequence. See #26167 (comment) Let's keep it separated from this issue. |
@saad-ali this may be related to the flake you are working on. |
The problem is a lack of syncronization between the store and the queue. If you can make a test case that reproduces an error, that would be great. I didn't manage that last time I messed with that code. |
Reproduced in unit test. See #27064 |
@lavalamp |
@lavalamp This should be triaged for 1.3 |
I read through informer code more thoroughly today. I figure out there is more problems beyond missing Delete events. Resync can return inconsistent data, Replace (on relist) can return inconsistent data, and also Delete. These 3 functions are the ones calling store functions: ListKeys, GetByKey. |
@hongchaodeng - what do you mean by "inconsistent data" above? Can you please clarify? |
Currently the store is used in delta fifo like this:
In this code, there could be new events being processed, data being changed between ListKeys() and iterating, between each iteration of GetByKey(). Things can get weird. For example, I have event A=1,B=1, B=2, A=2, . I could end up getting B=1 and A=2. But I would definitely expect B=2 for a List(). I might get wrong for the use case. But without strong guarantee of consistency, we end up with a lot of workarounds and dealing with complicated code... Another problem is the interleaving of Resync and normal flow in ListAndWatch() of Reflector, i.e. Replace -> Add/Update/Delete. #26966 is the example of that.
This is what the issue is about. |
I don't understand - the whole Replace method is under lock, as well as Add, Update, ... These cannot interrupt each other... |
Not between methods of DeltaFIFO. |
Sorry - I still don't understand - can you please give some exact example? Like what needs to happen to trigger the race (e.g. on Replace?) |
@wojtek-t I think @hongchaodeng is talking about this bug: On Thu, Jun 9, 2016 at 12:24 AM, Wojciech Tyczynski <
Regards, |
Randomly assigning people per thread participation. Please correct assignees as necessary. |
I thought about it a bit and to be honest, I don't think we can do a smart small change to fix that. I think that there are basically two options possible here:
To be honest, I think 1 is not-maintainable. So I would vote for 2. but that's also far from ideal solution... |
All accessors to the store are locked today, right? So could we allow DeltaFIFO to start a critical section that also holds the lock? That's kind of like 1, but doesn't DeltaFIFO have a relatively small critical section on the store it needs to borrow? |
@smarterclayton - I'm not sure I understand. So the only possibility I can see here is that "Pop()" also returns a "unlock function" which is called by the caller when the even returned from Pop is already processed and in fact this unlocks DeltaFIFO (but this isn't a nice solution...) |
This is a perfect opportunity to use
that operates under the store lock. The downside is that we'd have to provide to the function the store without locks (since they aren't reentrant):
Wouldn't that allow us to atomically apply changes and talk back to store? |
Yeah - I think that passing a function to Pop() should work. And it should be a big change too. |
Back when I last thought about this problem, I was imagining having some On Mon, Jun 13, 2016 at 8:57 AM, Wojciech Tyczynski <
|
Sorry - I actually wanted to write that passing a function to Pop should NOT be a big change... @lavalamp do you think it's a bad idea about passing a function to Pop()? |
I guess I don't know that just changing Pop can fix this-- I thought the On Mon, Jun 13, 2016 at 9:30 AM, Wojciech Tyczynski <
|
#27341 is supposed to fix this issue |
@wojtek-t I think #27341 is solving a different problem from the one described by @hongchaodeng. The original problem reported in this issue is that some code in DeltaFIFO is written under the impression that DeltaFIFO.knownobjects is reflecting the latest watch events, but in fact, the DeltaFIFO.knownobjects only reflects the objects that have been processed, that is, not including the deltas that are still in DeltaFIFO.items. We know this causes at least two issues:
|
Here is the interaction flow between the queue and store: What @caesarxuchao is questioning are valid concerns. Basically, in Relist, Resync, there are still interactions (mostly getting). |
Having dredged through that code during our resync adventure, I'd give a +1 to @hongchaodeng's idea. But I'm not in favor of making such a change until 1.4... |
@caesarxuchao - thanks for comments; those are valid points; |
Automatic merge from submit-queue Fix race in informer Fix #27004 @smarterclayton @hongchaodeng @caesarxuchao
Thanks @wojtek-t! |
@timothysc |
@caesarxuchao - in fact, those two issue that you described were bugs, not even races. |
/cc @sdminonne |
What's the issue?
Informer could end up in wrong state.
Example?
I have been stress testing RC of 100 replicas. I kept deleting the pod and let RC recover it to 100. The program will do informer.store.List() to check if there are 100 pods. But at the end of day it returned more than 100 (e.g. 131), while listing directly from apiserver returned 100.
Why?
Informer has a queue and a store (aka indexer) internally.
The text was updated successfully, but these errors were encountered: