-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race in informer #27341
Fix race in informer #27341
Conversation
@@ -22,12 +22,15 @@ import ( | |||
"k8s.io/kubernetes/pkg/util/sets" | |||
) | |||
|
|||
type PopProcessFunc func(interface{}) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go doc defining what this should do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@smarterclayton - comments answered - PTAL |
Pop() interface{} | ||
// Pop blocks until it has something to process. | ||
// It returns the object that was process and the result of processing. | ||
Pop(PopProcessFunc) (interface{}, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the intention. However, I dislike this interface change. I would still keep the Queue is a simple data structure without the ability to process the out-coming data. Is there anything prevent us from doing this outside the queue (holding some locks)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately that's not going to happen. Basically the problem is that we need to sync between DeltaFIFO and "knownKeys" that is passed to it. If we don't change the interface like that, the locking will be spreaded all over the code everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the intention. However, as far as I can see we do not really need to change a lot of places if we try to approach this problem in a different way that solving it outside the queue. I may be wrong. And I am OK with this solution as a temp one.
So here is my thought:
We now try to force the dequeue operation and the processing of the dequeued item protected by a shared queue lock. To be more concrete, for the service controller it is its cache.
First, I do not believe this is not a good abstraction. Queue now needs to worry about the synchronization between internal and external objects. Second, this does not prevent the synchronization problem to happen. The knowobject cache still can be accessed externally asynchronously. This solves today's issue in a not elegant and sustainable way I think.
Thus I dislike it. And I suggest to clean this whole thing up after 1.3 if we want to go this approach right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely agree that it's not an ideal solution. But fixing it properly can't be done within an hour or - it would require days, which we don't have before 1.3. And this is fixing the problem that we are facing now.
And I completely agree that fixing it properly after 1.3 is a good thing to do.
@smarterclayton - comments applied PTAL |
As a minimal change to fix the p0 for 1.3 I'm ok with this, although I think we can potentially clean up the API to address @xiang90's concerns (i think forcing all the dependent code to change via an interface change is a good forcing function - the most important need right now is to stomp the race). |
LGTM. |
+100 |
The change should fix the problem. Thanks @wojtek-t @smarterclayton @xiang90 for all the reasoning on this. The plan SGTM. |
@hongchaodeng Have you tried to apply this patch to the test that can reproduce the failure? |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
@xiang90 |
@@ -39,6 +44,16 @@ type Queue interface { | |||
HasSynced() bool | |||
} | |||
|
|||
// Helper function for popping from Queue. | |||
func Pop(queue Queue) interface{} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how anyone using this helper function will not have the problem you're trying to prevent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, you're only using it for tests. Please update this comment or people are going to get bitten by this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lavalamp - since this is already merged and we probably need to fix some more issues - see #27004 (comment) (but those one are very local changes) I will update the comment in that PR tomorrow.
This sort of defeats the purpose of Pop, which is to keep the lock held as little as possible. I'm not sure how much contention this adds, or if our queues will still be able to keep up. If the scalability tests go red I will probably roll this back... |
GCE e2e build/test passed for commit 5d702a3. |
Automatic merge from submit-queue |
@lavalamp @caesarxuchao - I looked into performance results after merging this PR and there is no effect. So this seems to be completely fine. |
Great, thanks! On Wed, Jun 15, 2016 at 7:32 AM, Wojciech Tyczynski <
|
Fix #27004
@smarterclayton @hongchaodeng @caesarxuchao