Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race in informer #27341

Merged
merged 1 commit into from
Jun 14, 2016
Merged

Fix race in informer #27341

merged 1 commit into from
Jun 14, 2016

Conversation

wojtek-t
Copy link
Member

@wojtek-t wojtek-t added kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note-none Denotes a PR that doesn't merit a release note. labels Jun 14, 2016
@wojtek-t wojtek-t added this to the v1.3 milestone Jun 14, 2016
@k8s-github-robot k8s-github-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 14, 2016
@wojtek-t
Copy link
Member Author

@k8s-bot unit test this please, issue: #27344

@@ -22,12 +22,15 @@ import (
"k8s.io/kubernetes/pkg/util/sets"
)

type PopProcessFunc func(interface{}) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go doc defining what this should do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@wojtek-t
Copy link
Member Author

@smarterclayton - comments answered - PTAL

Pop() interface{}
// Pop blocks until it has something to process.
// It returns the object that was process and the result of processing.
Pop(PopProcessFunc) (interface{}, error)
Copy link
Contributor

@xiang90 xiang90 Jun 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the intention. However, I dislike this interface change. I would still keep the Queue is a simple data structure without the ability to process the out-coming data. Is there anything prevent us from doing this outside the queue (holding some locks)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately that's not going to happen. Basically the problem is that we need to sync between DeltaFIFO and "knownKeys" that is passed to it. If we don't change the interface like that, the locking will be spreaded all over the code everywhere.

Copy link
Contributor

@xiang90 xiang90 Jun 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojtek-t

I understand the intention. However, as far as I can see we do not really need to change a lot of places if we try to approach this problem in a different way that solving it outside the queue. I may be wrong. And I am OK with this solution as a temp one.

So here is my thought:
We now try to force the dequeue operation and the processing of the dequeued item protected by a shared queue lock. To be more concrete, for the service controller it is its cache.

First, I do not believe this is not a good abstraction. Queue now needs to worry about the synchronization between internal and external objects. Second, this does not prevent the synchronization problem to happen. The knowobject cache still can be accessed externally asynchronously. This solves today's issue in a not elegant and sustainable way I think.

Thus I dislike it. And I suggest to clean this whole thing up after 1.3 if we want to go this approach right now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely agree that it's not an ideal solution. But fixing it properly can't be done within an hour or - it would require days, which we don't have before 1.3. And this is fixing the problem that we are facing now.
And I completely agree that fixing it properly after 1.3 is a good thing to do.

@wojtek-t
Copy link
Member Author

@smarterclayton - comments applied

PTAL

@smarterclayton
Copy link
Contributor

As a minimal change to fix the p0 for 1.3 I'm ok with this, although I think we can potentially clean up the API to address @xiang90's concerns (i think forcing all the dependent code to change via an interface change is a good forcing function - the most important need right now is to stomp the race).

@smarterclayton smarterclayton added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 14, 2016
@smarterclayton
Copy link
Contributor

LGTM.

@wojtek-t
Copy link
Member Author

As a minimal change to fix the p0 for 1.3 I'm ok with this, although I think we can potentially clean up the API to address @xiang90's concerns (i think forcing all the dependent code to change via an interface change is a good forcing function - the most important need right now is to stomp the race).

+100

@hongchaodeng
Copy link
Contributor

The change should fix the problem.

Thanks @wojtek-t @smarterclayton @xiang90 for all the reasoning on this. The plan SGTM.

@xiang90
Copy link
Contributor

xiang90 commented Jun 14, 2016

@hongchaodeng Have you tried to apply this patch to the test that can reproduce the failure?

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@hongchaodeng
Copy link
Contributor

@xiang90
The test won't work with this change, because everything is blocked and sleep. But that's expected and a good sign. If needed, I can redesign and write more tests to cover any potential races.

@@ -39,6 +44,16 @@ type Queue interface {
HasSynced() bool
}

// Helper function for popping from Queue.
func Pop(queue Queue) interface{} {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how anyone using this helper function will not have the problem you're trying to prevent?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you're only using it for tests. Please update this comment or people are going to get bitten by this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lavalamp - since this is already merged and we probably need to fix some more issues - see #27004 (comment) (but those one are very local changes) I will update the comment in that PR tomorrow.

@lavalamp
Copy link
Member

This sort of defeats the purpose of Pop, which is to keep the lock held as little as possible. I'm not sure how much contention this adds, or if our queues will still be able to keep up. If the scalability tests go red I will probably roll this back...

@k8s-bot
Copy link

k8s-bot commented Jun 14, 2016

GCE e2e build/test passed for commit 5d702a3.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit cc5e159 into kubernetes:master Jun 14, 2016
@wojtek-t
Copy link
Member Author

@lavalamp @caesarxuchao - I looked into performance results after merging this PR and there is no effect. So this seems to be completely fine.

@lavalamp
Copy link
Member

Great, thanks!

On Wed, Jun 15, 2016 at 7:32 AM, Wojciech Tyczynski <
notifications@github.com> wrote:

@lavalamp https://github.com/lavalamp @caesarxuchao
https://github.com/caesarxuchao - I looked into performance results
after merging this PR and there is no effect. So this seems to be
completely fine.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#27341 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAnglgXxLeEUxD7fF35eBkSGE_fiuQJNks5qMA0CgaJpZM4I1Dzl
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants