Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in setting node statusUpdateNeeded flag #32807

Merged
merged 1 commit into from
Sep 23, 2016

Conversation

jingxu97
Copy link
Contributor

@jingxu97 jingxu97 commented Sep 15, 2016

This PR fixes the race condition in setting node statusUpdateNeeded flag
in master's attachdetach controller. This flag is used to indicate
whether a node status has been updated by the node_status_updater or
not. When updater finishes update a node status, it is set to false.
When the node status is changed such as volume is detached or new volume
is attached to the node, the flag is set to true so that updater can
update the status again. The previous workflow has a race condition as
follows

  1. updater gets the currently attached volume list from the node which needs to be
    updated.
  2. A new volume A is attached to the same node right after 1 and set the
    flag to TRUE
  3. updater updates the node attached volume list (which does not include volume A) and then set the flag to FALSE.
    The result is that volume A will be never added to the attached volume
    list so at node side, this volume is never attached.

So in this PR, the flag is set to FALSE when updater tries to get the
attached volume list (as in an atomic operation). So in the above
example, after step 2, the flag will be TRUE again, in step 3, updater
does not set the flag if updates is sucessful. So after that, flag is
still TRUE and in next round of update, the node status will be updated.

Fix race condition in setting node statusUpdateNeeded flag 

This change is Reviewable

@saad-ali
Copy link
Member

Removing this from 1.4 milestone per offline discussion. We will get it merged, give it time to bake, and merge it into a 1.4.x release.

@saad-ali saad-ali removed this from the v1.4 milestone Sep 15, 2016
@jingxu97 jingxu97 changed the title Fix race conditino in setting node statusUpdateNeeded flag Fix race condition in setting node statusUpdateNeeded flag Sep 15, 2016
@jingxu97 jingxu97 force-pushed the stateupdateNeeded-9-15 branch 2 times, most recently from dfeb259 to 25a0678 Compare September 16, 2016 00:07
@k8s-bot
Copy link

k8s-bot commented Sep 16, 2016

GCE e2e build/test passed for commit 25a0678.

ResetNodeStatusUpdateNeeded(nodeName string) error
// node to true indicating the AttachedVolume field of the Node's Status
// object needs to be updated by the node updater again.
ResetNodeStatusUpdateNeeded(nodeName string)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert the If no node with the... portion of the comment since it is still applicable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I remove the returned error because I could not see the use of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method can result in a failed to setNodeStatusUpdateNeeded error if nodeName does not exist. I like documentation comments to capture 1) what the input is, 2) what the normal output is, and 3) what results in an error. So I would leave the comment as is. That said, I'll leave it up to you to decide.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the comments

ResetNodeStatusUpdateNeeded(nodeName string) error
// node to true indicating the AttachedVolume field of the Node's Status
// object needs to be updated by the node updater again.
ResetNodeStatusUpdateNeeded(nodeName string)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SInce this method now sets the value of statusUpdatedNeeded to true instead of false, change the name to reflect the behavior: SetNodeStatusUpdateNeeded(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

@saad-ali saad-ali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple minor comments, otherwise LGTM

@jingxu97
Copy link
Contributor Author

@saad-ali PTAL

1 similar comment
@jingxu97
Copy link
Contributor Author

@saad-ali PTAL

@jingxu97 jingxu97 force-pushed the stateupdateNeeded-9-15 branch from 25a0678 to d696d45 Compare September 22, 2016 18:27
"failed to ResetNodeStatusUpdateNeeded(nodeName=%q) nodeName does not exist",
// should not happen
glog.Errorf(
"failed to setNodeStatusUpdateNeeded(nodeName=%q) nodeName does not exist",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add needed bool to error message for clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

ResetNodeStatusUpdateNeeded(nodeName string) error
// node to true indicating the AttachedVolume field of the Node's Status
// object needs to be updated by the node updater again.
ResetNodeStatusUpdateNeeded(nodeName string)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method can result in a failed to setNodeStatusUpdateNeeded error if nodeName does not exist. I like documentation comments to capture 1) what the input is, 2) what the normal output is, and 3) what results in an error. So I would leave the comment as is. That said, I'll leave it up to you to decide.

Copy link
Member

@saad-ali saad-ali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments

// Update the flag statusUpdateNeeded to indicate whether node status is already updated or
// needs to be updated again by the node status updater.
// This is an internal function and caller should acquire and release the lock
func (asw *actualStateOfWorld) setNodeStatusUpdateNeeded(nodeName string, needed bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: to avoid confusing this with the SetNodeStatusUpdateNeeded which sets the value to true, maybe rename it to modifyNodeStatusUpdateNeeded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

This PR fixes the race condition in setting node statusUpdateNeeded flag
in master's attachdetach controller. This flag is used to indicate
whether a node status has been updated by the node_status_updater or
not. When updater finishes update a node status, it is set to false.
When the node status is changed such as volume is detached or new volume
is attached to the node, the flag is set to true so that updater can
update the status again. The previous workflow has a race condition as
follows
1. updater gets the currently attached volume list from the node which needs to be
updated.
2. A new volume A is attached to the same node right after 1 and set the
flag to TRUE
3. updater updates the node attached volume list (which does not include volume A) and then set the flag to FALSE.
The result is that volume A will be never added to the attached volume
list so at node side, this volume is never attached.

So in this PR, the flag is set to FALSE when updater tries to get the
attached volume list (as in an atomic operation). So in the above
example, after step 2, the flag will be TRUE again, in step 3, updater
does not set the flag if updates is sucessful. So after that, flag is
still TRUE and in next round of update, the node status will be updated.

This PR also changes a unit test due to the workflow changes
@jingxu97 jingxu97 force-pushed the stateupdateNeeded-9-15 branch from d696d45 to 14cad20 Compare September 22, 2016 21:02
@jingxu97
Copy link
Contributor Author

@saad-ali PTAL

Copy link
Member

@saad-ali saad-ali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@saad-ali saad-ali added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed release-note-label-needed labels Sep 22, 2016
@saad-ali
Copy link
Member

This PR should be cherry-picked to v1.4.1.

@k8s-github-robot k8s-github-robot added the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Sep 22, 2016
@jingxu97 jingxu97 added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. release-note-label-needed labels Sep 22, 2016
@jingxu97 jingxu97 added this to the v1.4 milestone Sep 22, 2016
@jingxu97 jingxu97 removed the release-note-none Denotes a PR that doesn't merit a release note. label Sep 22, 2016
@jingxu97 jingxu97 removed this from the v1.4 milestone Sep 22, 2016
@jingxu97 jingxu97 added the release-note-none Denotes a PR that doesn't merit a release note. label Sep 22, 2016
@k8s-ci-robot
Copy link
Contributor

Jenkins GCE e2e failed for commit 14cad20. Full PR test history.

The magic incantation to run this job again is @k8s-bot gce e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@jingxu97
Copy link
Contributor Author

@k8s-bot gce e2e test this

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 0a4316f into kubernetes:master Sep 23, 2016
@saad-ali saad-ali added this to the v1.4 milestone Sep 26, 2016
@saad-ali
Copy link
Member

Adding cherrypick-candidate and v1.4 milestone to have this picked up for v1.4.1

@jessfraz jessfraz added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Oct 4, 2016
k8s-github-robot pushed a commit that referenced this pull request Oct 4, 2016
…07-upstream-release-1.4

Automatic merge from submit-queue

Automated cherry pick of #32807

Cherry pick of #32807 on release-1.4.
@k8s-cherrypick-bot
Copy link

Commit found in the "release-1.4" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

@saad-ali
Copy link
Member

saad-ali commented Oct 4, 2016

Now that this has been cherry-picked to the 1.4 branch (for 1.4.1), let's also cherry-pick it to the 1.3 branch (for 1.3.9).

@saad-ali saad-ali modified the milestones: v1.3, v1.4 Oct 4, 2016
shyamjvs pushed a commit to shyamjvs/kubernetes that referenced this pull request Dec 1, 2016
…ck-of-#32807-upstream-release-1.4

Automatic merge from submit-queue

Automated cherry pick of kubernetes#32807

Cherry pick of kubernetes#32807 on release-1.4.
@k8s-cherrypick-bot
Copy link

Commit found in the "release-1.3" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants