Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work around for PDs stop mounting after a few hours issue #10169

Merged
merged 1 commit into from
Jun 30, 2015

Conversation

saad-ali
Copy link
Member

This is a temporary work around for #7972 and fixes #9994 (and is a follow-up to PR #9929).

The problem was that if we repeatedly create, attach, mount, unmount, detach, and delete a PD from a GCE instance (PD soak test did this), after a few hours the GCE instance gets corrupted and stops attaching PDs correctly.

The real fix lies on the GCE side, in the meantime, they suggest calling udevadm trigger as a work around. @quinton-hoole dug in to udevadm trigger and suggested that we target the trigger more narrowly to the device being mounted/unmounted to prevent any unintended side effects.

In addition to calling udevadm trigger on the device being attached/detached, this adds logic to verify that a disk detach completed (for #9994) and adds robust retry logic for both attaches and detaches.

Tested this locally by running PD tests back to back for hours. Although it is much more stable than before, the tests still flake out every few hours (disk doesn't attach or detach properly). But when this happens it doesn't leave the machine unable to attach any new disks (as was the case before this fix).

CC @quinton-hoole @brendanburns @dchen1107

@k8s-bot
Copy link

k8s-bot commented Jun 22, 2015

GCE e2e build/test passed for commit 5f2a9a5315e07fd82ae623da1d2827f60d2cca33.

@j3ffml j3ffml added this to the v1.0 milestone Jun 22, 2015
@saad-ali
Copy link
Member Author

Per new merge policy, risk assessment for this PR: this PR touches GCE Volume PD attach and detach code. I'd consider this a critical bug fix, that'll improve system stability--without this PR, GCE Volume PD attachment may stop working after some time on some nodes.

// Calls "udevadm trigger --action=change" on the specified drive.
// drivePath must be the the block device path to trigger on, in the format "/dev/sd*", or a symlink to it.
// This is workaround for Issue #7972. Once the underlying issue has been resolved, this may be removed.
func udevadmChangeToDrive(drivePath string) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets try to split this function up, it does too many things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to leave this as one function since it is implementing temporary behavior that I'd like to keep all together (and remove all together when the time comes). Let me know if you feel strongly otherwise.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you can put the split functions into a separate source file if you want to make it easy to delete them together at a later date. I actually suspect that this code might be required for some time still, given that the underlying GCE fix does not in fact directly address the fact that udev events are not being processed strictly correctly (which this PR works around).

I will leave the decision to you, but if there are bugs in this code, we're going to need to split it up to debug effectively.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you're definitely going to need to write unit and integration tests for this code, which is going to be easier if you split up the function.

@ghost
Copy link

ghost commented Jun 22, 2015

More to come on this review. I'll get back on it tomorrow - gotta run, sorry.

@k8s-bot
Copy link

k8s-bot commented Jun 23, 2015

GCE e2e build/test passed for commit f4bb795f59a67293b279c9666289b9f298c449be.

@ghost ghost assigned ghost and unassigned dchen1107 Jun 23, 2015
glog.Errorf("Error filepath.Glob(\"/dev/sd*\"): %v\r\n", err)
}
udevadmChangeToNewDrives(sdBefore, sdAfter)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not new code below, but this causes kubelet to go to sleep for 1 second, perhaps 10 times in a row, as far as I understand. Is this in a separate go rountine?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dchen1107 @lavalamp I'm being a bit lazy here, but what is the preferred way of doing these sorts of retries inside kubelet?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should block only one pod and not the entire queue. I'm not clear on why this is a for { loop rather than a for numTries := 0; numTries < 10; numTries++ { loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually @quinton-hoole's intuition is right here. This could cause kubelet's syncPods loop to sleep for an additional 10 seconds, see kubelet.go#L1328, since volume teardown is synchronous (unlike volume creation which is asynchronous, see pkg/kubelet/pod_workers.go#L151). Any opposition to spawning of a new thread (go routine) to execute vol.TearDown() in kubelet.go:cleanupOrphanedVolumes(...)?

@ghost
Copy link

ghost commented Jun 23, 2015

OK, review done. If you can strip all of the stuff I suggested out of udevadmChangeToNewDrives() and udevadmChangeToDrive() then there's no need to split up udevadmChangeToDrive(). Ideally I'd like unit and integration tests for this, but it's going to be tricky to stub out udevadm, and given how simple the code should become if you implement the suggested changes, I think that it's OK to merge without these tests. The e2e tests should catch any bad breakages in future, albeit at more test execution time.

@k8s-bot
Copy link

k8s-bot commented Jun 24, 2015

GCE e2e build/test failed for commit 95d694245aa0a760ae530d4eca5581ddaf1fa7f4.

@saad-ali
Copy link
Member Author

@k8s-bot retest this please

@k8s-bot
Copy link

k8s-bot commented Jun 24, 2015

GCE e2e build/test failed for commit 95d694245aa0a760ae530d4eca5581ddaf1fa7f4.

@saad-ali
Copy link
Member Author

Thanks for the feedback @quinton-hoole
PTAL

if err != nil {
glog.Errorf("Could not tear down volume %q: %v", name, err)
}
go func(volumeName string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove TODO. @dchen1107 is it OK to start a random goroutine here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g., what if it comes through next time and it's still working on tearing this down and it tears it down again.

@ghost
Copy link

ghost commented Jun 25, 2015

@saad-ali Let me know when you're ready for another round of reviews on this one. Assigning to you until then.

@ghost ghost assigned saad-ali and unassigned ghost Jun 25, 2015

// Veify the specified persistent disk device has been succesfully detached, and retries if it fails.
func verifyDetached(pd *gcePersistentDisk, gce cloudprovider.Interface) {
ch := detachChanManager.CreateAndAddChan(pd.pdName, 0 /* bufferSize */)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment to the effect:

bufferSize being 0 is very important, because it means that when senders send, they are blocked until we recieve; this avoids having to have a separate "did you exit yet" check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@lavalamp
Copy link
Member

removing ok-to-merge since it looks like saad has unpushed changes ;)

@k8s-bot
Copy link

k8s-bot commented Jun 30, 2015

GCE e2e build/test failed for commit 1215b17ec0374e582a45b484ebd544913cf2ccff.

@lavalamp
Copy link
Member

LGTM

@k8s-bot
Copy link

k8s-bot commented Jun 30, 2015

GCE e2e build/test failed for commit c952ee2.

@saad-ali
Copy link
Member Author

@k8s-bot retest this please

@saad-ali
Copy link
Member Author

@k8s-bot test this please

@k8s-bot
Copy link

k8s-bot commented Jun 30, 2015

GCE e2e build/test failed for commit c952ee2.

@ghost
Copy link

ghost commented Jun 30, 2015

Looks like github is flaky today. This is the second error accessing github that I've seen in the past hour. Retrying...

Unable to query GitHub for status of PullRequestjava.io.IOException: Server returned HTTP response code: 401 for URL: https://api.github.com/repos/GoogleCloudPlatform/kubernetes/pulls/10169
    at sun.reflect.GeneratedConstructorAccessor107.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1676)
    at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1674)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1672)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1245)
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254)
    at org.kohsuke.github.Requester.parse(Requester.java:451)
    at org.kohsuke.github.Requester._to(Requester.java:224)
    at org.kohsuke.github.Requester.to(Requester.java:198)
    at org.kohsuke.github.GHPullRequest.populate(GHPullRequest.java:196)
    at org.kohsuke.github.GHPullRequest.getMergeable(GHPullRequest.java:169)
    at org.jenkinsci.plugins.ghprb.GhprbBuilds.onStarted(GhprbBuilds.java:72)
    at org.jenkinsci.plugins.ghprb.GhprbBuildListener.onStarted(GhprbBuildListener.java:19)
    at org.jenkinsci.plugins.ghprb.GhprbBuildListener.onStarted(GhprbBuildListener.java:12)
    at hudson.model.listeners.RunListener.fireStarted(RunListener.java:215)
    at hudson.model.Run.execute(Run.java:1740)
    at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
    at hudson.model.ResourceController.execute(ResourceController.java:98)
    at hudson.model.Executor.run(Executor.java:374)
Caused by: java.io.IOException: Server returned HTTP response code: 401 for URL: https://api.github.com/repos/GoogleCloudPlatform/kubernetes/pulls/10169
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1627)
    at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:338)
    at org.kohsuke.github.Requester.parse(Requester.java:447)

@ghost
Copy link

ghost commented Jun 30, 2015

@k8s-bot test this please

@ghost
Copy link

ghost commented Jun 30, 2015

Oh wait. It's a 401 (unauthorized) error. I wonder whether the problem is on our Jenkins side with credentials? Checking...

@ghost
Copy link

ghost commented Jun 30, 2015

@ixdy Tells me that these 401 errors have been happening intermittently for a while now, and are mostly innocuous.

@ghost
Copy link

ghost commented Jun 30, 2015

The new e2e run seems to be succeeding...

@saad-ali
Copy link
Member Author

Thanks for taking a look @quinton-hoole

@k8s-bot
Copy link

k8s-bot commented Jun 30, 2015

GCE e2e build/test passed for commit c952ee2.

@ghost
Copy link

ghost commented Jun 30, 2015

e2e Passed. LGTM, ok-to-merge

@ghost ghost added the ok-to-merge label Jun 30, 2015
zmerlynn added a commit that referenced this pull request Jun 30, 2015
Work around for PDs stop mounting after a few hours issue
@zmerlynn zmerlynn merged commit 7df8d76 into kubernetes:master Jun 30, 2015
@saad-ali saad-ali deleted the fixPDIssue2 branch July 1, 2015 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Persistent Disk mount test sometimes fails to delete disk
8 participants