Work around for PDs stop mounting after a few hours issue #10169

saad-ali · 2015-06-22T02:50:20Z

This is a temporary work around for #7972 and fixes #9994 (and is a follow-up to PR #9929).

The problem was that if we repeatedly create, attach, mount, unmount, detach, and delete a PD from a GCE instance (PD soak test did this), after a few hours the GCE instance gets corrupted and stops attaching PDs correctly.

The real fix lies on the GCE side, in the meantime, they suggest calling udevadm trigger as a work around. @quinton-hoole dug in to udevadm trigger and suggested that we target the trigger more narrowly to the device being mounted/unmounted to prevent any unintended side effects.

In addition to calling udevadm trigger on the device being attached/detached, this adds logic to verify that a disk detach completed (for #9994) and adds robust retry logic for both attaches and detaches.

Tested this locally by running PD tests back to back for hours. Although it is much more stable than before, the tests still flake out every few hours (disk doesn't attach or detach properly). But when this happens it doesn't leave the machine unable to attach any new disks (as was the case before this fix).

CC @quinton-hoole @brendanburns @dchen1107

k8s-bot · 2015-06-22T03:05:26Z

GCE e2e build/test passed for commit 5f2a9a5315e07fd82ae623da1d2827f60d2cca33.

saad-ali · 2015-06-22T20:04:45Z

Per new merge policy, risk assessment for this PR: this PR touches GCE Volume PD attach and detach code. I'd consider this a critical bug fix, that'll improve system stability--without this PR, GCE Volume PD attachment may stop working after some time on some nodes.

ghost · 2015-06-22T23:21:23Z

pkg/volume/gce_pd/gce_util.go

+// Calls "udevadm trigger --action=change" on the specified drive.
+// drivePath must be the the block device path to trigger on, in the format "/dev/sd*", or a symlink to it.
+// This is workaround for Issue #7972. Once the underlying issue has been resolved, this may be removed.
+func udevadmChangeToDrive(drivePath string) {


Lets try to split this function up, it does too many things.

I would prefer to leave this as one function since it is implementing temporary behavior that I'd like to keep all together (and remove all together when the time comes). Let me know if you feel strongly otherwise.

I think that you can put the split functions into a separate source file if you want to make it easy to delete them together at a later date. I actually suspect that this code might be required for some time still, given that the underlying GCE fix does not in fact directly address the fact that udev events are not being processed strictly correctly (which this PR works around).

I will leave the decision to you, but if there are bugs in this code, we're going to need to split it up to debug effectively.

And you're definitely going to need to write unit and integration tests for this code, which is going to be easier if you split up the function.

ghost · 2015-06-22T23:27:25Z

More to come on this review. I'll get back on it tomorrow - gotta run, sorry.

k8s-bot · 2015-06-23T00:11:44Z

GCE e2e build/test passed for commit f4bb795f59a67293b279c9666289b9f298c449be.

ghost · 2015-06-23T20:13:16Z

pkg/volume/gce_pd/gce_util.go

+			glog.Errorf("Error filepath.Glob(\"/dev/sd*\"): %v\r\n", err)
+		}
+		udevadmChangeToNewDrives(sdBefore, sdAfter)
+


Not new code below, but this causes kubelet to go to sleep for 1 second, perhaps 10 times in a row, as far as I understand. Is this in a separate go rountine?

@dchen1107 @lavalamp I'm being a bit lazy here, but what is the preferred way of doing these sorts of retries inside kubelet?

It should block only one pod and not the entire queue. I'm not clear on why this is a for { loop rather than a for numTries := 0; numTries < 10; numTries++ { loop.

Actually @quinton-hoole's intuition is right here. This could cause kubelet's syncPods loop to sleep for an additional 10 seconds, see kubelet.go#L1328, since volume teardown is synchronous (unlike volume creation which is asynchronous, see pkg/kubelet/pod_workers.go#L151). Any opposition to spawning of a new thread (go routine) to execute vol.TearDown() in kubelet.go:cleanupOrphanedVolumes(...)?

ghost · 2015-06-23T21:57:21Z

OK, review done. If you can strip all of the stuff I suggested out of udevadmChangeToNewDrives() and udevadmChangeToDrive() then there's no need to split up udevadmChangeToDrive(). Ideally I'd like unit and integration tests for this, but it's going to be tricky to stub out udevadm, and given how simple the code should become if you implement the suggested changes, I think that it's OK to merge without these tests. The e2e tests should catch any bad breakages in future, albeit at more test execution time.

k8s-bot · 2015-06-24T04:15:55Z

GCE e2e build/test failed for commit 95d694245aa0a760ae530d4eca5581ddaf1fa7f4.

saad-ali · 2015-06-24T04:33:31Z

@k8s-bot retest this please

k8s-bot · 2015-06-24T05:27:49Z

GCE e2e build/test failed for commit 95d694245aa0a760ae530d4eca5581ddaf1fa7f4.

saad-ali · 2015-06-24T05:30:15Z

Thanks for the feedback @quinton-hoole
PTAL

lavalamp · 2015-06-24T18:25:51Z

pkg/kubelet/kubelet.go

-			if err != nil {
-				glog.Errorf("Could not tear down volume %q: %v", name, err)
-			}
+			go func(volumeName string) {


Remove TODO. @dchen1107 is it OK to start a random goroutine here?

E.g., what if it comes through next time and it's still working on tearing this down and it tears it down again.

ghost · 2015-06-25T21:02:03Z

@saad-ali Let me know when you're ready for another round of reviews on this one. Assigning to you until then.

lavalamp · 2015-06-30T19:33:31Z

pkg/volume/gce_pd/gce_util.go

+
+// Veify the specified persistent disk device has been succesfully detached, and retries if it fails.
+func verifyDetached(pd *gcePersistentDisk, gce cloudprovider.Interface) {
+	ch := detachChanManager.CreateAndAddChan(pd.pdName, 0 /* bufferSize */)


Please add a comment to the effect:

bufferSize being 0 is very important, because it means that when senders send, they are blocked until we recieve; this avoids having to have a separate "did you exit yet" check.

lavalamp · 2015-06-30T19:52:23Z

removing ok-to-merge since it looks like saad has unpushed changes ;)

k8s-bot · 2015-06-30T20:25:37Z

GCE e2e build/test failed for commit 1215b17ec0374e582a45b484ebd544913cf2ccff.

lavalamp · 2015-06-30T20:26:09Z

LGTM

k8s-bot · 2015-06-30T20:50:18Z

GCE e2e build/test failed for commit c952ee2.

saad-ali · 2015-06-30T20:52:27Z

@k8s-bot retest this please

saad-ali · 2015-06-30T21:14:03Z

@k8s-bot test this please

k8s-bot · 2015-06-30T21:29:20Z

GCE e2e build/test failed for commit c952ee2.

ghost · 2015-06-30T21:41:37Z

Looks like github is flaky today. This is the second error accessing github that I've seen in the past hour. Retrying...

Unable to query GitHub for status of PullRequestjava.io.IOException: Server returned HTTP response code: 401 for URL: https://api.github.com/repos/GoogleCloudPlatform/kubernetes/pulls/10169
    at sun.reflect.GeneratedConstructorAccessor107.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1676)
    at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1674)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1672)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1245)
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254)
    at org.kohsuke.github.Requester.parse(Requester.java:451)
    at org.kohsuke.github.Requester._to(Requester.java:224)
    at org.kohsuke.github.Requester.to(Requester.java:198)
    at org.kohsuke.github.GHPullRequest.populate(GHPullRequest.java:196)
    at org.kohsuke.github.GHPullRequest.getMergeable(GHPullRequest.java:169)
    at org.jenkinsci.plugins.ghprb.GhprbBuilds.onStarted(GhprbBuilds.java:72)
    at org.jenkinsci.plugins.ghprb.GhprbBuildListener.onStarted(GhprbBuildListener.java:19)
    at org.jenkinsci.plugins.ghprb.GhprbBuildListener.onStarted(GhprbBuildListener.java:12)
    at hudson.model.listeners.RunListener.fireStarted(RunListener.java:215)
    at hudson.model.Run.execute(Run.java:1740)
    at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
    at hudson.model.ResourceController.execute(ResourceController.java:98)
    at hudson.model.Executor.run(Executor.java:374)
Caused by: java.io.IOException: Server returned HTTP response code: 401 for URL: https://api.github.com/repos/GoogleCloudPlatform/kubernetes/pulls/10169
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1627)
    at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:338)
    at org.kohsuke.github.Requester.parse(Requester.java:447)

ghost · 2015-06-30T21:41:58Z

@k8s-bot test this please

ghost · 2015-06-30T21:43:59Z

Oh wait. It's a 401 (unauthorized) error. I wonder whether the problem is on our Jenkins side with credentials? Checking...

ghost · 2015-06-30T21:58:40Z

@ixdy Tells me that these 401 errors have been happening intermittently for a while now, and are mostly innocuous.

ghost · 2015-06-30T21:59:17Z

The new e2e run seems to be succeeding...

saad-ali · 2015-06-30T22:05:38Z

Thanks for taking a look @quinton-hoole

k8s-bot · 2015-06-30T22:06:33Z

GCE e2e build/test passed for commit c952ee2.

ghost · 2015-06-30T22:09:26Z

e2e Passed. LGTM, ok-to-merge

Work around for PDs stop mounting after a few hours issue

googlebot added the cla: yes label Jun 22, 2015

j3ffml added this to the v1.0 milestone Jun 22, 2015

j3ffml assigned dchen1107 Jun 22, 2015

ghost reviewed Jun 22, 2015
View reviewed changes

saad-ali force-pushed the fixPDIssue2 branch from 5f2a9a5 to f4bb795 Compare June 22, 2015 23:51

ghost assigned ghost and unassigned dchen1107 Jun 23, 2015

ghost reviewed Jun 23, 2015
View reviewed changes

lavalamp reviewed Jun 24, 2015
View reviewed changes

saad-ali mentioned this pull request Jun 24, 2015

Persistent Disk mount tests start failing a few hours after a cluster is created #7972

Closed

ghost assigned saad-ali and unassigned ghost Jun 25, 2015

lavalamp reviewed Jun 30, 2015
View reviewed changes

lavalamp removed the ok-to-merge label Jun 30, 2015

saad-ali force-pushed the fixPDIssue2 branch from 48a7d2b to 1215b17 Compare June 30, 2015 20:24

Work around for PDs stop mounting after a few hours issue

c952ee2

saad-ali force-pushed the fixPDIssue2 branch from 1215b17 to c952ee2 Compare June 30, 2015 20:31

ghost added the ok-to-merge label Jun 30, 2015

zmerlynn added a commit that referenced this pull request Jun 30, 2015

Merge pull request #10169 from saad-ali/fixPDIssue2

7df8d76

Work around for PDs stop mounting after a few hours issue

zmerlynn merged commit 7df8d76 into kubernetes:master Jun 30, 2015

saad-ali deleted the fixPDIssue2 branch July 1, 2015 18:04

saad-ali mentioned this pull request Jul 1, 2015

Enable readonly PD tests for Jenkins GCE E2E run #10633

Merged

saad-ali mentioned this pull request Jul 14, 2015

Kubernetes/GCE corrupted PD volume #11231

Closed

saad-ali unassigned lavalamp Aug 12, 2015

Work around for PDs stop mounting after a few hours issue #10169

Work around for PDs stop mounting after a few hours issue #10169

Conversation

saad-ali commented Jun 22, 2015

k8s-bot commented Jun 22, 2015

saad-ali commented Jun 22, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost commented Jun 22, 2015

k8s-bot commented Jun 23, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost commented Jun 23, 2015

k8s-bot commented Jun 24, 2015

saad-ali commented Jun 24, 2015

k8s-bot commented Jun 24, 2015

saad-ali commented Jun 24, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost commented Jun 25, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lavalamp commented Jun 30, 2015

k8s-bot commented Jun 30, 2015

lavalamp commented Jun 30, 2015

k8s-bot commented Jun 30, 2015

saad-ali commented Jun 30, 2015

saad-ali commented Jun 30, 2015

k8s-bot commented Jun 30, 2015

ghost commented Jun 30, 2015

ghost commented Jun 30, 2015

ghost commented Jun 30, 2015

ghost commented Jun 30, 2015

ghost commented Jun 30, 2015

saad-ali commented Jun 30, 2015

k8s-bot commented Jun 30, 2015

ghost commented Jun 30, 2015