Allow mounts to run in parallel for non-attachable volumes #28939

saad-ali · 2016-07-14T05:48:28Z

This PR:

Fixes Mounting (only 'default-token') volume takes a long time when creating a batch of pods #28616
- Enables mount volume operations to run in parallel for non-attachable volume plugins.
- Enables unmount volume operations to run in parallel for all volume plugins.

dims · 2016-07-14T10:52:07Z

pkg/volume/util/operationexecutor/operation_executor.go

@@ -390,7 +401,7 @@ func (oe *operationExecutor) UnmountDevice(
 	}

 	return oe.pendingOperations.Run(
-		string(deviceToDetach.VolumeName), unmountDeviceFunc)
+		string(deviceToDetach.VolumeName), "" /* operationSubName */, unmountDeviceFunc)


@saad-ali : (trying to learn). So the empty subname "" means that all the UnmountDevice operations will run one after another right?

saad-ali · 2016-07-14T20:54:28Z

PTAL

matchstick · 2016-07-14T22:41:57Z

pkg/util/goroutinemap/goroutinemap.go

+	// operationName/operationSubName is removed from the list of executing
+	// operations allowing a new operation to be started with the same name
+	// without error.
+	Run(operationName, operationSubName string, operationFunc func() error) error


@saad-ali @smarterclayton The run interface was a nice and simple interface. We are burdending the compelxity of the "sub" logic into callers who don't care. We should add a new routine for the subLogic that allows for running on two dimensions as opposed to one. That also means callers don't have to use a "" second argument that is just confusing.

How about

Run(target, operationFunc func() error) error RunSubTargets(target string, targetSub string, operationFunc func() error) error

Minor nit here - we usually call function variables fooFn or fn and function type aliases FooFunc.

Agree on splitting. Alternatively, you could reorder the function like:

Run(fn func() error, targets ...string)

and have identical targets serialize (you might have to join the targets to do that or create a special map key type, so it's not hugely important to me). Either way making clear that target has to be unique.

coufon · 2016-07-15T00:07:58Z

The latency of mounting volumes gets evidently improved after this fixup. The new result of re-running the test in #28616 (create 30 pods) is:

The latency of volume manager handling 30 pods is reduced down to ~6s (it is ~24s at beginning)

matchstick · 2016-07-15T00:21:19Z

@saad-ali @thockin @smarterclayton I spent some time reviewing this today and I talked to saad. I think the high level goals of this CL are spot on.

But the modifications to the goRoutineMap seem a bit too intrusive. We are overloading an elegant datastructure with new functionality that doesn't seem generic. The system as it stands is simple and generic and serves a good purprose. It looks like we are grafting on complexity that could be better served by simply writing a storage specific wrapper around a slice. Then we can have the code be more clear and maintainable.

saad and I ran some experiments to make sure that the slice will be able to perform well with a 1000000 operations and it looks like it can do it in about 1ms so pursuing this alternative should yield a simpler and less intrusive solution.

saad-ali · 2016-07-19T06:55:23Z

PTAL
Split GoroutineMap into separate files based on @matchstick's feedback.

smarterclayton · 2016-07-19T16:04:55Z

pkg/volume/util/goroutinemap/goroutinemap.go

+
+func (grm *goRoutineMap) operationComplete(
+	volumeName api.UniqueVolumeName, podName types.UniquePodName, err *error) {
+	defer grm.cond.Signal()


My brain twisted a bit at this - just to verify, we're going to take and hold lock, but cond will be signaled inside the lock? So lock dominates cond here as well as in wait?

@smarterclayton @saad-ali SO defer order according to https://blog.golang.org/defer-panic-and-recover is Last-In is First-Out. Which means that what we are doing here as I under stand it is

lock the structure

Do all the work in the datastructure required.

unlock structure

Wakeup anything potentially waiting before this routine takes the lock, or while the lock is taken here. There is a window between the defers where we release the lock and then do the signal but in that case the Wait routine should perform the correct action as the length of the slice should be updated.

Saad is that correct? I recommend commenting the desired order of the defers and the reasoning behind them.

My brain twisted a bit at this - just to verify, we're going to take and hold lock, but cond will be signaled inside the lock? So lock dominates cond here as well as in wait?

Cond will be signaled outside the lock here. But the wait is protected by the lock.

So pretty much what @matchstick said, the last operation to occur is signaling any waiting routines.

The wait code (introduced in #25399 and #28153) is all existing code, so I didn't give it much throught in this PR. Looking at it again, I think it's ok as is. Will add comments to explain the order of execution.

Just wanted to be sure, thanks.

smarterclayton · 2016-07-20T03:37:10Z

No other comments

Allow mount volume operations to run in parallel for non-attachable volume plugins. Allow unmount volume operations to run in parallel for all volume plugins.

jingxu97 · 2016-07-20T08:22:12Z

pkg/util/goroutinemap/exponentialbackoff/exponential_backoff.go

+	// initialDurationBeforeRetry is the amount of time after an error occurs
+	// that GoroutineMap will refuse to allow another operation to start with
+	// the same target (if exponentialBackOffOnError is enabled). Each
+	// successive error results in a wait 2x times the previous.


I think there might be performance issue with the current exponential backoff design. Consider a scenario to creat a rc consisting of N pods which all reference the same disk . After desired world populator adds the volume and pod information, reconciler starts issuing waitforattach/mount requests periodically. In the first round, reconciler scans all the pod volumes and issues total N requests almost at the same time. With gorountinemap, only one of them can succeed and some others will have ExponentialBackoffError which cause them to wait the same amount of time (double the current loop duration) to retry. After passing the durationBeforeRetry reconciler will issue N or N-1 (if the first finishes) requests again at around the same time, causing them to double the same wait time again. The same step repeat in the following rounds, each time only one request can be granted, and the waiting time to retry keeps doubling until reaching the maxDurationBeforeRetry. From my initial experiments, looks like mount operation finishes very fast which reduce the above problem. But wait for attach operation seems takes longer time. One possible solution is to introduce randomness into the algorithm.

With gorountinemap, only one of them can succeed

With this change all of them will be able to run mount in parallel.

some others will have ExponentialBackoffError

ExponentialBackoff does not apply when an operation is rejected because there is an existing operation (only when there is an execution error).

Here I am more focusing on the operations for attachable volumes. As you explained, if ExponentialBackoff only apply for execution error, it may not cause serious performance issue. But still, it may end up spending more time than necessary. For example, if the volume becomes attachable in 1 min, depending on the timing, it is possible for all N pods waiting for more than 3 mins before starting mount operations. Also it may have higher chances of operation is rejected because there is an existing operation. What will happen if there is a hanging operation in goroutine?

For volumes that are not attachable, since they can run in parallel without problem. Why need to run them through goroutine?

This feels like a 1.4 conversation if the exponential behaviour has not
changed. I am not saying JinXu's idea don't have merit, but we do need the
fix in 1.3 and if this is how it worked in 1.2 let's decouple this
conversation.

mrubin

On Wed, Jul 20, 2016 at 10:16 AM, Jing Xu notifications@github.com wrote:

In pkg/util/goroutinemap/exponentialbackoff/exponential_backoff.go
#28939 (comment)
:

+*/
+
+// Package exponentialbackoff contains logic for implementing exponential
+// backoff for GoRoutineMap and NestedPendingOperations.
+package exponentialbackoff
+
+import (

"fmt"

"time"
+)

+const (

// initialDurationBeforeRetry is the amount of time after an error occurs

// that GoroutineMap will refuse to allow another operation to start with

// the same target (if exponentialBackOffOnError is enabled). Each

// successive error results in a wait 2x times the previous.

Here I am more focusing on the operations for attachable volumes. As you
explained, if ExponentialBackoff only apply for execution error, it may not
cause serious performance issue. But still, it may end up spending more
time than necessary. For example, if the volume becomes attachable in 1
min, depending on the timing, it is possible for all N pods waiting for
more than 3 mins before starting mount operations. Also it may have higher
chances of operation is rejected because there is an existing operation.
What will happen if there is a hanging operation in goroutine?

For volumes that are not attachable, since they can run in parallel
without problem. Why need to run them through goroutine?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/pull/28939/files/88d495026dd3f741b1398531cbf6c484c0c03773#r71567301,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ASzXe9_0zvzED3kQRFsrfcDHldrKqTdnks5qXlf0gaJpZM4JMGj8
.

if the volume becomes attachable in 1 min, depending on the timing, it is possible for all N pods waiting for more than 3 mins before starting mount operations

That's not correct. If multiple pods on a node are referencing the same attachable volume, the attach and mount device steps are serialized and only happen once for all the pods. Once a device is attached and globally mounted, subsequent pods only need to do the bind mount (which, with this change is parallelized).

Also it may have higher chances of operation is rejected because there is an existing operation.

That makes little difference in execution speed. The reconciler retries very quickly.

What will happen if there is a hanging operation in goroutine

If an attachable volume fails to attach or complete global mount, regardless of if the operations fail or hang, they will block bind mounting. That is necessary and by design, an attachable volume can not be bind mounted until these operations succeed.

For volumes that are not attachable, since they can run in parallel without problem. Why need to run them through goroutine?

They must be run in a goroutine so that the caller is not blocked. They are run using nestedpendingoperations to prevent the multiple operations from being triggered on the same pod/volume.

saad-ali · 2016-07-20T19:57:06Z

@k8s-bot test this issue: #22655

k8s-bot · 2016-07-20T20:30:31Z

GCE e2e build/test passed for commit 88d4950.

k8s-github-robot · 2016-07-20T21:17:15Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

k8s-bot · 2016-07-20T21:56:55Z

GCE e2e build/test passed for commit 88d4950.

k8s-github-robot · 2016-07-20T21:56:58Z

Automatic merge from submit-queue

ncdc · 2016-07-22T15:07:45Z

@saad-ali can you backport this to 1.3?

matchstick · 2016-07-22T16:32:31Z

@ncdc @kubernetes/sig-storage The plan is to backport this to 1.3 and release it in 1.3.4 in a week. Before merging into the 1.3 release branch we want to let it soak in master to undergo some exposure in e2e testing and such. Please also feel free to test this. More coverage the better. Hope that is ok

smarterclayton · 2016-07-22T17:10:04Z

Since we'd be soaking against 1.3 we just wanted to avoid soaking a
different backport than what would end up in Kube.

On Fri, Jul 22, 2016 at 12:32 PM, Michael Rubin notifications@github.com
wrote:

@ncdc https://github.com/ncdc @kubernetes/sig-storage
https://github.com/orgs/kubernetes/teams/sig-storage The plan is to
backport this to 1.3 and release it in 1.3.4 in a week. Before merging into
the 1.3 release branch we want to let it soak in master to undergo some
exposure in e2e testing and such. Please also feel free to test this. More
coverage the better. Hope that is ok

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28939 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p86UOcfOXnl3mvkEQUXN9RnQHrm2ks5qYPCsgaJpZM4JMGj8
.

#28409-#28939-upstream-release-1.3 Automated cherry pick of #28153 #28409 #28939 upstream release 1.3

…ck-of-#28153-kubernetes#28409-kubernetes#28939-upstream-release-1.3 Automated cherry pick of kubernetes#28153 kubernetes#28409 kubernetes#28939 upstream release 1.3

saad-ali added release-note Denotes a PR that will be considered when it comes time to generate release notes. cherrypick-candidate labels Jul 14, 2016

saad-ali added this to the v1.3 milestone Jul 14, 2016

saad-ali assigned thockin and matchstick Jul 14, 2016

googlebot added the cla: yes label Jul 14, 2016

saad-ali force-pushed the fixIssue28616ParallelMount branch from 8abe1a3 to ed26598 Compare July 14, 2016 05:53

k8s-github-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 14, 2016

dims reviewed Jul 14, 2016
View reviewed changes

saad-ali force-pushed the fixIssue28616ParallelMount branch from ed26598 to d5631db Compare July 14, 2016 20:54

saad-ali assigned smarterclayton Jul 14, 2016

matchstick reviewed Jul 14, 2016
View reviewed changes

k8s-github-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 19, 2016

smarterclayton reviewed Jul 19, 2016
View reviewed changes

Allow mounts to run in parallel for non-attachable

88d4950

Allow mount volume operations to run in parallel for non-attachable volume plugins. Allow unmount volume operations to run in parallel for all volume plugins.

saad-ali force-pushed the fixIssue28616ParallelMount branch from 9063e3c to 88d4950 Compare July 20, 2016 04:54

saad-ali removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 20, 2016

k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 20, 2016

jingxu97 reviewed Jul 20, 2016
View reviewed changes

saad-ali added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 20, 2016

k8s-github-robot merged commit 4379619 into kubernetes:master Jul 20, 2016

saad-ali mentioned this pull request Jul 21, 2016

Pet Set stuck in ContainerCreating #28709

Closed

felixbuenemann mentioned this pull request Jul 22, 2016

Pods fail to mount secret on k8s 1.3.0 + GKE deis/workflow#372

Closed

fabioy added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Jul 24, 2016

jingxu97 mentioned this pull request Jul 25, 2016

Error while tearing down pod, "device or resource busy" on service account secret #28750

Closed

saad-ali mentioned this pull request Jul 25, 2016

Automated cherry pick of #28153 #28409 #28939 upstream release 1.3 #29569

Merged

fabioy added a commit that referenced this pull request Jul 28, 2016

Merge pull request #29569 from saad-ali/automated-cherry-pick-of-#28153-

3b64465

#28409-#28939-upstream-release-1.3 Automated cherry pick of #28153 #28409 #28939 upstream release 1.3

fabioy removed the cherrypick-candidate label Jul 28, 2016

pmorie mentioned this pull request Aug 1, 2016

1.3 upstream picks openshift/origin#9790

Closed

saad-ali mentioned this pull request Aug 1, 2016

[1.3.2] Pod stuck in the ContainerCreating state while mounting secret timeout #29617

Closed

pmorie mentioned this pull request Aug 1, 2016

3.3 Upstream patches for storage openshift/origin#10111

Merged

saad-ali deleted the fixIssue28616ParallelMount branch August 15, 2016 22:21

249043822 mentioned this pull request Mar 29, 2022

kubelet nestedPendingOperations may leak operation lead same pv not to do mount or umount operation #109047

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow mounts to run in parallel for non-attachable volumes #28939

Allow mounts to run in parallel for non-attachable volumes #28939

saad-ali commented Jul 14, 2016 •

edited

Loading

dims Jul 14, 2016

saad-ali Jul 14, 2016

saad-ali commented Jul 14, 2016

matchstick Jul 14, 2016 •

edited

Loading

smarterclayton Jul 14, 2016

smarterclayton Jul 14, 2016 •

edited

Loading

coufon commented Jul 15, 2016 •

edited

Loading

matchstick commented Jul 15, 2016

saad-ali commented Jul 19, 2016

smarterclayton Jul 19, 2016

matchstick Jul 19, 2016

saad-ali Jul 19, 2016

smarterclayton Jul 19, 2016 via email

smarterclayton commented Jul 20, 2016

jingxu97 Jul 20, 2016

saad-ali Jul 20, 2016 •

edited by jingxu97

Loading

jingxu97 Jul 20, 2016

matchstick Jul 20, 2016

saad-ali Jul 20, 2016

saad-ali commented Jul 20, 2016

k8s-bot commented Jul 20, 2016

k8s-github-robot commented Jul 20, 2016

k8s-bot commented Jul 20, 2016

k8s-github-robot commented Jul 20, 2016

ncdc commented Jul 22, 2016

matchstick commented Jul 22, 2016

smarterclayton commented Jul 22, 2016

Allow mounts to run in parallel for non-attachable volumes #28939

Allow mounts to run in parallel for non-attachable volumes #28939

Conversation

saad-ali commented Jul 14, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saad-ali commented Jul 14, 2016

matchstick Jul 14, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smarterclayton Jul 14, 2016 • edited Loading

Choose a reason for hiding this comment

coufon commented Jul 15, 2016 • edited Loading

matchstick commented Jul 15, 2016

saad-ali commented Jul 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smarterclayton Jul 19, 2016 via email

Choose a reason for hiding this comment

smarterclayton commented Jul 20, 2016

Choose a reason for hiding this comment

saad-ali Jul 20, 2016 • edited by jingxu97 Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saad-ali commented Jul 20, 2016

k8s-bot commented Jul 20, 2016

k8s-github-robot commented Jul 20, 2016

k8s-bot commented Jul 20, 2016

k8s-github-robot commented Jul 20, 2016

ncdc commented Jul 22, 2016

matchstick commented Jul 22, 2016

smarterclayton commented Jul 22, 2016

saad-ali commented Jul 14, 2016 •

edited

Loading

matchstick Jul 14, 2016 •

edited

Loading

smarterclayton Jul 14, 2016 •

edited

Loading

coufon commented Jul 15, 2016 •

edited

Loading

saad-ali Jul 20, 2016 •

edited by jingxu97

Loading