Fix goroutine leak of wait.poller #70277

kdada · 2018-10-26T08:47:40Z

What type of PR is this?

Uncomment only one, leave it on its own line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

This PR fix a bug of wait.poller(). wait.poller() returns a function with type WaitFunc. the function creates a goroutine and the goroutine only quits when after or done closed.

In cache.WaitForCacheSync, after is nil and done is never closed. Then the goroutine never stops.

So I add a cancel func for WaitFunc. If wait.WaitFor() returns, it will call the cancel function and stop the goroutine which created by wait.poller()

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

kdada · 2018-10-27T05:49:05Z

@liggitt @derekwaynecarr please take a look.

kdada · 2018-10-29T02:42:08Z

@deads2k

idealhack · 2018-10-29T10:16:12Z

/ok-to-test

roycaihw · 2018-10-29T20:08:47Z

/assign @caesarxuchao
cc @cheftako

kdada · 2018-11-01T09:32:48Z

@caesarxuchao @liggitt Please take a look.

idealhack · 2018-11-01T09:44:18Z

cc @kubernetes/sig-api-machinery-pr-reviews

lavalamp · 2018-11-01T16:15:54Z

staging/src/k8s.io/apimachinery/pkg/util/wait/wait.go

@@ -312,7 +312,7 @@ func PollImmediateUntil(interval time.Duration, condition ConditionFunc, stopCh

 // WaitFunc creates a channel that receives an item every time a test
 // should be executed and is closed when the last test should be invoked.
-type WaitFunc func(done <-chan struct{}) <-chan struct{}
+type WaitFunc func(done <-chan struct{}) (ch <-chan struct{}, cancel func())


Update documentation.

I'm honestly astonished that the tests continue to pass everywhere considering this breaks the public interface.

That means nobody uses WaitFor() directly. :)

Let's move it to private then.

lavalamp · 2018-11-01T16:17:28Z

staging/src/k8s.io/apimachinery/pkg/util/wait/wait.go

 		ch := make(chan struct{})

+		mu := sync.Mutex{}
+		canceled := false


sync.Once probably would make for more concise code.

lavalamp · 2018-11-01T16:23:03Z

staging/src/k8s.io/apimachinery/pkg/util/wait/wait.go

@@ -379,6 +393,8 @@ func poller(interval, timeout time.Duration) WaitFunc {
 					case ch <- struct{}{}:
 					default:
 					}
+				case <-cancelCh:
+					return


Which goroutine was being leaked, and by what path? The done channel is already getting closed, and should cause this goroutine to exit.

Can you make a test to demonstrate the leak (and verify it's not leaked any more)?

Sorry, I just now read the PR description. Shouldn't WaitForCacheSync just be fixed to close(done) instead? I don't see how this change helps, given that WaitForCacheSync still isn't calling the cancel function.

I don't think users should know how to prevent wait package from leaking goroutines.

Another way to fix this issue: the WaitFor() function creates a new channel and pass it to WaitFunc. Then close it after WaitFor() finishing.

kdada · 2018-11-05T03:47:01Z

@lavalamp PTAL.

caesarxuchao · 2018-12-08T02:10:09Z

staging/src/k8s.io/apimachinery/pkg/util/wait/wait.go

+				return nil
+			}
+			if !open {
+				break FOR


return ErrWaitTimeout here, then we don't need the FOR.

caesarxuchao · 2018-12-08T02:18:12Z

staging/src/k8s.io/apimachinery/pkg/util/wait/wait.go

@@ -312,7 +312,7 @@ func PollImmediateUntil(interval time.Duration, condition ConditionFunc, stopCh

 // WaitFunc creates a channel that receives an item every time a test
 // should be executed and is closed when the last test should be invoked.
-type WaitFunc func(done <-chan struct{}) <-chan struct{}
+type WaitFunc func(done <-chan struct{}) (ch <-chan struct{}, cancel func())


Let's move it to private then.

caesarxuchao

The fix seems reasonable. I can't understand the test easily. I'll take another look next week.

caesarxuchao · 2018-12-12T01:30:16Z

staging/src/k8s.io/apimachinery/pkg/util/wait/wait_test.go

+		t.Errorf("expected ErrWaitTimeout from WaitFunc")
+	}
+}
+
 func TestWaitForWithDelay(t *testing.T) {


IIUC, you are trying to verify that if the WaitFor's stopCh is closed, then the done channel pass to the waitFunc is also closed. If so, can you rename the function and variable names to be more clear?

How about TestInternalChannelOfWaitFor?

kdada · 2018-12-17T05:58:44Z

@caesarxuchao PTAL

caesarxuchao · 2018-12-18T01:33:21Z

staging/src/k8s.io/apimachinery/pkg/util/wait/wait_test.go

+		return false, nil
+	}, stopCh)
+	duration := time.Now().Sub(start)
+	// The WaitFor should returns immediately. So the duration is closed to 0s.


nits:
s/returns/return
s/closed/close
s/. So/, so

caesarxuchao · 2018-12-18T01:36:16Z

staging/src/k8s.io/apimachinery/pkg/util/wait/wait_test.go

+	duration := time.Now().Sub(start)
+	// The WaitFor should returns immediately. So the duration is closed to 0s.
+	// This condition ensures that if the WaitFor returns error caused by poller rather
+	// than stopCh, it will trigger an error.


Sorry, I don't follow this comment. Are you trying to explain why you did the check in line 472?

I've splitted these comments into two paragraphs:

// The WaitFor should return immediately, so the duration is close to 0s. if duration >= ForeverTestTimeout/2 { t.Errorf("expected short timeout duration") } // The interval of the poller is ForeverTestTimeout, so the WaitFor should always return ErrWaitTimeout. if err != ErrWaitTimeout { t.Errorf("expected ErrWaitTimeout from WaitFunc") }

caesarxuchao · 2018-12-18T01:47:21Z

staging/src/k8s.io/apimachinery/pkg/util/wait/wait_test.go

+	}
+}
+
+func TestInternalChannelOfWaitFor(t *testing.T) {


How about

// TestWaitForClosesStopCh verifies that after the condition func returns true, WaitFor() closes the stop channel it supplies to the WaitFunc. TestWaitForClosesStopCh ...

caesarxuchao · 2018-12-18T01:48:16Z

Also please squash :)

kdada · 2018-12-18T02:29:07Z

@caesarxuchao PTAL

kdada · 2018-12-18T02:46:12Z

/retest

caesarxuchao · 2018-12-18T18:42:09Z

/lgtm

Thanks, @kdada.

caesarxuchao · 2018-12-18T18:42:21Z

/retest

kdada · 2018-12-25T03:47:10Z

@caesarxuchao need an approved label.

caesarxuchao · 2018-12-26T22:00:03Z

/approve

k8s-ci-robot · 2018-12-26T22:02:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: caesarxuchao, kdada

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/apimachinery/pkg/OWNERS~~ [caesarxuchao]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cheftako · 2018-12-26T22:51:41Z

staging/src/k8s.io/apimachinery/pkg/util/wait/wait.go

+				return ErrWaitTimeout
+			}
+		case <-done:
+			closeCh()


I believe this is wrong. I do not think this will break you out of the for loop, which I believe is the desired behavior. If we just return here, that will have us exit the for loop and then the deferred close will be called. Otherwise we are relying on waiting for the timeout which seems wrong. We should make sure we are properly testing this case.

I had the same doubt but then decided the code in this PR was proper due to backward compatibility concerns.

The current behavior of the function is letting the wait function to decide how to react to the done channel, see

kubernetes/staging/src/k8s.io/apimachinery/pkg/util/wait/wait.go

Line 362 in dfaf10f

c := wait(done)

. So the behavior in this PR is consistent with the current behavior, while calling return here would be a behavior change.

Also note that the comment of the WaitFor function didn't say if the WaitFor should stop waiting when the done channel is closed. I think we should explicitly point out what role the done channels plays here.

If we decide that WaitFor should stop waiting once done channel is closed, regardless of the wait func, then I suggest that we apply @cheftako's comment in a different PR, because that's a behavior change while this PR is fixing a goroutine leak.

I'd be ok with creating an issue to make sure this does not get dropped. Also we should properly document the parameters when we make that change.

As @caesarxuchao said, It just goes to another loop and runs the final test.

This behavior is described in the documents.

// WaitFor gets a channel from 'wait()'', and then invokes 'fn' once for every value // placed on the channel and once more when the channel is closed.

once more when the channel is closed

The "channel" here means the channel returned by "wait()", not the "done" channel. The doc doesn't mention the 'done' channel at all, which leaves space for uncertainties..

What cheftako suggested (closing done channel should terminate WaitFor) was what I expected at first. Can you open an issue to track @cheftako's request?

Nvm, created #72357.

liggitt · 2018-12-26T22:56:47Z

/hold
for comments

kdada · 2018-12-27T03:22:49Z

@cheftako @liggitt Could this PR be merged?

liggitt · 2018-12-27T03:36:52Z

/hold cancel
per
#70277 (comment)

kdada · 2018-12-27T05:43:34Z

/retest

markmandel · 2018-12-27T06:38:41Z

Side question: is this fix going to be backported to older version of apimachinery?

We've manually patched our 1.11.5 vendored library - but it would be good to have the official fix in place.

Thanks! 👍 🤸‍♂️

k8s-ci-robot added release-note-none kind/bug size/M cncf-cla: yes needs-sig labels Oct 26, 2018

k8s-ci-robot requested review from derekwaynecarr and liggitt October 26, 2018 08:47

k8s-ci-robot added needs-ok-to-test sig/api-machinery and removed needs-sig labels Oct 26, 2018

k8s-ci-robot removed the needs-ok-to-test label Oct 29, 2018

k8s-ci-robot assigned caesarxuchao Oct 29, 2018

lavalamp reviewed Nov 1, 2018

View reviewed changes

kdada force-pushed the master branch from f1c23f5 to e3d65ac Compare November 5, 2018 03:42

kdada force-pushed the master branch from e3d65ac to c48bba0 Compare November 5, 2018 03:50

caesarxuchao reviewed Dec 8, 2018

View reviewed changes

caesarxuchao reviewed Dec 12, 2018

View reviewed changes

aLekSer mentioned this pull request Dec 12, 2018

CPU/MEMORY leak in agones controller container googleforgames/agones#414

Closed

caesarxuchao reviewed Dec 18, 2018

View reviewed changes

kdada force-pushed the master branch from f7fe58a to ba34e03 Compare December 18, 2018 02:22

stop goroutine of poller when WaitFor returns

2306eb4

kdada force-pushed the master branch from ba34e03 to 2306eb4 Compare December 18, 2018 02:59

liggitt removed their request for review December 18, 2018 16:47

k8s-ci-robot added the lgtm label Dec 18, 2018

k8s-ci-robot added the approved label Dec 26, 2018

cheftako reviewed Dec 26, 2018

View reviewed changes

k8s-ci-robot added the do-not-merge/hold label Dec 26, 2018

k8s-ci-robot removed the do-not-merge/hold label Dec 27, 2018

k8s-ci-robot merged commit 81a1f12 into kubernetes:master Dec 27, 2018

kdada mentioned this pull request Dec 27, 2018

WaitFor returns immediately when done is closed #72364

Merged

aLekSer mentioned this pull request Feb 27, 2019

Move to go modules googleforgames/agones#625

Closed

Fix goroutine leak of wait.poller #70277

Fix goroutine leak of wait.poller #70277

Conversation

kdada commented Oct 26, 2018

kdada commented Oct 27, 2018

kdada commented Oct 29, 2018

idealhack commented Oct 29, 2018

roycaihw commented Oct 29, 2018

kdada commented Nov 1, 2018

idealhack commented Nov 1, 2018

Choose a reason for hiding this comment

kdada Nov 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kdada commented Nov 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

caesarxuchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kdada commented Dec 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

caesarxuchao commented Dec 18, 2018

kdada commented Dec 18, 2018

kdada commented Dec 18, 2018

caesarxuchao commented Dec 18, 2018

caesarxuchao commented Dec 18, 2018

kdada commented Dec 25, 2018

caesarxuchao commented Dec 26, 2018

k8s-ci-robot commented Dec 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kdada Dec 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liggitt commented Dec 26, 2018

kdada commented Dec 27, 2018

liggitt commented Dec 27, 2018

kdada commented Dec 27, 2018

markmandel commented Dec 27, 2018

kdada Nov 2, 2018 •

edited

Loading

kdada Dec 27, 2018 •

edited

Loading