Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix goroutine leak of wait.poller #70277

Merged
merged 1 commit into from
Dec 27, 2018
Merged

Conversation

kdada
Copy link
Contributor

@kdada kdada commented Oct 26, 2018

What type of PR is this?

Uncomment only one, leave it on its own line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

This PR fix a bug of wait.poller(). wait.poller() returns a function with type WaitFunc. the function creates a goroutine and the goroutine only quits when after or done closed.

In cache.WaitForCacheSync, after is nil and done is never closed. Then the goroutine never stops.

So I add a cancel func for WaitFunc. If wait.WaitFor() returns, it will call the cancel function and stop the goroutine which created by wait.poller()

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 26, 2018
@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 26, 2018
@kdada
Copy link
Contributor Author

kdada commented Oct 27, 2018

@liggitt @derekwaynecarr please take a look.

@kdada
Copy link
Contributor Author

kdada commented Oct 29, 2018

@deads2k

@idealhack
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 29, 2018
@roycaihw
Copy link
Member

/assign @caesarxuchao
cc @cheftako

@kdada
Copy link
Contributor Author

kdada commented Nov 1, 2018

@caesarxuchao @liggitt Please take a look.

@idealhack
Copy link
Member

cc @kubernetes/sig-api-machinery-pr-reviews

@@ -312,7 +312,7 @@ func PollImmediateUntil(interval time.Duration, condition ConditionFunc, stopCh

// WaitFunc creates a channel that receives an item every time a test
// should be executed and is closed when the last test should be invoked.
type WaitFunc func(done <-chan struct{}) <-chan struct{}
type WaitFunc func(done <-chan struct{}) (ch <-chan struct{}, cancel func())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update documentation.

I'm honestly astonished that the tests continue to pass everywhere considering this breaks the public interface.

Copy link
Contributor Author

@kdada kdada Nov 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means nobody uses WaitFor() directly. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move it to private then.

ch := make(chan struct{})

mu := sync.Mutex{}
canceled := false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync.Once probably would make for more concise code.

@@ -379,6 +393,8 @@ func poller(interval, timeout time.Duration) WaitFunc {
case ch <- struct{}{}:
default:
}
case <-cancelCh:
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which goroutine was being leaked, and by what path? The done channel is already getting closed, and should cause this goroutine to exit.

Can you make a test to demonstrate the leak (and verify it's not leaked any more)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I just now read the PR description. Shouldn't WaitForCacheSync just be fixed to close(done) instead? I don't see how this change helps, given that WaitForCacheSync still isn't calling the cancel function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think users should know how to prevent wait package from leaking goroutines.

Another way to fix this issue: the WaitFor() function creates a new channel and pass it to WaitFunc. Then close it after WaitFor() finishing.

@kdada
Copy link
Contributor Author

kdada commented Nov 5, 2018

@lavalamp PTAL.

return nil
}
if !open {
break FOR
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return ErrWaitTimeout here, then we don't need the FOR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@@ -312,7 +312,7 @@ func PollImmediateUntil(interval time.Duration, condition ConditionFunc, stopCh

// WaitFunc creates a channel that receives an item every time a test
// should be executed and is closed when the last test should be invoked.
type WaitFunc func(done <-chan struct{}) <-chan struct{}
type WaitFunc func(done <-chan struct{}) (ch <-chan struct{}, cancel func())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move it to private then.

Copy link
Member

@caesarxuchao caesarxuchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix seems reasonable. I can't understand the test easily. I'll take another look next week.

t.Errorf("expected ErrWaitTimeout from WaitFunc")
}
}

func TestWaitForWithDelay(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, you are trying to verify that if the WaitFor's stopCh is closed, then the done channel pass to the waitFunc is also closed. If so, can you rename the function and variable names to be more clear?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about TestInternalChannelOfWaitFor?

@kdada
Copy link
Contributor Author

kdada commented Dec 17, 2018

@caesarxuchao PTAL

return false, nil
}, stopCh)
duration := time.Now().Sub(start)
// The WaitFor should returns immediately. So the duration is closed to 0s.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits:
s/returns/return
s/closed/close
s/. So/, so

duration := time.Now().Sub(start)
// The WaitFor should returns immediately. So the duration is closed to 0s.
// This condition ensures that if the WaitFor returns error caused by poller rather
// than stopCh, it will trigger an error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't follow this comment. Are you trying to explain why you did the check in line 472?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've splitted these comments into two paragraphs:

 	// The WaitFor should return immediately, so the duration is close to 0s.
 	if duration >= ForeverTestTimeout/2 {
 		t.Errorf("expected short timeout duration")
 	}
 	// The interval of the poller is ForeverTestTimeout, so the WaitFor should always return ErrWaitTimeout.
 	if err != ErrWaitTimeout {
 		t.Errorf("expected ErrWaitTimeout from WaitFunc")
 	}

}
}

func TestInternalChannelOfWaitFor(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

// TestWaitForClosesStopCh verifies that after the condition func returns true, WaitFor() closes the stop channel it supplies to the WaitFunc.
TestWaitForClosesStopCh
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@caesarxuchao
Copy link
Member

Also please squash :)

@kdada
Copy link
Contributor Author

kdada commented Dec 18, 2018

@caesarxuchao PTAL

@kdada
Copy link
Contributor Author

kdada commented Dec 18, 2018

/retest

@liggitt liggitt removed their request for review December 18, 2018 16:47
@caesarxuchao
Copy link
Member

/lgtm

Thanks, @kdada.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 18, 2018
@caesarxuchao
Copy link
Member

/retest

@kdada
Copy link
Contributor Author

kdada commented Dec 25, 2018

@caesarxuchao need an approved label.

@caesarxuchao
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: caesarxuchao, kdada

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 26, 2018
return ErrWaitTimeout
}
case <-done:
closeCh()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is wrong. I do not think this will break you out of the for loop, which I believe is the desired behavior. If we just return here, that will have us exit the for loop and then the deferred close will be called. Otherwise we are relying on waiting for the timeout which seems wrong. We should make sure we are properly testing this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same doubt but then decided the code in this PR was proper due to backward compatibility concerns.

The current behavior of the function is letting the wait function to decide how to react to the done channel, see

. So the behavior in this PR is consistent with the current behavior, while calling return here would be a behavior change.

Also note that the comment of the WaitFor function didn't say if the WaitFor should stop waiting when the done channel is closed. I think we should explicitly point out what role the done channels plays here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide that WaitFor should stop waiting once done channel is closed, regardless of the wait func, then I suggest that we apply @cheftako's comment in a different PR, because that's a behavior change while this PR is fixing a goroutine leak.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be ok with creating an issue to make sure this does not get dropped. Also we should properly document the parameters when we make that change.

Copy link
Contributor Author

@kdada kdada Dec 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @caesarxuchao said, It just goes to another loop and runs the final test.

This behavior is described in the documents.

// WaitFor gets a channel from 'wait()'', and then invokes 'fn' once for every value
// placed on the channel and once more when the channel is closed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once more when the channel is closed

The "channel" here means the channel returned by "wait()", not the "done" channel. The doc doesn't mention the 'done' channel at all, which leaves space for uncertainties..

What cheftako suggested (closing done channel should terminate WaitFor) was what I expected at first. Can you open an issue to track @cheftako's request?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm, created #72357.

@liggitt
Copy link
Member

liggitt commented Dec 26, 2018

/hold
for comments

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 26, 2018
@kdada
Copy link
Contributor Author

kdada commented Dec 27, 2018

@cheftako @liggitt Could this PR be merged?

@liggitt
Copy link
Member

liggitt commented Dec 27, 2018

/hold cancel
per
#70277 (comment)

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 27, 2018
@kdada
Copy link
Contributor Author

kdada commented Dec 27, 2018

/retest

@k8s-ci-robot k8s-ci-robot merged commit 81a1f12 into kubernetes:master Dec 27, 2018
@markmandel
Copy link
Contributor

Side question: is this fix going to be backported to older version of apimachinery?

We've manually patched our 1.11.5 vendored library - but it would be good to have the official fix in place.

Thanks! 👍 🤸‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants