Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller-manager sees higher mem-usage when load test runs before density #61041

Closed
shyamjvs opened this issue Mar 12, 2018 · 13 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@shyamjvs
Copy link
Member

shyamjvs commented Mar 12, 2018

I accidentally turned off our load test in PR #60973. But thanks to it, I observed this pattern in our controller-manager memory usage during density test:

controller-maanger-load-density-variance

You can see the jump after run 11479 when I re-enabled load test. And all subsequent spikes are seen in runs when the density test was preceeded by load test. We were seeing similar issues in past, but IIRC it was for kube-proxies. My feeling is this is related to endpoints-controller processing backlog - but need to confirm.

@wojtek-t - Is it sth already observed in the past? Do we consider it WAI or should we try to fix it?

@kubernetes/sig-scalability-bugs

@k8s-ci-robot k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. kind/bug Categorizes issue or PR as related to a bug. labels Mar 12, 2018
@shyamjvs
Copy link
Member Author

So this is also seen in the apiserver:

apiserver-mem-usage-load-density-variance

@wojtek-t
Copy link
Member

It hopefully shouldn't be endpoint controller. Note that we don't start the second test before all namespaces from the previous one are deleted. That means endpoint controller wouldn't be able to update endpoints object becuase of non-existing namespace.
I hope that we would observe the higher number of errors somewhere if that's the case.

@shyamjvs
Copy link
Member Author

shyamjvs commented Mar 12, 2018

I looked into the apiserver logs, and can't find any calls that have 'load' mentioned in them after the first call which has 'density' mentioned in it. This could potentially mean that the memory usage is coming from watches (though not of those created by the e2e test, as they IIUC are closed after the load test finishes). So maybe watches from kube-proxies or kubelets?

@shyamjvs
Copy link
Member Author

I have one idea to check if it is sth around services. Let's disable them in our load test for the job and see.

@shyamjvs
Copy link
Member Author

It hopefully shouldn't be endpoint controller. Note that we don't start the second test before all namespaces from the previous one are deleted. That means endpoint controller wouldn't be able to update endpoints object becuase of non-existing namespace.

That's true. But IIUC it's still possible that endpoints-controller is using memory, for e.g to process watch events for endpoints updates (coming from load test deletion phase)?

@wojtek-t
Copy link
Member

I don't think it's watch related.
Yes - CM may accumulate some memory, because it allocated a lot in the past. And without memory-pressure not everything might have been cleared yet.

I don't think we should spent too much time on it now.

@shyamjvs
Copy link
Member Author

Makes sense.

I don't think we should spent too much time on it now.

Sure... But this is just taking very small time while I'm running bisection for the pod-startup regression in the background :)

@jberkus
Copy link

jberkus commented Mar 12, 2018

@shyamjvs, @wojtek-t does this look more like a problem with the test, or does this look like a real user-affecting performance problem related to the other performance issues?

@shyamjvs
Copy link
Member Author

This needs more digging, but AFAIU it's not a regression and we've been seeing it for a while already I think.
Wrt users this may not be much of a problem, but I'd want to hear out @wojtek-t on this.

@shyamjvs
Copy link
Member Author

related to the other performance issues?

By "other performance issues" if you mean the recent ones that I've been hunting down (#60500, #60589), then I don't think so. Not 100% sure - but I believe they're different.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 10, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 10, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

5 participants