Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating selectors is expensive in EndpointsController #73527

Closed
wojtek-t opened this issue Jan 30, 2019 · 23 comments · Fixed by #84280
Closed

Creating selectors is expensive in EndpointsController #73527

wojtek-t opened this issue Jan 30, 2019 · 23 comments · Fixed by #84280
Assignees
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@wojtek-t
Copy link
Member

When profiling scalability tests at large scale, it appears that the function SelectorFromValidatedSet can consume even 10% of cpu.
In the profiles it's coming purely from the GetPodServices in EndpointsController.

The reason for that is for every single call of GetPodServices:
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/endpoint/endpoints_controller.go#L170
for every single service in that namespace we build the selector from scratch:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/listers/core/v1/service_expansion.go#L49

Even though a single such operation is not very expensive, the amount of that is resulting in significant percentage of overall cpu usage.

Longer-term solution is to do exactly what was done e.g. for ReplicaSet:
https://github.com/kubernetes/api/blob/master/apps/v1/types.go#L723

  • use appropriate type for the selector once we create v2 Service API.

That would mean that store-in-memory services would have the selector already created, so it would eliminate that completely.

I'm wondering if there is something we can do here more short-term.

@kubernetes/sig-network-misc @freehan
@kubernetes/sig-scalability-misc

@wojtek-t wojtek-t added kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Jan 30, 2019
@thockin thockin added the triage/unresolved Indicates an issue that can not or will not be resolved. label Mar 8, 2019
@thockin thockin added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. and removed triage/unresolved Indicates an issue that can not or will not be resolved. labels Mar 21, 2019
@freehan
Copy link
Contributor

freehan commented Mar 21, 2019

IIUC, It sounds like a simple short-term fix can be adding an in-memory cache of selector per service. That would avoid recreating the selector every time.

This would be a good PR for new contributor. If anyone wants to send an PR, please assign it to me. I am willing to shepherd it.

@freehan freehan added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Mar 21, 2019
@wojtek-t
Copy link
Member Author

IIUC, It sounds like a simple short-term fix can be adding an in-memory cache of selector per service. That would avoid recreating the selector every time.

Yeah probably. Though it obviously complicates the code, so my main goal was to ensure that the new API will not have this problem.

@yashbhutwala
Copy link

@freehan @wojtek-t I would like to work on this! As a first time contributor, I would appreciate some help.

@wojtek-t
Copy link
Member Author

@yashbhutwala - nice to meet you.
From my comment above it may not be clear, but I'm not 100% convinced that it's a good first task.

But to describe in more details:

What @freehan is suggesting that, we may want to cache "service->selector" mapping inside endpoints controller and use that instead of recomputing that every single time.

I'm a bit worried that it may complicate the code, but I think I may be too pesimistic here. So it's probably worth trying.

So what you would need to do is:

  • add this mapping service->selector in endpoints controller
  • add logic that is keeping this mapping up-to-date
  • instead of using GetPodServices(), we will be "listing all services from that namespace" and then filtering that ourservles
    Shouldn't be that much work...

@yashbhutwala
Copy link

yashbhutwala commented Mar 27, 2019

@shinytang6
Copy link
Contributor

Hey @yashbhutwala , what's the status of this issue? Still work on that?

@yashbhutwala
Copy link

@shinytang6 @wojtek-t @freehan I made a PR here: #75965. Review/feedback/suggestions appreciated! :)

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 30, 2019
@wojtek-t
Copy link
Member Author

wojtek-t commented Jul 1, 2019

/remove-lifecycle stale
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 1, 2019
@tedyu
Copy link
Contributor

tedyu commented Jul 1, 2019

@wojtek-t
SelectorFromValidatedSet calls sort. However, the goal for creating the selector is for calling selector.Matches where the sorting can be skipped.

		selector := labels.Set(service.Spec.Selector).AsSelectorPreValidated()

How about passing a flag to AsSelectorPreValidated() which skips the sorting step ?

@wojtek-t
Copy link
Member Author

wojtek-t commented Jul 1, 2019

How about passing a flag to AsSelectorPreValidated() which skips the sorting step ?

AsSelectorPreValidated() is already a bit of a hack by iself (I know I'm the one who added it). And I'm bit reluctant of making it even more strange looking...

@tedyu
Copy link
Contributor

tedyu commented Jul 1, 2019

Can SelectorFromValidatedSet skip calling sort ?
I would think that the selectors have equal weight.

@wojtek-t
Copy link
Member Author

wojtek-t commented Jul 1, 2019

Can SelectorFromValidatedSet skip calling sort ?

It can't - it would e.g. break the assumption that you can compare them.
Or will be printed randomly when you print them.
The potential drawbacks are larger than the gain from it (i.e. majority of selectors are fairly small and memory allocations/gc connected to building selector are usually more expensive the this sort).

@k8s-ci-robot k8s-ci-robot removed help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. labels Oct 13, 2019
@bowei
Copy link
Member

bowei commented Oct 15, 2019

/assign @robscott

@k8s-ci-robot
Copy link
Contributor

@bowei:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Oct 15, 2019
@gongguan
Copy link
Contributor

gongguan commented Oct 17, 2019

  • instead of using GetPodServices(), we will be "listing all services from that namespace" and then filtering that ourservles

@wojtek-t it seems selector-cache solution has to rewrite GetPodServices function, is it feasible to change so many code?

@wojtek-t
Copy link
Member Author

Not necessary - see #73527

@wojtek-t
Copy link
Member Author

I meant #75965

@gongguan
Copy link
Contributor

gongguan commented Oct 18, 2019

I meant #75965

yes, most logic of GetPodServices has been rewrite in #75965

@sarajmunjal
Copy link

@wojtek-t I'd like to work on this as a first issue if you don't mind. Should we iterate on #75965 or try afresh?

@gongguan
Copy link
Contributor

gongguan commented Oct 23, 2019

@wojtek-t does the same problem exist in endpointslice controller?

seems #75965 has something wrong and no one maintained for a long time, I raised a new pr

@gongguan
Copy link
Contributor

gongguan commented Oct 25, 2019

@wojtek-t Excuse me, I have one concern about the service selector: should cache updated while service selector changed?

@wojtek-t
Copy link
Member Author

@wojtek-t Excuse me, I have one concern about the service selector: should cache updated while service selector changed?

yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet