Rewrite service controller to apply best controller pattern #25189

mfanjie · 2016-05-05T07:47:47Z

This PR is a long term solution for #21625:
We apply the same pattern like replication controller to service controller to avoid the potential process order messes in service controller, the change includes:

introduce informer controller to watch service changes from kube-apiserver, so that every changes on same service will be kept in serviceStore as the only element.
put the service name to be processed to working queue
when process service, always get info from serviceStore to ensure the info is up-to-date
keep the retry mechanism, sleep for certain interval and add it back to queue.
remote the logic of reading last service info from kube-apiserver before processing the LB info as we trust the info from serviceStore.

The UT has been passed, manual test passed after I hardcode the cloud provider as FakeCloud, however I am not able to boot a k8s cluster with any available cloudprovider, so e2e test is not done.

Submit this PR first for review and for triggering a e2e test.

k8s-bot · 2016-05-05T07:48:12Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-05-05T07:48:30Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-05-05T07:49:11Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

mfanjie · 2016-05-05T08:08:41Z

@justinsb @bprashanth @thockin

zhouhaibing089 · 2016-05-05T08:24:00Z

pkg/apis/componentconfig/types.go

+	// concurrentServiceSyncs is the number of services that are
+	// allowed to sync concurrently. Larger number = more responsive service
+	// management, but more CPU (and network) load.
+	ConcurrentServiceSyncs int32 `json:"concurrentServiceSyncs"`


You are adding one new field, needs to run ./hack/update-all.sh?

ah, right, thanks for pointing out, will run and add changes later.

mfanjie · 2016-05-06T21:49:34Z

@davidopp @justinsb @bprashanth @thockin @a-robinson can you guys take some time to review this change please? Thanks

k8s-bot · 2016-05-09T08:10:31Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

bprashanth · 2016-05-09T17:01:37Z

pkg/controller/service/servicecontroller.go

+		cachedService = sc.cache.getOrCreate(namespacedName.String())
+	}
+
+	err, retryDelay := sc.processDelta(cachedService, deltaService, namespacedName)


can you add a comment on why we really need a delta? on every service update can we just get the latest service from the store, and verify that it has the expected loadbalancers? that would make this controller level triggered.

I just kept the old variable names, the logic is not actually processing the delta. the service store hold the latest service info, and on service updated, we simply add the service into queue and check if it needs update in createLoadBalancerIfNeeded(). I will change the variable and function name to avoid confusion, and add check to controller like

UpdateFunc: func(old, cur interface{}) { if sc.needsUpdate(old, cur) { sc.enqueueService(cur) } },

nikhiljindal · 2016-05-09T22:15:29Z

@k8s-bot ok to test

mfanjie · 2016-05-10T04:46:27Z

@bprashanth addressed your comments.

bprashanth · 2016-08-03T16:54:42Z

@mfanjie what is the fix? it's hard to tell with one giant commit. Can you please isolate the change in another commit, or comment around the fix so I can focus on that part?

mfanjie · 2016-08-04T00:24:58Z

@bprashanth I squashed the commits as there are several minor changes to test.
The change is not big, originally, the Run func of servicecontroller is func (s *ServiceController) Run(workers int, stopCh <-chan struct{}) (similar with resource_quota_controller.go) while in controllermanager.go this method was not called by go, as it return error. so the <-stopCh blocks the thread and the following controller could not be loaded.
I made the changes to remove stopCh <-chan struct{} as the parameter of Run, so that <-stopCh is remove together. similar with nodecontroller.

In addition, what kind of cloud resource leak you mentioned? LB IP? how can I ensure there is no leak for it, this is the first time I use gce to boot k8s cluster.
I have the e2e job ran overnight, but checking the logs, I see

ug  3 12:15:20.783: INFO: Waiting up to 20m0s for service "mutability-test" to have a LoadBalancer
Aug  3 12:35:20.949: INFO: Timeout waiting for service "mutability-test" to have a load balancer

From code, it seems a known problem exists on LB IP allocation?

// How long to wait for a load balancer to be created/modified.
    //TODO: once support ticket 21807001 is resolved, reduce this timeout back to something reasonable
    loadBalancerCreateTimeout = 20 * time.Minute

k8s-bot · 2016-08-04T02:02:36Z

GCE e2e build/test passed for commit 5fa6404.

mfanjie · 2016-08-04T02:18:35Z

All settled from my perspective, waiting for LGTM. @bprashanth

bprashanth · 2016-08-04T05:26:50Z

From code, it seems a known problem exists on LB IP allocation?

Yeah that 20m timeout was put in to mitigate the issue, it shouln't take that long (i.e the test shouldn't timeout, or if it was the submit queue would be closed). I haven't seen that timeout in a while. Can you investigate why the test failed? how long did it take to create the lb? how many successful runs did you get?

you can check for leaked resources by going to the gce console (console.cloud.google.com) and checking your quota (compute engine > quotas). Typically leaked addresses show up in gcloud (gcloud compute addresses list, target-pools list etc).

mfanjie · 2016-08-04T08:07:04Z

@bprashanth The timeout error happened occasionally last night, the e2e test is still ongoing and runs so far so good. I do not see the error now and here is the successful logs.

STEP: waiting for the TCP service to have a load balancer
Aug  4 00:46:59.688: INFO: Waiting up to 20m0s for service "mutability-test" to have a LoadBalancer
^@^@Aug  4 00:49:03.795: INFO: TCP load balancer: 104.197.122.110
STEP: demoting the static IP to ephemeral
I0804 00:49:03.795490   15443 google_compute.go:89] Deleting static IP with name "e2e-external-lb-test-cf7be64d-5a16-11e6-bd46-74dbd180286e" in project "nds1-1340" in region "us-central1"
STEP: waiting for the UDP service to have a load balancer
Aug  4 00:49:12.730: INFO: Waiting up to 20m0s for service "mutability-test" to have a LoadBalancer
^@Aug  4 00:50:06.837: INFO: UDP load balancer: 146.148.43.188

From gce console, it shows 5 target pools, and 4 in-used address. I did not see a leak with the time being.

bprashanth · 2016-08-04T19:53:03Z

I guess a nightly back to back run is fine. I assume the 5 target pools and 4 in use addresses are from the currently running test? otherwise a leak at that rate would close SQ in 10 days because the address quota is ~50 i think.

mfanjie · 2016-08-04T20:01:45Z

yes, the test is still on going, and the 5 pool and 4 address are using for the tests. I did not see any abnormal leak from the long run.

Please consider if it is ok for merge. I have no further action on this pr, except to check the result again next daytime, I do not expect any problem though.

? 2016?8?5??03:54?Prashanth B <notifications@github.com mailto:notifications@github.com> ???

I guess a nightly back to back run is fine. I assume the 5 target pools and 4 in use addresses are from the currently running test? otherwise a leak at that rate would close SQ in 10 days because the address quota is ~50 i think.

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//pull/25189#issuecomment-237664354, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AP30JPmI3Ts_gx78bWVPlT3_CmnYtkiKks5qckNSgaJpZM4IX1D_.

bprashanth · 2016-08-04T20:03:45Z

Thanks for running the test, LGTM

k8s-github-robot · 2016-08-04T20:26:55Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

k8s-bot · 2016-08-04T21:05:28Z

GCE e2e build/test passed for commit 5fa6404.

k8s-github-robot · 2016-08-04T21:05:50Z

Automatic merge from submit-queue

mfanjie · 2016-08-23T02:15:57Z

@anguslees can you explain more detail? here is a brief flow by reading the code through, please checkout and point out if there is a logic error. Thanks.

initial state
111. s.knownHosts == empty array
623. nodes == all nodes meet the condition
628. newHosts == string array of node names

first loop
629. stringSlicesEqual(newHosts, s.knownHosts) == false
639. serviceToUpdate = all services
6406. numServices = all services size
641. servicesToUpdate = servicesToRetry
645. s.knownHosts = newHosts

next loop(no new nodes)
629. stringSlicesEqual(newHosts, s.knownHosts) == true
632. retry on serviceToUpdate which is servicesToRetry

next loop(with new nodes)
same as first loop

anguslees · 2016-08-23T05:47:24Z

On the "next loop(no new nodes)" case, servicesToUpdate is always a new/empty list (from line 622).

So if the service failed to update fully on the first loop (stringSlicesEqual(newHosts, s.knownHosts) == false), then the service will not be retried on later loops (stringSlicesEqual(..) == true). The error will persist until the list of hosts changes again and we trigger the allServices() case again (stringSlicesEqual(..) == false)

Note that the old version did not have this problem because the servicesToUpdate value persists across iterations of the sync loop in (old version) line 687. I believe the fix for the new version is to persist servicesToUpdate in the ServiceController object in some form, much like was done with prevHosts => s.knownHosts as part of this commit.

mfanjie · 2016-08-23T06:36:37Z

@anguslees ah, yes, the servicesToUpdate is re-declared inside the loop, yes, we need to persist the latest value somewhere outside the loop. will send a fix.

@anguslees

Automatic merge from submit-queue persist services need to be retried in service controller cache. fix issue reported by @anguslees more detail on #25189

…source UPSTREAM: <carry>: apiserver: log source in LateConnections event Origin-commit: 22a0724bb9b0d19dd955227126bce72378e2d45f

googlebot added the cla: yes label May 5, 2016

k8s-github-robot assigned davidopp May 5, 2016

k8s-github-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels May 5, 2016

mfanjie force-pushed the kube-service-controller-rewritten branch from cff0711 to af5ce26 Compare May 5, 2016 08:07

zhouhaibing089 reviewed May 5, 2016
View reviewed changes

k8s-github-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/old-docs labels May 5, 2016

k8s-github-robot assigned mikedanese and bprashanth and unassigned davidopp and mikedanese May 6, 2016

bprashanth reviewed May 9, 2016
View reviewed changes

mfanjie force-pushed the kube-service-controller-rewritten branch from d5412f6 to 661b748 Compare May 10, 2016 04:44

mfanjie force-pushed the kube-service-controller-rewritten branch 3 times, most recently from 9c62386 to 87f6aab Compare May 11, 2016 05:10

mfanjie force-pushed the kube-service-controller-rewritten branch from 045bbd1 to 8646dc4 Compare August 3, 2016 14:19

k8s-github-robot mentioned this pull request Aug 3, 2016

[k8s.io] Downward API volume should update annotations on modification [Conformance] {Kubernetes e2e suite} #28462

Closed

rewrite serivce controller to apply the latest controller pattern

5fa6404

mfanjie force-pushed the kube-service-controller-rewritten branch from 42cd970 to 5fa6404 Compare August 4, 2016 01:29

k8s-github-robot merged commit c41c3d4 into kubernetes:master Aug 4, 2016

justinsb mentioned this pull request Aug 22, 2016

Load Balancer retry error #31007

Closed

mfanjie mentioned this pull request Aug 23, 2016

persist services need to be retried in service controller cache. #31217

Merged

Miciah mentioned this pull request May 18, 2018

Avoid unnecessary calls to the cloud provider #63926

Closed

bertinatto mentioned this pull request Jun 28, 2021

Add CSI migration gates for GCE and Azure Disk (in A/D controller) #103270

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite service controller to apply best controller pattern #25189

Rewrite service controller to apply best controller pattern #25189

mfanjie commented May 5, 2016 •

edited

Loading

k8s-bot commented May 5, 2016

k8s-bot commented May 5, 2016

k8s-bot commented May 5, 2016

mfanjie commented May 5, 2016

zhouhaibing089 May 5, 2016

mfanjie May 5, 2016

mfanjie May 5, 2016

mfanjie commented May 6, 2016

k8s-bot commented May 9, 2016

bprashanth May 9, 2016

mfanjie May 10, 2016

nikhiljindal commented May 9, 2016

mfanjie commented May 10, 2016

bprashanth commented Aug 3, 2016

mfanjie commented Aug 4, 2016 •

edited

Loading

k8s-bot commented Aug 4, 2016

mfanjie commented Aug 4, 2016

bprashanth commented Aug 4, 2016

mfanjie commented Aug 4, 2016

bprashanth commented Aug 4, 2016

mfanjie commented Aug 4, 2016

bprashanth commented Aug 4, 2016

k8s-github-robot commented Aug 4, 2016

k8s-bot commented Aug 4, 2016

k8s-github-robot commented Aug 4, 2016

mfanjie commented Aug 23, 2016

anguslees commented Aug 23, 2016

mfanjie commented Aug 23, 2016

Rewrite service controller to apply best controller pattern #25189

Rewrite service controller to apply best controller pattern #25189

Conversation

mfanjie commented May 5, 2016 • edited Loading

k8s-bot commented May 5, 2016

k8s-bot commented May 5, 2016

k8s-bot commented May 5, 2016

mfanjie commented May 5, 2016

zhouhaibing089 May 5, 2016

Choose a reason for hiding this comment

mfanjie May 5, 2016

Choose a reason for hiding this comment

mfanjie May 5, 2016

Choose a reason for hiding this comment

mfanjie commented May 6, 2016

k8s-bot commented May 9, 2016

bprashanth May 9, 2016

Choose a reason for hiding this comment

mfanjie May 10, 2016

Choose a reason for hiding this comment

nikhiljindal commented May 9, 2016

mfanjie commented May 10, 2016

bprashanth commented Aug 3, 2016

mfanjie commented Aug 4, 2016 • edited Loading

k8s-bot commented Aug 4, 2016

mfanjie commented Aug 4, 2016

bprashanth commented Aug 4, 2016

mfanjie commented Aug 4, 2016

bprashanth commented Aug 4, 2016

mfanjie commented Aug 4, 2016

bprashanth commented Aug 4, 2016

k8s-github-robot commented Aug 4, 2016

k8s-bot commented Aug 4, 2016

k8s-github-robot commented Aug 4, 2016

mfanjie commented Aug 23, 2016

anguslees commented Aug 23, 2016

mfanjie commented Aug 23, 2016

mfanjie commented May 5, 2016 •

edited

Loading

mfanjie commented Aug 4, 2016 •

edited

Loading