Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite service controller to apply best controller pattern #25189

Conversation

mfanjie
Copy link

@mfanjie mfanjie commented May 5, 2016

This PR is a long term solution for #21625:
We apply the same pattern like replication controller to service controller to avoid the potential process order messes in service controller, the change includes:

  1. introduce informer controller to watch service changes from kube-apiserver, so that every changes on same service will be kept in serviceStore as the only element.
  2. put the service name to be processed to working queue
  3. when process service, always get info from serviceStore to ensure the info is up-to-date
  4. keep the retry mechanism, sleep for certain interval and add it back to queue.
  5. remote the logic of reading last service info from kube-apiserver before processing the LB info as we trust the info from serviceStore.

The UT has been passed, manual test passed after I hardcode the cloud provider as FakeCloud, however I am not able to boot a k8s cluster with any available cloudprovider, so e2e test is not done.

Submit this PR first for review and for triggering a e2e test.

@k8s-bot
Copy link

k8s-bot commented May 5, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

1 similar comment
@k8s-bot
Copy link

k8s-bot commented May 5, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-github-robot k8s-github-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels May 5, 2016
@k8s-bot
Copy link

k8s-bot commented May 5, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@mfanjie mfanjie force-pushed the kube-service-controller-rewritten branch from cff0711 to af5ce26 Compare May 5, 2016 08:07
@mfanjie
Copy link
Author

mfanjie commented May 5, 2016

// concurrentServiceSyncs is the number of services that are
// allowed to sync concurrently. Larger number = more responsive service
// management, but more CPU (and network) load.
ConcurrentServiceSyncs int32 `json:"concurrentServiceSyncs"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are adding one new field, needs to run ./hack/update-all.sh?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, right, thanks for pointing out, will run and add changes later.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

@k8s-github-robot k8s-github-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/old-docs labels May 5, 2016
@mfanjie
Copy link
Author

mfanjie commented May 6, 2016

@davidopp @justinsb @bprashanth @thockin @a-robinson can you guys take some time to review this change please? Thanks

@k8s-bot
Copy link

k8s-bot commented May 9, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

cachedService = sc.cache.getOrCreate(namespacedName.String())
}

err, retryDelay := sc.processDelta(cachedService, deltaService, namespacedName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment on why we really need a delta? on every service update can we just get the latest service from the store, and verify that it has the expected loadbalancers? that would make this controller level triggered.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just kept the old variable names, the logic is not actually processing the delta. the service store hold the latest service info, and on service updated, we simply add the service into queue and check if it needs update in createLoadBalancerIfNeeded(). I will change the variable and function name to avoid confusion, and add check to controller like

UpdateFunc: func(old, cur interface{}) {
                if sc.needsUpdate(old, cur) {
                    sc.enqueueService(cur)
                }
            },

@nikhiljindal
Copy link
Contributor

@k8s-bot ok to test

@mfanjie mfanjie force-pushed the kube-service-controller-rewritten branch from d5412f6 to 661b748 Compare May 10, 2016 04:44
@mfanjie
Copy link
Author

mfanjie commented May 10, 2016

@bprashanth addressed your comments.

@mfanjie mfanjie force-pushed the kube-service-controller-rewritten branch 3 times, most recently from 9c62386 to 87f6aab Compare May 11, 2016 05:10
@bprashanth
Copy link
Contributor

@mfanjie what is the fix? it's hard to tell with one giant commit. Can you please isolate the change in another commit, or comment around the fix so I can focus on that part?

@mfanjie
Copy link
Author

mfanjie commented Aug 4, 2016

@bprashanth I squashed the commits as there are several minor changes to test.
The change is not big, originally, the Run func of servicecontroller is func (s *ServiceController) Run(workers int, stopCh <-chan struct{}) (similar with resource_quota_controller.go) while in controllermanager.go this method was not called by go, as it return error. so the <-stopCh blocks the thread and the following controller could not be loaded.
I made the changes to remove stopCh <-chan struct{} as the parameter of Run, so that <-stopCh is remove together. similar with nodecontroller.

In addition, what kind of cloud resource leak you mentioned? LB IP? how can I ensure there is no leak for it, this is the first time I use gce to boot k8s cluster.
I have the e2e job ran overnight, but checking the logs, I see

ug  3 12:15:20.783: INFO: Waiting up to 20m0s for service "mutability-test" to have a LoadBalancer
Aug  3 12:35:20.949: INFO: Timeout waiting for service "mutability-test" to have a load balancer

From code, it seems a known problem exists on LB IP allocation?

// How long to wait for a load balancer to be created/modified.
    //TODO: once support ticket 21807001 is resolved, reduce this timeout back to something reasonable
    loadBalancerCreateTimeout = 20 * time.Minute

@mfanjie mfanjie force-pushed the kube-service-controller-rewritten branch from 42cd970 to 5fa6404 Compare August 4, 2016 01:29
@k8s-bot
Copy link

k8s-bot commented Aug 4, 2016

GCE e2e build/test passed for commit 5fa6404.

@mfanjie
Copy link
Author

mfanjie commented Aug 4, 2016

All settled from my perspective, waiting for LGTM. @bprashanth

@bprashanth
Copy link
Contributor

From code, it seems a known problem exists on LB IP allocation?

Yeah that 20m timeout was put in to mitigate the issue, it shouln't take that long (i.e the test shouldn't timeout, or if it was the submit queue would be closed). I haven't seen that timeout in a while. Can you investigate why the test failed? how long did it take to create the lb? how many successful runs did you get?

you can check for leaked resources by going to the gce console (console.cloud.google.com) and checking your quota (compute engine > quotas). Typically leaked addresses show up in gcloud (gcloud compute addresses list, target-pools list etc).

@mfanjie
Copy link
Author

mfanjie commented Aug 4, 2016

@bprashanth The timeout error happened occasionally last night, the e2e test is still ongoing and runs so far so good. I do not see the error now and here is the successful logs.

STEP: waiting for the TCP service to have a load balancer
Aug  4 00:46:59.688: INFO: Waiting up to 20m0s for service "mutability-test" to have a LoadBalancer
^@^@Aug  4 00:49:03.795: INFO: TCP load balancer: 104.197.122.110
STEP: demoting the static IP to ephemeral
I0804 00:49:03.795490   15443 google_compute.go:89] Deleting static IP with name "e2e-external-lb-test-cf7be64d-5a16-11e6-bd46-74dbd180286e" in project "nds1-1340" in region "us-central1"
STEP: waiting for the UDP service to have a load balancer
Aug  4 00:49:12.730: INFO: Waiting up to 20m0s for service "mutability-test" to have a LoadBalancer
^@Aug  4 00:50:06.837: INFO: UDP load balancer: 146.148.43.188

From gce console, it shows 5 target pools, and 4 in-used address. I did not see a leak with the time being.

@bprashanth
Copy link
Contributor

I guess a nightly back to back run is fine. I assume the 5 target pools and 4 in use addresses are from the currently running test? otherwise a leak at that rate would close SQ in 10 days because the address quota is ~50 i think.

@mfanjie
Copy link
Author

mfanjie commented Aug 4, 2016

yes, the test is still on going, and the 5 pool and 4 address are using for the tests. I did not see any abnormal leak from the long run.

Please consider if it is ok for merge. I have no further action on this pr, except to check the result again next daytime, I do not expect any problem though.

? 2016?8?5??03:54?Prashanth B <notifications@github.commailto:notifications@github.com> ???

I guess a nightly back to back run is fine. I assume the 5 target pools and 4 in use addresses are from the currently running test? otherwise a leak at that rate would close SQ in 10 days because the address quota is ~50 i think.

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//pull/25189#issuecomment-237664354, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AP30JPmI3Ts_gx78bWVPlT3_CmnYtkiKks5qckNSgaJpZM4IX1D_.

@bprashanth
Copy link
Contributor

Thanks for running the test, LGTM

@bprashanth bprashanth added release-note Denotes a PR that will be considered when it comes time to generate release notes. lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. release-note-label-needed labels Aug 4, 2016
@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-bot
Copy link

k8s-bot commented Aug 4, 2016

GCE e2e build/test passed for commit 5fa6404.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit c41c3d4 into kubernetes:master Aug 4, 2016
@mfanjie
Copy link
Author

mfanjie commented Aug 23, 2016

@anguslees can you explain more detail? here is a brief flow by reading the code through, please checkout and point out if there is a logic error. Thanks.

initial state
111. s.knownHosts == empty array
623. nodes == all nodes meet the condition
628. newHosts == string array of node names

first loop
629. stringSlicesEqual(newHosts, s.knownHosts) == false
639. serviceToUpdate = all services
6406. numServices = all services size
641. servicesToUpdate = servicesToRetry
645. s.knownHosts = newHosts

next loop(no new nodes)
629. stringSlicesEqual(newHosts, s.knownHosts) == true
632. retry on serviceToUpdate which is servicesToRetry

next loop(with new nodes)
same as first loop

@anguslees
Copy link
Member

On the "next loop(no new nodes)" case, servicesToUpdate is always a new/empty list (from line 622).

So if the service failed to update fully on the first loop (stringSlicesEqual(newHosts, s.knownHosts) == false), then the service will not be retried on later loops (stringSlicesEqual(..) == true). The error will persist until the list of hosts changes again and we trigger the allServices() case again (stringSlicesEqual(..) == false)

Note that the old version did not have this problem because the servicesToUpdate value persists across iterations of the sync loop in (old version) line 687. I believe the fix for the new version is to persist servicesToUpdate in the ServiceController object in some form, much like was done with prevHosts => s.knownHosts as part of this commit.

@mfanjie
Copy link
Author

mfanjie commented Aug 23, 2016

@anguslees ah, yes, the servicesToUpdate is re-declared inside the loop, yes, we need to persist the latest value somewhere outside the loop. will send a fix.

k8s-github-robot pushed a commit that referenced this pull request Aug 26, 2016
Automatic merge from submit-queue

persist services need to be retried in service controller cache.

fix issue reported by @anguslees
more detail on #25189
openshift-publish-robot pushed a commit to openshift/kubernetes that referenced this pull request Jun 23, 2020
…source

UPSTREAM: <carry>: apiserver: log source in LateConnections event

Origin-commit: 22a0724bb9b0d19dd955227126bce72378e2d45f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.