-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite service controller to apply best controller pattern #25189
Rewrite service controller to apply best controller pattern #25189
Conversation
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
1 similar comment
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
cff0711
to
af5ce26
Compare
// concurrentServiceSyncs is the number of services that are | ||
// allowed to sync concurrently. Larger number = more responsive service | ||
// management, but more CPU (and network) load. | ||
ConcurrentServiceSyncs int32 `json:"concurrentServiceSyncs"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are adding one new field, needs to run ./hack/update-all.sh
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, right, thanks for pointing out, will run and add changes later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
@davidopp @justinsb @bprashanth @thockin @a-robinson can you guys take some time to review this change please? Thanks |
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
cachedService = sc.cache.getOrCreate(namespacedName.String()) | ||
} | ||
|
||
err, retryDelay := sc.processDelta(cachedService, deltaService, namespacedName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment on why we really need a delta? on every service update can we just get the latest service from the store, and verify that it has the expected loadbalancers? that would make this controller level triggered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just kept the old variable names, the logic is not actually processing the delta. the service store hold the latest service info, and on service updated, we simply add the service into queue and check if it needs update in createLoadBalancerIfNeeded(). I will change the variable and function name to avoid confusion, and add check to controller like
UpdateFunc: func(old, cur interface{}) {
if sc.needsUpdate(old, cur) {
sc.enqueueService(cur)
}
},
@k8s-bot ok to test |
d5412f6
to
661b748
Compare
@bprashanth addressed your comments. |
9c62386
to
87f6aab
Compare
045bbd1
to
8646dc4
Compare
@mfanjie what is the fix? it's hard to tell with one giant commit. Can you please isolate the change in another commit, or comment around the fix so I can focus on that part? |
@bprashanth I squashed the commits as there are several minor changes to test. In addition, what kind of cloud resource leak you mentioned? LB IP? how can I ensure there is no leak for it, this is the first time I use gce to boot k8s cluster.
From code, it seems a known problem exists on LB IP allocation?
|
42cd970
to
5fa6404
Compare
GCE e2e build/test passed for commit 5fa6404. |
All settled from my perspective, waiting for LGTM. @bprashanth |
Yeah that 20m timeout was put in to mitigate the issue, it shouln't take that long (i.e the test shouldn't timeout, or if it was the submit queue would be closed). I haven't seen that timeout in a while. Can you investigate why the test failed? how long did it take to create the lb? how many successful runs did you get? you can check for leaked resources by going to the gce console (console.cloud.google.com) and checking your quota (compute engine > quotas). Typically leaked addresses show up in gcloud (gcloud compute addresses list, target-pools list etc). |
@bprashanth The timeout error happened occasionally last night, the e2e test is still ongoing and runs so far so good. I do not see the error now and here is the successful logs.
From gce console, it shows 5 target pools, and 4 in-used address. I did not see a leak with the time being. |
I guess a nightly back to back run is fine. I assume the 5 target pools and 4 in use addresses are from the currently running test? otherwise a leak at that rate would close SQ in 10 days because the address quota is ~50 i think. |
yes, the test is still on going, and the 5 pool and 4 address are using for the tests. I did not see any abnormal leak from the long run. Please consider if it is ok for merge. I have no further action on this pr, except to check the result again next daytime, I do not expect any problem though. ? 2016?8?5??03:54?Prashanth B <notifications@github.commailto:notifications@github.com> ??? I guess a nightly back to back run is fine. I assume the 5 target pools and 4 in use addresses are from the currently running test? otherwise a leak at that rate would close SQ in 10 days because the address quota is ~50 i think. You are receiving this because you were mentioned. |
Thanks for running the test, LGTM |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e build/test passed for commit 5fa6404. |
Automatic merge from submit-queue |
@anguslees can you explain more detail? here is a brief flow by reading the code through, please checkout and point out if there is a logic error. Thanks.
|
On the "next loop(no new nodes)" case, So if the service failed to update fully on the first loop ( Note that the old version did not have this problem because the |
@anguslees ah, yes, the servicesToUpdate is re-declared inside the loop, yes, we need to persist the latest value somewhere outside the loop. will send a fix. |
Automatic merge from submit-queue persist services need to be retried in service controller cache. fix issue reported by @anguslees more detail on #25189
…source UPSTREAM: <carry>: apiserver: log source in LateConnections event Origin-commit: 22a0724bb9b0d19dd955227126bce72378e2d45f
This PR is a long term solution for #21625:
We apply the same pattern like replication controller to service controller to avoid the potential process order messes in service controller, the change includes:
The UT has been passed, manual test passed after I hardcode the cloud provider as FakeCloud, however I am not able to boot a k8s cluster with any available cloudprovider, so e2e test is not done.
Submit this PR first for review and for triggering a e2e test.