-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
e2e loadbalancer remove after each cleanup #113562
Conversation
@aojea: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @bowei @cezarygerard @code-elinka @panslava |
/sig network |
/test |
@aojea: The
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test pull-kubernetes-e2e-gci-gce-ingress |
/assign @bowei |
The cloud-provider and the e2e test were racing on deleting the cloud resources. Also, the cloud-provider should not leave orphan resources, that will be detected by the job and fail, thus we should not have additional logic to cleanup masking these errors.
something is wrong with the cloud loadbalancer controller, the test starts at
and ends because of a 5 min timeout in
but the gce controller doesn't start processing events until https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/113562/pull-kubernetes-e2e-gci-gce-ingress/1587933663377494016/artifacts/bootstrap-e2e-master/kube-controller-manager.log
|
nevermind, is processed in the glb controller, but I don´t understand the workflow the controller either https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/113562/pull-kubernetes-e2e-gci-gce-ingress/1587933663377494016/artifacts/bootstrap-e2e-master/glbc.log
/test pull-kubernetes-e2e-gci-gce-ingress anyway, this PR doesn´t touch on ingress tests, only the loadbalancer ones |
/assign @MrHohn you may have context here too |
/test pull-kubernetes-e2e-gci-gce-ingress |
I can be wrong, but what I see GCE is not deleting load balancer actually. It is trying to ensure it, it is just deleting target pools and health checks, cause it wants to change health checks from shared to node-local. https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-ingress/1587657000760643584/artifacts/bootstrap-e2e-master/kube-controller-manager.log
and it feels like it is actually races with "cleanup hook" in tests. GCE is trying to ensure load balancer, reserves address, cleanup hook deletes it, so gce fails to ensure load balancer, and resyncs the service The question is -- who triggers lb update. What I see, that the failing test is "should handle updates to ExternalTrafficPolicy field". And it is an interesting test What I don't understand is this defer kubernetes/test/e2e/network/loadbalancer.go Line 1227 in 924b467
Why it changes service type to ClusterIP, and then why jig.ChangeServiceType is deleting load balancer kubernetes/test/e2e/framework/service/jig.go Line 171 in 924b467
err := cs.CoreV1().Services(svc.Namespace).Delete(context.TODO(), svc.Name, metav1.DeleteOptions{}) kubernetes/test/e2e/network/loadbalancer.go Line 1229 in 924b467
I think this is the source of flake and races |
it seems that is doing an ordered cleanup of the loadbalancer deletion
let's try first removing the hook as in this PR, we can iterate later, WDYT? |
What do you mean by "remove all the loadbalancer" ? It is just deleting cloud resources |
I mean that, all the GCE resources associated to the Service type Loadbalancer, since is not longer a LoadBalancer service |
can I have a lgtm to iterate on this? |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 - with the finalizer mechanism the load balancers should not be leaked even without these extra cleanup steps.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aojea, MrHohn The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
still too soon for celebrate https://testgrid.k8s.io/google-gce#gci-gce-ingress&width=5 but no failures since this merged 4 days ago |
No more failures since this merged https://testgrid.k8s.io/google-gce#gci-gce-ingress&width=5 |
The cloud-provider and the e2e test were racing on deleting the cloud resources.
Also, the cloud-provider should not leave orphan resources, that will be detected by the job and fail, thus we should not have additional logic to cleanup masking these errors.
/kind flake
Fixes: #107530
Looking at the jobs in https://testgrid.k8s.io/google-gce#gci-gce-ingress
Failing jobs orphan IPs
Successful jobs doesn't present that error.
In addition, checking at the e2e output of the failing job https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-ingress/1587657000760643584/build-log.txt we can see that the e2e.test is deleting the gce objects
in parallel, the cloud-provider seems to be deleting the same objects https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-ingress/1587657000760643584/artifacts/bootstrap-e2e-master/kube-controller-manager.log , but at this point it fails and decide to recreate the load balancer ???? failing to assign the IP that was already orphaned?
there seems to be a bug in the cloud-provider?