e2e loadbalancer remove after each cleanup #113562

aojea · 2022-11-02T21:27:32Z

The cloud-provider and the e2e test were racing on deleting the cloud resources.

Also, the cloud-provider should not leave orphan resources, that will be detected by the job and fail, thus we should not have additional logic to cleanup masking these errors.

/kind flake

NONE

Fixes: #107530

Looking at the jobs in https://testgrid.k8s.io/google-gce#gci-gce-ingress

Failing jobs orphan IPs

W1101 17:03:45.329009 9 gce_loadbalancer_external.go:147] ensureExternalLoadBalancer(a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): Orphaning static IP 34.168.75.161 in region us-west1: googleapi: Error 404: The resource 'projects/k8s-infra-e2e-boskos-014/global/httpHealthChecks/a966e0e50f726482a91a69dca3b8e1a2' was not found, notFound.

Successful jobs doesn't present that error.

In addition, checking at the e2e output of the failing job https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-ingress/1587657000760643584/build-log.txt we can see that the e2e.test is deleting the gce objects

grep a966e0e50f726482a91a69dca3b8e1a2 build-log.txt
I1101 17:03:14.036] Nov  1 17:03:14.035: INFO: cleaning load balancer resource for a966e0e50f726482a91a69dca3b8e1a2
I1101 17:03:25.800] I1101 17:03:25.800642   10220 gce_loadbalancer_external.go:385] DeleteExternalTargetPoolAndChecks(a966e0e50f726482a91a69dca3b8e1a2(/)): Target pool already deleted. Continuing to delete other resources.
I1101 17:03:25.801] I1101 17:03:25.800668   10220 gce_loadbalancer_external.go:402] DeleteExternalTargetPoolAndChecks(a966e0e50f726482a91a69dca3b8e1a2(/)): Deleting health check k8s-a835f4d3b1f613eb-node.
I1101 17:03:26.345] W1101 17:03:26.345494   10220 gce_loadbalancer_external.go:409] DeleteExternalTargetPoolAndChecks(a966e0e50f726482a91a69dca3b8e1a2(/)): Failed to delete health check k8s-a835f4d3b1f613eb-node: googleapi: Error 400: The resource 'projects/k8s-infra-e2e-boskos-014/global/httpHealthChecks/k8s-a835f4d3b1f613eb-node' is not ready, resourceNotReady.
I1101 17:03:29.917] I1101 17:03:29.917619   10220 gce_loadbalancer_external.go:385] DeleteExternalTargetPoolAndChecks(a966e0e50f726482a91a69dca3b8e1a2(/)): Target pool already deleted. Continuing to delete other resources.
I1101 17:03:29.918] I1101 17:03:29.917643   10220 gce_loadbalancer_external.go:402] DeleteExternalTargetPoolAndChecks(a966e0e50f726482a91a69dca3b8e1a2(/)): Deleting health check k8s-a835f4d3b1f613eb-node.
I1101 17:03:30.378] I1101 17:03:30.377812   10220 gce_loadbalancer_external.go:424] DeleteExternalTargetPoolAndChecks(a966e0e50f726482a91a69dca3b8e1a2(/)): Deleting health check firewall k8s-a835f4d3b1f613eb-node-http-hc.
I1101 17:03:35.019] I1101 17:03:35.018925   10220 gce_loadbalancer_external.go:385] DeleteExternalTargetPoolAndChecks(a966e0e50f726482a91a69dca3b8e1a2(/)): Target pool already deleted. Continuing to delete other resources.
I1101 17:03:35.019] I1101 17:03:35.018958   10220 gce_loadbalancer_external.go:402] DeleteExternalTargetPoolAndChecks(a966e0e50f726482a91a69dca3b8e1a2(/)): Deleting health check k8s-a835f4d3b1f613eb-node.
I1101 17:03:35.449] I1101 17:03:35.448935   10220 gce_loadbalancer_external.go:424] DeleteExternalTargetPoolAndChecks(a966e0e50f726482a91a69dca3b8e1a2(/)): Deleting health check firewall k8s-a835f4d3b1f613eb-node-http-hc.

in parallel, the cloud-provider seems to be deleting the same objects https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-ingress/1587657000760643584/artifacts/bootstrap-e2e-master/kube-controller-manager.log , but at this point it fails and decide to recreate the load balancer ???? failing to assign the IP that was already orphaned?

I1101 17:03:05.852473       9 gen.go:16624] GCEHttpHealthChecks.Get(context.Background.WithDeadline(2022-11-01 18:03:05.646296407 +0000 UTC m=+6793.291264222 [59m59.793814122s]), Key{"a966e0e50f726482a91a69dca3b8e1a2"}) = <nil>, googleapi: Error 404: The resource 'projects/k8s-infra-e2e-boskos-014/global/httpHealthChecks/a966e0e50f726482a91a69dca3b8e1a2' was not found, notFound
I1101 17:03:05.852571       9 gce_loadbalancer_external.go:226] ensureExternalLoadBalancer(a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): Service needs local traffic health checks on: 31123/healthz.
I1101 17:03:05.852582       9 gce_loadbalancer_external.go:231] ensureExternalLoadBalancer(a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): Updating from nodes health checks to local traffic health checks.
I1101 17:03:22.419771       9 gen.go:12224] GCEForwardingRules.Delete(context.Background.WithDeadline(2022-11-01 18:03:05.852589313 +0000 UTC m=+6793.497557147 [59m43.432807764s]), Key{"a966e0e50f726482a91a69dca3b8e1a2", region: "us-west1"}) = <nil>
I1101 17:03:22.419846       9 gce_loadbalancer_external.go:267] ensureExternalLoadBalancer(a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): Deleted forwarding rule.
I1101 17:03:24.948520       9 gen.go:36248] GCETargetPools.Delete(context.Background.WithDeadline(2022-11-01 18:03:22.419856979 +0000 UTC m=+6810.064824806 [59m57.471327112s]), Key{"a966e0e50f726482a91a69dca3b8e1a2", region: "us-west1"}) = <nil>
I1101 17:03:24.948571       9 gce_loadbalancer_external.go:402] DeleteExternalTargetPoolAndChecks(a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): Deleting health check k8s-a835f4d3b1f613eb-node.
I1101 17:03:29.354394       9 gce_loadbalancer_external.go:424] DeleteExternalTargetPoolAndChecks(a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): Deleting health check firewall k8s-a835f4d3b1f613eb-node-http-hc.
I1101 17:03:33.611054       9 gce_loadbalancer_external.go:505] ensureTargetPoolAndHealthCheck(a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): Deleted target pool.
I1101 17:03:33.804177       9 gen.go:10056] GCEFirewalls.Get(context.Background.WithDeadline(2022-11-01 18:03:33.611062158 +0000 UTC m=+6821.256029988 [59m59.806877216s]), Key{"k8s-a966e0e50f726482a91a69dca3b8e1a2-http-hc"}) = <nil>, googleapi: Error 404: The resource 'projects/k8s-infra-e2e-boskos-014/global/firewalls/k8s-a966e0e50f726482a91a69dca3b8e1a2-http-hc' was not found, notFound
I1101 17:03:33.804245       9 gce_loadbalancer_external.go:931] Creating firewall k8s-a966e0e50f726482a91a69dca3b8e1a2-http-hc for health checks.
I1101 17:03:39.575874       9 gen.go:10132] GCEFirewalls.Insert(context.Background.WithDeadline(2022-11-01 18:03:33.804255907 +0000 UTC m=+6821.449223737 [59m54.228372326s]), Key{"k8s-a966e0e50f726482a91a69dca3b8e1a2-http-hc"}, &{Allowed:[0xc002c19500] CreationTimestamp: Denied:[] Description:{"kubernetes.io/service-name":"esipp-7190/external-local-update", "kubernetes.io/service-ip":"34.168.75.161"} DestinationRanges:[34.168.75.161] Direction: Disabled:false Id:0 Kind: LogConfig:<nil> Name:k8s-a966e0e50f726482a91a69dca3b8e1a2-http-hc Network:https://www.googleapis.com/compute/v1/projects/k8s-infra-e2e-boskos-014/global/networks/bootstrap-e2e Priority:0 SelfLink: SourceRanges:[130.211.0.0/22 35.191.0.0/16 209.85.152.0/22 209.85.204.0/22] SourceServiceAccounts:[] SourceTags:[] TargetServiceAccounts:[] TargetTags:[bootstrap-e2e-minion] ServerResponse:{HTTPStatusCode:0 Header:map[]} ForceSendFields:[] NullFields:[]}) = <nil>
I1101 17:03:39.575987       9 gce_loadbalancer_external.go:935] Created firewall k8s-a966e0e50f726482a91a69dca3b8e1a2-http-hc for health checks.
I1101 17:03:39.755795       9 gen.go:16624] GCEHttpHealthChecks.Get(context.Background.WithDeadline(2022-11-01 18:03:39.575998684 +0000 UTC m=+6827.220966514 [59m59.820194454s]), Key{"a966e0e50f726482a91a69dca3b8e1a2"}) = <nil>, googleapi: Error 404: The resource 'projects/k8s-infra-e2e-boskos-014/global/httpHealthChecks/a966e0e50f726482a91a69dca3b8e1a2' was not found, notFound
I1101 17:03:39.755876       9 gce_loadbalancer_external.go:717] Did not find health check a966e0e50f726482a91a69dca3b8e1a2, creating port 31123 path /healthz
I1101 17:03:41.588316       9 gen.go:16700] GCEHttpHealthChecks.Insert(context.Background.WithDeadline(2022-11-01 18:03:39.755882277 +0000 UTC m=+6827.400850102 [59m58.167553232s]), Key{"a966e0e50f726482a91a69dca3b8e1a2"}, &{CheckIntervalSec:8 CreationTimestamp: Description:{"kubernetes.io/service-name":"a966e0e50f726482a91a69dca3b8e1a2"} HealthyThreshold:1 Host: Id:0 Kind: Name:a966e0e50f726482a91a69dca3b8e1a2 Port:31123 RequestPath:/healthz SelfLink: TimeoutSec:1 UnhealthyThreshold:3 ServerResponse:{HTTPStatusCode:0 Header:map[]} ForceSendFields:[] NullFields:[]}) = <nil>
I1101 17:03:41.847558       9 gen.go:16624] GCEHttpHealthChecks.Get(context.Background.WithDeadline(2022-11-01 18:03:41.588385956 +0000 UTC m=+6829.233353771 [59m59.740819823s]), Key{"a966e0e50f726482a91a69dca3b8e1a2"}) = &{CheckIntervalSec:8 CreationTimestamp:2022-11-01T10:03:40.169-07:00 Description:{"kubernetes.io/service-name":"a966e0e50f726482a91a69dca3b8e1a2"} HealthyThreshold:1 Host: Id:615235441063161731 Kind:compute#httpHealthCheck Name:a966e0e50f726482a91a69dca3b8e1a2 Port:31123 RequestPath:/healthz SelfLink:https://www.googleapis.com/compute/v1/projects/k8s-infra-e2e-boskos-014/global/httpHealthChecks/a966e0e50f726482a91a69dca3b8e1a2 TimeoutSec:1 UnhealthyThreshold:3 ServerResponse:{HTTPStatusCode:200 Header:map[Cache-Control:[private] Content-Type:[application/json; charset=UTF-8] Date:[Tue, 01 Nov 2022 17:03:41 GMT] Server:[ESF] Vary:[Origin X-Origin Referer] X-Content-Type-Options:[nosniff] X-Frame-Options:[SAMEORIGIN] X-Xss-Protection:[0]]} ForceSendFields:[] NullFields:[]}, <nil>
I1101 17:03:41.847663       9 gce_loadbalancer_external.go:726] Created HTTP health check a966e0e50f726482a91a69dca3b8e1a2 healthCheckNodePort: 31123
I1101 17:03:41.847705       9 gce_loadbalancer_external.go:576] Creating targetpool a966e0e50f726482a91a69dca3b8e1a2 with 1 healthchecks
I1101 17:03:44.560988       9 gen.go:36215] GCETargetPools.Insert(context.Background.WithDeadline(2022-11-01 18:03:41.847713095 +0000 UTC m=+6829.492680925 [59m57.286707546s]), Key{"a966e0e50f726482a91a69dca3b8e1a2", region: "us-west1"}, &{BackupPool: CreationTimestamp: Description:{"kubernetes.io/service-name":"esipp-7190/external-local-update"} FailoverRatio:0 HealthChecks:[https://www.googleapis.com/compute/v1/projects/k8s-infra-e2e-boskos-014/global/httpHealthChecks/a966e0e50f726482a91a69dca3b8e1a2] Id:0 Instances:[/zones/us-west1-b/instances/bootstrap-e2e-master /zones/us-west1-b/instances/bootstrap-e2e-minion-group-t21s /zones/us-west1-b/instances/bootstrap-e2e-minion-group-9bk2 /zones/us-west1-b/instances/bootstrap-e2e-minion-group-tb53] Kind: Name:a966e0e50f726482a91a69dca3b8e1a2 Region: SelfLink: SessionAffinity:NONE ServerResponse:{HTTPStatusCode:0 Header:map[]} ForceSendFields:[] NullFields:[]}) = <nil>
I1101 17:03:44.561084       9 gce_loadbalancer_external.go:518] ensureTargetPoolAndHealthCheck(a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): Created health checks a966e0e50f726482a91a69dca3b8e1a2.
I1101 17:03:44.561092       9 gce_loadbalancer_external.go:521] ensureTargetPoolAndHealthCheck(a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): Created target pool.
I1101 17:03:44.561105       9 gce_loadbalancer_external.go:275] ensureExternalLoadBalancer(a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): Creating forwarding rule, IP 34.168.75.161 (tier: Premium).
I1101 17:03:45.328911       9 gen.go:12186] GCEForwardingRules.Insert(context.Background.WithDeadline(2022-11-01 18:03:44.561115144 +0000 UTC m=+6832.206082971 [59m59.232188705s]), Key{"a966e0e50f726482a91a69dca3b8e1a2", region: "us-west1"}, ...) = googleapi: Error 400: Invalid value for field 'resource.IPAddress': '34.168.75.161'. Invalid IP address specified., invalid
W1101 17:03:45.329009       9 gce_loadbalancer_external.go:147] ensureExternalLoadBalancer(a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): Orphaning static IP 34.168.75.161 in region us-west1: googleapi: Error 404: The resource 'projects/k8s-infra-e2e-boskos-014/global/httpHealthChecks/a966e0e50f726482a91a69dca3b8e1a2' was not found, notFound.
E1101 17:03:45.329024       9 gce_loadbalancer.go:174] Failed to EnsureLoadBalancer(bootstrap-e2e, esipp-7190, external-local-update, a966e0e50f726482a91a69dca3b8e1a2, us-west1), err: failed to create forwarding rule for load balancer (a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): googleapi: Error 400: Invalid value for field 'resource.IPAddress': '34.168.75.161'. Invalid IP address specified., invalid
E1101 17:03:45.329072       9 controller.go:289] error processing service esipp-7190/external-local-update (will retry): failed to ensure load balancer: failed to create forwarding rule for load balancer (a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): googleapi: Error 400: Invalid value for field 'resource.IPAddress': '34.168.75.161'. Invalid IP address specified., invalid
I1101 17:03:45.331490       9 event.go:294] "Event occurred" object="esipp-7190/external-local-update" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: failed to create forwarding rule for load balancer (a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): googleapi: Error 400: Invalid value for field 'resource.IPAddress': '34.168.75.161'. Invalid IP address specified., invalid"
E1101 17:03:45.336717       9 event.go:267] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"external-local-update.172384bae6da1d34", GenerateName:"", Namespace:"esipp-7190", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Service", Namespace:"esipp-7190", Name:"external-local-update", UID:"966e0e50-f726-482a-91a6-9dca3b8e1a27", APIVersion:"v1", ResourceVersion:"8542", FieldPath:""}, Reason:"SyncLoadBalancerFailed", Message:"Error syncing load balancer: failed to ensure load balancer: failed to create forwarding rule for load balancer (a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): googleapi: Error 400: Invalid value for field 'resource.IPAddress': '34.168.75.161'. Invalid IP address specified., invalid", Source:v1.EventSource{Component:"service-controller", Host:""}, FirstTimestamp:time.Date(2022, time.November, 1, 17, 3, 45, 329052980, time.Local), LastTimestamp:time.Date(2022, time.November, 1, 17, 3, 45, 329052980, time.Local), Count:1, Type:"Warning", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events "external-local-update.172384bae6da1d34" is forbidden: unable to create new content in namespace esipp-7190 because it is being terminated' (will not retry!)
I1101 17:04:21.178898       9 gen.go:12115] GCEForwardingRules.Get(context.Background.WithDeadline(2022-11-01 18:04:21.120154883 +0000 UTC m=+6868.765122710 [59m59.941248789s]), Key{"a966e0e50f726482a91a69dca3b8e1a2", region: "us-west1"}) = <nil>, googleapi: Error 404: The resource 'projects/k8s-infra-e2e-boskos-014/regions/us-west1/forwardingRules/a966e0e50f726482a91a69dca3b8e1a2' was not found, notFound

there seems to be a bug in the cloud-provider?

k8s-ci-robot · 2022-11-02T21:27:40Z

@aojea: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

aojea · 2022-11-02T21:28:06Z

/assign @bowei @cezarygerard @code-elinka @panslava

aojea · 2022-11-02T21:28:15Z

/sig network

aojea · 2022-11-02T21:29:27Z

/test

k8s-ci-robot · 2022-11-02T21:29:29Z

@aojea: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

/test pull-kubernetes-conformance-kind-ga-only-parallel
/test pull-kubernetes-coverage-unit
/test pull-kubernetes-dependencies
/test pull-kubernetes-dependencies-go-canary
/test pull-kubernetes-e2e-gce
/test pull-kubernetes-e2e-gce-100-performance
/test pull-kubernetes-e2e-gce-big-performance
/test pull-kubernetes-e2e-gce-canary
/test pull-kubernetes-e2e-gce-network-proxy-http-connect
/test pull-kubernetes-e2e-gce-no-stage
/test pull-kubernetes-e2e-gce-scale-performance-manual
/test pull-kubernetes-e2e-gce-ubuntu-containerd
/test pull-kubernetes-e2e-gce-ubuntu-containerd-canary
/test pull-kubernetes-e2e-kind
/test pull-kubernetes-e2e-kind-ipv6
/test pull-kubernetes-integration
/test pull-kubernetes-integration-go-canary
/test pull-kubernetes-kubemark-e2e-gce-scale
/test pull-kubernetes-node-e2e-containerd
/test pull-kubernetes-typecheck
/test pull-kubernetes-unit
/test pull-kubernetes-unit-go-canary
/test pull-kubernetes-update
/test pull-kubernetes-verify
/test pull-kubernetes-verify-go-canary
/test pull-kubernetes-verify-govet-levee

The following commands are available to trigger optional jobs:

/test check-dependency-stats
/test pull-ci-kubernetes-unit-windows
/test pull-e2e-gce-cloud-provider-disabled
/test pull-kubernetes-conformance-image-test
/test pull-kubernetes-conformance-kind-ga-only
/test pull-kubernetes-conformance-kind-ipv6-parallel
/test pull-kubernetes-cross
/test pull-kubernetes-e2e-capz-azure-disk
/test pull-kubernetes-e2e-capz-azure-disk-vmss
/test pull-kubernetes-e2e-capz-azure-file
/test pull-kubernetes-e2e-capz-azure-file-vmss
/test pull-kubernetes-e2e-capz-conformance
/test pull-kubernetes-e2e-capz-ha-control-plane
/test pull-kubernetes-e2e-capz-windows-containerd
/test pull-kubernetes-e2e-containerd-gce
/test pull-kubernetes-e2e-gce-alpha-features
/test pull-kubernetes-e2e-gce-correctness
/test pull-kubernetes-e2e-gce-csi-serial
/test pull-kubernetes-e2e-gce-device-plugin-gpu
/test pull-kubernetes-e2e-gce-kubetest2
/test pull-kubernetes-e2e-gce-network-proxy-grpc
/test pull-kubernetes-e2e-gce-storage-disruptive
/test pull-kubernetes-e2e-gce-storage-slow
/test pull-kubernetes-e2e-gce-storage-snapshot
/test pull-kubernetes-e2e-gce-ubuntu-containerd-serial
/test pull-kubernetes-e2e-gci-gce-autoscaling
/test pull-kubernetes-e2e-gci-gce-ingress
/test pull-kubernetes-e2e-gci-gce-ipvs
/test pull-kubernetes-e2e-kind-canary
/test pull-kubernetes-e2e-kind-dual-canary
/test pull-kubernetes-e2e-kind-ipv6-canary
/test pull-kubernetes-e2e-kind-ipvs-dual-canary
/test pull-kubernetes-e2e-kind-multizone
/test pull-kubernetes-e2e-kops-aws
/test pull-kubernetes-e2e-kubelet-credential-provider
/test pull-kubernetes-e2e-ubuntu-gce-network-policies
/test pull-kubernetes-kind-dra
/test pull-kubernetes-kubemark-e2e-gce-big
/test pull-kubernetes-local-e2e
/test pull-kubernetes-node-crio-cgrpv2-e2e
/test pull-kubernetes-node-crio-cgrpv2-e2e-kubetest2
/test pull-kubernetes-node-crio-e2e
/test pull-kubernetes-node-crio-e2e-kubetest2
/test pull-kubernetes-node-e2e-containerd-alpha-features
/test pull-kubernetes-node-e2e-containerd-features
/test pull-kubernetes-node-e2e-containerd-features-kubetest2
/test pull-kubernetes-node-e2e-containerd-kubetest2
/test pull-kubernetes-node-kubelet-credential-provider
/test pull-kubernetes-node-kubelet-serial-containerd
/test pull-kubernetes-node-kubelet-serial-containerd-kubetest2
/test pull-kubernetes-node-kubelet-serial-cpu-manager
/test pull-kubernetes-node-kubelet-serial-cpu-manager-kubetest2
/test pull-kubernetes-node-kubelet-serial-crio-cgroupv1
/test pull-kubernetes-node-kubelet-serial-crio-cgroupv2
/test pull-kubernetes-node-kubelet-serial-hugepages
/test pull-kubernetes-node-kubelet-serial-memory-manager
/test pull-kubernetes-node-kubelet-serial-topology-manager
/test pull-kubernetes-node-kubelet-serial-topology-manager-kubetest2
/test pull-kubernetes-node-memoryqos-cgrpv2
/test pull-kubernetes-node-swap-fedora
/test pull-kubernetes-node-swap-fedora-serial
/test pull-kubernetes-node-swap-ubuntu-serial
/test pull-kubernetes-unit-experimental
/test pull-publishing-bot-validate

Use /test all to run the following jobs that were automatically triggered:

pull-kubernetes-conformance-kind-ga-only-parallel
pull-kubernetes-conformance-kind-ipv6-parallel
pull-kubernetes-dependencies
pull-kubernetes-e2e-gce-100-performance
pull-kubernetes-e2e-gce-ubuntu-containerd
pull-kubernetes-e2e-gci-gce-ingress
pull-kubernetes-e2e-gci-gce-ipvs
pull-kubernetes-e2e-kind
pull-kubernetes-e2e-kind-ipv6
pull-kubernetes-e2e-ubuntu-gce-network-policies
pull-kubernetes-integration
pull-kubernetes-node-e2e-containerd
pull-kubernetes-typecheck
pull-kubernetes-unit
pull-kubernetes-verify
pull-kubernetes-verify-govet-levee

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

aojea · 2022-11-02T21:30:01Z

/test pull-kubernetes-e2e-gci-gce-ingress

aojea · 2022-11-02T21:30:49Z

/assign @bowei

The cloud-provider and the e2e test were racing on deleting the cloud resources. Also, the cloud-provider should not leave orphan resources, that will be detected by the job and fail, thus we should not have additional logic to cleanup masking these errors.

aojea · 2022-11-03T00:49:00Z

something is wrong with the cloud loadbalancer controller, the test starts at

STEP: Create a basic HTTP ingress using NEG 11/02/22 23:59:13.553

and ends because of a 5 min timeout in

Nov 3 00:04:44.596: FAIL: timed out waiting for the condition

but the gce controller doesn't start processing events until https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/113562/pull-kubernetes-e2e-gci-gce-ingress/1587933663377494016/artifacts/bootstrap-e2e-master/kube-controller-manager.log

I1103 00:05:49.418559 8 gce_loadbalancer.go:137] EnsureLoadBalancer(bootstrap-e2e, loadbalancers-1795, mutability-test, a1212145f157a40a998fa59e2f116858, us-west1): ensure EXTERNAL loadbalancer

aojea · 2022-11-03T01:33:55Z

nevermind, is processed in the glb controller, but I don´t understand the workflow the controller either https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/113562/pull-kubernetes-e2e-gci-gce-ingress/1587933663377494016/artifacts/bootstrap-e2e-master/glbc.log

I1103 00:00:09.201169       1 healthchecks.go:236] DefaultNEGHealthCheck(HTTP)
I1103 00:00:09.201187       1 healthchecks.go:324] Getting Health Check, name: k8s1-6ccc0461-ingress-743-hostname-80-d8c9a299, version: beta, scope: global
I1103 00:00:09.394564       1 healthchecks.go:124] Existing HC = &{ForNEG:false ForILB:false HTTPHealthCheck:{Host: Port:0 PortName: PortSpecification:USE_SERVING_PORT ProxyHeader:NONE RequestPath:/ Response: ForceSendFields:[] NullFields:[]} HealthCheck:{CheckIntervalSec:15 CreationTimestamp:2022-11-02T16:59:15.063-07:00 Description:Default kubernetes L7 Loadbalancing health check for NEG. GrpcHealthCheck:<nil> HealthyThreshold:1 Http2HealthCheck:<nil> HttpHealthCheck:<nil> HttpsHealthCheck:<nil> Id:4261368500172274364 Kind:compute#healthCheck LogConfig:<nil> Name:k8s1-6ccc0461-ingress-743-hostname-80-d8c9a299 Region: SelfLink:https://www.googleapis.com/compute/beta/projects/k8s-jkns-e2e-gce-ubuntu-slow/global/healthChecks/k8s1-6ccc0461-ingress-743-hostname-80-d8c9a299 SelfLinkWithId: SslHealthCheck:<nil> TcpHealthCheck:<nil> TimeoutSec:15 Type:HTTP UdpHealthCheck:<nil> UnhealthyThreshold:2 ServerResponse:{HTTPStatusCode:0 Header:map[]} ForceSendFields:[] NullFields:[]}}
I1103 00:00:09.394614       1 healthchecks.go:125] HC before merge = &{ForNEG:true ForILB:false HTTPHealthCheck:{Host: Port:32450 PortName: PortSpecification:USE_SERVING_PORT ProxyHeader: RequestPath:/ Response: ForceSendFields:[] NullFields:[]} HealthCheck:{CheckIntervalSec:15 CreationTimestamp: Description:Default kubernetes L7 Loadbalancing health check for NEG. GrpcHealthCheck:<nil> HealthyThreshold:1 Http2HealthCheck:<nil> HttpHealthCheck:<nil> HttpsHealthCheck:<nil> Id:0 Kind: LogConfig:<nil> Name:k8s1-6ccc0461-ingress-743-hostname-80-d8c9a299 Region: SelfLink: SelfLinkWithId: SslHealthCheck:<nil> TcpHealthCheck:<nil> TimeoutSec:15 Type:HTTP UdpHealthCheck:<nil> UnhealthyThreshold:2 ServerResponse:{HTTPStatusCode:0 Header:map[]} ForceSendFields:[] NullFields:[]}}
I1103 00:00:09.394644       1 healthchecks.go:126] Resulting HC = &{ForNEG:true ForILB:false HTTPHealthCheck:{Host: Port:0 PortName: PortSpecification:USE_SERVING_PORT ProxyHeader:NONE RequestPath:/ Response: ForceSendFields:[] NullFields:[]} HealthCheck:{CheckIntervalSec:15 CreationTimestamp: Description:Default kubernetes L7 Loadbalancing health check for NEG. GrpcHealthCheck:<nil> HealthyThreshold:1 Http2HealthCheck:<nil> HttpHealthCheck:<nil> HttpsHealthCheck:<nil> Id:0 Kind: LogConfig:<nil> Name:k8s1-6ccc0461-ingress-743-hostname-80-d8c9a299 Region: SelfLink: SelfLinkWithId: SslHealthCheck:<nil> TcpHealthCheck:<nil> TimeoutSec:15 Type:HTTP UdpHealthCheck:<nil> UnhealthyThreshold:2 ServerResponse:{HTTPStatusCode:0 Header:map[]} ForceSendFields:[] NullFields:[]}}
I1103 00:00:09.394662       1 healthchecks.go:144] Health check "k8s1-6ccc0461-ingress-743-hostname-80-d8c9a299" already exists and needs no update
I1103 00:00:09.394734       1 gen.go:4952] Getting ga zonal NetworkEndpointGroup k8s1-6ccc0461-ingress-743-hostname-80-d8c9a299
I1103 00:00:09.466697       1 gen.go:3896] Getting ga BackendService k8s1-6ccc0461-ingress-743-hostname-80-d8c9a299
I1103 00:00:09.739906       1 gen.go:6636] Getting ga UrlMap k8s-um-ingress-743-hostname--6ccc0461bdfb9376
I1103 00:00:09.893021       1 gen.go:6039] Getting ga TargetHttpProxy k8s-tp-ingress-743-hostname--6ccc0461bdfb9376
I1103 00:00:10.116082       1 gen.go:4210] Getting ga ForwardingRule k8s-fw-ingress-743-hostname--6ccc0461bdfb9376
I1103 00:00:10.318264       1 gen.go:3896] Getting ga BackendService k8s1-6ccc0461-ingress-743-hostname-80-d8c9a299
I1103 00:00:10.899664       1 backends.go:204] backend service "k8s1-6ccc0461-ingress-743-hostname-80-d8c9a299" does not have health status: []
I1103 00:00:10.899745       1 gen.go:6633] Getting ga region UrlMap k8s-um-ingress-743-hostname--6ccc0461bdfb9376
I1103 00:00:10.971614       1 sync.go:104] Using algorithm NoCleanUpNeeded to GC frontend of ingress ingress-743/hostname
I1103 00:00:10.971685       1 gen.go:3940] Listing ga region BackendService
I1103 00:00:11.050154       1 gen.go:3943] Listing ga BackendService
I1103 00:00:35.009983       1 gen.go:3896] Getting ga BackendService k8s-ingress-svc-acct-permission-check-probe
I1103 00:00:45.605925       1 manager.go:282] Start NEG garbage collection.
I1103 00:00:45.605967       1 manager.go:497] Deletion candidate ingress-4703/k8s1-6ccc0461-ingress-4703-hostname-80-49baa8d3 has 1 NEG references

/test pull-kubernetes-e2e-gci-gce-ingress

anyway, this PR doesn´t touch on ingress tests, only the loadbalancer ones

aojea · 2022-11-03T01:46:37Z

/assign @MrHohn

you may have context here too

aojea · 2022-11-03T12:04:56Z

/test pull-kubernetes-e2e-gci-gce-ingress

panslava · 2022-11-03T16:07:14Z

I can be wrong, but what I see

GCE is not deleting load balancer actually. It is trying to ensure it, it is just deleting target pools and health checks, cause it wants to change health checks from shared to node-local. https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-ingress/1587657000760643584/artifacts/bootstrap-e2e-master/kube-controller-manager.log

I1102 05:15:49.083597       9 gce_loadbalancer_external.go:226] ensureExternalLoadBalancer(a37d5fcfd34ad46b2b5e38e4fccf8abe(esipp-8909/external-local-update)): Service needs local traffic health checks on: 32529/healthz.
I1102 05:15:49.083612       9 gce_loadbalancer_external.go:231] ensureExternalLoadBalancer(a37d5fcfd34ad46b2b5e38e4fccf8abe(esipp-8909/external-local-update)): Updating from nodes health checks to local traffic health checks

I1102 05:16:05.321505       9 gce_loadbalancer_external.go:267] ensureExternalLoadBalancer(a37d5fcfd34ad46b2b5e38e4fccf8abe(esipp-8909/external-local-update)): Deleted forwarding rule.
I1102 05:16:06.013657       9 httplog.go:132] "HTTP" verb="GET" URI="/healthz" latency="312.194µs" userAgent="kube-probe/1.26+" audit-ID="" srcIP="127.0.0.1:44776" resp=200
I1102 05:16:07.882693       9 gen.go:36248] GCETargetPools.Delete(context.Background.WithDeadline(2022-11-02 06:16:05.321520537 +0000 UTC m=+7545.195692136 [59m57.438791026s]), Key{"a37d5fcfd34ad46b2b5e38e4fccf8abe", region: "us-west1"}) = <nil>
I1102 05:16:07.882831       9 gce_loadbalancer_external.go:402] DeleteExternalTargetPoolAndChecks(a37d5fcfd34ad46b2b5e38e4fccf8abe(esipp-8909/external-local-update)): Deleting health check k8s-674fcf47d69408d1-node.
I1102 05:16:14.778806       9 gce_loadbalancer_external.go:505] ensureTargetPoolAndHealthCheck(a37d5fcfd34ad46b2b5e38e4fccf8abe(esipp-8909/external-local-update)): Deleted target pool.

and it feels like it is actually races with "cleanup hook" in tests. GCE is trying to ensure load balancer, reserves address, cleanup hook deletes it, so gce fails to ensure load balancer, and resyncs the service

The question is -- who triggers lb update. What I see, that the failing test is "should handle updates to ExternalTrafficPolicy field". And it is an interesting test

What I don't understand is this defer

kubernetes/test/e2e/network/loadbalancer.go

Line 1227 in 924b467

err = jig.ChangeServiceType(v1.ServiceTypeClusterIP, loadBalancerCreateTimeout)

Why it changes service type to ClusterIP, and then why jig.ChangeServiceType is deleting load balancer

kubernetes/test/e2e/framework/service/jig.go

Line 171 in 924b467

    
           func (j *TestJig) ChangeServiceType(newType v1.ServiceType, timeout time.Duration) error {

, which is calling CleanupServiceResources and why after this we again call err := cs.CoreV1().Services(svc.Namespace).Delete(context.TODO(), svc.Name, metav1.DeleteOptions{})

kubernetes/test/e2e/network/loadbalancer.go

Line 1229 in 924b467

    
           err := cs.CoreV1().Services(svc.Namespace).Delete(context.TODO(), svc.Name, metav1.DeleteOptions{})

. And after that we also had a cleanup hook (before this pr) which also does CleanupServiceResources the same as in jig.ChangeServiceType

I think this is the source of flake and races

aojea · 2022-11-03T16:30:09Z

What I don't understand is this defer

it seems that is doing an ordered cleanup of the loadbalancer deletion

mutate the service and remove all the loadbalancer, checking that the Service is correctly updated
delete the service definitively

And after that we also had a cleanup hook (before this pr) which also does CleanupServiceResources the same as in jig.ChangeServiceType
I think this is the source of flake and races

let's try first removing the hook as in this PR, we can iterate later, WDYT?

panslava · 2022-11-03T17:10:02Z

mutate the service and remove all the loadbalancer, checking that the Service is correctly updated

What do you mean by "remove all the loadbalancer" ? It is just deleting cloud resources

aojea · 2022-11-03T20:08:23Z

What do you mean by "remove all the loadbalancer" ? It is just deleting cloud resources

I mean that, all the GCE resources associated to the Service type Loadbalancer, since is not longer a LoadBalancer service

aojea · 2022-11-03T23:44:47Z

can I have a lgtm to iterate on this?

panslava · 2022-11-04T00:05:39Z

/lgtm
but probably will need more "iterations"

MrHohn

+1 - with the finalizer mechanism the load balancers should not be leaked even without these extra cleanup steps.
/lgtm

k8s-ci-robot · 2022-11-04T00:15:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, MrHohn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/e2e/network/OWNERS~~ [MrHohn,aojea]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aojea · 2022-11-07T19:49:03Z

still too soon for celebrate https://testgrid.k8s.io/google-gce#gci-gce-ingress&width=5 but no failures since this merged 4 days ago

aojea · 2022-11-11T11:32:30Z

No more failures since this merged https://testgrid.k8s.io/google-gce#gci-gce-ingress&width=5

Fixed by kubernetes#113858 kubernetes#113562

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Nov 2, 2022

k8s-ci-robot added area/test sig/network Categorizes an issue or PR as relevant to SIG Network. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 2, 2022

k8s-ci-robot requested review from caseydavenport and khenidak November 2, 2022 21:28

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 2, 2022

aojea force-pushed the e2e_lb branch from 9a2881c to 924b467 Compare November 2, 2022 22:23

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 2, 2022

k8s-ci-robot assigned MrHohn Nov 3, 2022

k8s-ci-robot assigned panslava Nov 4, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 4, 2022

MrHohn approved these changes Nov 4, 2022

View reviewed changes

k8s-ci-robot merged commit 54d3de0 into kubernetes:master Nov 4, 2022

k8s-ci-robot added this to the v1.26 milestone Nov 4, 2022

aojea mentioned this pull request Nov 11, 2022

remove unused afterEach hook on e2e services #113850

Merged

aojea added a commit to aojea/kubernetes that referenced this pull request Nov 25, 2022

remove flaky tag for GCE NEG tests

b89f42b

Fixed by kubernetes#113858 kubernetes#113562

aojea mentioned this pull request Nov 25, 2022

remove flaky tag for GCE NEG tests #114131

Merged

ivelichkovich pushed a commit to ivelichkovich/kubernetes that referenced this pull request Dec 20, 2022

remove flaky tag for GCE NEG tests

201b23c

Fixed by kubernetes#113858 kubernetes#113562

jaehnri pushed a commit to jaehnri/kubernetes that referenced this pull request Jan 3, 2023

remove flaky tag for GCE NEG tests

782d659

Fixed by kubernetes#113858 kubernetes#113562

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e loadbalancer remove after each cleanup #113562

e2e loadbalancer remove after each cleanup #113562

aojea commented Nov 2, 2022

k8s-ci-robot commented Nov 2, 2022

aojea commented Nov 2, 2022

aojea commented Nov 2, 2022

aojea commented Nov 2, 2022

k8s-ci-robot commented Nov 2, 2022

aojea commented Nov 2, 2022

aojea commented Nov 2, 2022

aojea commented Nov 3, 2022

aojea commented Nov 3, 2022

aojea commented Nov 3, 2022

aojea commented Nov 3, 2022

panslava commented Nov 3, 2022

aojea commented Nov 3, 2022

panslava commented Nov 3, 2022 •

edited

Loading

aojea commented Nov 3, 2022 •

edited

Loading

aojea commented Nov 3, 2022

panslava commented Nov 4, 2022

MrHohn left a comment

k8s-ci-robot commented Nov 4, 2022

aojea commented Nov 7, 2022

aojea commented Nov 11, 2022

e2e loadbalancer remove after each cleanup #113562

e2e loadbalancer remove after each cleanup #113562

Conversation

aojea commented Nov 2, 2022

k8s-ci-robot commented Nov 2, 2022

aojea commented Nov 2, 2022

aojea commented Nov 2, 2022

aojea commented Nov 2, 2022

k8s-ci-robot commented Nov 2, 2022

aojea commented Nov 2, 2022

aojea commented Nov 2, 2022

aojea commented Nov 3, 2022

aojea commented Nov 3, 2022

aojea commented Nov 3, 2022

aojea commented Nov 3, 2022

panslava commented Nov 3, 2022

aojea commented Nov 3, 2022

panslava commented Nov 3, 2022 • edited Loading

aojea commented Nov 3, 2022 • edited Loading

aojea commented Nov 3, 2022

panslava commented Nov 4, 2022

MrHohn left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Nov 4, 2022

aojea commented Nov 7, 2022

aojea commented Nov 11, 2022

panslava commented Nov 3, 2022 •

edited

Loading

aojea commented Nov 3, 2022 •

edited

Loading