Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e loadbalancer remove after each cleanup #113562

Merged
merged 1 commit into from
Nov 4, 2022

Conversation

aojea
Copy link
Member

@aojea aojea commented Nov 2, 2022

The cloud-provider and the e2e test were racing on deleting the cloud resources.

Also, the cloud-provider should not leave orphan resources, that will be detected by the job and fail, thus we should not have additional logic to cleanup masking these errors.

/kind flake

NONE

Fixes: #107530

Looking at the jobs in https://testgrid.k8s.io/google-gce#gci-gce-ingress

Failing jobs orphan IPs

W1101 17:03:45.329009 9 gce_loadbalancer_external.go:147] ensureExternalLoadBalancer(a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): Orphaning static IP 34.168.75.161 in region us-west1: googleapi: Error 404: The resource 'projects/k8s-infra-e2e-boskos-014/global/httpHealthChecks/a966e0e50f726482a91a69dca3b8e1a2' was not found, notFound.

Successful jobs doesn't present that error.

In addition, checking at the e2e output of the failing job https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-ingress/1587657000760643584/build-log.txt we can see that the e2e.test is deleting the gce objects

grep a966e0e50f726482a91a69dca3b8e1a2 build-log.txt
I1101 17:03:14.036] Nov  1 17:03:14.035: INFO: cleaning load balancer resource for a966e0e50f726482a91a69dca3b8e1a2
I1101 17:03:25.800] I1101 17:03:25.800642   10220 gce_loadbalancer_external.go:385] DeleteExternalTargetPoolAndChecks(a966e0e50f726482a91a69dca3b8e1a2(/)): Target pool already deleted. Continuing to delete other resources.
I1101 17:03:25.801] I1101 17:03:25.800668   10220 gce_loadbalancer_external.go:402] DeleteExternalTargetPoolAndChecks(a966e0e50f726482a91a69dca3b8e1a2(/)): Deleting health check k8s-a835f4d3b1f613eb-node.
I1101 17:03:26.345] W1101 17:03:26.345494   10220 gce_loadbalancer_external.go:409] DeleteExternalTargetPoolAndChecks(a966e0e50f726482a91a69dca3b8e1a2(/)): Failed to delete health check k8s-a835f4d3b1f613eb-node: googleapi: Error 400: The resource 'projects/k8s-infra-e2e-boskos-014/global/httpHealthChecks/k8s-a835f4d3b1f613eb-node' is not ready, resourceNotReady.
I1101 17:03:29.917] I1101 17:03:29.917619   10220 gce_loadbalancer_external.go:385] DeleteExternalTargetPoolAndChecks(a966e0e50f726482a91a69dca3b8e1a2(/)): Target pool already deleted. Continuing to delete other resources.
I1101 17:03:29.918] I1101 17:03:29.917643   10220 gce_loadbalancer_external.go:402] DeleteExternalTargetPoolAndChecks(a966e0e50f726482a91a69dca3b8e1a2(/)): Deleting health check k8s-a835f4d3b1f613eb-node.
I1101 17:03:30.378] I1101 17:03:30.377812   10220 gce_loadbalancer_external.go:424] DeleteExternalTargetPoolAndChecks(a966e0e50f726482a91a69dca3b8e1a2(/)): Deleting health check firewall k8s-a835f4d3b1f613eb-node-http-hc.
I1101 17:03:35.019] I1101 17:03:35.018925   10220 gce_loadbalancer_external.go:385] DeleteExternalTargetPoolAndChecks(a966e0e50f726482a91a69dca3b8e1a2(/)): Target pool already deleted. Continuing to delete other resources.
I1101 17:03:35.019] I1101 17:03:35.018958   10220 gce_loadbalancer_external.go:402] DeleteExternalTargetPoolAndChecks(a966e0e50f726482a91a69dca3b8e1a2(/)): Deleting health check k8s-a835f4d3b1f613eb-node.
I1101 17:03:35.449] I1101 17:03:35.448935   10220 gce_loadbalancer_external.go:424] DeleteExternalTargetPoolAndChecks(a966e0e50f726482a91a69dca3b8e1a2(/)): Deleting health check firewall k8s-a835f4d3b1f613eb-node-http-hc.

in parallel, the cloud-provider seems to be deleting the same objects https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-ingress/1587657000760643584/artifacts/bootstrap-e2e-master/kube-controller-manager.log , but at this point it fails and decide to recreate the load balancer ???? failing to assign the IP that was already orphaned?

I1101 17:03:05.852473       9 gen.go:16624] GCEHttpHealthChecks.Get(context.Background.WithDeadline(2022-11-01 18:03:05.646296407 +0000 UTC m=+6793.291264222 [59m59.793814122s]), Key{"a966e0e50f726482a91a69dca3b8e1a2"}) = <nil>, googleapi: Error 404: The resource 'projects/k8s-infra-e2e-boskos-014/global/httpHealthChecks/a966e0e50f726482a91a69dca3b8e1a2' was not found, notFound
I1101 17:03:05.852571       9 gce_loadbalancer_external.go:226] ensureExternalLoadBalancer(a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): Service needs local traffic health checks on: 31123/healthz.
I1101 17:03:05.852582       9 gce_loadbalancer_external.go:231] ensureExternalLoadBalancer(a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): Updating from nodes health checks to local traffic health checks.
I1101 17:03:22.419771       9 gen.go:12224] GCEForwardingRules.Delete(context.Background.WithDeadline(2022-11-01 18:03:05.852589313 +0000 UTC m=+6793.497557147 [59m43.432807764s]), Key{"a966e0e50f726482a91a69dca3b8e1a2", region: "us-west1"}) = <nil>
I1101 17:03:22.419846       9 gce_loadbalancer_external.go:267] ensureExternalLoadBalancer(a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): Deleted forwarding rule.
I1101 17:03:24.948520       9 gen.go:36248] GCETargetPools.Delete(context.Background.WithDeadline(2022-11-01 18:03:22.419856979 +0000 UTC m=+6810.064824806 [59m57.471327112s]), Key{"a966e0e50f726482a91a69dca3b8e1a2", region: "us-west1"}) = <nil>
I1101 17:03:24.948571       9 gce_loadbalancer_external.go:402] DeleteExternalTargetPoolAndChecks(a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): Deleting health check k8s-a835f4d3b1f613eb-node.
I1101 17:03:29.354394       9 gce_loadbalancer_external.go:424] DeleteExternalTargetPoolAndChecks(a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): Deleting health check firewall k8s-a835f4d3b1f613eb-node-http-hc.
I1101 17:03:33.611054       9 gce_loadbalancer_external.go:505] ensureTargetPoolAndHealthCheck(a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): Deleted target pool.
I1101 17:03:33.804177       9 gen.go:10056] GCEFirewalls.Get(context.Background.WithDeadline(2022-11-01 18:03:33.611062158 +0000 UTC m=+6821.256029988 [59m59.806877216s]), Key{"k8s-a966e0e50f726482a91a69dca3b8e1a2-http-hc"}) = <nil>, googleapi: Error 404: The resource 'projects/k8s-infra-e2e-boskos-014/global/firewalls/k8s-a966e0e50f726482a91a69dca3b8e1a2-http-hc' was not found, notFound
I1101 17:03:33.804245       9 gce_loadbalancer_external.go:931] Creating firewall k8s-a966e0e50f726482a91a69dca3b8e1a2-http-hc for health checks.
I1101 17:03:39.575874       9 gen.go:10132] GCEFirewalls.Insert(context.Background.WithDeadline(2022-11-01 18:03:33.804255907 +0000 UTC m=+6821.449223737 [59m54.228372326s]), Key{"k8s-a966e0e50f726482a91a69dca3b8e1a2-http-hc"}, &{Allowed:[0xc002c19500] CreationTimestamp: Denied:[] Description:{"kubernetes.io/service-name":"esipp-7190/external-local-update", "kubernetes.io/service-ip":"34.168.75.161"} DestinationRanges:[34.168.75.161] Direction: Disabled:false Id:0 Kind: LogConfig:<nil> Name:k8s-a966e0e50f726482a91a69dca3b8e1a2-http-hc Network:https://www.googleapis.com/compute/v1/projects/k8s-infra-e2e-boskos-014/global/networks/bootstrap-e2e Priority:0 SelfLink: SourceRanges:[130.211.0.0/22 35.191.0.0/16 209.85.152.0/22 209.85.204.0/22] SourceServiceAccounts:[] SourceTags:[] TargetServiceAccounts:[] TargetTags:[bootstrap-e2e-minion] ServerResponse:{HTTPStatusCode:0 Header:map[]} ForceSendFields:[] NullFields:[]}) = <nil>
I1101 17:03:39.575987       9 gce_loadbalancer_external.go:935] Created firewall k8s-a966e0e50f726482a91a69dca3b8e1a2-http-hc for health checks.
I1101 17:03:39.755795       9 gen.go:16624] GCEHttpHealthChecks.Get(context.Background.WithDeadline(2022-11-01 18:03:39.575998684 +0000 UTC m=+6827.220966514 [59m59.820194454s]), Key{"a966e0e50f726482a91a69dca3b8e1a2"}) = <nil>, googleapi: Error 404: The resource 'projects/k8s-infra-e2e-boskos-014/global/httpHealthChecks/a966e0e50f726482a91a69dca3b8e1a2' was not found, notFound
I1101 17:03:39.755876       9 gce_loadbalancer_external.go:717] Did not find health check a966e0e50f726482a91a69dca3b8e1a2, creating port 31123 path /healthz
I1101 17:03:41.588316       9 gen.go:16700] GCEHttpHealthChecks.Insert(context.Background.WithDeadline(2022-11-01 18:03:39.755882277 +0000 UTC m=+6827.400850102 [59m58.167553232s]), Key{"a966e0e50f726482a91a69dca3b8e1a2"}, &{CheckIntervalSec:8 CreationTimestamp: Description:{"kubernetes.io/service-name":"a966e0e50f726482a91a69dca3b8e1a2"} HealthyThreshold:1 Host: Id:0 Kind: Name:a966e0e50f726482a91a69dca3b8e1a2 Port:31123 RequestPath:/healthz SelfLink: TimeoutSec:1 UnhealthyThreshold:3 ServerResponse:{HTTPStatusCode:0 Header:map[]} ForceSendFields:[] NullFields:[]}) = <nil>
I1101 17:03:41.847558       9 gen.go:16624] GCEHttpHealthChecks.Get(context.Background.WithDeadline(2022-11-01 18:03:41.588385956 +0000 UTC m=+6829.233353771 [59m59.740819823s]), Key{"a966e0e50f726482a91a69dca3b8e1a2"}) = &{CheckIntervalSec:8 CreationTimestamp:2022-11-01T10:03:40.169-07:00 Description:{"kubernetes.io/service-name":"a966e0e50f726482a91a69dca3b8e1a2"} HealthyThreshold:1 Host: Id:615235441063161731 Kind:compute#httpHealthCheck Name:a966e0e50f726482a91a69dca3b8e1a2 Port:31123 RequestPath:/healthz SelfLink:https://www.googleapis.com/compute/v1/projects/k8s-infra-e2e-boskos-014/global/httpHealthChecks/a966e0e50f726482a91a69dca3b8e1a2 TimeoutSec:1 UnhealthyThreshold:3 ServerResponse:{HTTPStatusCode:200 Header:map[Cache-Control:[private] Content-Type:[application/json; charset=UTF-8] Date:[Tue, 01 Nov 2022 17:03:41 GMT] Server:[ESF] Vary:[Origin X-Origin Referer] X-Content-Type-Options:[nosniff] X-Frame-Options:[SAMEORIGIN] X-Xss-Protection:[0]]} ForceSendFields:[] NullFields:[]}, <nil>
I1101 17:03:41.847663       9 gce_loadbalancer_external.go:726] Created HTTP health check a966e0e50f726482a91a69dca3b8e1a2 healthCheckNodePort: 31123
I1101 17:03:41.847705       9 gce_loadbalancer_external.go:576] Creating targetpool a966e0e50f726482a91a69dca3b8e1a2 with 1 healthchecks
I1101 17:03:44.560988       9 gen.go:36215] GCETargetPools.Insert(context.Background.WithDeadline(2022-11-01 18:03:41.847713095 +0000 UTC m=+6829.492680925 [59m57.286707546s]), Key{"a966e0e50f726482a91a69dca3b8e1a2", region: "us-west1"}, &{BackupPool: CreationTimestamp: Description:{"kubernetes.io/service-name":"esipp-7190/external-local-update"} FailoverRatio:0 HealthChecks:[https://www.googleapis.com/compute/v1/projects/k8s-infra-e2e-boskos-014/global/httpHealthChecks/a966e0e50f726482a91a69dca3b8e1a2] Id:0 Instances:[/zones/us-west1-b/instances/bootstrap-e2e-master /zones/us-west1-b/instances/bootstrap-e2e-minion-group-t21s /zones/us-west1-b/instances/bootstrap-e2e-minion-group-9bk2 /zones/us-west1-b/instances/bootstrap-e2e-minion-group-tb53] Kind: Name:a966e0e50f726482a91a69dca3b8e1a2 Region: SelfLink: SessionAffinity:NONE ServerResponse:{HTTPStatusCode:0 Header:map[]} ForceSendFields:[] NullFields:[]}) = <nil>
I1101 17:03:44.561084       9 gce_loadbalancer_external.go:518] ensureTargetPoolAndHealthCheck(a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): Created health checks a966e0e50f726482a91a69dca3b8e1a2.
I1101 17:03:44.561092       9 gce_loadbalancer_external.go:521] ensureTargetPoolAndHealthCheck(a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): Created target pool.
I1101 17:03:44.561105       9 gce_loadbalancer_external.go:275] ensureExternalLoadBalancer(a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): Creating forwarding rule, IP 34.168.75.161 (tier: Premium).
I1101 17:03:45.328911       9 gen.go:12186] GCEForwardingRules.Insert(context.Background.WithDeadline(2022-11-01 18:03:44.561115144 +0000 UTC m=+6832.206082971 [59m59.232188705s]), Key{"a966e0e50f726482a91a69dca3b8e1a2", region: "us-west1"}, ...) = googleapi: Error 400: Invalid value for field 'resource.IPAddress': '34.168.75.161'. Invalid IP address specified., invalid
W1101 17:03:45.329009       9 gce_loadbalancer_external.go:147] ensureExternalLoadBalancer(a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): Orphaning static IP 34.168.75.161 in region us-west1: googleapi: Error 404: The resource 'projects/k8s-infra-e2e-boskos-014/global/httpHealthChecks/a966e0e50f726482a91a69dca3b8e1a2' was not found, notFound.
E1101 17:03:45.329024       9 gce_loadbalancer.go:174] Failed to EnsureLoadBalancer(bootstrap-e2e, esipp-7190, external-local-update, a966e0e50f726482a91a69dca3b8e1a2, us-west1), err: failed to create forwarding rule for load balancer (a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): googleapi: Error 400: Invalid value for field 'resource.IPAddress': '34.168.75.161'. Invalid IP address specified., invalid
E1101 17:03:45.329072       9 controller.go:289] error processing service esipp-7190/external-local-update (will retry): failed to ensure load balancer: failed to create forwarding rule for load balancer (a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): googleapi: Error 400: Invalid value for field 'resource.IPAddress': '34.168.75.161'. Invalid IP address specified., invalid
I1101 17:03:45.331490       9 event.go:294] "Event occurred" object="esipp-7190/external-local-update" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: failed to create forwarding rule for load balancer (a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): googleapi: Error 400: Invalid value for field 'resource.IPAddress': '34.168.75.161'. Invalid IP address specified., invalid"
E1101 17:03:45.336717       9 event.go:267] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"external-local-update.172384bae6da1d34", GenerateName:"", Namespace:"esipp-7190", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Service", Namespace:"esipp-7190", Name:"external-local-update", UID:"966e0e50-f726-482a-91a6-9dca3b8e1a27", APIVersion:"v1", ResourceVersion:"8542", FieldPath:""}, Reason:"SyncLoadBalancerFailed", Message:"Error syncing load balancer: failed to ensure load balancer: failed to create forwarding rule for load balancer (a966e0e50f726482a91a69dca3b8e1a2(esipp-7190/external-local-update)): googleapi: Error 400: Invalid value for field 'resource.IPAddress': '34.168.75.161'. Invalid IP address specified., invalid", Source:v1.EventSource{Component:"service-controller", Host:""}, FirstTimestamp:time.Date(2022, time.November, 1, 17, 3, 45, 329052980, time.Local), LastTimestamp:time.Date(2022, time.November, 1, 17, 3, 45, 329052980, time.Local), Count:1, Type:"Warning", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events "external-local-update.172384bae6da1d34" is forbidden: unable to create new content in namespace esipp-7190 because it is being terminated' (will not retry!)
I1101 17:04:21.178898       9 gen.go:12115] GCEForwardingRules.Get(context.Background.WithDeadline(2022-11-01 18:04:21.120154883 +0000 UTC m=+6868.765122710 [59m59.941248789s]), Key{"a966e0e50f726482a91a69dca3b8e1a2", region: "us-west1"}) = <nil>, googleapi: Error 404: The resource 'projects/k8s-infra-e2e-boskos-014/regions/us-west1/forwardingRules/a966e0e50f726482a91a69dca3b8e1a2' was not found, notFound

there seems to be a bug in the cloud-provider?

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/flake Categorizes issue or PR as related to a flaky test. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 2, 2022
@k8s-ci-robot
Copy link
Contributor

@aojea: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Nov 2, 2022
@aojea
Copy link
Member Author

aojea commented Nov 2, 2022

/assign @bowei @cezarygerard @code-elinka @panslava

@k8s-ci-robot k8s-ci-robot added area/test sig/network Categorizes an issue or PR as relevant to SIG Network. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 2, 2022
@aojea
Copy link
Member Author

aojea commented Nov 2, 2022

/sig network

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 2, 2022
@aojea
Copy link
Member Author

aojea commented Nov 2, 2022

/test

@k8s-ci-robot
Copy link
Contributor

@aojea: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

  • /test pull-kubernetes-conformance-kind-ga-only-parallel
  • /test pull-kubernetes-coverage-unit
  • /test pull-kubernetes-dependencies
  • /test pull-kubernetes-dependencies-go-canary
  • /test pull-kubernetes-e2e-gce
  • /test pull-kubernetes-e2e-gce-100-performance
  • /test pull-kubernetes-e2e-gce-big-performance
  • /test pull-kubernetes-e2e-gce-canary
  • /test pull-kubernetes-e2e-gce-network-proxy-http-connect
  • /test pull-kubernetes-e2e-gce-no-stage
  • /test pull-kubernetes-e2e-gce-scale-performance-manual
  • /test pull-kubernetes-e2e-gce-ubuntu-containerd
  • /test pull-kubernetes-e2e-gce-ubuntu-containerd-canary
  • /test pull-kubernetes-e2e-kind
  • /test pull-kubernetes-e2e-kind-ipv6
  • /test pull-kubernetes-integration
  • /test pull-kubernetes-integration-go-canary
  • /test pull-kubernetes-kubemark-e2e-gce-scale
  • /test pull-kubernetes-node-e2e-containerd
  • /test pull-kubernetes-typecheck
  • /test pull-kubernetes-unit
  • /test pull-kubernetes-unit-go-canary
  • /test pull-kubernetes-update
  • /test pull-kubernetes-verify
  • /test pull-kubernetes-verify-go-canary
  • /test pull-kubernetes-verify-govet-levee

The following commands are available to trigger optional jobs:

  • /test check-dependency-stats
  • /test pull-ci-kubernetes-unit-windows
  • /test pull-e2e-gce-cloud-provider-disabled
  • /test pull-kubernetes-conformance-image-test
  • /test pull-kubernetes-conformance-kind-ga-only
  • /test pull-kubernetes-conformance-kind-ipv6-parallel
  • /test pull-kubernetes-cross
  • /test pull-kubernetes-e2e-capz-azure-disk
  • /test pull-kubernetes-e2e-capz-azure-disk-vmss
  • /test pull-kubernetes-e2e-capz-azure-file
  • /test pull-kubernetes-e2e-capz-azure-file-vmss
  • /test pull-kubernetes-e2e-capz-conformance
  • /test pull-kubernetes-e2e-capz-ha-control-plane
  • /test pull-kubernetes-e2e-capz-windows-containerd
  • /test pull-kubernetes-e2e-containerd-gce
  • /test pull-kubernetes-e2e-gce-alpha-features
  • /test pull-kubernetes-e2e-gce-correctness
  • /test pull-kubernetes-e2e-gce-csi-serial
  • /test pull-kubernetes-e2e-gce-device-plugin-gpu
  • /test pull-kubernetes-e2e-gce-kubetest2
  • /test pull-kubernetes-e2e-gce-network-proxy-grpc
  • /test pull-kubernetes-e2e-gce-storage-disruptive
  • /test pull-kubernetes-e2e-gce-storage-slow
  • /test pull-kubernetes-e2e-gce-storage-snapshot
  • /test pull-kubernetes-e2e-gce-ubuntu-containerd-serial
  • /test pull-kubernetes-e2e-gci-gce-autoscaling
  • /test pull-kubernetes-e2e-gci-gce-ingress
  • /test pull-kubernetes-e2e-gci-gce-ipvs
  • /test pull-kubernetes-e2e-kind-canary
  • /test pull-kubernetes-e2e-kind-dual-canary
  • /test pull-kubernetes-e2e-kind-ipv6-canary
  • /test pull-kubernetes-e2e-kind-ipvs-dual-canary
  • /test pull-kubernetes-e2e-kind-multizone
  • /test pull-kubernetes-e2e-kops-aws
  • /test pull-kubernetes-e2e-kubelet-credential-provider
  • /test pull-kubernetes-e2e-ubuntu-gce-network-policies
  • /test pull-kubernetes-kind-dra
  • /test pull-kubernetes-kubemark-e2e-gce-big
  • /test pull-kubernetes-local-e2e
  • /test pull-kubernetes-node-crio-cgrpv2-e2e
  • /test pull-kubernetes-node-crio-cgrpv2-e2e-kubetest2
  • /test pull-kubernetes-node-crio-e2e
  • /test pull-kubernetes-node-crio-e2e-kubetest2
  • /test pull-kubernetes-node-e2e-containerd-alpha-features
  • /test pull-kubernetes-node-e2e-containerd-features
  • /test pull-kubernetes-node-e2e-containerd-features-kubetest2
  • /test pull-kubernetes-node-e2e-containerd-kubetest2
  • /test pull-kubernetes-node-kubelet-credential-provider
  • /test pull-kubernetes-node-kubelet-serial-containerd
  • /test pull-kubernetes-node-kubelet-serial-containerd-kubetest2
  • /test pull-kubernetes-node-kubelet-serial-cpu-manager
  • /test pull-kubernetes-node-kubelet-serial-cpu-manager-kubetest2
  • /test pull-kubernetes-node-kubelet-serial-crio-cgroupv1
  • /test pull-kubernetes-node-kubelet-serial-crio-cgroupv2
  • /test pull-kubernetes-node-kubelet-serial-hugepages
  • /test pull-kubernetes-node-kubelet-serial-memory-manager
  • /test pull-kubernetes-node-kubelet-serial-topology-manager
  • /test pull-kubernetes-node-kubelet-serial-topology-manager-kubetest2
  • /test pull-kubernetes-node-memoryqos-cgrpv2
  • /test pull-kubernetes-node-swap-fedora
  • /test pull-kubernetes-node-swap-fedora-serial
  • /test pull-kubernetes-node-swap-ubuntu-serial
  • /test pull-kubernetes-unit-experimental
  • /test pull-publishing-bot-validate

Use /test all to run the following jobs that were automatically triggered:

  • pull-kubernetes-conformance-kind-ga-only-parallel
  • pull-kubernetes-conformance-kind-ipv6-parallel
  • pull-kubernetes-dependencies
  • pull-kubernetes-e2e-gce-100-performance
  • pull-kubernetes-e2e-gce-ubuntu-containerd
  • pull-kubernetes-e2e-gci-gce-ingress
  • pull-kubernetes-e2e-gci-gce-ipvs
  • pull-kubernetes-e2e-kind
  • pull-kubernetes-e2e-kind-ipv6
  • pull-kubernetes-e2e-ubuntu-gce-network-policies
  • pull-kubernetes-integration
  • pull-kubernetes-node-e2e-containerd
  • pull-kubernetes-typecheck
  • pull-kubernetes-unit
  • pull-kubernetes-verify
  • pull-kubernetes-verify-govet-levee

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@aojea
Copy link
Member Author

aojea commented Nov 2, 2022

/test pull-kubernetes-e2e-gci-gce-ingress

@aojea
Copy link
Member Author

aojea commented Nov 2, 2022

/assign  @bowei

The cloud-provider and the e2e test were racing on deleting the
cloud resources.

Also, the cloud-provider should not leave orphan resources, that will
be detected by the job and fail, thus we should not have additional
logic to cleanup masking these errors.
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 2, 2022
@aojea
Copy link
Member Author

aojea commented Nov 3, 2022

something is wrong with the cloud loadbalancer controller, the test starts at

STEP: Create a basic HTTP ingress using NEG 11/02/22 23:59:13.553

and ends because of a 5 min timeout in

Nov 3 00:04:44.596: FAIL: timed out waiting for the condition

but the gce controller doesn't start processing events until https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/113562/pull-kubernetes-e2e-gci-gce-ingress/1587933663377494016/artifacts/bootstrap-e2e-master/kube-controller-manager.log

I1103 00:05:49.418559 8 gce_loadbalancer.go:137] EnsureLoadBalancer(bootstrap-e2e, loadbalancers-1795, mutability-test, a1212145f157a40a998fa59e2f116858, us-west1): ensure EXTERNAL loadbalancer

@aojea
Copy link
Member Author

aojea commented Nov 3, 2022

nevermind, is processed in the glb controller, but I don´t understand the workflow the controller either https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/113562/pull-kubernetes-e2e-gci-gce-ingress/1587933663377494016/artifacts/bootstrap-e2e-master/glbc.log

I1103 00:00:09.201169       1 healthchecks.go:236] DefaultNEGHealthCheck(HTTP)
I1103 00:00:09.201187       1 healthchecks.go:324] Getting Health Check, name: k8s1-6ccc0461-ingress-743-hostname-80-d8c9a299, version: beta, scope: global
I1103 00:00:09.394564       1 healthchecks.go:124] Existing HC = &{ForNEG:false ForILB:false HTTPHealthCheck:{Host: Port:0 PortName: PortSpecification:USE_SERVING_PORT ProxyHeader:NONE RequestPath:/ Response: ForceSendFields:[] NullFields:[]} HealthCheck:{CheckIntervalSec:15 CreationTimestamp:2022-11-02T16:59:15.063-07:00 Description:Default kubernetes L7 Loadbalancing health check for NEG. GrpcHealthCheck:<nil> HealthyThreshold:1 Http2HealthCheck:<nil> HttpHealthCheck:<nil> HttpsHealthCheck:<nil> Id:4261368500172274364 Kind:compute#healthCheck LogConfig:<nil> Name:k8s1-6ccc0461-ingress-743-hostname-80-d8c9a299 Region: SelfLink:https://www.googleapis.com/compute/beta/projects/k8s-jkns-e2e-gce-ubuntu-slow/global/healthChecks/k8s1-6ccc0461-ingress-743-hostname-80-d8c9a299 SelfLinkWithId: SslHealthCheck:<nil> TcpHealthCheck:<nil> TimeoutSec:15 Type:HTTP UdpHealthCheck:<nil> UnhealthyThreshold:2 ServerResponse:{HTTPStatusCode:0 Header:map[]} ForceSendFields:[] NullFields:[]}}
I1103 00:00:09.394614       1 healthchecks.go:125] HC before merge = &{ForNEG:true ForILB:false HTTPHealthCheck:{Host: Port:32450 PortName: PortSpecification:USE_SERVING_PORT ProxyHeader: RequestPath:/ Response: ForceSendFields:[] NullFields:[]} HealthCheck:{CheckIntervalSec:15 CreationTimestamp: Description:Default kubernetes L7 Loadbalancing health check for NEG. GrpcHealthCheck:<nil> HealthyThreshold:1 Http2HealthCheck:<nil> HttpHealthCheck:<nil> HttpsHealthCheck:<nil> Id:0 Kind: LogConfig:<nil> Name:k8s1-6ccc0461-ingress-743-hostname-80-d8c9a299 Region: SelfLink: SelfLinkWithId: SslHealthCheck:<nil> TcpHealthCheck:<nil> TimeoutSec:15 Type:HTTP UdpHealthCheck:<nil> UnhealthyThreshold:2 ServerResponse:{HTTPStatusCode:0 Header:map[]} ForceSendFields:[] NullFields:[]}}
I1103 00:00:09.394644       1 healthchecks.go:126] Resulting HC = &{ForNEG:true ForILB:false HTTPHealthCheck:{Host: Port:0 PortName: PortSpecification:USE_SERVING_PORT ProxyHeader:NONE RequestPath:/ Response: ForceSendFields:[] NullFields:[]} HealthCheck:{CheckIntervalSec:15 CreationTimestamp: Description:Default kubernetes L7 Loadbalancing health check for NEG. GrpcHealthCheck:<nil> HealthyThreshold:1 Http2HealthCheck:<nil> HttpHealthCheck:<nil> HttpsHealthCheck:<nil> Id:0 Kind: LogConfig:<nil> Name:k8s1-6ccc0461-ingress-743-hostname-80-d8c9a299 Region: SelfLink: SelfLinkWithId: SslHealthCheck:<nil> TcpHealthCheck:<nil> TimeoutSec:15 Type:HTTP UdpHealthCheck:<nil> UnhealthyThreshold:2 ServerResponse:{HTTPStatusCode:0 Header:map[]} ForceSendFields:[] NullFields:[]}}
I1103 00:00:09.394662       1 healthchecks.go:144] Health check "k8s1-6ccc0461-ingress-743-hostname-80-d8c9a299" already exists and needs no update
I1103 00:00:09.394734       1 gen.go:4952] Getting ga zonal NetworkEndpointGroup k8s1-6ccc0461-ingress-743-hostname-80-d8c9a299
I1103 00:00:09.466697       1 gen.go:3896] Getting ga BackendService k8s1-6ccc0461-ingress-743-hostname-80-d8c9a299
I1103 00:00:09.739906       1 gen.go:6636] Getting ga UrlMap k8s-um-ingress-743-hostname--6ccc0461bdfb9376
I1103 00:00:09.893021       1 gen.go:6039] Getting ga TargetHttpProxy k8s-tp-ingress-743-hostname--6ccc0461bdfb9376
I1103 00:00:10.116082       1 gen.go:4210] Getting ga ForwardingRule k8s-fw-ingress-743-hostname--6ccc0461bdfb9376
I1103 00:00:10.318264       1 gen.go:3896] Getting ga BackendService k8s1-6ccc0461-ingress-743-hostname-80-d8c9a299
I1103 00:00:10.899664       1 backends.go:204] backend service "k8s1-6ccc0461-ingress-743-hostname-80-d8c9a299" does not have health status: []
I1103 00:00:10.899745       1 gen.go:6633] Getting ga region UrlMap k8s-um-ingress-743-hostname--6ccc0461bdfb9376
I1103 00:00:10.971614       1 sync.go:104] Using algorithm NoCleanUpNeeded to GC frontend of ingress ingress-743/hostname
I1103 00:00:10.971685       1 gen.go:3940] Listing ga region BackendService
I1103 00:00:11.050154       1 gen.go:3943] Listing ga BackendService
I1103 00:00:35.009983       1 gen.go:3896] Getting ga BackendService k8s-ingress-svc-acct-permission-check-probe
I1103 00:00:45.605925       1 manager.go:282] Start NEG garbage collection.
I1103 00:00:45.605967       1 manager.go:497] Deletion candidate ingress-4703/k8s1-6ccc0461-ingress-4703-hostname-80-49baa8d3 has 1 NEG references

/test pull-kubernetes-e2e-gci-gce-ingress

anyway, this PR doesn´t touch on ingress tests, only the loadbalancer ones

@aojea
Copy link
Member Author

aojea commented Nov 3, 2022

/assign @MrHohn

you may have context here too

@aojea
Copy link
Member Author

aojea commented Nov 3, 2022

/test pull-kubernetes-e2e-gci-gce-ingress

@panslava
Copy link
Contributor

panslava commented Nov 3, 2022

I can be wrong, but what I see

GCE is not deleting load balancer actually. It is trying to ensure it, it is just deleting target pools and health checks, cause it wants to change health checks from shared to node-local. https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-ingress/1587657000760643584/artifacts/bootstrap-e2e-master/kube-controller-manager.log

I1102 05:15:49.083597       9 gce_loadbalancer_external.go:226] ensureExternalLoadBalancer(a37d5fcfd34ad46b2b5e38e4fccf8abe(esipp-8909/external-local-update)): Service needs local traffic health checks on: 32529/healthz.
I1102 05:15:49.083612       9 gce_loadbalancer_external.go:231] ensureExternalLoadBalancer(a37d5fcfd34ad46b2b5e38e4fccf8abe(esipp-8909/external-local-update)): Updating from nodes health checks to local traffic health checks

I1102 05:16:05.321505       9 gce_loadbalancer_external.go:267] ensureExternalLoadBalancer(a37d5fcfd34ad46b2b5e38e4fccf8abe(esipp-8909/external-local-update)): Deleted forwarding rule.
I1102 05:16:06.013657       9 httplog.go:132] "HTTP" verb="GET" URI="/healthz" latency="312.194µs" userAgent="kube-probe/1.26+" audit-ID="" srcIP="127.0.0.1:44776" resp=200
I1102 05:16:07.882693       9 gen.go:36248] GCETargetPools.Delete(context.Background.WithDeadline(2022-11-02 06:16:05.321520537 +0000 UTC m=+7545.195692136 [59m57.438791026s]), Key{"a37d5fcfd34ad46b2b5e38e4fccf8abe", region: "us-west1"}) = <nil>
I1102 05:16:07.882831       9 gce_loadbalancer_external.go:402] DeleteExternalTargetPoolAndChecks(a37d5fcfd34ad46b2b5e38e4fccf8abe(esipp-8909/external-local-update)): Deleting health check k8s-674fcf47d69408d1-node.
I1102 05:16:14.778806       9 gce_loadbalancer_external.go:505] ensureTargetPoolAndHealthCheck(a37d5fcfd34ad46b2b5e38e4fccf8abe(esipp-8909/external-local-update)): Deleted target pool.

and it feels like it is actually races with "cleanup hook" in tests. GCE is trying to ensure load balancer, reserves address, cleanup hook deletes it, so gce fails to ensure load balancer, and resyncs the service

The question is -- who triggers lb update. What I see, that the failing test is "should handle updates to ExternalTrafficPolicy field". And it is an interesting test

What I don't understand is this defer

err = jig.ChangeServiceType(v1.ServiceTypeClusterIP, loadBalancerCreateTimeout)

Why it changes service type to ClusterIP, and then why jig.ChangeServiceType is deleting load balancer

func (j *TestJig) ChangeServiceType(newType v1.ServiceType, timeout time.Duration) error {
, which is calling CleanupServiceResources and why after this we again call err := cs.CoreV1().Services(svc.Namespace).Delete(context.TODO(), svc.Name, metav1.DeleteOptions{})
err := cs.CoreV1().Services(svc.Namespace).Delete(context.TODO(), svc.Name, metav1.DeleteOptions{})
. And after that we also had a cleanup hook (before this pr) which also does CleanupServiceResources the same as in jig.ChangeServiceType

I think this is the source of flake and races

@aojea
Copy link
Member Author

aojea commented Nov 3, 2022

What I don't understand is this defer

it seems that is doing an ordered cleanup of the loadbalancer deletion

  1. mutate the service and remove all the loadbalancer, checking that the Service is correctly updated
  2. delete the service definitively

And after that we also had a cleanup hook (before this pr) which also does CleanupServiceResources the same as in jig.ChangeServiceType
I think this is the source of flake and races

let's try first removing the hook as in this PR, we can iterate later, WDYT?

@panslava
Copy link
Contributor

panslava commented Nov 3, 2022

mutate the service and remove all the loadbalancer, checking that the Service is correctly updated

What do you mean by "remove all the loadbalancer" ? It is just deleting cloud resources

@aojea
Copy link
Member Author

aojea commented Nov 3, 2022

What do you mean by "remove all the loadbalancer" ? It is just deleting cloud resources

I mean that, all the GCE resources associated to the Service type Loadbalancer, since is not longer a LoadBalancer service

@aojea
Copy link
Member Author

aojea commented Nov 3, 2022

can I have a lgtm to iterate on this?

@panslava
Copy link
Contributor

panslava commented Nov 4, 2022

/lgtm
but probably will need more "iterations"

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 4, 2022
Copy link
Member

@MrHohn MrHohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - with the finalizer mechanism the load balancers should not be leaked even without these extra cleanup steps.
/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, MrHohn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 54d3de0 into kubernetes:master Nov 4, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Nov 4, 2022
@aojea
Copy link
Member Author

aojea commented Nov 7, 2022

still too soon for celebrate https://testgrid.k8s.io/google-gce#gci-gce-ingress&width=5 but no failures since this merged 4 days ago

@aojea
Copy link
Member Author

aojea commented Nov 11, 2022

No more failures since this merged https://testgrid.k8s.io/google-gce#gci-gce-ingress&width=5

aojea added a commit to aojea/kubernetes that referenced this pull request Nov 25, 2022
ivelichkovich pushed a commit to ivelichkovich/kubernetes that referenced this pull request Dec 20, 2022
jaehnri pushed a commit to jaehnri/kubernetes that referenced this pull request Jan 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Flaky test] gci-gce-ingress, diffResource
4 participants