Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endpoints do not reconcile with EndpointSlices for Services with selector #127370

Closed
mbrancato opened this issue Sep 15, 2024 · 15 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@mbrancato
Copy link

mbrancato commented Sep 15, 2024

What happened?

#125675 was merged to 1.31, and backported and regressed the following releases:

  • 1.31.0+
  • 1.30.3+
  • 1.29.7+
  • 1.28.12+

After an upgrade to 1.28 (1.28.13), we have had significant problems with Services that use a selector to target Pods in a Deployment. This is likely happening only with very large services with 1000-2000 backing Pods. The Endpoints eventually appear to be "stuck" with LOTS of IPs as targets (these are large services) and they are not removing old Pod IPs from the Endpoints object, even long after the Pod is gone.

These services are Knative Services, and are operating under frequent and wide scaling. It does seem like Knative is still reading from the Endpoints API.

Deleting the Endpoints object causes it to be recreated and instantly resolves problems from things downstream which are relying on the Endpoints API.

I can show the behavior, but it is difficult to reproduce this without mimicking the scale and churn (scale out and scale in) of a. real service.

Here, I have a service my-app - it is in the failed state where Endpoints are not being updated. Note, the source Service we are concerned about here is my-app-00112-private.

% kubectl -n app get endpointslices | grep my-app-00               
my-app-00112-4vk84                   IPv4          8012,8112,8012               10.32.2.22,10.32.5.21,10.32.31.20 + 997 more...         40m
my-app-00112-private-5662t           IPv4          8012,8022,8112 + 3 more...   10.32.86.67,10.32.86.68                                 34m
my-app-00112-private-9mdgr           IPv4          8012,8022,8112 + 3 more...   10.32.210.18,10.32.210.12,10.32.210.32 + 3 more...      36m
my-app-00112-private-kzkxb           IPv4          8012,8022,8112 + 3 more...   10.32.222.22,10.32.222.21,10.32.222.29                  36m
my-app-00112-private-mrpt4           IPv4          8012,8022,8112 + 3 more...   10.32.217.26,10.32.86.63,10.32.86.64 + 12 more...       36m
my-app-00112-private-qnd6m           IPv4          8012,8022,8112 + 3 more...   10.32.139.22,10.32.139.16,10.32.139.20                  37m
my-app-00112-private-swrhv           IPv4          8012,8022,8112 + 3 more...   10.32.85.54,10.32.85.57,10.32.85.55                     34m
my-app-00112-private-xm2w7           IPv4          8012,8022,8112 + 3 more...   10.32.10.180,10.32.96.167,10.32.200.143 + 4 more...     40m
my-app-00112-private-zlp44           IPv4          8012,8022,8112 + 3 more...   10.32.217.21,10.32.217.29,10.32.217.16 + 6 more...      36m

And if we look at the scale of that Deployment:

% kubectl -n app get deploy my-app-00112-deployment                
NAME                         READY   UP-TO-DATE   AVAILABLE   AGE
my-app-00112-deployment   28/28   28           28          40m

This seems correct. But when we look at the Endpoints, it is a much different story:

% kubectl -n app get endpoints my-app-00112-private
NAME                      ENDPOINTS                                                            AGE
my-app-00112-private   10.32.0.70:9091,10.32.10.180:9091,10.32.101.10:9091 + 5907 more...   40m

Nearly 6000 endpoints - at about 6 ports per pod (5 are from Knative, basically), that is 1000 pods, the over capacity limit for Endpoints API. In fact, we DO see this annotation:

Annotations:  endpoints.kubernetes.io/over-capacity: truncated

But that is supposed to come down when pods fall below 1k. Since this service is using a selector, and I think is managed by the Endpoints / EndpointSlices controller, the Endpoints object comes right back if it is deleted, effectively forcing reconciliation from EndpointSlices to Endpoints. And that is exactly what happens.

% kubectl -n app delete endpoints my-app-00112-private
endpoints "my-app-00112-private" deleted
% kubectl -n app get endpoints my-app-00112-private     
NAME                      ENDPOINTS                                                           AGE
my-app-00112-private   10.32.0.70:9091,10.32.10.180:9091,10.32.160.111:9091 + 81 more...   4s

So it does seem like EndpointSlices -> Endpoints reconciliation is broken in some fashion, under these conditions.

What did you expect to happen?

Endpoints should be updated when EndpointSlices are changed, even on scale-in operations where Pods are removed.

How can we reproduce it (as minimally and precisely as possible)?

It is very difficult for me to provide clear details. But it happens with large Knative services, which are just Deployments that are autoscaled and feed a service.

Anything else we need to know?

I see audit logs where endpoint-controller/kube-system is making frequent use of the permission io.k8s.core.v1.endpoints.update on that endpoints resource. But then the audit logs abruptly stop. Which seems to indicate that the controller is no longer even attempting to update the Endpoints resource. This appears to happend after. particularly rapid set of calls to update the endpoints - anywhere from 500ms to 5s apart and about a total of 15-20 times in a minute.

Kubernetes version

$ kubectl version
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.13-gke.1119000

Cloud provider

GKE

OS version

Google COS

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@mbrancato mbrancato added the kind/bug Categorizes issue or PR as related to a bug. label Sep 15, 2024
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 15, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Sep 15, 2024
@mbrancato
Copy link
Author

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 15, 2024
@mbrancato
Copy link
Author

This was cherry-picked into 1.28 and released in 1.28.12. I confirmed this was not in a prior working version, 1.27.11.
#125675

@aojea
Copy link
Member

aojea commented Sep 15, 2024

cc: @tnqn @robscott

@mbrancato so, to see if I understand correctly, once the Endpoint reaches its limits, if the number of endpoints present then goes beyond this limits, the system never reconciles?

@aojea
Copy link
Member

aojea commented Sep 15, 2024

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Sep 15, 2024
@aojea
Copy link
Member

aojea commented Sep 15, 2024

I can show the behavior, but it is difficult to reproduce this without mimicking the scale and churn (scale out and scale in) of a. real service.

Ok, it seems is not easy to reproduce, the same scenario as integration test does not fail in master

diff --git a/test/integration/endpoints/endpoints_test.go b/test/integration/endpoints/endpoints_test.go
index 2dc71791d77..b22e7d74514 100644
--- a/test/integration/endpoints/endpoints_test.go
+++ b/test/integration/endpoints/endpoints_test.go
@@ -35,6 +35,7 @@ import (
        "k8s.io/kubernetes/pkg/controller/endpoint"
        "k8s.io/kubernetes/test/integration/framework"
        "k8s.io/kubernetes/test/utils/ktesting"
+       netutils "k8s.io/utils/net"
 )
 
 func TestEndpointUpdates(t *testing.T) {
@@ -576,6 +577,177 @@ func TestEndpointWithTerminatingPod(t *testing.T) {
        }
 }
 
+func TestEndpointTruncate(t *testing.T) {
+       // Disable ServiceAccount admission plugin as we don't have serviceaccount controller running.
+       server := kubeapiservertesting.StartTestServerOrDie(t, nil, framework.DefaultTestServerFlags(), framework.SharedEtcd())
+       defer server.TearDownFn()
+
+       client, err := clientset.NewForConfig(server.ClientConfig)
+       if err != nil {
+               t.Fatalf("Error creating clientset: %v", err)
+       }
+
+       informers := informers.NewSharedInformerFactory(client, 0)
+
+       tCtx := ktesting.Init(t)
+       epController := endpoint.NewEndpointController(
+               tCtx,
+               informers.Core().V1().Pods(),
+               informers.Core().V1().Services(),
+               informers.Core().V1().Endpoints(),
+               client,
+               0)
+
+       // Start informer and controllers
+       informers.Start(tCtx.Done())
+       go epController.Run(tCtx, 1)
+
+       // Create namespace
+       ns := framework.CreateNamespaceOrDie(client, "test-endpoints-truncate", t)
+       defer framework.DeleteNamespaceOrDie(client, ns, t)
+
+       // Create a pod with labels
+       basePod := &v1.Pod{
+               ObjectMeta: metav1.ObjectMeta{
+                       Name:   "test-pod",
+                       Labels: labelMap(),
+               },
+               Spec: v1.PodSpec{
+                       NodeName: "fake-node",
+                       Containers: []v1.Container{
+                               {
+                                       Name:  "fakename",
+                                       Image: "fakeimage",
+                                       Ports: []v1.ContainerPort{
+                                               {
+                                                       Name:          "port-443",
+                                                       ContainerPort: 443,
+                                               },
+                                       },
+                               },
+                       },
+               },
+               Status: v1.PodStatus{
+                       Phase: v1.PodRunning,
+                       Conditions: []v1.PodCondition{
+                               {
+                                       Type:   v1.PodReady,
+                                       Status: v1.ConditionTrue,
+                               },
+                       },
+                       PodIP: "10.0.0.1",
+                       PodIPs: []v1.PodIP{
+                               {
+                                       IP: "10.0.0.1",
+                               },
+                       },
+               },
+       }
+
+       // create 1001 Pods to reach endpoint max capacity that is set to 1000
+       baseIP := netutils.BigForIP(netutils.ParseIPSloppy("10.0.0.1"))
+       for i := 0; i < 1001; i++ {
+               pod := basePod.DeepCopy()
+               pod.Name = fmt.Sprintf("%s-%d", basePod.Name, i)
+               podIP := netutils.AddIPOffset(baseIP, i).String()
+               pod.Status.PodIP = podIP
+               pod.Status.PodIPs[0] = v1.PodIP{IP: podIP}
+               createdPod, err := client.CoreV1().Pods(ns.Name).Create(tCtx, pod, metav1.CreateOptions{})
+               if err != nil {
+                       t.Fatalf("Failed to create pod %s: %v", pod.Name, err)
+               }
+
+               createdPod.Status = pod.Status
+               _, err = client.CoreV1().Pods(ns.Name).UpdateStatus(tCtx, createdPod, metav1.UpdateOptions{})
+               if err != nil {
+                       t.Fatalf("Failed to update status of pod %s: %v", pod.Name, err)
+               }
+       }
+
+       // Create a service associated to the pod
+       svc := &v1.Service{
+               ObjectMeta: metav1.ObjectMeta{
+                       Name:      "test-service",
+                       Namespace: ns.Name,
+                       Labels: map[string]string{
+                               "foo": "bar",
+                       },
+               },
+               Spec: v1.ServiceSpec{
+                       Selector: map[string]string{
+                               "foo": "bar",
+                       },
+                       Ports: []v1.ServicePort{
+                               {Name: "port-443", Port: 443, Protocol: "TCP", TargetPort: intstr.FromInt32(443)},
+                       },
+               },
+       }
+       _, err = client.CoreV1().Services(ns.Name).Create(tCtx, svc, metav1.CreateOptions{})
+       if err != nil {
+               t.Fatalf("Failed to create service %s: %v", svc.Name, err)
+       }
+
+       // poll until associated Endpoints to the previously created Service exists
+       if err := wait.PollImmediate(1*time.Second, 10*time.Second, func() (bool, error) {
+               endpoints, err := client.CoreV1().Endpoints(ns.Name).Get(tCtx, svc.Name, metav1.GetOptions{})
+               if err != nil {
+                       return false, nil
+               }
+
+               numEndpoints := 0
+               for _, subset := range endpoints.Subsets {
+                       numEndpoints += len(subset.Addresses)
+               }
+
+               if numEndpoints != 1000 {
+                       return false, nil
+               }
+
+               truncated, ok := endpoints.Annotations[v1.EndpointsOverCapacity]
+               if !ok || truncated != "truncated" {
+                       return false, nil
+               }
+               return true, nil
+       }); err != nil {
+               t.Fatalf("endpoints not found: %v", err)
+       }
+
+       // delete 501 Pods
+       for i := 500; i < 1001; i++ {
+               podName := fmt.Sprintf("%s-%d", basePod.Name, i)
+               err = client.CoreV1().Pods(ns.Name).Delete(tCtx, podName, metav1.DeleteOptions{})
+               if err != nil {
+                       t.Fatalf("error deleting test pod: %v", err)
+               }
+       }
+
+       // poll until endpoints for deleted Pod are no longer in Endpoints.
+       if err := wait.PollImmediate(1*time.Second, 10*time.Second, func() (bool, error) {
+               endpoints, err := client.CoreV1().Endpoints(ns.Name).Get(tCtx, svc.Name, metav1.GetOptions{})
+               if err != nil {
+                       return false, nil
+               }
+
+               numEndpoints := 0
+               for _, subset := range endpoints.Subsets {
+                       numEndpoints += len(subset.Addresses)
+               }
+
+               if numEndpoints != 500 {
+                       return false, nil
+               }
+
+               truncated, ok := endpoints.Annotations[v1.EndpointsOverCapacity]
+               if ok || truncated == "truncated" {
+                       return false, nil
+               }
+
+               return true, nil
+       }); err != nil {
+               t.Fatalf("error checking for no endpoints with terminating pods: %v", err)
+       }
+}
+

We need to be able to reproduce this condition but #125675 indeed seems a good candidate, @mbrancato can you please ping me in slack (@aojea) to see if we can get more details from the GKE side?

@mbrancato
Copy link
Author

@mbrancato so, to see if I understand correctly, once the Endpoint reaches its limits, if the number of endpoints present then goes beyond this limits, the system never reconciles?

@aojea I actually think it might be due to the rapid updates that are causing it to stop updating. Following the RBAC audit log, after a rapid set of updates to the Endpoints, the controller then stops updating them all together. However, the EndpointSlices stay up to date.

My current workaround is I created a CronJob that just deletes the Endpoints resource every 5 minutes, but that is non-ideal.

@aojea
Copy link
Member

aojea commented Sep 16, 2024

@mbrancato can you please look for these messages in the logs?

return fmt.Errorf("endpoints informer cache is out of date, resource version %s already processed for endpoints %s", currentEndpoints.ResourceVersion, key)

logger.V(2).Info("Error syncing endpoints, retrying", "service", klog.KRef(ns, name), "err", err)

logger.Info("Dropping service out of the queue", "service", klog.KRef(ns, name), "err", err)

@tnqn
Copy link
Member

tnqn commented Sep 16, 2024

This appears to be another scenario where an endpoints update request is a no-op, resulting in no change to the resourceVersion upon successful execution. It's because, when the number of endpoints exceeds 1000, the following check always return false, then the newEndpoints gets truncated and becomes identical to currentEndpoints.

endpointSubsetsEqualIgnoreResourceVersion(currentEndpoints.Subsets, subsets) &&

@aojea this can be reproduced by inserting dummy Service updates before "delete 501 Pods" in the above integration test.

#127103 should fix it, and may need to be accelerated.

@aojea
Copy link
Member

aojea commented Sep 16, 2024

@tnqn can you provide the patch with your modifications on the integration test to @M00nF1sh with the reproducer?

@liggitt liggitt added the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Sep 16, 2024
@mbrancato
Copy link
Author

@aojea Unfortunately, I do not think GKE exposes the logs from the endpoint-controller - outside RBAC usage of its service account, no info or error logs from those file are being captured by stackdriver / cloud logging (and I don't think we have any filter).

I was able to adjust your test and reproduce the issue and an error locally by doing the following:

  • create 1000 pods synchronously (async seems to overload the api server / etcd in my local tests)
  • asynchronously add / remove a portion of the pods, delaying 500ms - 3s between attempts
  • scale down the pods back to a reasonable <1000 number
  • attempt to wait for the endpoints to update correctly (they never do)
endpoints_controller.go:353: I0916 07:53:18.628099] Error syncing endpoints, retrying service="test-endpoints-large-rapid-changes/test-service" err="endpoints informer cache is out of date, resource version 2560 already processed for endpoints test-endpoints-large-rapid-changes/test-service"

The goal of the async changes is to mimic what ReplicaSets do, since that is ultimately the resource in play here.

mbrancato@56f20fd?diff=unified&w=0

@tnqn
Copy link
Member

tnqn commented Sep 16, 2024

@tnqn can you provide the patch with your modifications on the integration test to @M00nF1sh with the reproducer?

Done: #127103 (comment)

@aojea
Copy link
Member

aojea commented Sep 18, 2024

This is fixed by #127417

@aojea
Copy link
Member

aojea commented Sep 18, 2024

/close

@k8s-ci-robot
Copy link
Contributor

@aojea: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests

5 participants