Endpoints do not reconcile with EndpointSlices for Services with selector #127370

mbrancato · 2024-09-15T01:46:42Z

What happened?

#125675 was merged to 1.31, and backported and regressed the following releases:

1.31.0+
1.30.3+
1.29.7+
1.28.12+

After an upgrade to 1.28 (1.28.13), we have had significant problems with Services that use a selector to target Pods in a Deployment. This is likely happening only with very large services with 1000-2000 backing Pods. The Endpoints eventually appear to be "stuck" with LOTS of IPs as targets (these are large services) and they are not removing old Pod IPs from the Endpoints object, even long after the Pod is gone.

These services are Knative Services, and are operating under frequent and wide scaling. It does seem like Knative is still reading from the Endpoints API.

Deleting the Endpoints object causes it to be recreated and instantly resolves problems from things downstream which are relying on the Endpoints API.

I can show the behavior, but it is difficult to reproduce this without mimicking the scale and churn (scale out and scale in) of a. real service.

Here, I have a service my-app - it is in the failed state where Endpoints are not being updated. Note, the source Service we are concerned about here is my-app-00112-private.

% kubectl -n app get endpointslices | grep my-app-00               
my-app-00112-4vk84                   IPv4          8012,8112,8012               10.32.2.22,10.32.5.21,10.32.31.20 + 997 more...         40m
my-app-00112-private-5662t           IPv4          8012,8022,8112 + 3 more...   10.32.86.67,10.32.86.68                                 34m
my-app-00112-private-9mdgr           IPv4          8012,8022,8112 + 3 more...   10.32.210.18,10.32.210.12,10.32.210.32 + 3 more...      36m
my-app-00112-private-kzkxb           IPv4          8012,8022,8112 + 3 more...   10.32.222.22,10.32.222.21,10.32.222.29                  36m
my-app-00112-private-mrpt4           IPv4          8012,8022,8112 + 3 more...   10.32.217.26,10.32.86.63,10.32.86.64 + 12 more...       36m
my-app-00112-private-qnd6m           IPv4          8012,8022,8112 + 3 more...   10.32.139.22,10.32.139.16,10.32.139.20                  37m
my-app-00112-private-swrhv           IPv4          8012,8022,8112 + 3 more...   10.32.85.54,10.32.85.57,10.32.85.55                     34m
my-app-00112-private-xm2w7           IPv4          8012,8022,8112 + 3 more...   10.32.10.180,10.32.96.167,10.32.200.143 + 4 more...     40m
my-app-00112-private-zlp44           IPv4          8012,8022,8112 + 3 more...   10.32.217.21,10.32.217.29,10.32.217.16 + 6 more...      36m

And if we look at the scale of that Deployment:

% kubectl -n app get deploy my-app-00112-deployment                
NAME                         READY   UP-TO-DATE   AVAILABLE   AGE
my-app-00112-deployment   28/28   28           28          40m

This seems correct. But when we look at the Endpoints, it is a much different story:

% kubectl -n app get endpoints my-app-00112-private
NAME                      ENDPOINTS                                                            AGE
my-app-00112-private   10.32.0.70:9091,10.32.10.180:9091,10.32.101.10:9091 + 5907 more...   40m

Nearly 6000 endpoints - at about 6 ports per pod (5 are from Knative, basically), that is 1000 pods, the over capacity limit for Endpoints API. In fact, we DO see this annotation:

Annotations:  endpoints.kubernetes.io/over-capacity: truncated

But that is supposed to come down when pods fall below 1k. Since this service is using a selector, and I think is managed by the Endpoints / EndpointSlices controller, the Endpoints object comes right back if it is deleted, effectively forcing reconciliation from EndpointSlices to Endpoints. And that is exactly what happens.

% kubectl -n app delete endpoints my-app-00112-private
endpoints "my-app-00112-private" deleted
% kubectl -n app get endpoints my-app-00112-private     
NAME                      ENDPOINTS                                                           AGE
my-app-00112-private   10.32.0.70:9091,10.32.10.180:9091,10.32.160.111:9091 + 81 more...   4s

So it does seem like EndpointSlices -> Endpoints reconciliation is broken in some fashion, under these conditions.

What did you expect to happen?

Endpoints should be updated when EndpointSlices are changed, even on scale-in operations where Pods are removed.

How can we reproduce it (as minimally and precisely as possible)?

It is very difficult for me to provide clear details. But it happens with large Knative services, which are just Deployments that are autoscaled and feed a service.

Anything else we need to know?

I see audit logs where endpoint-controller/kube-system is making frequent use of the permission io.k8s.core.v1.endpoints.update on that endpoints resource. But then the audit logs abruptly stop. Which seems to indicate that the controller is no longer even attempting to update the Endpoints resource. This appears to happend after. particularly rapid set of calls to update the endpoints - anywhere from 500ms to 5s apart and about a total of 15-20 times in a minute.

Kubernetes version

$ kubectl version
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.13-gke.1119000

Cloud provider

GKE

OS version

Google COS

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-09-15T01:46:50Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

mbrancato · 2024-09-15T01:49:15Z

/sig network

mbrancato · 2024-09-15T04:05:56Z

This was cherry-picked into 1.28 and released in 1.28.12. I confirmed this was not in a prior working version, 1.27.11.
#125675

aojea · 2024-09-15T07:41:31Z

cc: @tnqn @robscott

@mbrancato so, to see if I understand correctly, once the Endpoint reaches its limits, if the number of endpoints present then goes beyond this limits, the system never reconciles?

aojea · 2024-09-15T07:41:52Z

/priority important-soon

aojea · 2024-09-15T08:10:18Z

I can show the behavior, but it is difficult to reproduce this without mimicking the scale and churn (scale out and scale in) of a. real service.

Ok, it seems is not easy to reproduce, the same scenario as integration test does not fail in master

diff --git a/test/integration/endpoints/endpoints_test.go b/test/integration/endpoints/endpoints_test.go
index 2dc71791d77..b22e7d74514 100644
--- a/test/integration/endpoints/endpoints_test.go
+++ b/test/integration/endpoints/endpoints_test.go
@@ -35,6 +35,7 @@ import (
        "k8s.io/kubernetes/pkg/controller/endpoint"
        "k8s.io/kubernetes/test/integration/framework"
        "k8s.io/kubernetes/test/utils/ktesting"
+       netutils "k8s.io/utils/net"
 )
 
 func TestEndpointUpdates(t *testing.T) {
@@ -576,6 +577,177 @@ func TestEndpointWithTerminatingPod(t *testing.T) {
        }
 }
 
+func TestEndpointTruncate(t *testing.T) {
+       // Disable ServiceAccount admission plugin as we don't have serviceaccount controller running.
+       server := kubeapiservertesting.StartTestServerOrDie(t, nil, framework.DefaultTestServerFlags(), framework.SharedEtcd())
+       defer server.TearDownFn()
+
+       client, err := clientset.NewForConfig(server.ClientConfig)
+       if err != nil {
+               t.Fatalf("Error creating clientset: %v", err)
+       }
+
+       informers := informers.NewSharedInformerFactory(client, 0)
+
+       tCtx := ktesting.Init(t)
+       epController := endpoint.NewEndpointController(
+               tCtx,
+               informers.Core().V1().Pods(),
+               informers.Core().V1().Services(),
+               informers.Core().V1().Endpoints(),
+               client,
+               0)
+
+       // Start informer and controllers
+       informers.Start(tCtx.Done())
+       go epController.Run(tCtx, 1)
+
+       // Create namespace
+       ns := framework.CreateNamespaceOrDie(client, "test-endpoints-truncate", t)
+       defer framework.DeleteNamespaceOrDie(client, ns, t)
+
+       // Create a pod with labels
+       basePod := &v1.Pod{
+               ObjectMeta: metav1.ObjectMeta{
+                       Name:   "test-pod",
+                       Labels: labelMap(),
+               },
+               Spec: v1.PodSpec{
+                       NodeName: "fake-node",
+                       Containers: []v1.Container{
+                               {
+                                       Name:  "fakename",
+                                       Image: "fakeimage",
+                                       Ports: []v1.ContainerPort{
+                                               {
+                                                       Name:          "port-443",
+                                                       ContainerPort: 443,
+                                               },
+                                       },
+                               },
+                       },
+               },
+               Status: v1.PodStatus{
+                       Phase: v1.PodRunning,
+                       Conditions: []v1.PodCondition{
+                               {
+                                       Type:   v1.PodReady,
+                                       Status: v1.ConditionTrue,
+                               },
+                       },
+                       PodIP: "10.0.0.1",
+                       PodIPs: []v1.PodIP{
+                               {
+                                       IP: "10.0.0.1",
+                               },
+                       },
+               },
+       }
+
+       // create 1001 Pods to reach endpoint max capacity that is set to 1000
+       baseIP := netutils.BigForIP(netutils.ParseIPSloppy("10.0.0.1"))
+       for i := 0; i < 1001; i++ {
+               pod := basePod.DeepCopy()
+               pod.Name = fmt.Sprintf("%s-%d", basePod.Name, i)
+               podIP := netutils.AddIPOffset(baseIP, i).String()
+               pod.Status.PodIP = podIP
+               pod.Status.PodIPs[0] = v1.PodIP{IP: podIP}
+               createdPod, err := client.CoreV1().Pods(ns.Name).Create(tCtx, pod, metav1.CreateOptions{})
+               if err != nil {
+                       t.Fatalf("Failed to create pod %s: %v", pod.Name, err)
+               }
+
+               createdPod.Status = pod.Status
+               _, err = client.CoreV1().Pods(ns.Name).UpdateStatus(tCtx, createdPod, metav1.UpdateOptions{})
+               if err != nil {
+                       t.Fatalf("Failed to update status of pod %s: %v", pod.Name, err)
+               }
+       }
+
+       // Create a service associated to the pod
+       svc := &v1.Service{
+               ObjectMeta: metav1.ObjectMeta{
+                       Name:      "test-service",
+                       Namespace: ns.Name,
+                       Labels: map[string]string{
+                               "foo": "bar",
+                       },
+               },
+               Spec: v1.ServiceSpec{
+                       Selector: map[string]string{
+                               "foo": "bar",
+                       },
+                       Ports: []v1.ServicePort{
+                               {Name: "port-443", Port: 443, Protocol: "TCP", TargetPort: intstr.FromInt32(443)},
+                       },
+               },
+       }
+       _, err = client.CoreV1().Services(ns.Name).Create(tCtx, svc, metav1.CreateOptions{})
+       if err != nil {
+               t.Fatalf("Failed to create service %s: %v", svc.Name, err)
+       }
+
+       // poll until associated Endpoints to the previously created Service exists
+       if err := wait.PollImmediate(1*time.Second, 10*time.Second, func() (bool, error) {
+               endpoints, err := client.CoreV1().Endpoints(ns.Name).Get(tCtx, svc.Name, metav1.GetOptions{})
+               if err != nil {
+                       return false, nil
+               }
+
+               numEndpoints := 0
+               for _, subset := range endpoints.Subsets {
+                       numEndpoints += len(subset.Addresses)
+               }
+
+               if numEndpoints != 1000 {
+                       return false, nil
+               }
+
+               truncated, ok := endpoints.Annotations[v1.EndpointsOverCapacity]
+               if !ok || truncated != "truncated" {
+                       return false, nil
+               }
+               return true, nil
+       }); err != nil {
+               t.Fatalf("endpoints not found: %v", err)
+       }
+
+       // delete 501 Pods
+       for i := 500; i < 1001; i++ {
+               podName := fmt.Sprintf("%s-%d", basePod.Name, i)
+               err = client.CoreV1().Pods(ns.Name).Delete(tCtx, podName, metav1.DeleteOptions{})
+               if err != nil {
+                       t.Fatalf("error deleting test pod: %v", err)
+               }
+       }
+
+       // poll until endpoints for deleted Pod are no longer in Endpoints.
+       if err := wait.PollImmediate(1*time.Second, 10*time.Second, func() (bool, error) {
+               endpoints, err := client.CoreV1().Endpoints(ns.Name).Get(tCtx, svc.Name, metav1.GetOptions{})
+               if err != nil {
+                       return false, nil
+               }
+
+               numEndpoints := 0
+               for _, subset := range endpoints.Subsets {
+                       numEndpoints += len(subset.Addresses)
+               }
+
+               if numEndpoints != 500 {
+                       return false, nil
+               }
+
+               truncated, ok := endpoints.Annotations[v1.EndpointsOverCapacity]
+               if ok || truncated == "truncated" {
+                       return false, nil
+               }
+
+               return true, nil
+       }); err != nil {
+               t.Fatalf("error checking for no endpoints with terminating pods: %v", err)
+       }
+}
+

We need to be able to reproduce this condition but #125675 indeed seems a good candidate, @mbrancato can you please ping me in slack (@aojea) to see if we can get more details from the GKE side?

mbrancato · 2024-09-15T12:51:58Z

@mbrancato so, to see if I understand correctly, once the Endpoint reaches its limits, if the number of endpoints present then goes beyond this limits, the system never reconciles?

@aojea I actually think it might be due to the rapid updates that are causing it to stop updating. Following the RBAC audit log, after a rapid set of updates to the Endpoints, the controller then stops updating them all together. However, the EndpointSlices stay up to date.

My current workaround is I created a CronJob that just deletes the Endpoints resource every 5 minutes, but that is non-ideal.

aojea · 2024-09-16T09:50:57Z

@mbrancato can you please look for these messages in the logs?

kubernetes/pkg/controller/endpoint/endpoints_controller.go

Line 481 in 856475e

    
           return fmt.Errorf("endpoints informer cache is out of date, resource version %s already processed for endpoints %s", currentEndpoints.ResourceVersion, key)

kubernetes/pkg/controller/endpoint/endpoints_controller.go

Line 353 in 856475e

    
           logger.V(2).Info("Error syncing endpoints, retrying", "service", klog.KRef(ns, name), "err", err)

kubernetes/pkg/controller/endpoint/endpoints_controller.go

Line 358 in 856475e

    
           logger.Info("Dropping service out of the queue", "service", klog.KRef(ns, name), "err", err)

tnqn · 2024-09-16T11:00:40Z

This appears to be another scenario where an endpoints update request is a no-op, resulting in no change to the resourceVersion upon successful execution. It's because, when the number of endpoints exceeds 1000, the following check always return false, then the newEndpoints gets truncated and becomes identical to currentEndpoints.

kubernetes/pkg/controller/endpoint/endpoints_controller.go

Line 498 in 856475e

endpointSubsetsEqualIgnoreResourceVersion(currentEndpoints.Subsets, subsets) &&

@aojea this can be reproduced by inserting dummy Service updates before "delete 501 Pods" in the above integration test.

#127103 should fix it, and may need to be accelerated.

aojea · 2024-09-16T13:09:41Z

@tnqn can you provide the patch with your modifications on the integration test to @M00nF1sh with the reproducer?

mbrancato · 2024-09-16T16:16:36Z

@aojea Unfortunately, I do not think GKE exposes the logs from the endpoint-controller - outside RBAC usage of its service account, no info or error logs from those file are being captured by stackdriver / cloud logging (and I don't think we have any filter).

I was able to adjust your test and reproduce the issue and an error locally by doing the following:

create 1000 pods synchronously (async seems to overload the api server / etcd in my local tests)
asynchronously add / remove a portion of the pods, delaying 500ms - 3s between attempts
scale down the pods back to a reasonable <1000 number
attempt to wait for the endpoints to update correctly (they never do)

endpoints_controller.go:353: I0916 07:53:18.628099] Error syncing endpoints, retrying service="test-endpoints-large-rapid-changes/test-service" err="endpoints informer cache is out of date, resource version 2560 already processed for endpoints test-endpoints-large-rapid-changes/test-service"

The goal of the async changes is to mimic what ReplicaSets do, since that is ultimately the resource in play here.

mbrancato@56f20fd?diff=unified&w=0

tnqn · 2024-09-16T16:30:55Z

@tnqn can you provide the patch with your modifications on the integration test to @M00nF1sh with the reproducer?

Done: #127103 (comment)

aojea · 2024-09-18T07:37:13Z

This is fixed by #127417

aojea · 2024-09-18T09:12:55Z

/close

k8s-ci-robot · 2024-09-18T09:13:01Z

@aojea: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

mbrancato added the kind/bug Categorizes issue or PR as related to a bug. label Sep 15, 2024

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 15, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Sep 15, 2024

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 15, 2024

mbrancato mentioned this issue Sep 15, 2024

Service endpoints are not updated / removed after upgrade to Kubernetes 1.28 knative/serving#15510

Open

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Sep 15, 2024

aojea mentioned this issue Sep 16, 2024

[WIP]fix an edge case where endpoints might be incorrectly marked as stale #127103

Closed

liggitt added the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Sep 16, 2024

This was referenced Sep 17, 2024

KEP 4770: EndpointSlice Controller Flexibility kubernetes/enhancements#4771

Open

bugfix: endpoints controller track resource version conrrectly #127417

Merged

k8s-ci-robot closed this as completed Sep 18, 2024

liggitt mentioned this issue Sep 18, 2024

Fix endpoints status out-of-sync when the pod state changes rapidly #125675

Merged

sidewinder12s mentioned this issue Sep 20, 2024

support endpointslices.discovery.k8s.io for prometheus.operator.servicemonitors grafana/alloy#1714

Closed

pracucci mentioned this issue Sep 20, 2024

Add MimirGossipMembersEndpointsOutOfSync alert grafana/mimir#9347

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Endpoints do not reconcile with EndpointSlices for Services with selector #127370

Endpoints do not reconcile with EndpointSlices for Services with selector #127370

mbrancato commented Sep 15, 2024 •

edited by liggitt

Loading

k8s-ci-robot commented Sep 15, 2024

mbrancato commented Sep 15, 2024

mbrancato commented Sep 15, 2024

aojea commented Sep 15, 2024

aojea commented Sep 15, 2024

aojea commented Sep 15, 2024 •

edited

Loading

mbrancato commented Sep 15, 2024

aojea commented Sep 16, 2024

tnqn commented Sep 16, 2024

aojea commented Sep 16, 2024

mbrancato commented Sep 16, 2024

tnqn commented Sep 16, 2024

aojea commented Sep 18, 2024 •

edited

Loading

aojea commented Sep 18, 2024

k8s-ci-robot commented Sep 18, 2024

Endpoints do not reconcile with EndpointSlices for Services with selector #127370

Endpoints do not reconcile with EndpointSlices for Services with selector #127370

Comments

mbrancato commented Sep 15, 2024 • edited by liggitt Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented Sep 15, 2024

mbrancato commented Sep 15, 2024

mbrancato commented Sep 15, 2024

aojea commented Sep 15, 2024

aojea commented Sep 15, 2024

aojea commented Sep 15, 2024 • edited Loading

mbrancato commented Sep 15, 2024

aojea commented Sep 16, 2024

tnqn commented Sep 16, 2024

aojea commented Sep 16, 2024

mbrancato commented Sep 16, 2024

tnqn commented Sep 16, 2024

aojea commented Sep 18, 2024 • edited Loading

aojea commented Sep 18, 2024

k8s-ci-robot commented Sep 18, 2024

mbrancato commented Sep 15, 2024 •

edited by liggitt

Loading

aojea commented Sep 15, 2024 •

edited

Loading

aojea commented Sep 18, 2024 •

edited

Loading