-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Endpoints do not reconcile with EndpointSlices for Services with selector #127370
Comments
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/sig network |
This was cherry-picked into 1.28 and released in 1.28.12. I confirmed this was not in a prior working version, 1.27.11. |
@mbrancato so, to see if I understand correctly, once the Endpoint reaches its limits, if the number of endpoints present then goes beyond this limits, the system never reconciles? |
/priority important-soon |
Ok, it seems is not easy to reproduce, the same scenario as integration test does not fail in master diff --git a/test/integration/endpoints/endpoints_test.go b/test/integration/endpoints/endpoints_test.go
index 2dc71791d77..b22e7d74514 100644
--- a/test/integration/endpoints/endpoints_test.go
+++ b/test/integration/endpoints/endpoints_test.go
@@ -35,6 +35,7 @@ import (
"k8s.io/kubernetes/pkg/controller/endpoint"
"k8s.io/kubernetes/test/integration/framework"
"k8s.io/kubernetes/test/utils/ktesting"
+ netutils "k8s.io/utils/net"
)
func TestEndpointUpdates(t *testing.T) {
@@ -576,6 +577,177 @@ func TestEndpointWithTerminatingPod(t *testing.T) {
}
}
+func TestEndpointTruncate(t *testing.T) {
+ // Disable ServiceAccount admission plugin as we don't have serviceaccount controller running.
+ server := kubeapiservertesting.StartTestServerOrDie(t, nil, framework.DefaultTestServerFlags(), framework.SharedEtcd())
+ defer server.TearDownFn()
+
+ client, err := clientset.NewForConfig(server.ClientConfig)
+ if err != nil {
+ t.Fatalf("Error creating clientset: %v", err)
+ }
+
+ informers := informers.NewSharedInformerFactory(client, 0)
+
+ tCtx := ktesting.Init(t)
+ epController := endpoint.NewEndpointController(
+ tCtx,
+ informers.Core().V1().Pods(),
+ informers.Core().V1().Services(),
+ informers.Core().V1().Endpoints(),
+ client,
+ 0)
+
+ // Start informer and controllers
+ informers.Start(tCtx.Done())
+ go epController.Run(tCtx, 1)
+
+ // Create namespace
+ ns := framework.CreateNamespaceOrDie(client, "test-endpoints-truncate", t)
+ defer framework.DeleteNamespaceOrDie(client, ns, t)
+
+ // Create a pod with labels
+ basePod := &v1.Pod{
+ ObjectMeta: metav1.ObjectMeta{
+ Name: "test-pod",
+ Labels: labelMap(),
+ },
+ Spec: v1.PodSpec{
+ NodeName: "fake-node",
+ Containers: []v1.Container{
+ {
+ Name: "fakename",
+ Image: "fakeimage",
+ Ports: []v1.ContainerPort{
+ {
+ Name: "port-443",
+ ContainerPort: 443,
+ },
+ },
+ },
+ },
+ },
+ Status: v1.PodStatus{
+ Phase: v1.PodRunning,
+ Conditions: []v1.PodCondition{
+ {
+ Type: v1.PodReady,
+ Status: v1.ConditionTrue,
+ },
+ },
+ PodIP: "10.0.0.1",
+ PodIPs: []v1.PodIP{
+ {
+ IP: "10.0.0.1",
+ },
+ },
+ },
+ }
+
+ // create 1001 Pods to reach endpoint max capacity that is set to 1000
+ baseIP := netutils.BigForIP(netutils.ParseIPSloppy("10.0.0.1"))
+ for i := 0; i < 1001; i++ {
+ pod := basePod.DeepCopy()
+ pod.Name = fmt.Sprintf("%s-%d", basePod.Name, i)
+ podIP := netutils.AddIPOffset(baseIP, i).String()
+ pod.Status.PodIP = podIP
+ pod.Status.PodIPs[0] = v1.PodIP{IP: podIP}
+ createdPod, err := client.CoreV1().Pods(ns.Name).Create(tCtx, pod, metav1.CreateOptions{})
+ if err != nil {
+ t.Fatalf("Failed to create pod %s: %v", pod.Name, err)
+ }
+
+ createdPod.Status = pod.Status
+ _, err = client.CoreV1().Pods(ns.Name).UpdateStatus(tCtx, createdPod, metav1.UpdateOptions{})
+ if err != nil {
+ t.Fatalf("Failed to update status of pod %s: %v", pod.Name, err)
+ }
+ }
+
+ // Create a service associated to the pod
+ svc := &v1.Service{
+ ObjectMeta: metav1.ObjectMeta{
+ Name: "test-service",
+ Namespace: ns.Name,
+ Labels: map[string]string{
+ "foo": "bar",
+ },
+ },
+ Spec: v1.ServiceSpec{
+ Selector: map[string]string{
+ "foo": "bar",
+ },
+ Ports: []v1.ServicePort{
+ {Name: "port-443", Port: 443, Protocol: "TCP", TargetPort: intstr.FromInt32(443)},
+ },
+ },
+ }
+ _, err = client.CoreV1().Services(ns.Name).Create(tCtx, svc, metav1.CreateOptions{})
+ if err != nil {
+ t.Fatalf("Failed to create service %s: %v", svc.Name, err)
+ }
+
+ // poll until associated Endpoints to the previously created Service exists
+ if err := wait.PollImmediate(1*time.Second, 10*time.Second, func() (bool, error) {
+ endpoints, err := client.CoreV1().Endpoints(ns.Name).Get(tCtx, svc.Name, metav1.GetOptions{})
+ if err != nil {
+ return false, nil
+ }
+
+ numEndpoints := 0
+ for _, subset := range endpoints.Subsets {
+ numEndpoints += len(subset.Addresses)
+ }
+
+ if numEndpoints != 1000 {
+ return false, nil
+ }
+
+ truncated, ok := endpoints.Annotations[v1.EndpointsOverCapacity]
+ if !ok || truncated != "truncated" {
+ return false, nil
+ }
+ return true, nil
+ }); err != nil {
+ t.Fatalf("endpoints not found: %v", err)
+ }
+
+ // delete 501 Pods
+ for i := 500; i < 1001; i++ {
+ podName := fmt.Sprintf("%s-%d", basePod.Name, i)
+ err = client.CoreV1().Pods(ns.Name).Delete(tCtx, podName, metav1.DeleteOptions{})
+ if err != nil {
+ t.Fatalf("error deleting test pod: %v", err)
+ }
+ }
+
+ // poll until endpoints for deleted Pod are no longer in Endpoints.
+ if err := wait.PollImmediate(1*time.Second, 10*time.Second, func() (bool, error) {
+ endpoints, err := client.CoreV1().Endpoints(ns.Name).Get(tCtx, svc.Name, metav1.GetOptions{})
+ if err != nil {
+ return false, nil
+ }
+
+ numEndpoints := 0
+ for _, subset := range endpoints.Subsets {
+ numEndpoints += len(subset.Addresses)
+ }
+
+ if numEndpoints != 500 {
+ return false, nil
+ }
+
+ truncated, ok := endpoints.Annotations[v1.EndpointsOverCapacity]
+ if ok || truncated == "truncated" {
+ return false, nil
+ }
+
+ return true, nil
+ }); err != nil {
+ t.Fatalf("error checking for no endpoints with terminating pods: %v", err)
+ }
+}
+ We need to be able to reproduce this condition but #125675 indeed seems a good candidate, @mbrancato can you please ping me in slack (@aojea) to see if we can get more details from the GKE side? |
@aojea I actually think it might be due to the rapid updates that are causing it to stop updating. Following the RBAC audit log, after a rapid set of updates to the Endpoints, the controller then stops updating them all together. However, the EndpointSlices stay up to date. My current workaround is I created a CronJob that just deletes the Endpoints resource every 5 minutes, but that is non-ideal. |
@mbrancato can you please look for these messages in the logs?
|
This appears to be another scenario where an endpoints update request is a no-op, resulting in no change to the resourceVersion upon successful execution. It's because, when the number of endpoints exceeds 1000, the following check always return
@aojea this can be reproduced by inserting dummy Service updates before "delete 501 Pods" in the above integration test. #127103 should fix it, and may need to be accelerated. |
@aojea Unfortunately, I do not think GKE exposes the logs from the I was able to adjust your test and reproduce the issue and an error locally by doing the following:
The goal of the async changes is to mimic what ReplicaSets do, since that is ultimately the resource in play here. |
Done: #127103 (comment) |
This is fixed by #127417 |
/close |
@aojea: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What happened?
#125675 was merged to 1.31, and backported and regressed the following releases:
After an upgrade to 1.28 (1.28.13), we have had significant problems with Services that use a
selector
to target Pods in a Deployment. This is likely happening only with very large services with 1000-2000 backing Pods. The Endpoints eventually appear to be "stuck" with LOTS of IPs as targets (these are large services) and they are not removing old Pod IPs from the Endpoints object, even long after the Pod is gone.These services are Knative Services, and are operating under frequent and wide scaling. It does seem like Knative is still reading from the Endpoints API.
Deleting the Endpoints object causes it to be recreated and instantly resolves problems from things downstream which are relying on the Endpoints API.
I can show the behavior, but it is difficult to reproduce this without mimicking the scale and churn (scale out and scale in) of a. real service.
Here, I have a service
my-app
- it is in the failed state where Endpoints are not being updated. Note, the sourceService
we are concerned about here ismy-app-00112-private
.And if we look at the scale of that Deployment:
This seems correct. But when we look at the Endpoints, it is a much different story:
Nearly 6000 endpoints - at about 6 ports per pod (5 are from Knative, basically), that is 1000 pods, the over capacity limit for Endpoints API. In fact, we DO see this annotation:
But that is supposed to come down when pods fall below 1k. Since this service is using a selector, and I think is managed by the Endpoints / EndpointSlices controller, the Endpoints object comes right back if it is deleted, effectively forcing reconciliation from EndpointSlices to Endpoints. And that is exactly what happens.
So it does seem like EndpointSlices -> Endpoints reconciliation is broken in some fashion, under these conditions.
What did you expect to happen?
Endpoints should be updated when EndpointSlices are changed, even on scale-in operations where Pods are removed.
How can we reproduce it (as minimally and precisely as possible)?
It is very difficult for me to provide clear details. But it happens with large Knative services, which are just Deployments that are autoscaled and feed a service.
Anything else we need to know?
I see audit logs where
endpoint-controller/kube-system
is making frequent use of the permissionio.k8s.core.v1.endpoints.update
on that endpoints resource. But then the audit logs abruptly stop. Which seems to indicate that the controller is no longer even attempting to update the Endpoints resource. This appears to happend after. particularly rapid set of calls to update the endpoints - anywhere from 500ms to 5s apart and about a total of 15-20 times in a minute.Kubernetes version
Cloud provider
OS version
Google COS
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: