Scheduling Inconsistency Caused by kube-scheduler Restart #126499

kangzhiqin · 2024-08-01T03:09:35Z

What would you like to be added?

I have a deployment that needs to pull up two replicas and configure pod strong anti-affinity as follows:

      - ephemeral:
          volumeClaimTemplate:
            metadata:
              creationTimestamp: null
            spec:
              accessModes:
              - ReadWriteOnce
              resources:
                requests:
                  storage: 200Mi
              storageClassName: localcsi
              volumeMode: Filesystem

      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - xxx
            topologyKey: kubernetes.io/hostname

Why is this needed?

kube-scheduler schedules the temporary volume of replica 1 to node 1 but does not schedule the pod of replica 1. At this time, kube-scheduler restarts. After the restart, kube-scheduler starts to schedule the temporary volume and pod of replica 2 and schedules replica 2 to node 1. As a result, copy 1 cannot be scheduled due to anti-affinity.
After kube-scheduler is restarted, it cannot load the temporary volume information scheduled before the restart.
Does the community have any suggestions on this issue?

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-08-01T03:09:45Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

neolit123 · 2024-08-01T07:25:47Z

/sig scheduling storage

googs1025 · 2024-08-01T08:00:32Z

Some information is not clear. What is your k8s version? Can you provide a yaml file to reproduce this problem? In addition, what do you mean by restarting kube-scheduler? Can you print out the log?

googs1025 · 2024-08-01T08:01:59Z

/triage needs-information

googs1025 · 2024-08-01T08:02:39Z

/remove-kind feature

kangzhiqin · 2024-08-01T11:33:47Z

Some information is not clear. What is your k8s version? Can you provide a yaml file to reproduce this problem? In addition, what do you mean by restarting kube-scheduler? Can you print out the log?

k8s version:1.28.1

Problem Description：
The kube-scheduler cannot be connected to the kube-apiserver for renewal. As a result, the restart is triggered. The log is as follows:

E0730 17:11:13.098467      10 leaderelection.go:369] Failed to update lock: Put "https://172.48.1.30:8443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0730 17:11:18.315182      10 leaderelection.go:369] Failed to update lock: Put "https://172.48.1.30:8443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=10s": context deadline exceeded
I0730 17:11:18.315246      10 leaderelection.go:285] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
E0730 17:11:21.723746      10 leaderelection.go:308] Failed to release lock: Operation cannot be fulfilled on leases.coordination.k8s.io "kube-scheduler": the object has been modified; please apply your changes to the latest version and try again
E0730 17:11:21.723810      10 server.go:246] "Leaderelection lost"
I0730 17:11:21.723824      10 scheduling_queue.go:826] "Scheduling queue is closed"
I0730 17:11:23.621602      11 flags.go:64] FLAG: --allow-metric-labels="[]"
I0730 17:11:23.621665      11 flags.go:64] FLAG: --authentication-kubeconfig="/opt/kubernetes/config/scheduler.config"
I0730 17:11:23.621671      11 flags.go:64] FLAG: --authentication-skip-lookup="false"
I0730 17:11:23.621676      11 flags.go:64] FLAG: --authentication-token-webhook-cache-ttl="10s"
I0730 17:11:23.621681      11 flags.go:64] FLAG: --authentication-tolerate-lookup-failure="true"
I0730 17:11:23.621684      11 flags.go:64] FLAG: --authorization-always-allow-paths="[/healthz,/readyz,/livez]"
I0730 17:11:23.621692      11 flags.go:64] FLAG: --authorization-kubeconfig="/opt/kubernetes/config/scheduler.config"
I0730 17:11:23.621695      11 flags.go:64] FLAG: --authorization-webhook-cache-authorized-ttl="10s"
I0730 17:11:23.621700      11 flags.go:64] FLAG: --authorization-webhook-cache-unauthorized-ttl="10s"
I0730 17:11:23.621703      11 flags.go:64] FLAG: --bind-address="127.0.0.1"
I0730 17:11:23.621711      11 flags.go:64] FLAG: --cert-dir=""
I0730 17:11:23.621714      11 flags.go:64] FLAG: --client-ca-file=""
I0730 17:11:23.621718      11 flags.go:64] FLAG: --config="/opt/kubernetes/config/scheduler.yaml"
I0730 17:11:23.621721      11 flags.go:64] FLAG: --contention-profiling="true"
I0730 17:11:23.621724      11 flags.go:64] FLAG: --disabled-metrics="[]"
I0730 17:11:23.621729      11 flags.go:64] FLAG: --feature-gates="ComponentSLIs=false,SchedulerQueueingHints=false"
I0730 17:11:23.621739      11 flags.go:64] FLAG: --help="false"
I0730 17:11:23.621741      11 flags.go:64] FLAG: --http2-max-streams-per-connection="0"
I0730 17:11:23.621750      11 flags.go:64] FLAG: --kube-api-burst="100"
I0730 17:11:23.621754      11 flags.go:64] FLAG: --kube-api-content-type="application/vnd.kubernetes.protobuf"
I0730 17:11:23.621765      11 flags.go:64] FLAG: --kube-api-qps="50"
I0730 17:11:23.621774      11 flags.go:64] FLAG: --kubeconfig="/opt/kubernetes/config/scheduler.config"
I0730 17:11:23.621777      11 flags.go:64] FLAG: --leader-elect="true"
I0730 17:11:23.621780      11 flags.go:64] FLAG: --leader-elect-lease-duration="25s"

Ignore the reason why the APIServer cannot be connected and check why the temporary volume information scheduled before the restart cannot be read after the kube-scheduler is restarted.

kangzhiqin · 2024-08-01T11:54:59Z

The procedure is as follows:

kube-schduler selects a node for a ephemeral volume
kube-scheduler is restarted
After the kube-scheduler is restarted, the things that have been performed in step 1 are not loaded

kangzhiqin · 2024-08-27T03:12:33Z

The kube-scheduler restarts after the PVC node information is updated. The code is stored in the following directory:

kubernetes/pkg/scheduler/framework/plugins/volumebinding/binder.go

Lines 596 to 609 in 619b005

    
           // Update claims objects to trigger volume provisioning. Let the PV controller take care of the rest 
        
           // PV controller is expected to signal back by removing related annotations if actual provisioning fails 
        
           for i, claim = range claimsToProvision { 
        
           	logger.V(5).Info("Updating claims objects to trigger volume provisioning", "pod", klog.KObj(pod), "PVC", klog.KObj(claim)) 
        
           	newClaim, err := b.kubeClient.CoreV1().PersistentVolumeClaims(claim.Namespace).Update(ctx, claim, metav1.UpdateOptions{}) 
        
           	if err != nil { 
        
           		logger.V(4).Info("Updating PersistentVolumeClaim: binding to volume failed", "PVC", klog.KObj(claim), "err", err) 
        
           		return err 
        
           	} 
        
           	// Save updated object from apiserver for later checking. 
        
           	claimsToProvision[i] = newClaim 
        
           	lastProcessedProvisioning++ 
        
           }

The pod configured with strong anti-affinity has two copies. Before the pod is restarted, the PVC of copy 1 is scheduled to node2.
After the restart, the PVC and pod of copy 2 are scheduled to node2. Therefore, anti-affinity is enabled and copy 1 cannot be started.

This seems to be a bug in k8s. Does the community have a plan to fix it?

kangzhiqin · 2024-08-27T03:15:39Z

The flow of this problem is roughly like this:

alculquicondor · 2024-09-12T17:57:49Z

FYI on a related issue #125491

dom4ha · 2024-09-17T13:48:46Z

I'm looking into #125491 which potentially should help with fixing this one, once we manage to restore the in-flight actions after scheduler restart.

k8s-triage-robot · 2024-12-16T14:03:52Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

dom4ha · 2024-12-17T02:06:58Z

/remove-lifecycle stale

kangzhiqin added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 1, 2024

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 1, 2024

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 1, 2024

k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Aug 1, 2024

k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/feature Categorizes issue or PR as related to a new feature. labels Aug 1, 2024

kangzhiqin changed the title ~~An error occurred when kube-scheduler scheduled the ephemeral volume.~~ Scheduling Inconsistency Caused by kube-scheduler Restart Aug 27, 2024

alculquicondor mentioned this issue Sep 12, 2024

Scheduler pre-binding can cause race conditions with automated empty node removal #125491

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 16, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduling Inconsistency Caused by kube-scheduler Restart #126499

Scheduling Inconsistency Caused by kube-scheduler Restart #126499

kangzhiqin commented Aug 1, 2024

k8s-ci-robot commented Aug 1, 2024

neolit123 commented Aug 1, 2024

googs1025 commented Aug 1, 2024

googs1025 commented Aug 1, 2024

googs1025 commented Aug 1, 2024

kangzhiqin commented Aug 1, 2024

kangzhiqin commented Aug 1, 2024

kangzhiqin commented Aug 27, 2024 •

edited

Loading

kangzhiqin commented Aug 27, 2024

alculquicondor commented Sep 12, 2024 •

edited

Loading

dom4ha commented Sep 17, 2024

k8s-triage-robot commented Dec 16, 2024

dom4ha commented Dec 17, 2024

Scheduling Inconsistency Caused by kube-scheduler Restart #126499

Scheduling Inconsistency Caused by kube-scheduler Restart #126499

Comments

kangzhiqin commented Aug 1, 2024

What would you like to be added?

Why is this needed?

k8s-ci-robot commented Aug 1, 2024

neolit123 commented Aug 1, 2024

googs1025 commented Aug 1, 2024

googs1025 commented Aug 1, 2024

googs1025 commented Aug 1, 2024

kangzhiqin commented Aug 1, 2024

kangzhiqin commented Aug 1, 2024

kangzhiqin commented Aug 27, 2024 • edited Loading

kangzhiqin commented Aug 27, 2024

alculquicondor commented Sep 12, 2024 • edited Loading

dom4ha commented Sep 17, 2024

k8s-triage-robot commented Dec 16, 2024

dom4ha commented Dec 17, 2024

kangzhiqin commented Aug 27, 2024 •

edited

Loading

alculquicondor commented Sep 12, 2024 •

edited

Loading