Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduling Inconsistency Caused by kube-scheduler Restart #126499

Open
kangzhiqin opened this issue Aug 1, 2024 · 13 comments
Open

Scheduling Inconsistency Caused by kube-scheduler Restart #126499

kangzhiqin opened this issue Aug 1, 2024 · 13 comments
Labels
needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@kangzhiqin
Copy link

What would you like to be added?

I have a deployment that needs to pull up two replicas and configure pod strong anti-affinity as follows:

      - ephemeral:
          volumeClaimTemplate:
            metadata:
              creationTimestamp: null
            spec:
              accessModes:
              - ReadWriteOnce
              resources:
                requests:
                  storage: 200Mi
              storageClassName: localcsi
              volumeMode: Filesystem

      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - xxx
            topologyKey: kubernetes.io/hostname

Why is this needed?

kube-scheduler schedules the temporary volume of replica 1 to node 1 but does not schedule the pod of replica 1. At this time, kube-scheduler restarts. After the restart, kube-scheduler starts to schedule the temporary volume and pod of replica 2 and schedules replica 2 to node 1. As a result, copy 1 cannot be scheduled due to anti-affinity.
After kube-scheduler is restarted, it cannot load the temporary volume information scheduled before the restart.
Does the community have any suggestions on this issue?

@kangzhiqin kangzhiqin added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 1, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 1, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@neolit123
Copy link
Member

/sig scheduling storage

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 1, 2024
@googs1025
Copy link
Member

Some information is not clear. What is your k8s version? Can you provide a yaml file to reproduce this problem? In addition, what do you mean by restarting kube-scheduler? Can you print out the log?

@googs1025
Copy link
Member

/triage needs-information

@k8s-ci-robot k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Aug 1, 2024
@googs1025
Copy link
Member

/remove-kind feature

@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/feature Categorizes issue or PR as related to a new feature. labels Aug 1, 2024
@kangzhiqin
Copy link
Author

Some information is not clear. What is your k8s version? Can you provide a yaml file to reproduce this problem? In addition, what do you mean by restarting kube-scheduler? Can you print out the log?

k8s version:1.28.1

Problem Description:
The kube-scheduler cannot be connected to the kube-apiserver for renewal. As a result, the restart is triggered. The log is as follows:

E0730 17:11:13.098467      10 leaderelection.go:369] Failed to update lock: Put "https://172.48.1.30:8443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0730 17:11:18.315182      10 leaderelection.go:369] Failed to update lock: Put "https://172.48.1.30:8443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=10s": context deadline exceeded
I0730 17:11:18.315246      10 leaderelection.go:285] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
E0730 17:11:21.723746      10 leaderelection.go:308] Failed to release lock: Operation cannot be fulfilled on leases.coordination.k8s.io "kube-scheduler": the object has been modified; please apply your changes to the latest version and try again
E0730 17:11:21.723810      10 server.go:246] "Leaderelection lost"
I0730 17:11:21.723824      10 scheduling_queue.go:826] "Scheduling queue is closed"
I0730 17:11:23.621602      11 flags.go:64] FLAG: --allow-metric-labels="[]"
I0730 17:11:23.621665      11 flags.go:64] FLAG: --authentication-kubeconfig="/opt/kubernetes/config/scheduler.config"
I0730 17:11:23.621671      11 flags.go:64] FLAG: --authentication-skip-lookup="false"
I0730 17:11:23.621676      11 flags.go:64] FLAG: --authentication-token-webhook-cache-ttl="10s"
I0730 17:11:23.621681      11 flags.go:64] FLAG: --authentication-tolerate-lookup-failure="true"
I0730 17:11:23.621684      11 flags.go:64] FLAG: --authorization-always-allow-paths="[/healthz,/readyz,/livez]"
I0730 17:11:23.621692      11 flags.go:64] FLAG: --authorization-kubeconfig="/opt/kubernetes/config/scheduler.config"
I0730 17:11:23.621695      11 flags.go:64] FLAG: --authorization-webhook-cache-authorized-ttl="10s"
I0730 17:11:23.621700      11 flags.go:64] FLAG: --authorization-webhook-cache-unauthorized-ttl="10s"
I0730 17:11:23.621703      11 flags.go:64] FLAG: --bind-address="127.0.0.1"
I0730 17:11:23.621711      11 flags.go:64] FLAG: --cert-dir=""
I0730 17:11:23.621714      11 flags.go:64] FLAG: --client-ca-file=""
I0730 17:11:23.621718      11 flags.go:64] FLAG: --config="/opt/kubernetes/config/scheduler.yaml"
I0730 17:11:23.621721      11 flags.go:64] FLAG: --contention-profiling="true"
I0730 17:11:23.621724      11 flags.go:64] FLAG: --disabled-metrics="[]"
I0730 17:11:23.621729      11 flags.go:64] FLAG: --feature-gates="ComponentSLIs=false,SchedulerQueueingHints=false"
I0730 17:11:23.621739      11 flags.go:64] FLAG: --help="false"
I0730 17:11:23.621741      11 flags.go:64] FLAG: --http2-max-streams-per-connection="0"
I0730 17:11:23.621750      11 flags.go:64] FLAG: --kube-api-burst="100"
I0730 17:11:23.621754      11 flags.go:64] FLAG: --kube-api-content-type="application/vnd.kubernetes.protobuf"
I0730 17:11:23.621765      11 flags.go:64] FLAG: --kube-api-qps="50"
I0730 17:11:23.621774      11 flags.go:64] FLAG: --kubeconfig="/opt/kubernetes/config/scheduler.config"
I0730 17:11:23.621777      11 flags.go:64] FLAG: --leader-elect="true"
I0730 17:11:23.621780      11 flags.go:64] FLAG: --leader-elect-lease-duration="25s"

Ignore the reason why the APIServer cannot be connected and check why the temporary volume information scheduled before the restart cannot be read after the kube-scheduler is restarted.

@kangzhiqin
Copy link
Author

The procedure is as follows:

  1. kube-schduler selects a node for a ephemeral volume
  2. kube-scheduler is restarted
  3. After the kube-scheduler is restarted, the things that have been performed in step 1 are not loaded

@kangzhiqin
Copy link
Author

kangzhiqin commented Aug 27, 2024

The kube-scheduler restarts after the PVC node information is updated. The code is stored in the following directory:

// Update claims objects to trigger volume provisioning. Let the PV controller take care of the rest
// PV controller is expected to signal back by removing related annotations if actual provisioning fails
for i, claim = range claimsToProvision {
logger.V(5).Info("Updating claims objects to trigger volume provisioning", "pod", klog.KObj(pod), "PVC", klog.KObj(claim))
newClaim, err := b.kubeClient.CoreV1().PersistentVolumeClaims(claim.Namespace).Update(ctx, claim, metav1.UpdateOptions{})
if err != nil {
logger.V(4).Info("Updating PersistentVolumeClaim: binding to volume failed", "PVC", klog.KObj(claim), "err", err)
return err
}
// Save updated object from apiserver for later checking.
claimsToProvision[i] = newClaim
lastProcessedProvisioning++
}

The pod configured with strong anti-affinity has two copies. Before the pod is restarted, the PVC of copy 1 is scheduled to node2.
After the restart, the PVC and pod of copy 2 are scheduled to node2. Therefore, anti-affinity is enabled and copy 1 cannot be started.

This seems to be a bug in k8s. Does the community have a plan to fix it?

@kangzhiqin
Copy link
Author

The flow of this problem is roughly like this:
image

@kangzhiqin kangzhiqin changed the title An error occurred when kube-scheduler scheduled the ephemeral volume. Scheduling Inconsistency Caused by kube-scheduler Restart Aug 27, 2024
@alculquicondor
Copy link
Member

alculquicondor commented Sep 12, 2024

FYI on a related issue #125491

@dom4ha
Copy link
Member

dom4ha commented Sep 17, 2024

I'm looking into #125491 which potentially should help with fixing this one, once we manage to restore the in-flight actions after scheduler restart.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 16, 2024
@dom4ha
Copy link
Member

dom4ha commented Dec 17, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

7 participants