Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limitrange request causes indefinite amount of pods spawned #93750

Closed
dza89 opened this issue Aug 6, 2020 · 8 comments
Closed

Limitrange request causes indefinite amount of pods spawned #93750

dza89 opened this issue Aug 6, 2020 · 8 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@dza89
Copy link

dza89 commented Aug 6, 2020

What happened:
Pods keep spawning indefinitely with OutOfcpu error

What you expected to happen:
A single error returned

How to reproduce it (as minimally and precisely as possible):

Add limitrange:

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limit-range
  namespace: "{{ meta.name }}"
spec:
  limits:
  - default:
      cpu: 1
      memory: 512Mi
    defaultRequest:
      cpu: 1
      memory: 256Mi
    type: Container

Make sure your nodes have less resources available then 1 CPU (set high for this example)

Do a dummy deployment without any resources set

Anything else we need to know?:

My findings:
Since the limits are set on container level, the kubelet receives the requests, then the requestlimit is added and the container cannot be scheduled, the kubelet returns OutOfcpu and the replicaset deploys a new pod. And this starts an endless loop of pod spawning.

Environment:

  • Kubernetes version 16.1
  • Cloud provider or hardware configuration: EKS
  • OS (e.g: cat /etc/os-release): Amazon Linux 2
  • Kernel (e.g. uname -a): 4.14.186-146.268.amzn2.x86_64
  • Others:
    kubeletVersion: v1.16.13-eks-2ba888
@dza89 dza89 added the kind/bug Categorizes issue or PR as related to a bug. label Aug 6, 2020
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Aug 6, 2020
@dza89
Copy link
Author

dza89 commented Aug 6, 2020

/sig scheduling

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 6, 2020
@cablespaghetti
Copy link

cablespaghetti commented Aug 6, 2020

Oh great it's not just us! We're also on EKS v1.16. This started last week for us...I wonder if it's an EKS problem specifically...

edit: That said. We're not using LimitRange.

@dza89
Copy link
Author

dza89 commented Aug 6, 2020

@cablespaghetti
Hmm, it was exactly the same for us. After swapping to the latest EKS AMI. I suspect a different version of the kubelet.
Maybe it's not limited to limitrange? Removing this however did fix our issue.

@cablespaghetti
Copy link

cablespaghetti commented Aug 6, 2020

I'm downgrading to ami-05ac566a7ec2378db which is from May but was the previous AMI we were running. Will feedback if that fixes it...

edit: To clarify this is going from a 1.16.12 or 1.16.13 AMI to a 1.16.8 one.

@dza89
Copy link
Author

dza89 commented Aug 6, 2020

You don't have anything else, like a policy that applies resources on the container level?

@cablespaghetti
Copy link

cablespaghetti commented Aug 6, 2020

We don't no. Just heard back from AWS:

Based on the details you provided, I understand that many of your pods are failing to start due to "OutOfcpu" although worker instances have enough CPU available for pods.

As you suggested, I have looked into similar reports by different customers and indeed I detected that this behavior was reported and it was correlated to EKS AMI versions 1.16.10+ , as it seems to be related to this kubernetes issue[1] which was reported for K8 version 1.16.10 .

Based on the previous analysis on this issue, our suggested approach is to update your cluster to version 1.17 if applicable, otherwise in case you don't prefer update approach, as you are currently considering you can retain earlier AMI such as 1.16.8 .

On behalf of AWS I apologize for any inconvenience caused by this issue, I hope that this information assist you with addressing this issue, meanwhile feel free to update me with any further queries or inputs you may need to add.

References:
[1] #90455

I spent a while puzzling over this and it seems to be that this is a problem because the control plane on EKS 1.16 is 1.16.8 which pre-dates this change in later versions of the kubelet.

@dza89
Copy link
Author

dza89 commented Aug 6, 2020

Ah ok, I saw it happen numerous times with init containers.
Removing the limitranger worked because then there were no more resources definined.
/close

@k8s-ci-robot
Copy link
Contributor

@dza89: Closing this issue.

In response to this:

Ah ok, I saw it happen numerous times with init containers.
Removing the limitranger worked because then there were no more resources definined.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests

3 participants