Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Probe-level terminationGracePeriodSeconds #99375

Merged
merged 6 commits into from
Mar 12, 2021

Conversation

ehashman
Copy link
Member

@ehashman ehashman commented Feb 23, 2021

What type of PR is this?

/kind bug
/kind api-change

/sig node
/priority important-soon

What this PR does / why we need it:

Adds a Probe-level terminationGracePeriodSeconds.

  • Add field to APIs
  • Add feature flag for field (ProbeTerminationGracePeriod)
  • Wire up code changes
  • Add e2e tests (maybe in a follow-up?)

Which issue(s) this PR fixes:

Fixes #64715.

See kubernetes/enhancements#2238 and linked KEP below for details.

Special notes for your reviewer:

Manually tested code changes with ./hack/local-up-cluster.sh using no feature gates/FEATURE_GATES="ProbeTerminationGracePeriod=true".

Test pod spec:

---
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  terminationGracePeriodSeconds: 500
  containers:
    - name: test
      image: nginx
      command: [bash, -c, "sleep 100000000"]
      ports:
        - containerPort: 8080
      livenessProbe:
        httpGet:
          path: /healthz
          port: 8080
        failureThreshold: 1
        periodSeconds: 60
        terminationGracePeriodSeconds: 10

With feature gate on, terminationGracePeriod is successfully overridden:

Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  49s               default-scheduler  Successfully assigned default/test-pod to 127.0.0.1
  Normal   Pulled     48s               kubelet            Successfully pulled image "nginx" in 1.413504483s
  Warning  Unhealthy  19s               kubelet            Liveness probe failed: Get "http://172.17.0.5:8080/healthz": dial tcp 172.17.0.5:8080: connect: connection refused
  Normal   Killing    19s               kubelet            Container test failed liveness probe, will be restarted
  Normal   Pulling    9s (x2 over 49s)  kubelet            Pulling image "nginx"
  Normal   Created    7s (x2 over 48s)  kubelet            Created container test
  Normal   Started    7s (x2 over 48s)  kubelet            Started container test
  Normal   Pulled     7s                kubelet            Successfully pulled image "nginx" in 1.630529866s

Without it, behaviour is same as status quo:

Events:
  Type     Reason     Age   From               Message
  ----     ------     ----  ----               -------
  Normal   Scheduled  96s   default-scheduler  Successfully assigned default/test-pod to 127.0.0.1
  Normal   Pulling    95s   kubelet            Pulling image "nginx"
  Normal   Pulled     92s   kubelet            Successfully pulled image "nginx" in 3.370430829s
  Normal   Created    92s   kubelet            Created container test
  Normal   Started    92s   kubelet            Started container test
  Warning  Unhealthy  41s   kubelet            Liveness probe failed: Get "http://172.17.0.3:8080/healthz": dial tcp 172.17.0.3:8080: connect: connection refused
  Normal   Killing    41s   kubelet            Container test failed liveness probe, will be restarted

Does this PR introduce a user-facing change?

Add Probe-level terminationGracePeriodSeconds field

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2238-liveness-probe-grace-period

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. sig/node Categorizes an issue or PR as relevant to SIG Node. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Feb 23, 2021
@ehashman ehashman changed the title [WIP] Add Probe-level terminationGracePeriodSeconds Add Probe-level terminationGracePeriodSeconds Mar 1, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 1, 2021
@ehashman
Copy link
Member Author

ehashman commented Mar 1, 2021

Taking off WIP since there is probably enough here for an initial review/feedback. I'm working on the e2es but should have that shortly.

@ehashman
Copy link
Member Author

ehashman commented Mar 1, 2021

Since this is an alpha feature, it's unclear to me if the e2es should land in this PR -- I think for that to work I'd need to add the feature gate (which won't exist until this lands) to the test-infra jobs like kubernetes/test-infra#20369 ? Should I do that after this lands or simultaneously? @dims?

@ehashman
Copy link
Member Author

ehashman commented Mar 2, 2021

/cc @smarterclayton @matthyx

@dims
Copy link
Member

dims commented Mar 2, 2021

@ehashman we would land the e2e test and feature here. and follow up with a PR to test-infra to switch it on. (then iterate if something fails).

@ehashman
Copy link
Member Author

ehashman commented Mar 2, 2021

@dims (just thinking out loud) it won't be able to actually exercise any of the new logic with the alpha flag turned on until I add it to test-infra, and I'm not sure if I can add the flag to test-infra if it's not yet defined, am I missing something?

@liggitt
Copy link
Member

liggitt commented Mar 2, 2021

/label api-review

@k8s-ci-robot k8s-ci-robot added the api-review Categorizes an issue or PR as actively needing an API review. label Mar 2, 2021
@dims
Copy link
Member

dims commented Mar 2, 2021

@ehashman we would fake test it with an extra commit that turns it on ... run it a few times and remove the extra commit before getting reviews and merging the PR

@smarterclayton
Copy link
Contributor

smarterclayton commented Mar 10, 2021

/approve

for api changes. I'm ok with an exception request since we have had this mostly baked for a while and the API issues were minor. I assume sig-node is ok too. Risk seems minimal.

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 10, 2021
@palnabarun
Copy link
Member

/milestone v1.21

Restoring the milestone since the exception request was APPROVED. ref: https://groups.google.com/g/kubernetes-sig-release/c/eOphdiRmK7k/m/4cSfBcneAgAJ

@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Mar 11, 2021
@rphillips
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 11, 2021
@ehashman
Copy link
Member Author

/retest

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Mar 11, 2021

@ehashman: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubernetes-node-e2e-alpha afc3fdbe91356e32271c75ee45f0694346f8830e link /test pull-kubernetes-node-e2e-alpha

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 12, 2021
@ehashman
Copy link
Member Author

I think the verify fail was a rebase/merge conflict issue, hopefully the update will pass. Passes locally.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, derekwaynecarr, ehashman, matthyx, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@matthyx
Copy link
Contributor

matthyx commented Mar 12, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 12, 2021
@k8s-ci-robot k8s-ci-robot merged commit faa5c8c into kubernetes:master Mar 12, 2021
@thockin
Copy link
Member

thockin commented Oct 30, 2021

Hi. I know this is old, so I am not sure we want to do anything about this comment, but I wanted to ask:

I'm investigating places where we share API structs across use-cases but some fields don't apply. We have a few cases in Probe, and I stumbled on this one (which is new to me). I skimmed the KEP but I didn't see this aspect really discussed. Was it considered?

For example, maybe we should have un-shared the struct? Or maybe this should have been Pod.spec.abnormalTerminationGracePeriod (so it applies to all relevant probes and maybe other things without changing the probe struct)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: API review completed, 1.21
Archived in project
Development

Successfully merging this pull request may close these issues.

terminationGracePeriodSeconds is overloaded