Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes the races around devicemanager Allocate() and endpoint deletion. #60856

Merged
merged 1 commit into from
Mar 12, 2018

Conversation

jiayingz
Copy link
Contributor

@jiayingz jiayingz commented Mar 6, 2018

There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc()
could get Node with non-zero deviceplugin resource allocatable for a
non-existing endpoint. That race can happen when a device plugin fails,
but is more likely when kubelet restarts as with the current registration
model, there is a time gap between kubelet restart and device plugin
re-registration. During this time window, even though devicemanager could
have removed the resource initially during GetCapacity() call, Kubelet
may overwrite the device plugin resource capacity/allocatable with the
old value when node update from the API server comes in later. This
could cause a pod to be started without proper device runtime config set.

To solve this problem, introduce endpointStopGracePeriod. When a device
plugin fails, don't immediately remove the endpoint but set stopTime in
its endpoint. During kubelet restart, create endpoints with stopTime set
for any checkpointed registered resource. The endpoint is considered to be
in stopGracePeriod if its stoptime is set. This allows us to track what
resources should be handled by devicemanager during the time gap.
When an endpoint's stopGracePeriod expires, we remove the endpoint and
its resource. This allows the resource to be exported through other channels
(e.g., by directly updating node status through API server) if there is such
use case. Currently endpointStopGracePeriod is set as 5 minutes.

Given that an endpoint is no longer immediately removed upon disconnection,
mark all its devices unhealthy so that we can signal the resource allocatable
change to the scheduler to avoid scheduling more pods to the node.
When a device plugin endpoint is in stopGracePeriod, pods requesting the
corresponding resource will fail admission handler.

Tested:
Ran GPUDevicePlugin e2e_node test 100 times and all passed now.

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #60176

Special notes for your reviewer:

Release note:

Fixes the races around devicemanager Allocate() and endpoint deletion.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 6, 2018
@jiayingz
Copy link
Contributor Author

jiayingz commented Mar 6, 2018

cc @vishh @RenaudWasTaken

@jiayingz jiayingz force-pushed the race-fix branch 2 times, most recently from 922d2a9 to 25254d4 Compare March 6, 2018 23:20
@RenaudWasTaken
Copy link
Contributor

/area hw-accelerators

Copy link
Contributor

@RenaudWasTaken RenaudWasTaken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be simpler to issue a callback on endpoint stop and add that timeout at the manager level?

That would remove this weird intermediary of dead but not removed state in the endpoint and simplify the guarding code you added to all the RPC calls.

You also wouldn't need to check for devices not in sync with the endpoint since you'd always have the stopGracePeriod.

And as far as re-registration is concerned by the stop callback, you would just need to replace the callback func by a noop func before stopping the endpoint.

// because its device plugin fails. DeviceManager keeps the stopped endpoint in its
// cache during this grace period to cover the time gap for the capacity change to
// take effect.
const endpointStopGracePeriod = time.Duration(5) * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move this in types.go this sounds like an important constant :)

@dims
Copy link
Member

dims commented Mar 6, 2018

/sig node

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Mar 6, 2018
@jiayingz jiayingz force-pushed the race-fix branch 2 times, most recently from fa88f83 to 9f75b6f Compare March 6, 2018 23:52
@jiayingz
Copy link
Contributor Author

jiayingz commented Mar 6, 2018 via email

@RenaudWasTaken
Copy link
Contributor

Are you suggesting to add another map in manager to track endpoint stop grace period? I did consider that approach at the beginning but felt we would need to manage lifecycle of another data structure and make sure it is consistent with endpoint update.

We probably need to move the devices in it's own structure anyways as you mentioned managing the lifcycle of these structures in the manager is complex and needs to be carefully checked.

What do you think of something like:

type ManagerStore interface {
     // This sets the timer when Update(rName, [], [], allTheDevices)
     Update(rName string, added, updated, deleted []pluginapi.Device)

     // This decides whether an a resource may be removed based on the time since the last Update
     GetCapacity() (capacity, allocatable v1.ResourceList, deleted []string)
}

@jiayingz
Copy link
Contributor Author

jiayingz commented Mar 7, 2018 via email

@jiayingz jiayingz force-pushed the race-fix branch 2 times, most recently from 68be9a9 to 1708a72 Compare March 7, 2018 01:19
@jiayingz
Copy link
Contributor Author

jiayingz commented Mar 7, 2018

/test pull-kubernetes-kubemark-e2e-gce

1 similar comment
@jiayingz
Copy link
Contributor Author

jiayingz commented Mar 7, 2018

/test pull-kubernetes-kubemark-e2e-gce

@jiayingz
Copy link
Contributor Author

jiayingz commented Mar 7, 2018

/assign @vishh

// TODO: Reuse devices between init containers and regular containers.
for _, container := range pod.Spec.InitContainers {
if err := m.allocateContainerResources(pod, &container, devicesToReuse); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why change this line? It is idiomatic go style.

// TODO: Reuse devices between init containers and regular containers.
for _, container := range pod.Spec.InitContainers {
if err := m.allocateContainerResources(pod, &container, devicesToReuse); err != nil {
allocatedDevices, err := m.allocateContainerResources(pod, &container, devicesToReuse)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the new API support allocating multiple device at once?

@@ -259,18 +263,39 @@ func (m *ManagerImpl) Devices() map[string][]pluginapi.Device {
func (m *ManagerImpl) Allocate(node *schedulercache.NodeInfo, attrs *lifecycle.PodAdmitAttributes) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this logic be re-structured as follows to improve readability?

  1. Figure out max(# of devices requested by init containers)
  2. Figure out sum(# of devices requested by regular containers)
  3. Compute devices required - max(2 & 3)
  4. Allocate these devices in some order (either individually or in batch)
  5. Assign devices to init containers (can intersect) and regular containers (mutually independent).

As of now the logic is dense. Since your PR is touching this logic, I'd like for it to be cleaned up.

@vishh
Copy link
Contributor

vishh commented Mar 9, 2018

given that kubelet processed incoming pods serially during admission, this change can potentially block non device plugin pods from starting up right?
Instead should we consider a model where pods that do use only first class resources do not get impacted by device plugins?
Imagine a cluster admin trying to run a container on a specific node to exec and debug the host and the admin's debug pod not starting on the node because of this change.

@k8s-ci-robot k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 9, 2018
@jiayingz
Copy link
Contributor Author

jiayingz commented Mar 9, 2018

Per the offline discussion with @vishh, given that pod admission is run serially by a single process and there is a chance that the device plugin pod may even be queued behind a pod that requests the device plugin resource, it seems better for now to just fail the pod admission if it requests a disconnected device plugin resource. In 1.11, we should probably move Allocate grpc call outside of pod admission so that we can have certain retry grace period. I modified the PR to only include endpoint stopGracePeriod part so that devicemanager can properly fail pod admission during that time window. Also modified the device plugin e2e_node test to make sure we don't create pods too early after kubelet restart so that they won't fail admission. PTAL.

@vishh
Copy link
Contributor

vishh commented Mar 9, 2018

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 9, 2018
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc()
could get Node with non-zero deviceplugin resource allocatable for a
non-existing endpoint. That race can happen when a device plugin fails,
but is more likely when kubelet restarts as with the current registration
model, there is a time gap between kubelet restart and device plugin
re-registration. During this time window, even though devicemanager could
have removed the resource initially during GetCapacity() call, Kubelet
may overwrite the device plugin resource capacity/allocatable with the
old value when node update from the API server comes in later. This
could cause a pod to be started without proper device runtime config set.

To solve this problem, introduce endpointStopGracePeriod. When a device
plugin fails, don't immediately remove the endpoint but set stopTime in
its endpoint. During kubelet restart, create endpoints with stopTime set
for any checkpointed registered resource. The endpoint is considered to be
in stopGracePeriod if its stoptime is set. This allows us to track what
resources should be handled by devicemanager during the time gap.
When an endpoint's stopGracePeriod expires, we remove the endpoint and
its resource. This allows the resource to be exported through other channels
(e.g., by directly updating node status through API server) if there is such
use case. Currently endpointStopGracePeriod is set as 5 minutes.

Given that an endpoint is no longer immediately removed upon disconnection,
mark all its devices unhealthy so that we can signal the resource allocatable
change to the scheduler to avoid scheduling more pods to the node.
When a device plugin endpoint is in stopGracePeriod, pods requesting the
corresponding resource will fail admission handler.
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 10, 2018
@vishh vishh added this to the v1.10 milestone Mar 10, 2018
@vishh vishh added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. status/approved-for-milestone labels Mar 10, 2018
@dims
Copy link
Member

dims commented Mar 10, 2018

/test pull-kubernetes-e2e-gce

@vikaschoudhary16
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 12, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jiayingz, vikaschoudhary16, vishh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Pull Request Labels Incomplete

@jiayingz @vikaschoudhary16 @vishh

Action required: This pull request requires label changes. If the required changes are not made within 1 day, the pull request will be moved out of the v1.10 milestone.

kind: Must specify exactly one of kind/bug, kind/cleanup or kind/feature.

Help

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit a3f40dd into kubernetes:master Mar 12, 2018
k8s-github-robot pushed a commit that referenced this pull request Mar 22, 2018
…56-upstream-release-1.9

Automatic merge from submit-queue.

Automated cherry pick of #60856

Cherry pick of #60856 on release-1.9.

#60856: Fixes the races around devicemanager Allocate() and endpoint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/hw-accelerators cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. milestone/incomplete-labels priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Device Plugin failure handling in kubelet is racy
8 participants