Fixes the races around devicemanager Allocate() and endpoint deletion. #60856

jiayingz · 2018-03-06T21:14:37Z

There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc()
could get Node with non-zero deviceplugin resource allocatable for a
non-existing endpoint. That race can happen when a device plugin fails,
but is more likely when kubelet restarts as with the current registration
model, there is a time gap between kubelet restart and device plugin
re-registration. During this time window, even though devicemanager could
have removed the resource initially during GetCapacity() call, Kubelet
may overwrite the device plugin resource capacity/allocatable with the
old value when node update from the API server comes in later. This
could cause a pod to be started without proper device runtime config set.

To solve this problem, introduce endpointStopGracePeriod. When a device
plugin fails, don't immediately remove the endpoint but set stopTime in
its endpoint. During kubelet restart, create endpoints with stopTime set
for any checkpointed registered resource. The endpoint is considered to be
in stopGracePeriod if its stoptime is set. This allows us to track what
resources should be handled by devicemanager during the time gap.
When an endpoint's stopGracePeriod expires, we remove the endpoint and
its resource. This allows the resource to be exported through other channels
(e.g., by directly updating node status through API server) if there is such
use case. Currently endpointStopGracePeriod is set as 5 minutes.

Given that an endpoint is no longer immediately removed upon disconnection,
mark all its devices unhealthy so that we can signal the resource allocatable
change to the scheduler to avoid scheduling more pods to the node.
When a device plugin endpoint is in stopGracePeriod, pods requesting the
corresponding resource will fail admission handler.

Tested:
Ran GPUDevicePlugin e2e_node test 100 times and all passed now.

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #60176

Special notes for your reviewer:

Release note:

Fixes the races around devicemanager Allocate() and endpoint deletion.

jiayingz · 2018-03-06T21:21:32Z

cc @vishh @RenaudWasTaken

RenaudWasTaken · 2018-03-06T23:21:06Z

/area hw-accelerators

RenaudWasTaken

Wouldn't it be simpler to issue a callback on endpoint stop and add that timeout at the manager level?

That would remove this weird intermediary of dead but not removed state in the endpoint and simplify the guarding code you added to all the RPC calls.

You also wouldn't need to check for devices not in sync with the endpoint since you'd always have the stopGracePeriod.

And as far as re-registration is concerned by the stop callback, you would just need to replace the callback func by a noop func before stopping the endpoint.

RenaudWasTaken · 2018-03-06T23:25:18Z

pkg/kubelet/cm/devicemanager/endpoint.go

+// because its device plugin fails. DeviceManager keeps the stopped endpoint in its
+// cache during this grace period to cover the time gap for the capacity change to
+// take effect.
+const endpointStopGracePeriod = time.Duration(5) * time.Minute


maybe move this in types.go this sounds like an important constant :)

dims · 2018-03-06T23:49:11Z

/sig node

jiayingz · 2018-03-06T23:52:14Z

On Tue, Mar 6, 2018 at 3:40 PM, Renaud Gaubert ***@***.***> wrote: ***@***.**** commented on this pull request. Wouldn't it be simpler to issue a callback on endpoint stop and add that timeout at the manager level? That would remove this weird intermediary of dead but not removed state in the endpoint and simplify the guarding code you added to all the RPC calls. You also wouldn't need to check for devices not in sync with the endpoint since you'd always have the stopGracePeriod. And as far as re-registration is concerned by the stop callback, you would just need to replace the callback func by a noop func before stopping the endpoint.

Are you suggesting to add another map in manager to track endpoint stop grace period? I did consider that approach at the beginning but felt we would need to manage lifecycle of another data structure and make sure it is consistent with endpoint update.

------------------------------ In pkg/kubelet/cm/devicemanager/endpoint.go <#60856 (comment)> : > } +// endpointStopGracePeriod indicates the grace period after an endpoint is stopped +// because its device plugin fails. DeviceManager keeps the stopped endpoint in its +// cache during this grace period to cover the time gap for the capacity change to +// take effect. +const endpointStopGracePeriod = time.Duration(5) * time.Minute maybe move this in types.go this sounds like an important constant :)

done.

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#60856 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AcIZlFZNnNYLg86f9LK1RsRVlEToSqATks5tbx5tgaJpZM4SfcOT> .

RenaudWasTaken · 2018-03-07T00:31:35Z

Are you suggesting to add another map in manager to track endpoint stop grace period? I did consider that approach at the beginning but felt we would need to manage lifecycle of another data structure and make sure it is consistent with endpoint update.

We probably need to move the devices in it's own structure anyways as you mentioned managing the lifcycle of these structures in the manager is complex and needs to be carefully checked.

What do you think of something like:

type ManagerStore interface {
     // This sets the timer when Update(rName, [], [], allTheDevices)
     Update(rName string, added, updated, deleted []pluginapi.Device)

     // This decides whether an a resource may be removed based on the time since the last Update
     GetCapacity() (capacity, allocatable v1.ResourceList, deleted []string)
}

jiayingz · 2018-03-07T00:41:51Z

On Tue, Mar 6, 2018 at 4:32 PM, Renaud Gaubert ***@***.***> wrote: Are you suggesting to add another map in manager to track endpoint stop grace period? I did consider that approach at the beginning but felt we would need to manage lifecycle of another data structure and make sure it is consistent with endpoint update. We probably need to move the devices in it's own structure anyways as you mentioned managing the lifcycle of these structures in the manager is complex and needs to be carefully checked. What do you think of something like: type ManagerStore interface { // This sets the timer when Update(rName, [], [], allTheDevices) Update(rName string, added, updated, deleted []pluginapi.Device) // This decides whether an a resource may be removed based on the time since the last Update GetCapacity() (capacity, allocatable v1.ResourceList, deleted []string) } Having a separate data store will perhaps help simplifying the data

structure management. Still needs to actually try it to see how it looks. Jiaying

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#60856 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AcIZlN1_W8ZKDp27NZpYB7FbltvVoZGNks5tbyqbgaJpZM4SfcOT> .

jiayingz · 2018-03-07T05:50:55Z

/test pull-kubernetes-kubemark-e2e-gce

jiayingz · 2018-03-07T22:48:20Z

/test pull-kubernetes-kubemark-e2e-gce

jiayingz · 2018-03-07T23:51:21Z

/assign @vishh

vishh · 2018-03-08T23:30:38Z

pkg/kubelet/cm/devicemanager/manager.go

 	// TODO: Reuse devices between init containers and regular containers.
 	for _, container := range pod.Spec.InitContainers {
-		if err := m.allocateContainerResources(pod, &container, devicesToReuse); err != nil {


nit: why change this line? It is idiomatic go style.

vishh · 2018-03-08T23:35:31Z

pkg/kubelet/cm/devicemanager/manager.go

 	// TODO: Reuse devices between init containers and regular containers.
 	for _, container := range pod.Spec.InitContainers {
-		if err := m.allocateContainerResources(pod, &container, devicesToReuse); err != nil {
+		allocatedDevices, err := m.allocateContainerResources(pod, &container, devicesToReuse)


Doesn't the new API support allocating multiple device at once?

vishh · 2018-03-08T23:46:25Z

pkg/kubelet/cm/devicemanager/manager.go

@@ -259,18 +263,39 @@ func (m *ManagerImpl) Devices() map[string][]pluginapi.Device {
 func (m *ManagerImpl) Allocate(node *schedulercache.NodeInfo, attrs *lifecycle.PodAdmitAttributes) error {


Can this logic be re-structured as follows to improve readability?

Figure out max(# of devices requested by init containers)

Figure out sum(# of devices requested by regular containers)

Compute devices required - max(2 & 3)

Allocate these devices in some order (either individually or in batch)

Assign devices to init containers (can intersect) and regular containers (mutually independent).

As of now the logic is dense. Since your PR is touching this logic, I'd like for it to be cleaned up.

vishh · 2018-03-09T00:14:56Z

given that kubelet processed incoming pods serially during admission, this change can potentially block non device plugin pods from starting up right?
Instead should we consider a model where pods that do use only first class resources do not get impacted by device plugins?
Imagine a cluster admin trying to run a container on a specific node to exec and debug the host and the admin's debug pod not starting on the node because of this change.

jiayingz · 2018-03-09T22:13:54Z

Per the offline discussion with @vishh, given that pod admission is run serially by a single process and there is a chance that the device plugin pod may even be queued behind a pod that requests the device plugin resource, it seems better for now to just fail the pod admission if it requests a disconnected device plugin resource. In 1.11, we should probably move Allocate grpc call outside of pod admission so that we can have certain retry grace period. I modified the PR to only include endpoint stopGracePeriod part so that devicemanager can properly fail pod admission during that time window. Also modified the device plugin e2e_node test to make sure we don't create pods too early after kubelet restart so that they won't fail admission. PTAL.

vishh · 2018-03-09T22:31:56Z

/lgtm
/approve

fejta-bot · 2018-03-10T00:48:32Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc() could get Node with non-zero deviceplugin resource allocatable for a non-existing endpoint. That race can happen when a device plugin fails, but is more likely when kubelet restarts as with the current registration model, there is a time gap between kubelet restart and device plugin re-registration. During this time window, even though devicemanager could have removed the resource initially during GetCapacity() call, Kubelet may overwrite the device plugin resource capacity/allocatable with the old value when node update from the API server comes in later. This could cause a pod to be started without proper device runtime config set. To solve this problem, introduce endpointStopGracePeriod. When a device plugin fails, don't immediately remove the endpoint but set stopTime in its endpoint. During kubelet restart, create endpoints with stopTime set for any checkpointed registered resource. The endpoint is considered to be in stopGracePeriod if its stoptime is set. This allows us to track what resources should be handled by devicemanager during the time gap. When an endpoint's stopGracePeriod expires, we remove the endpoint and its resource. This allows the resource to be exported through other channels (e.g., by directly updating node status through API server) if there is such use case. Currently endpointStopGracePeriod is set as 5 minutes. Given that an endpoint is no longer immediately removed upon disconnection, mark all its devices unhealthy so that we can signal the resource allocatable change to the scheduler to avoid scheduling more pods to the node. When a device plugin endpoint is in stopGracePeriod, pods requesting the corresponding resource will fail admission handler.

dims · 2018-03-10T13:55:56Z

/test pull-kubernetes-e2e-gce

vikaschoudhary16 · 2018-03-12T08:42:08Z

/lgtm

k8s-ci-robot · 2018-03-12T08:42:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jiayingz, vikaschoudhary16, vishh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/kubelet/cm/devicemanager/OWNERS~~ [jiayingz,vishh]
~~test/e2e_node/OWNERS~~ [vishh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-github-robot · 2018-03-12T08:42:27Z

[MILESTONENOTIFIER] Milestone Pull Request Labels Incomplete

@jiayingz @vikaschoudhary16 @vishh

Action required: This pull request requires label changes. If the required changes are not made within 1 day, the pull request will be moved out of the v1.10 milestone.

kind: Must specify exactly one of kind/bug, kind/cleanup or kind/feature.

Help

k8s-github-robot · 2018-03-12T08:43:23Z

/test all [submit-queue is verifying that this PR is safe to merge]

k8s-github-robot · 2018-03-12T09:50:12Z

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

…56-upstream-release-1.9 Automatic merge from submit-queue. Automated cherry pick of #60856 Cherry pick of #60856 on release-1.9. #60856: Fixes the races around devicemanager Allocate() and endpoint

k8s-ci-robot requested review from rohitagarwal003 and vikaschoudhary16 March 6, 2018 21:14

jiayingz force-pushed the race-fix branch 2 times, most recently from 922d2a9 to 25254d4 Compare March 6, 2018 23:20

k8s-ci-robot added the area/hw-accelerators label Mar 6, 2018

RenaudWasTaken reviewed Mar 6, 2018

View reviewed changes

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Mar 6, 2018

jiayingz force-pushed the race-fix branch 2 times, most recently from fa88f83 to 9f75b6f Compare March 6, 2018 23:52

jiayingz force-pushed the race-fix branch 2 times, most recently from 68be9a9 to 1708a72 Compare March 7, 2018 01:19

k8s-ci-robot assigned vishh Mar 7, 2018

vishh reviewed Mar 9, 2018

View reviewed changes

jiayingz force-pushed the race-fix branch from 1708a72 to 6bb1869 Compare March 9, 2018 22:02

k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 9, 2018

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 9, 2018

jiayingz force-pushed the race-fix branch from 6bb1869 to 5514a1f Compare March 10, 2018 01:01

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 10, 2018

vishh added this to the v1.10 milestone Mar 10, 2018

vishh added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. status/approved-for-milestone labels Mar 10, 2018

k8s-github-robot added the milestone/incomplete-labels label Mar 10, 2018

k8s-ci-robot assigned vikaschoudhary16 Mar 12, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 12, 2018

k8s-github-robot merged commit a3f40dd into kubernetes:master Mar 12, 2018

jiayingz mentioned this pull request Mar 12, 2018

Automated cherry pick of #60856 #61060

Merged

jiayingz mentioned this pull request Mar 23, 2018

Node-level Checkpointing manager: Migrate dockershim and device plugin manager checkpointing #56040

Merged

vikaschoudhary16 mentioned this pull request Apr 16, 2018

Add vikaschoudhary16 to the approvers in device manager #62184

Closed

jiayingz mentioned this pull request Jun 5, 2018

Reconcile extended resource capacity after kubelet restart. #64784

Merged

tossmilestone mentioned this pull request Jun 22, 2018

Fixes the races around devicemanager Allocate() and endpoint deletion alauda/kubernetes#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes the races around devicemanager Allocate() and endpoint deletion. #60856

Fixes the races around devicemanager Allocate() and endpoint deletion. #60856

jiayingz commented Mar 6, 2018 •

edited

Loading

jiayingz commented Mar 6, 2018

RenaudWasTaken commented Mar 6, 2018

RenaudWasTaken left a comment

RenaudWasTaken Mar 6, 2018

dims commented Mar 6, 2018

jiayingz commented Mar 6, 2018 via email

RenaudWasTaken commented Mar 7, 2018

jiayingz commented Mar 7, 2018 via email

jiayingz commented Mar 7, 2018

jiayingz commented Mar 7, 2018

jiayingz commented Mar 7, 2018

vishh Mar 8, 2018

vishh Mar 8, 2018

vishh Mar 8, 2018

vishh commented Mar 9, 2018

jiayingz commented Mar 9, 2018

vishh commented Mar 9, 2018

fejta-bot commented Mar 10, 2018

dims commented Mar 10, 2018

vikaschoudhary16 commented Mar 12, 2018

k8s-ci-robot commented Mar 12, 2018

k8s-github-robot commented Mar 12, 2018

k8s-github-robot commented Mar 12, 2018

k8s-github-robot commented Mar 12, 2018

		@@ -259,18 +263,39 @@ func (m *ManagerImpl) Devices() map[string][]pluginapi.Device {
		func (m ManagerImpl) Allocate(node schedulercache.NodeInfo, attrs *lifecycle.PodAdmitAttributes) error {

Fixes the races around devicemanager Allocate() and endpoint deletion. #60856

Fixes the races around devicemanager Allocate() and endpoint deletion. #60856

Conversation

jiayingz commented Mar 6, 2018 • edited Loading

jiayingz commented Mar 6, 2018

RenaudWasTaken commented Mar 6, 2018

RenaudWasTaken left a comment

Choose a reason for hiding this comment

RenaudWasTaken Mar 6, 2018

Choose a reason for hiding this comment

dims commented Mar 6, 2018

jiayingz commented Mar 6, 2018 via email

RenaudWasTaken commented Mar 7, 2018

jiayingz commented Mar 7, 2018 via email

jiayingz commented Mar 7, 2018

jiayingz commented Mar 7, 2018

jiayingz commented Mar 7, 2018

vishh Mar 8, 2018

Choose a reason for hiding this comment

vishh Mar 8, 2018

Choose a reason for hiding this comment

vishh Mar 8, 2018

Choose a reason for hiding this comment

vishh commented Mar 9, 2018

jiayingz commented Mar 9, 2018

vishh commented Mar 9, 2018

fejta-bot commented Mar 10, 2018

dims commented Mar 10, 2018

vikaschoudhary16 commented Mar 12, 2018

k8s-ci-robot commented Mar 12, 2018

k8s-github-robot commented Mar 12, 2018

k8s-github-robot commented Mar 12, 2018

k8s-github-robot commented Mar 12, 2018

jiayingz commented Mar 6, 2018 •

edited

Loading