-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes the races around devicemanager Allocate() and endpoint deletion. #60856
Conversation
922d2a9
to
25254d4
Compare
/area hw-accelerators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be simpler to issue a callback on endpoint stop and add that timeout at the manager level?
That would remove this weird intermediary of dead but not removed state in the endpoint and simplify the guarding code you added to all the RPC calls.
You also wouldn't need to check for devices not in sync with the endpoint since you'd always have the stopGracePeriod.
And as far as re-registration is concerned by the stop callback, you would just need to replace the callback func by a noop func before stopping the endpoint.
// because its device plugin fails. DeviceManager keeps the stopped endpoint in its | ||
// cache during this grace period to cover the time gap for the capacity change to | ||
// take effect. | ||
const endpointStopGracePeriod = time.Duration(5) * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe move this in types.go this sounds like an important constant :)
/sig node |
fa88f83
to
9f75b6f
Compare
On Tue, Mar 6, 2018 at 3:40 PM, Renaud Gaubert ***@***.***> wrote:
***@***.**** commented on this pull request.
Wouldn't it be simpler to issue a callback on endpoint stop and add that
timeout at the manager level?
That would remove this weird intermediary of dead but not removed state in
the endpoint and simplify the guarding code you added to all the RPC calls.
You also wouldn't need to check for devices not in sync with the endpoint
since you'd always have the stopGracePeriod.
And as far as re-registration is concerned by the stop callback, you would
just need to replace the callback func by a noop func before stopping the
endpoint.
Are you suggesting to add another map in manager to track endpoint stop
grace period? I did consider that approach at the beginning but felt
we would need to manage lifecycle of another data structure and make sure
it is consistent with endpoint update.
------------------------------
In pkg/kubelet/cm/devicemanager/endpoint.go
<#60856 (comment)>
:
> }
+// endpointStopGracePeriod indicates the grace period after an endpoint is stopped
+// because its device plugin fails. DeviceManager keeps the stopped endpoint in its
+// cache during this grace period to cover the time gap for the capacity change to
+// take effect.
+const endpointStopGracePeriod = time.Duration(5) * time.Minute
maybe move this in types.go this sounds like an important constant :)
done.
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#60856 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AcIZlFZNnNYLg86f9LK1RsRVlEToSqATks5tbx5tgaJpZM4SfcOT>
.
|
We probably need to move the devices in it's own structure anyways as you mentioned managing the lifcycle of these structures in the manager is complex and needs to be carefully checked. What do you think of something like: type ManagerStore interface {
// This sets the timer when Update(rName, [], [], allTheDevices)
Update(rName string, added, updated, deleted []pluginapi.Device)
// This decides whether an a resource may be removed based on the time since the last Update
GetCapacity() (capacity, allocatable v1.ResourceList, deleted []string)
} |
On Tue, Mar 6, 2018 at 4:32 PM, Renaud Gaubert ***@***.***> wrote:
Are you suggesting to add another map in manager to track endpoint stop
grace period? I did consider that approach at the beginning but felt we
would need to manage lifecycle of another data structure and make sure it
is consistent with endpoint update.
We probably need to move the devices in it's own structure anyways as you
mentioned managing the lifcycle of these structures in the manager is
complex and needs to be carefully checked.
What do you think of something like:
type ManagerStore interface {
// This sets the timer when Update(rName, [], [], allTheDevices)
Update(rName string, added, updated, deleted []pluginapi.Device)
// This decides whether an a resource may be removed based on the time since the last Update
GetCapacity() (capacity, allocatable v1.ResourceList, deleted []string)
}
Having a separate data store will perhaps help simplifying the data
structure management.
Still needs to actually try it to see how it looks.
Jiaying
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#60856 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AcIZlN1_W8ZKDp27NZpYB7FbltvVoZGNks5tbyqbgaJpZM4SfcOT>
.
|
68be9a9
to
1708a72
Compare
/test pull-kubernetes-kubemark-e2e-gce |
1 similar comment
/test pull-kubernetes-kubemark-e2e-gce |
/assign @vishh |
// TODO: Reuse devices between init containers and regular containers. | ||
for _, container := range pod.Spec.InitContainers { | ||
if err := m.allocateContainerResources(pod, &container, devicesToReuse); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why change this line? It is idiomatic go style.
// TODO: Reuse devices between init containers and regular containers. | ||
for _, container := range pod.Spec.InitContainers { | ||
if err := m.allocateContainerResources(pod, &container, devicesToReuse); err != nil { | ||
allocatedDevices, err := m.allocateContainerResources(pod, &container, devicesToReuse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the new API support allocating multiple device at once?
@@ -259,18 +263,39 @@ func (m *ManagerImpl) Devices() map[string][]pluginapi.Device { | |||
func (m *ManagerImpl) Allocate(node *schedulercache.NodeInfo, attrs *lifecycle.PodAdmitAttributes) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this logic be re-structured as follows to improve readability?
- Figure out max(# of devices requested by init containers)
- Figure out sum(# of devices requested by regular containers)
- Compute devices required - max(2 & 3)
- Allocate these devices in some order (either individually or in batch)
- Assign devices to init containers (can intersect) and regular containers (mutually independent).
As of now the logic is dense. Since your PR is touching this logic, I'd like for it to be cleaned up.
given that kubelet processed incoming pods serially during admission, this change can potentially block non device plugin pods from starting up right? |
Per the offline discussion with @vishh, given that pod admission is run serially by a single process and there is a chance that the device plugin pod may even be queued behind a pod that requests the device plugin resource, it seems better for now to just fail the pod admission if it requests a disconnected device plugin resource. In 1.11, we should probably move Allocate grpc call outside of pod admission so that we can have certain retry grace period. I modified the PR to only include endpoint stopGracePeriod part so that devicemanager can properly fail pod admission during that time window. Also modified the device plugin e2e_node test to make sure we don't create pods too early after kubelet restart so that they won't fail admission. PTAL. |
/lgtm |
/retest Review the full test history for this PR. Silence the bot with an |
There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc() could get Node with non-zero deviceplugin resource allocatable for a non-existing endpoint. That race can happen when a device plugin fails, but is more likely when kubelet restarts as with the current registration model, there is a time gap between kubelet restart and device plugin re-registration. During this time window, even though devicemanager could have removed the resource initially during GetCapacity() call, Kubelet may overwrite the device plugin resource capacity/allocatable with the old value when node update from the API server comes in later. This could cause a pod to be started without proper device runtime config set. To solve this problem, introduce endpointStopGracePeriod. When a device plugin fails, don't immediately remove the endpoint but set stopTime in its endpoint. During kubelet restart, create endpoints with stopTime set for any checkpointed registered resource. The endpoint is considered to be in stopGracePeriod if its stoptime is set. This allows us to track what resources should be handled by devicemanager during the time gap. When an endpoint's stopGracePeriod expires, we remove the endpoint and its resource. This allows the resource to be exported through other channels (e.g., by directly updating node status through API server) if there is such use case. Currently endpointStopGracePeriod is set as 5 minutes. Given that an endpoint is no longer immediately removed upon disconnection, mark all its devices unhealthy so that we can signal the resource allocatable change to the scheduler to avoid scheduling more pods to the node. When a device plugin endpoint is in stopGracePeriod, pods requesting the corresponding resource will fail admission handler.
/test pull-kubernetes-e2e-gce |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jiayingz, vikaschoudhary16, vishh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[MILESTONENOTIFIER] Milestone Pull Request Labels Incomplete @jiayingz @vikaschoudhary16 @vishh Action required: This pull request requires label changes. If the required changes are not made within 1 day, the pull request will be moved out of the v1.10 milestone. kind: Must specify exactly one of |
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here. |
There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc()
could get Node with non-zero deviceplugin resource allocatable for a
non-existing endpoint. That race can happen when a device plugin fails,
but is more likely when kubelet restarts as with the current registration
model, there is a time gap between kubelet restart and device plugin
re-registration. During this time window, even though devicemanager could
have removed the resource initially during GetCapacity() call, Kubelet
may overwrite the device plugin resource capacity/allocatable with the
old value when node update from the API server comes in later. This
could cause a pod to be started without proper device runtime config set.
To solve this problem, introduce endpointStopGracePeriod. When a device
plugin fails, don't immediately remove the endpoint but set stopTime in
its endpoint. During kubelet restart, create endpoints with stopTime set
for any checkpointed registered resource. The endpoint is considered to be
in stopGracePeriod if its stoptime is set. This allows us to track what
resources should be handled by devicemanager during the time gap.
When an endpoint's stopGracePeriod expires, we remove the endpoint and
its resource. This allows the resource to be exported through other channels
(e.g., by directly updating node status through API server) if there is such
use case. Currently endpointStopGracePeriod is set as 5 minutes.
Given that an endpoint is no longer immediately removed upon disconnection,
mark all its devices unhealthy so that we can signal the resource allocatable
change to the scheduler to avoid scheduling more pods to the node.
When a device plugin endpoint is in stopGracePeriod, pods requesting the
corresponding resource will fail admission handler.
Tested:
Ran GPUDevicePlugin e2e_node test 100 times and all passed now.
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #60176
Special notes for your reviewer:
Release note: