Invoke preStart RPC call before container start, if desired by plugin #58282

vikaschoudhary16 · 2018-01-15T10:52:50Z

What this PR does / why we need it:

Adds a new RPC preStart to device plugin API
Update Register RPC handling to receive a flag from the Device plugins as an indicator if kubelet should invoke preStart RPC before starting container.
Changes in device manager to invoke preStart before container start
Test case updates

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #56943 #56307

Special notes for your reviewer:

Release note:

None

/sig node

/area hw-accelerators
/cc @jiayingz @RenaudWasTaken @vishh @ScorpioCPH @sjenning @derekwaynecarr @jeremyeder @lichuqiang @tengqm

k8s-ci-robot · 2018-01-15T10:52:55Z

@vikaschoudhary16: GitHub didn't allow me to request PR reviews from the following users: tengqm, ScorpioCPH.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

What this PR does / why we need it:
Adds a mechanism in Kubelet device plugin manager to ensure per-container-creation Allocate gRPC call

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #56943

Special notes for your reviewer:

Release note:
None
/sig node

/area hw-accelerators
/cc @jiayingz @RenaudWasTaken @vishh @ScorpioCPH @sjenning @derekwaynecarr @jeremyeder @lichuqiang @tengqm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

RenaudWasTaken · 2018-01-15T20:14:03Z

pkg/kubelet/cm/deviceplugin/manager.go

 		}
+		if resp == nil {


When is resp != nil ?

Looking at the code it's probably on container restart.
If that's the case I expect container restart to issue an Allocate call (i.e: reset the device in the same state at every container start)

Making Allocate call on each container restart have the side-effect that containers which were UP may not restart if device plugin has got failed.
How about having a configurable option, which will be configured by device plugin at registration time, CacheUse. Enabled by default, would mean the current PR functionality and if device plugin set it to false, Allocate will be invoked every time. WDYT?

It would make the checkpoint mechanism useless to always issue an Allocate call on restart.
Agree on making it optional if we have the use case indeed.

I guess the question you are asking really is what is the device plugin system expected to do.
The purpose of the device plugin is to setup a device before making it available in a container.

Making Allocate call on each container restart have the side-effect that containers which were UP may not restart if device plugin has got failed.

Not making an Allocate call has also the side effect to break container reproducibility

How about having a configurable option, which will be configured by device plugin at registration time, CacheUse. Enabled by default, would mean the current PR functionality and if device plugin set it to false, Allocate will be invoked every time. WDYT?

I agree with @lichuqiang do we have a use case for caching it ?
Is there a device plugin that needs caching ?

@RenaudWasTaken hmm so you are advocating for making Allocate call for each container every time. I am fine with updating the changes if thats majority opinion. I personally would prefer to have it as a configuration option and using cache in default configuration. This would shorten container start-up time.

@RenaudWasTaken hmm so you are advocating for making Allocate call for each container every time. I am fine with updating the changes if thats majority opinion.

If there is no use case I'm not sure it's worth.

This would shorten container start-up time.

Let's do some actual metrics :)
It's also possible that checkpointing data is more expensive

RenaudWasTaken · 2018-01-15T20:15:18Z

pkg/kubelet/cm/deviceplugin/manager.go

+			invokeRPCBoolFlag = true
+
+		} else {
+			invokeRPCBoolFlag = m.podDevices.isAllocationOverlapping(podUID, contName, resource, devices)


What does resp != nil and a device is overlapping ? i.e: What is this case for ?

Overlapping case is restart of a container that has re-allocated devices which are being reused within the pod containers

Overlapping case is restart of a container that has re-allocated devices which are being reused within the pod containers

So if a container restarts and it has at least one re-allocated (I'm guessing InitContainer + Container) devices then we issue an allocate call ?
What's the use case for that ?

seems this makes more sense if reuse is between regular containers, which is not supported at the moment. In (Init + regular) case, seems do not make sense. will update.

RenaudWasTaken

Do you mind adding a bit more comments explaining the different code path ?

RenaudWasTaken · 2018-01-15T20:28:04Z

pkg/kubelet/cm/deviceplugin/manager.go

@@ -524,7 +524,7 @@ func (m *ManagerImpl) devicesToAllocate(podUID, contName, resource string, requi
 	needed := required
 	// Gets list of devices that have already been allocated.
 	// This can happen if a container restarts for example.


This can happen if a container restarts for example.

Can this still happen given that it's called from the acceptation predicate ?

call from acceptation predicate is not being newly introduced in this PR, therefore, these comments can be corrected independent of this PR.

Leaving a TODO if that's the case would be a good idea though.

RenaudWasTaken · 2018-01-15T20:34:14Z

pkg/kubelet/cm/deviceplugin/manager.go

 		}
+		if resp == nil {


Looking at the code it's probably on container restart.
If that's the case I expect container restart to issue an Allocate call (i.e: reset the device in the same state at every container start)

RenaudWasTaken · 2018-01-15T20:36:14Z

pkg/kubelet/cm/deviceplugin/manager.go

 		m.mutex.Lock()
-		e, ok := m.endpoints[resource]
+		m.podDevices.insert(podUID, contName, resource, allocDevices, nil)


Given that we don't need to cache Allocate response, we should remove the last parameter.

Do you mean not needed because we would make Allocate each time?

Do you mean not needed because we would make Allocate each time?

I think so :), this insert will be rewritten on the following GetDeviceRunContainerOptions.

@ScorpioCPH Please not that last parameter, resp, is nil here and in GetDeviceRunContainerOptions, resp will be updated. So i would say it will be "updated".

RenaudWasTaken · 2018-01-15T20:37:57Z

pkg/kubelet/cm/deviceplugin/manager.go

+			continue
+		}
+		m.mutex.Lock()
+		devices, resp := m.podDevices.containerDevices(podUID, contName, string(k))


We probably don't need the resp here

Do you mean not needed because we would make Allocate each time?
If yes, please share your opinion on making this a configurable option as i explained in another comment.

lichuqiang

Seems the allocation process got changed after the change, that is (if I'm not mistaken):

Calculate resources usage and insert into poddev during predicate phase;
Doing real allocation when generating container runtime info(cached resp nil or devs got reused by other containers)

So I think it necessary to update comments accordingly, otherwise, people who first study the code might got confused.

lichuqiang · 2018-01-16T09:39:15Z

pkg/kubelet/cm/deviceplugin/manager.go

 		}
+		if resp == nil {


It would make the checkpoint mechanism useless to always issue an Allocate call on restart.
Agree on making it optional if we have the use case indeed.

jiayingz · 2018-01-17T22:29:39Z

Sorry for joining the conversation late.

The original model I was thinking when filing FR #56943 is that we make an Allocate grpc call only when we re-allocate a device to a different container. For most common cases when a single container inside a pod requests DP resources, we will not re-issue Allocate grpc call when the container restarts. From the discussions in this PR, looks like the model @RenaudWasTaken proposes is to always issue Allocate grpc call whenever container restarts. I think this contradicts with the failure recovery statement we made in the DP design doc that if any device plugin fails, we don't expect Kubelet to fail or restart any pods or containers running that are using these devices. I would like to better understand the use cases to decide how we should proceed.

From our discussions, I think there are two main reasons @RenaudWasTaken mentioned on why an Allocate call is desired per container start.

First is for security isolation, we want to zero-out device memory before starting a container. For this use case, first of all, I feel most security isolation guarantees Kubernetes provides are at pod level. Even though we want to provide container level security isolation, I don't see why an Allocate call is required for the most common use case when a single container in a pod request to use DP resource.

The second reason is for device state checking and recovery. For this use case, I would like to better understand how often a crashed process/container may leave a device in bad state and how we handle this kind of failure in non-container environment. I.e., one example @RenaudWasTaken mentioned is that a crashed process could leave some gpu memory not allocatable. I think we have the same problem in non-container environment and I would expect device drive to do proper garbage collection to handle this case.

I also want to discuss alternative ways of performing per-container start operations. As I understand, nvidia device plugin is already taking the path of running specific runc prestart hooks. In that case, why the operations you mentioned can't be performed in the runc hook?

vikaschoudhary16 · 2018-01-17T23:27:59Z

As I understand, nvidia device plugin is already taking the path of running specific runc prestart hooks. In that case, why the operations you mentioned can't be performed in the runc hook?

+1

ScorpioCPH

Hi, some quick comments, try to understand the logic later.

ScorpioCPH · 2018-01-18T10:06:30Z

pkg/kubelet/cm/deviceplugin/manager.go

+		// Continues if this is neither an active device plugin resource nor
+		// a resource we have previously allocated.
+		if !registeredResource && !allocatedResource {
+			continue


Should we throw an error/warning here?

This is not actually an error condition. For other resources like cpu and memory, continue will be invoked.

ScorpioCPH · 2018-01-18T10:28:04Z

pkg/kubelet/cm/deviceplugin/manager.go

+			continue
+		}
+		m.mutex.Lock()
+		devices, resp := m.podDevices.containerDevices(podUID, contName, string(k))


nit: s/string(k)/resource

ScorpioCPH · 2018-01-18T10:41:00Z

pkg/kubelet/cm/deviceplugin/manager.go

+	err := m.writeCheckpoint()
+	if err != nil {
+		glog.Errorf("Checkpointing failed for pod %s, container %s", podUID, contName)
+	}
 	return m.podDevices.deviceRunContainerOptions(string(pod.UID), container.Name)


nit: s/string(pod.UID)/podUID

ScorpioCPH · 2018-01-18T13:15:02Z

pkg/kubelet/cm/deviceplugin/manager.go

 		m.mutex.Lock()
-		e, ok := m.endpoints[resource]
+		m.podDevices.insert(podUID, contName, resource, allocDevices, nil)


Do you mean not needed because we would make Allocate each time?

I think so :), this insert will be rewritten on the following GetDeviceRunContainerOptions.

flx42 · 2018-01-18T21:34:24Z

As I understand, nvidia device plugin is already taking the path of running specific runc prestart hooks. In that case, why the operations you mentioned can't be performed in the runc hook?

Because it's a prestart hook :), we want to do the health check early, and not pull the full image, setup the container filesystem but then fail just before the pivot_root.

RenaudWasTaken · 2018-01-18T23:48:18Z

Hello!

I'll do a general explanation of the problem that is being discussed here and then answer your questions @jiayingz.

The way we designed the device plugin was for running device specific operations in order mostly to reset devices. That guarantees containers to always run devices in the same state, thus keeping the container promise of reproducibility.

Examples of device operations for GPUs include:

memory scrubbing
healthCheck and reset in case of bad state
GPU Allocated memory checks
"Zombie processes" checks
...

Intel mentioned something about buffer reservations at some point in yesterday's meeting (though I'm not sure I recall correctly).

Having this model where you don't setup the device before your container starts means that (aside from the added code complexity and performance decrease by having to cache the response) you are breaking container reproducibility and you can't guarantee any security features.

E.g: A tensorflow job requests all the GPU memory and then crashes (because of some software bug) say in this particular case it didn't release the memory and the driver couldn't release it on the next restart because no allocate call was made, the device plugin couldn't detect this and try to fix this behavior, the container will just crash because not enough memory is available.

If we apply the model you are suggesting to CRI, what you would be asking is to re-use the same filesystem, Cgroups, ... in the case a container restarts.

CRI didn't chose this model, why are you asking it to be different for the device plugin ? The device plugin has the clear mission to setup the device in the same way CRI has the mission to setup the container. You can't just assume that devices are stateless.

I don't see why we can't handle the device plugin not being available In the same way the CRI is already handling it's gRPC runtime not available.
After all if CRI can't setup the container after a restart because the runtime's gRPC server is not available, it's not trying to re-use the previous container.

I think this contradicts with the failure recovery statement we made in the DP design doc that if any device plugin fails, we don't expect Kubelet to fail or restart any pods or containers running that are using these devices.

I'm not sure how failing a container that is restarting because the device plugin is not available conflicts with that statement.
The reason we wrote this sentence is that we're basically saying that we aren't going to kill "running" containers if the device plugin is not available. But we aren't giving any guarantees about restarting the container.

First is for security isolation, we want to zero-out device memory before starting a container. For this use case, first of all, I feel most security isolation guarantees Kubernetes provides are at pod level.

You are ignoring the container security model defined in CRI here. The isolation guarantees provided by Kubernetes are not just at the pod level, there is a security model in place at the container level too.

Even though we want to provide container level security isolation, I don't see why an Allocate call is required for the most common use case when a single container in a pod request to use DP resource.

Sorry I don't understand what you are asking and or suggesting ?

The second reason is for device state checking and recovery. For this use case, I would like to better understand how often a crashed process/container may leave a device in bad state

Sorry I don't understand what kind of metrics you are requesting here.
Maybe an answer to this is that it's possible and we've seen it happen in the field.

I also want to discuss alternative ways of performing per-container start operations. As I understand, nvidia device plugin is already taking the path of running specific runc prestart hooks. In that case, why the operations you mentioned can't be performed in the runc hook?

Because the device plugin is here to do infrastructure operations such as cleaning up the device, testing it's health and cleaning up memory, processes, ...
The runtime pre-start hook is here to setup the container so that it has access to GPUs.

But more importantly by asking us to move the infrastructure operations in the runtime hook, essentially what we would be saying is that doesn't fit the device plugin API and lifecycle.

Essentially in the opposite direction of the requirements for the beta roadmap and the whole purpose of the device plugin.

jiayingz · 2018-01-19T07:08:37Z

Thanks a lot for your inputs, @flx42 and @RenaudWasTaken

I understand on your side, you want to perform certain operations per container start to detect some potential device failure scenario, and prefer to running those operations earlier in the container start process. Generally, I think we want to optimize for most common failure cases and make some tradeoff. Given that device plugin is still relatively new and is more likely to contain bugs, I would like to minimize its failure impact if a device plugin is unavailable. Right now, I don't think device plugin is at a comparable position as container runtime. A system can have some critical failure points but we always want to minimize the set of such components. It doesn't seem to be a good reasoning that because we have some critical failure points now, we can keep adding more. Also I think the CRI container security model mentioned above is actually defined at pod level.

I would suggest you run the mentioned operations in your runc hook for now. We can revisit this after getting more experience of using device plugin in production. If we do see strong use cases for guaranteeing per-container-run allocation, I agree with @vikaschoudhary16 that we perhaps want to add this as a configurable behavior.

RenaudWasTaken · 2018-01-19T16:32:20Z

I understand on your side, you want to perform certain operations per container start to detect some potential device failure scenario, and prefer to running those operations earlier in the container start process.

What we are saying is that the device plugin runs infrastructure operations such as device cleanup.
Whereas the pre-start hook runs container setup operations (such as creating the devices, setting the cgroups, ...)

Generally, I think we want to optimize for most common failure cases and make some tradeoff.

Undefined behavior in case of container restart is not an acceptable tradeoff.
Furthermore, I expect device plugins to be a really stable piece of software compared to containers that might be running experimental code (e.g: Deep Learning training).
IMHO most common failure case is probably a container restart not a device plugin that's unavailable.

Given that device plugin is still relatively new and is more likely to contain bugs

If that's the discourse you are going to have then we can legitimately ask the question if the device plugin is stable enough to graduate to beta ?

I would like to minimize its failure impact if a device plugin is unavailable.
Right now, I don't think device plugin is at a comparable position as container runtime. A system can have some critical failure points but we always want to minimize the set of such components.

That's a reasonable reasoning. However I'd be interested in having metrics on how often these critical failure points happen in a production cluster compared to a container restart.

My intuition is that as mentioned higher the device plugin is a robust piece of software and well audited that should rarely fail. On the other hand container restart especially in DL may not always be production ready.
Thus introducing Undefined Behavior on container restart is probably injecting a bigger failure point than the one you are trying to reduce.

Also I think the CRI container security model mentioned above is actually defined at pod level.

Apologies I pointed to the wrong structure here is the ContainerSecurityContext.

I would suggest you run the mentioned operations in your runc hook for now. We can revisit this after getting more experience of using device plugin in production.

A bit off topic, can you tell me how GKE intends to graduate GPUs in beta if memory scrubbing, device reset, ... is only done by the NVIDIA device plugin and not in the GKE device plugin?

If we do see strong use cases for guaranteeing per-container-run allocation, I agree with @vikaschoudhary16 that we perhaps want to add this as a configurable behavior.

As mentioned I'm not against making this a configurable behavior. I'm just not sure for what device plugin we would be implementing this.

jiayingz · 2018-01-19T21:39:51Z

On Fri, Jan 19, 2018 at 8:33 AM, Renaud Gaubert ***@***.***> wrote: I understand on your side, you want to perform certain operations per container start to detect some potential device failure scenario, and prefer to running those operations earlier in the container start process. What we are saying is that the device plugin runs infrastructure operations such as device cleanup. Whereas the pre-start hook runs container setup operations (such as creating the devices, setting the cgroups, ...)

Shouldn't it be opposite? I actually expect device plugin to communicate container runtime setup requirements to Kubelet and then to container runtime. The pre-start hook, if we really decide to officially support it in CRI at some point, can be used to perform device cleanup operations, given that the set of cleanup operations we want to run will heavily depend on the node running environment. E.g., on a bare metal machine, we may need to run a bigger set of health checking operations whereas on some VM environments, if we already have some separate agent that monitors device health and drains unhealthy devices, it could be a light-weight operation.

…

Generally, I think we want to optimize for most common failure cases and make some tradeoff. Undefined behavior in case of container restart is not an acceptable tradeoff. Furthermore, I expect device plugins to be a really stable piece of software compared to containers that might be running experimental code (e.g: Deep Learning training). IMHO most common failure case is probably a container restart not a device plugin that's unavailable. Given that device plugin is still relatively new and is more likely to contain bugs If that's the discourse you are going to have then we can legitimately ask the question if the device plugin is stable enough to graduate to beta ? I would like to minimize its failure impact if a device plugin is unavailable. Right now, I don't think device plugin is at a comparable position as container runtime. A system can have some critical failure points but we always want to minimize the set of such components. That's a reasonable reasoning. However I'd be interested in having metrics on how often these critical failure points happen in a production cluster compared to a container restart. My intuition is that as mentioned higher the device plugin is a robust piece of software and well audited that should rarely fail. On the other hand container restart especially in DL may not always be production ready. Also I think the CRI container security model mentioned above is actually defined at pod level. Apologies I pointed to the wrong structure here is the ContainerSecurityContext <https://github.com/kubernetes/kubernetes/blob/ed04e55f6e8ed021144763caf9f4c6af9132f5d6/pkg/kubelet/apis/cri/v1alpha1/runtime/api.proto#L493> . I would suggest you run the mentioned operations in your runc hook for now. We can revisit this after getting more experience of using device plugin in production. A bit off topic, can you tell me how GKE intends to graduate GPUs in beta if memory scrubbing, device reset, ... is only done by the NVIDIA device plugin and not in the GKE device plugin? If we do see strong use cases for guaranteeing per-container-run allocation, I agree with @vikaschoudhary16 <https://github.com/vikaschoudhary16> that we perhaps want to add this as a configurable behavior. As mentioned I'm not against making this a configurable behavior. I'm just not sure for what device plugin we would be implementing this. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#58282 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AcIZlMFojFd8bsUQ75sU97YpdiQfdGyGks5tMMPDgaJpZM4RePAf> .

flx42 · 2018-01-20T01:38:58Z

I actually expect device plugin to communicate container runtime setup requirements to Kubelet and then to container runtime.

It's an interesting question. It's also a fundamental question since we will probably continue disagreeing until we have settled this matter. :)

I see 4 different features we want to enable for each device:

Runtime support (mounts, cgroups, LSMs)
Health check
Hardware setup and device cleanup
Monitoring

There is potential overlap in the tools used for each component, but I feel that this distinction makes sense.

Now, where should we put each of those?

a) In the container runtime or an extension (e.g. CRI pre-start hook).
b) In the device plugin.
c) In a separate process.

1) -> a) obviously, the device plugin shouldn't (and can't) handle low-level details like the location of the container's ldcache, interacting with the namespaced network stack or how to handle a SELinux context.

2) -> b) by design of the device plugin. Are you advocating that this should be done in the runtime?
I believe, and that's a rather strong claim, that the VM vs container scenario can't be handled in 1) only (the runtime). It will have to leak in most of the stack.
It also affects the way you do health check and monitoring, at least. Not all devices might be able to provide these features when you take the role of the hypervisor.
If I remember correctly, it also impacts the type of device reset you might be able to perform.

4) -> c) for instance cadvisor, or another metrics exporter. There is some overlap with health check though. For us health check and monitoring can be achieved by the same tools (NVML or DCGM).
A device plugin could therefore also do monitoring and metrics export in a separate thread, but that shouldn't be a requirement.

3) is a bit more fuzzy for me, but I see multiple cases:

Hardware setup that doesn't depend on your container: for instance power-up the device or GPIO twiddling, those steps are generic so they could also be in the device plugin.
Hardware setup that depends on your container: for instance if you need to reprogram your FPGA, well, you don't have a choice, you might need to do it in runtime, or let the container do it (giving it high privileges). The cost is that this container might take more time to start than the other containers that it needs to interact with. It also increases the possibility of failure at the very last step before the container start, which is not ideal.
Device cleanup (reset/scrub): ideally, could be done by a background thread/process after the last container exited. That would map well with the device plugin system.
So 3) -> b) when you can, but 3) -> a) when you must.

It would be nice to also get feedback from people that are much more knowledgeable with VFIO and hypervisors than me.

jiayingz · 2018-01-20T07:45:02Z

On Fri, Jan 19, 2018 at 5:39 PM, Felix Abecassis ***@***.***> wrote: I actually expect device plugin to communicate container runtime setup requirements to Kubelet and then to container runtime. It's an interesting question. It's also a fundamental question since we will probably continue disagreeing until we have settled this matter. :)

Yes. Indeed :). I think reaching agreements on these questions will also help us converge the two gpu device plugins.

I see 4 different features we want to enable for each device: 1. Runtime support (mounts, cgroups, LSMs) 2. Health check 3. Hardware setup and device cleanup 4. Monitoring There is potential overlap in the tools used for each component, but I feel that this distinction makes sense. Ideally I hope 1 and 2 can go through device plugin, given that there is

still no official plan on supporting runc hook in CRI and device plugin API has been designed to support them from the very beginning. The health state change however is expected to be communicated through the ListWatch API instead of during Allocate API.

Now, where should we put each of those? - a) In the container runtime or an extension (e.g. CRI pre-start hook). - b) In the device plugin. - c) In a separate process. *1) -> a)* obviously, the device plugin shouldn't (and can't) handle low-level details like the location of the container's ldcache, interacting with the namespaced network stack or how to handle a SELinux context. *2) -> b)* by design of the device plugin, are you advocating that this should be done in the runtime? I believe, and that's a rather strong claim, that the VM vs container scenario can't be handled in 1) only (the runtime). It will have to leak in most of the stack. It also affects the way you do health check and monitoring, at least. Not all devices might be able to provide these features when you take the role of the hypervisor. If I remember correctly, it also impacts the type of device reset you might be able to perform. *4) -> c)* for instance cadvisor, or another metrics exporter. There is some overlap with health check though. For us health check and monitoring can be achieved by the same tools (NVML or DCGM). A device plugin could therefore also do monitoring and metrics export in a separate thread, but that shouldn't be a requirement. *3)* is a bit more fuzzy for me, but I see multiple cases: - Hardware setup that *doesn't depend* on your container: for instance power-up the device or GPIO twiddling, those steps are generic so they could also be in the device plugin. - Hardware setup that *depends* on your container: for instance if you need to reprogram your FPGA, well, you don't have a choice, you might need to do it in runtime, or let the container do it (giving it high privileges). - Device cleanup (reset/scrub): ideally, could be done by a background thread/process *after* the last container exited. That would map well with the device plugin system. So *3) -> b)* when you can, but *3) -> a)* when you must. Agree this is a fuzzy one, and I think the whole conversion on when

Allocate should be issued is also largely triggered by the questions on when and where these setup/cleanup operations should be performed. Right now, we probably don't have enough user feedbacks on answering these questions. That is why I am suggesting to start with doing these operations separately from device plugin, without changing the current allocation behavior.

It would be nice to also get feedback from people that are much more knowledgeable with VFIO and hypervisors than me.

Would like to hear more feedbacks on this too. As mentioned, I am open to support per-container-run Allocate as a configurable behavior if we see strong use cases.

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#58282 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AcIZlEof-tzes0G_C4to0A-bBJGcccgTks5tMUPcgaJpZM4RePAf> .

vikaschoudhary16 · 2018-02-20T21:53:44Z

@vishh updated, PTAL.

vishh · 2018-02-20T21:55:27Z

Right now, we need to enable device plugin feature gate by default (i.e., graduate device plugin to beta) to run the stub based e2e_node device plugin test in normal e2e_node suite

We can run it in a separate suite meanwhile.

@vikaschoudhary16 @RenaudWasTaken @jiayingz APIs are hard to change. I see "potential" use cases for this API, but no concrete ones yet. Can one of you list more concretely what would happen for one (ideally two) devices with this API? Do reach out to the wider community if necessary. Let's all be certain that this API is the right one based on use cases, since we are introducing this API directly to Beta.

vishh · 2018-02-20T21:56:39Z

pkg/kubelet/apis/deviceplugin/v1beta1/api.proto

+        // PreStart is called, if indicated by Device Plugin during registeration phase,
+        // before each container start. Device plugin can run device specific operations
+        // such as reseting the device before making devices available to the container
+	rpc PreStart(PreStartRequest) returns (PreStartResponse) {}


nit: Can you name it PreStartContainer coz we may have a PreStartPod in the future (potentially for Network Plugins)?

vishh · 2018-02-20T22:02:20Z

@RenaudWasTaken

Run inexpensive cleanup operations (i.e: not scrubbing the memory)

Can you specify what operations these are?

Run inexpensive and synchronous check operations (i.e: look if all the memory is available) and switch the GPU to unhealthy if operations can't be done within the required ti

Can these operations be run out of band, not as a process on the device?

RenaudWasTaken · 2018-02-20T22:22:17Z

@RenaudWasTaken given that most of your suggestions are unit test improvements, I would also suggest that we address them in a follow-up PR, given that we do need this change to unblock other changes, like changing the gke gpu device plugin etc.

@jiayingz there is one outstanding bug here that is related to the device plugin system never removing a device plugin (or cleaning up the map). But in that case the fix is significant and is not completly related to this PR.

However, most of the other standing comments don't require a significant amount of work and are mostly about improving the code introduced in this PR, which is a normal process that we require for every other PRs.
Code freeze is in 6 days which is enough time to merge this PR and I'm not sure what compromising on code quality for merging the feature earlier will bring.

@RenaudWasTaken based on our Wed's discussions in rmwg meeting, could you comment on whether 30s timeout is still necessary or we can make it shorter?

We're still discussing internally what operations should be ran and what operations shouldn't as well as getting the metrics for all the architectures and driver versions.
Intuitively it shouldn't reach 30 seconds but we have to take into account that one second operation on a 16 GPU board is 16s.

RenaudWasTaken · 2018-02-20T22:24:03Z

Can you specify what operations these are?

We're still discussing internally what operations should be ran and what operations shouldn't as well as getting the metrics for all the architectures and driver versions.
But here are a few of these:

healthCheck and reset in case of bad state
GPU Allocated memory checks and cleanup
"Zombie processes" checks and cleanup
...

jiayingz · 2018-02-20T23:05:41Z

@RenaudWasTaken thanks a lot for your input. To quickly summarize from your comment, you think PreStart grpc call is still needed for nvidia gpu device plugin and 30s seems OK as a starting point?

@vishh based on these inputs, are you ok to proceed on this PR?

jiayingz · 2018-02-20T23:09:16Z

@RenaudWasTaken could you file a separate issue for device manager never removing a device plugin if you believe it is a true bug?

Other than adding a new PreStart grpc call, this PR also changes device manager to the v1beta1 API. I was thinking to make a separate PR of doing this but decided to wait for this one in first to avoid merge conflicts. I have a pending PR on the gke gpu device plugin to support v1beta1 DP API that depends on this change. Then I can change the gpu device plugin e2e test to pick up the right image to pass. In parallel, I would like to send a PR to change DevicePlugins feature gate to beta that also depends on changing device manager to the v1beta1 API. I think people's reviewing cycles are getting more limited as we are approaching the code freeze. I really don't want to accumulate so many changes till the last minute. I understand you have certain preference on how to structure unit test code. To save some back-forth communication, could you make a separate PR for unit test improvements? That will be a device manager local change and is usually reviewed much faster. Thanks!

RenaudWasTaken · 2018-02-20T23:58:14Z

I really don't want to accumulate so many changes till the last minute.

It's not like I am asking for significant changes, I'm asking for sane code changes and some more lines of tests that increase code quality. As @vishh was mentioning since we are introducing this API directly to Beta let's make sure the code quality is high.

I understand you have certain preference on how to structure unit test code.

Half if not more of the requested changes are to make sure that we are correctly testing the feature or prevent deadlocks in future tests and are not structure related just code quality:

FatalF instead of using Failnow
Don't use Hardcoded values for testing
Use NewManager instead of filling by hand the manager structure

I think people's reviewing cycles are getting more limited as we are approaching the code freeze
To save some back-forth communication, could you make a separate PR for unit test improvements? That will be a device manager local change and is usually reviewed much faster. Thanks!

Now I'm confused, do we have limited cycles or not?
If we have then let's make sure this PR we all agree on has a quality of code we all feel ok with merging into the Kubernetes code base instead of merging something we don't feel comfortable with and losing cycles on reviewing code improvements for code we were'nt comfortable with in the first place.

vikaschoudhary16 · 2018-02-21T06:01:33Z

@vishh @RenaudWasTaken @jiayingz PTAL now.

Signed-off-by: vikaschoudhary16 <vichoudh@redhat.com>

jiayingz · 2018-02-21T06:37:17Z

/lgtm
/approve

RenaudWasTaken

/lgtm

vishh · 2018-02-21T18:59:26Z

/approve
/lgtm

k8s-ci-robot · 2018-02-21T18:59:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jiayingz, RenaudWasTaken, vikaschoudhary16, vishh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-github-robot · 2018-02-21T20:25:10Z

/test all [submit-queue is verifying that this PR is safe to merge]

k8s-github-robot · 2018-02-21T21:07:26Z

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Jan 15, 2018

k8s-ci-robot requested a review from derekwaynecarr January 15, 2018 10:52

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Jan 15, 2018

k8s-ci-robot requested a review from jeremyeder January 15, 2018 10:52

k8s-ci-robot added area/hw-accelerators size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 15, 2018

k8s-ci-robot requested a review from vishh January 15, 2018 10:52

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 15, 2018

k8s-ci-robot requested review from sjenning, lichuqiang, jiayingz and RenaudWasTaken January 15, 2018 10:52

k8s-github-robot assigned RenaudWasTaken and rohitagarwal003 Jan 15, 2018

RenaudWasTaken reviewed Jan 15, 2018

View reviewed changes

lichuqiang reviewed Jan 16, 2018

View reviewed changes

vikaschoudhary16 mentioned this pull request Jan 17, 2018

Device plugin handling of partial allocation failures #56307

Closed

ScorpioCPH reviewed Jan 18, 2018

View reviewed changes

vikaschoudhary16 force-pushed the per-container-allocate branch from e8b360f to 3706b48 Compare February 20, 2018 21:47

k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 20, 2018

vishh reviewed Feb 20, 2018

View reviewed changes

jiayingz mentioned this pull request Feb 20, 2018

Add support for v1beta1 device plugin API. GoogleCloudPlatform/container-engine-accelerators#55

Merged

vikaschoudhary16 force-pushed the per-container-allocate branch from 3706b48 to d9cbfa8 Compare February 21, 2018 05:58

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 21, 2018

vikaschoudhary16 and others added 2 commits February 21, 2018 01:25

Invoke PreStart RPC call before container start, if desired by plugin

defcab8

Signed-off-by: vikaschoudhary16 <vichoudh@redhat.com>

Migrate deviceplugin api from v1alpha to v1beta1

e64517c

vikaschoudhary16 force-pushed the per-container-allocate branch from d9cbfa8 to e64517c Compare February 21, 2018 06:26

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 21, 2018

RenaudWasTaken approved these changes Feb 21, 2018

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 21, 2018

k8s-github-robot merged commit e8dd75f into kubernetes:master Feb 21, 2018

vikaschoudhary16 mentioned this pull request Apr 16, 2018

Add vikaschoudhary16 to the approvers in device manager #62184

Closed

flx42 mentioned this pull request Apr 25, 2018

Pass pod annotations to Allocate #61775

Closed

Invoke preStart RPC call before container start, if desired by plugin #58282

Invoke preStart RPC call before container start, if desired by plugin #58282

Conversation

vikaschoudhary16 commented Jan 15, 2018 • edited Loading

k8s-ci-robot commented Jan 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RenaudWasTaken Jan 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RenaudWasTaken left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lichuqiang left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiayingz commented Jan 17, 2018

vikaschoudhary16 commented Jan 17, 2018

ScorpioCPH left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flx42 commented Jan 18, 2018

RenaudWasTaken commented Jan 18, 2018

jiayingz commented Jan 19, 2018

RenaudWasTaken commented Jan 19, 2018 • edited Loading

jiayingz commented Jan 19, 2018 via email

flx42 commented Jan 20, 2018 • edited Loading

jiayingz commented Jan 20, 2018 via email

vikaschoudhary16 commented Feb 20, 2018

vishh commented Feb 20, 2018

Choose a reason for hiding this comment

vishh commented Feb 20, 2018

RenaudWasTaken commented Feb 20, 2018

RenaudWasTaken commented Feb 20, 2018 • edited Loading

jiayingz commented Feb 20, 2018

jiayingz commented Feb 20, 2018

RenaudWasTaken commented Feb 20, 2018

vikaschoudhary16 commented Feb 21, 2018

jiayingz commented Feb 21, 2018

RenaudWasTaken left a comment

Choose a reason for hiding this comment

vishh commented Feb 21, 2018

k8s-ci-robot commented Feb 21, 2018

k8s-github-robot commented Feb 21, 2018

k8s-github-robot commented Feb 21, 2018

vikaschoudhary16 commented Jan 15, 2018 •

edited

Loading

RenaudWasTaken Jan 15, 2018 •

edited

Loading

lichuqiang left a comment •

edited

Loading

RenaudWasTaken commented Jan 19, 2018 •

edited

Loading

flx42 commented Jan 20, 2018 •

edited

Loading

RenaudWasTaken commented Feb 20, 2018 •

edited

Loading