Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invoke preStart RPC call before container start, if desired by plugin #58282

Merged

Conversation

vikaschoudhary16
Copy link
Contributor

@vikaschoudhary16 vikaschoudhary16 commented Jan 15, 2018

What this PR does / why we need it:

  1. Adds a new RPC preStart to device plugin API
  2. Update Register RPC handling to receive a flag from the Device plugins as an indicator if kubelet should invoke preStart RPC before starting container.
  3. Changes in device manager to invoke preStart before container start
  4. Test case updates

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #56943 #56307

Special notes for your reviewer:

Release note:

None

/sig node

/area hw-accelerators
/cc @jiayingz @RenaudWasTaken @vishh @ScorpioCPH @sjenning @derekwaynecarr @jeremyeder @lichuqiang @tengqm

@k8s-ci-robot k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Jan 15, 2018
@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Jan 15, 2018
@k8s-ci-robot k8s-ci-robot added area/hw-accelerators size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 15, 2018
@k8s-ci-robot k8s-ci-robot requested a review from vishh January 15, 2018 10:52
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 15, 2018
@k8s-ci-robot
Copy link
Contributor

@vikaschoudhary16: GitHub didn't allow me to request PR reviews from the following users: tengqm, ScorpioCPH.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

What this PR does / why we need it:
Adds a mechanism in Kubelet device plugin manager to ensure per-container-creation Allocate gRPC call

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #56943

Special notes for your reviewer:

Release note:

None

/sig node

/area hw-accelerators
/cc @jiayingz @RenaudWasTaken @vishh @ScorpioCPH @sjenning @derekwaynecarr @jeremyeder @lichuqiang @tengqm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

}
if resp == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is resp != nil ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code it's probably on container restart.
If that's the case I expect container restart to issue an Allocate call (i.e: reset the device in the same state at every container start)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making Allocate call on each container restart have the side-effect that containers which were UP may not restart if device plugin has got failed.
How about having a configurable option, which will be configured by device plugin at registration time, CacheUse. Enabled by default, would mean the current PR functionality and if device plugin set it to false, Allocate will be invoked every time. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make the checkpoint mechanism useless to always issue an Allocate call on restart.
Agree on making it optional if we have the use case indeed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the question you are asking really is what is the device plugin system expected to do.
The purpose of the device plugin is to setup a device before making it available in a container.

Making Allocate call on each container restart have the side-effect that containers which were UP may not restart if device plugin has got failed.

Not making an Allocate call has also the side effect to break container reproducibility

How about having a configurable option, which will be configured by device plugin at registration time, CacheUse. Enabled by default, would mean the current PR functionality and if device plugin set it to false, Allocate will be invoked every time. WDYT?

I agree with @lichuqiang do we have a use case for caching it ?
Is there a device plugin that needs caching ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RenaudWasTaken hmm so you are advocating for making Allocate call for each container every time. I am fine with updating the changes if thats majority opinion. I personally would prefer to have it as a configuration option and using cache in default configuration. This would shorten container start-up time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RenaudWasTaken hmm so you are advocating for making Allocate call for each container every time. I am fine with updating the changes if thats majority opinion.

If there is no use case I'm not sure it's worth.

This would shorten container start-up time.

Let's do some actual metrics :)
It's also possible that checkpointing data is more expensive

invokeRPCBoolFlag = true

} else {
invokeRPCBoolFlag = m.podDevices.isAllocationOverlapping(podUID, contName, resource, devices)
Copy link
Contributor

@RenaudWasTaken RenaudWasTaken Jan 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does resp != nil and a device is overlapping ? i.e: What is this case for ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overlapping case is restart of a container that has re-allocated devices which are being reused within the pod containers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overlapping case is restart of a container that has re-allocated devices which are being reused within the pod containers

So if a container restarts and it has at least one re-allocated (I'm guessing InitContainer + Container) devices then we issue an allocate call ?
What's the use case for that ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems this makes more sense if reuse is between regular containers, which is not supported at the moment. In (Init + regular) case, seems do not make sense. will update.

Copy link
Contributor

@RenaudWasTaken RenaudWasTaken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind adding a bit more comments explaining the different code path ?

@@ -524,7 +524,7 @@ func (m *ManagerImpl) devicesToAllocate(podUID, contName, resource string, requi
needed := required
// Gets list of devices that have already been allocated.
// This can happen if a container restarts for example.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can happen if a container restarts for example.

Can this still happen given that it's called from the acceptation predicate ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call from acceptation predicate is not being newly introduced in this PR, therefore, these comments can be corrected independent of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a TODO if that's the case would be a good idea though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

}
if resp == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code it's probably on container restart.
If that's the case I expect container restart to issue an Allocate call (i.e: reset the device in the same state at every container start)

m.mutex.Lock()
e, ok := m.endpoints[resource]
m.podDevices.insert(podUID, contName, resource, allocDevices, nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we don't need to cache Allocate response, we should remove the last parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean not needed because we would make Allocate each time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean not needed because we would make Allocate each time?

I think so :), this insert will be rewritten on the following GetDeviceRunContainerOptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScorpioCPH Please not that last parameter, resp, is nil here and in GetDeviceRunContainerOptions, resp will be updated. So i would say it will be "updated".

continue
}
m.mutex.Lock()
devices, resp := m.podDevices.containerDevices(podUID, contName, string(k))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need the resp here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean not needed because we would make Allocate each time?
If yes, please share your opinion on making this a configurable option as i explained in another comment.

Copy link
Contributor

@lichuqiang lichuqiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the allocation process got changed after the change, that is (if I'm not mistaken):

  1. Calculate resources usage and insert into poddev during predicate phase;
  2. Doing real allocation when generating container runtime info(cached resp nil or devs got reused by other containers)

So I think it necessary to update comments accordingly, otherwise, people who first study the code might got confused.

}
if resp == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make the checkpoint mechanism useless to always issue an Allocate call on restart.
Agree on making it optional if we have the use case indeed.

@jiayingz
Copy link
Contributor

Sorry for joining the conversation late.

The original model I was thinking when filing FR #56943 is that we make an Allocate grpc call only when we re-allocate a device to a different container. For most common cases when a single container inside a pod requests DP resources, we will not re-issue Allocate grpc call when the container restarts. From the discussions in this PR, looks like the model @RenaudWasTaken proposes is to always issue Allocate grpc call whenever container restarts. I think this contradicts with the failure recovery statement we made in the DP design doc that if any device plugin fails, we don't expect Kubelet to fail or restart any pods or containers running that are using these devices. I would like to better understand the use cases to decide how we should proceed.

From our discussions, I think there are two main reasons @RenaudWasTaken mentioned on why an Allocate call is desired per container start.

First is for security isolation, we want to zero-out device memory before starting a container. For this use case, first of all, I feel most security isolation guarantees Kubernetes provides are at pod level. Even though we want to provide container level security isolation, I don't see why an Allocate call is required for the most common use case when a single container in a pod request to use DP resource.

The second reason is for device state checking and recovery. For this use case, I would like to better understand how often a crashed process/container may leave a device in bad state and how we handle this kind of failure in non-container environment. I.e., one example @RenaudWasTaken mentioned is that a crashed process could leave some gpu memory not allocatable. I think we have the same problem in non-container environment and I would expect device drive to do proper garbage collection to handle this case.

I also want to discuss alternative ways of performing per-container start operations. As I understand, nvidia device plugin is already taking the path of running specific runc prestart hooks. In that case, why the operations you mentioned can't be performed in the runc hook?

@vikaschoudhary16
Copy link
Contributor Author

As I understand, nvidia device plugin is already taking the path of running specific runc prestart hooks. In that case, why the operations you mentioned can't be performed in the runc hook?

+1

Copy link
Contributor

@ScorpioCPH ScorpioCPH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, some quick comments, try to understand the logic later.

// Continues if this is neither an active device plugin resource nor
// a resource we have previously allocated.
if !registeredResource && !allocatedResource {
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we throw an error/warning here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not actually an error condition. For other resources like cpu and memory, continue will be invoked.

continue
}
m.mutex.Lock()
devices, resp := m.podDevices.containerDevices(podUID, contName, string(k))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/string(k)/resource

err := m.writeCheckpoint()
if err != nil {
glog.Errorf("Checkpointing failed for pod %s, container %s", podUID, contName)
}
return m.podDevices.deviceRunContainerOptions(string(pod.UID), container.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/string(pod.UID)/podUID

m.mutex.Lock()
e, ok := m.endpoints[resource]
m.podDevices.insert(podUID, contName, resource, allocDevices, nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean not needed because we would make Allocate each time?

I think so :), this insert will be rewritten on the following GetDeviceRunContainerOptions.

@flx42
Copy link

flx42 commented Jan 18, 2018

As I understand, nvidia device plugin is already taking the path of running specific runc prestart hooks. In that case, why the operations you mentioned can't be performed in the runc hook?

Because it's a prestart hook :), we want to do the health check early, and not pull the full image, setup the container filesystem but then fail just before the pivot_root.

@RenaudWasTaken
Copy link
Contributor

Hello!

I'll do a general explanation of the problem that is being discussed here and then answer your questions @jiayingz.

The way we designed the device plugin was for running device specific operations in order mostly to reset devices. That guarantees containers to always run devices in the same state, thus keeping the container promise of reproducibility.

Examples of device operations for GPUs include:

  • memory scrubbing
  • healthCheck and reset in case of bad state
  • GPU Allocated memory checks
  • "Zombie processes" checks
  • ...

Intel mentioned something about buffer reservations at some point in yesterday's meeting (though I'm not sure I recall correctly).

Having this model where you don't setup the device before your container starts means that (aside from the added code complexity and performance decrease by having to cache the response) you are breaking container reproducibility and you can't guarantee any security features.

E.g: A tensorflow job requests all the GPU memory and then crashes (because of some software bug) say in this particular case it didn't release the memory and the driver couldn't release it on the next restart because no allocate call was made, the device plugin couldn't detect this and try to fix this behavior, the container will just crash because not enough memory is available.

If we apply the model you are suggesting to CRI, what you would be asking is to re-use the same filesystem, Cgroups, ... in the case a container restarts.

CRI didn't chose this model, why are you asking it to be different for the device plugin ? The device plugin has the clear mission to setup the device in the same way CRI has the mission to setup the container. You can't just assume that devices are stateless.

I don't see why we can't handle the device plugin not being available In the same way the CRI is already handling it's gRPC runtime not available.
After all if CRI can't setup the container after a restart because the runtime's gRPC server is not available, it's not trying to re-use the previous container.

I think this contradicts with the failure recovery statement we made in the DP design doc that if any device plugin fails, we don't expect Kubelet to fail or restart any pods or containers running that are using these devices.

I'm not sure how failing a container that is restarting because the device plugin is not available conflicts with that statement.
The reason we wrote this sentence is that we're basically saying that we aren't going to kill "running" containers if the device plugin is not available. But we aren't giving any guarantees about restarting the container.

First is for security isolation, we want to zero-out device memory before starting a container. For this use case, first of all, I feel most security isolation guarantees Kubernetes provides are at pod level.

You are ignoring the container security model defined in CRI here. The isolation guarantees provided by Kubernetes are not just at the pod level, there is a security model in place at the container level too.

Even though we want to provide container level security isolation, I don't see why an Allocate call is required for the most common use case when a single container in a pod request to use DP resource.

Sorry I don't understand what you are asking and or suggesting ?

The second reason is for device state checking and recovery. For this use case, I would like to better understand how often a crashed process/container may leave a device in bad state

Sorry I don't understand what kind of metrics you are requesting here.
Maybe an answer to this is that it's possible and we've seen it happen in the field.

I also want to discuss alternative ways of performing per-container start operations. As I understand, nvidia device plugin is already taking the path of running specific runc prestart hooks. In that case, why the operations you mentioned can't be performed in the runc hook?

Because the device plugin is here to do infrastructure operations such as cleaning up the device, testing it's health and cleaning up memory, processes, ...
The runtime pre-start hook is here to setup the container so that it has access to GPUs.

But more importantly by asking us to move the infrastructure operations in the runtime hook, essentially what we would be saying is that doesn't fit the device plugin API and lifecycle.

Essentially in the opposite direction of the requirements for the beta roadmap and the whole purpose of the device plugin.

@jiayingz
Copy link
Contributor

Thanks a lot for your inputs, @flx42 and @RenaudWasTaken

I understand on your side, you want to perform certain operations per container start to detect some potential device failure scenario, and prefer to running those operations earlier in the container start process. Generally, I think we want to optimize for most common failure cases and make some tradeoff. Given that device plugin is still relatively new and is more likely to contain bugs, I would like to minimize its failure impact if a device plugin is unavailable. Right now, I don't think device plugin is at a comparable position as container runtime. A system can have some critical failure points but we always want to minimize the set of such components. It doesn't seem to be a good reasoning that because we have some critical failure points now, we can keep adding more. Also I think the CRI container security model mentioned above is actually defined at pod level.

I would suggest you run the mentioned operations in your runc hook for now. We can revisit this after getting more experience of using device plugin in production. If we do see strong use cases for guaranteeing per-container-run allocation, I agree with @vikaschoudhary16 that we perhaps want to add this as a configurable behavior.

@RenaudWasTaken
Copy link
Contributor

RenaudWasTaken commented Jan 19, 2018

I understand on your side, you want to perform certain operations per container start to detect some potential device failure scenario, and prefer to running those operations earlier in the container start process.

What we are saying is that the device plugin runs infrastructure operations such as device cleanup.
Whereas the pre-start hook runs container setup operations (such as creating the devices, setting the cgroups, ...)

Generally, I think we want to optimize for most common failure cases and make some tradeoff.

Undefined behavior in case of container restart is not an acceptable tradeoff.
Furthermore, I expect device plugins to be a really stable piece of software compared to containers that might be running experimental code (e.g: Deep Learning training).
IMHO most common failure case is probably a container restart not a device plugin that's unavailable.

Given that device plugin is still relatively new and is more likely to contain bugs

If that's the discourse you are going to have then we can legitimately ask the question if the device plugin is stable enough to graduate to beta ?

I would like to minimize its failure impact if a device plugin is unavailable.
Right now, I don't think device plugin is at a comparable position as container runtime. A system can have some critical failure points but we always want to minimize the set of such components.

That's a reasonable reasoning. However I'd be interested in having metrics on how often these critical failure points happen in a production cluster compared to a container restart.

My intuition is that as mentioned higher the device plugin is a robust piece of software and well audited that should rarely fail. On the other hand container restart especially in DL may not always be production ready.
Thus introducing Undefined Behavior on container restart is probably injecting a bigger failure point than the one you are trying to reduce.

Also I think the CRI container security model mentioned above is actually defined at pod level.

Apologies I pointed to the wrong structure here is the ContainerSecurityContext.

I would suggest you run the mentioned operations in your runc hook for now. We can revisit this after getting more experience of using device plugin in production.

A bit off topic, can you tell me how GKE intends to graduate GPUs in beta if memory scrubbing, device reset, ... is only done by the NVIDIA device plugin and not in the GKE device plugin?

If we do see strong use cases for guaranteeing per-container-run allocation, I agree with @vikaschoudhary16 that we perhaps want to add this as a configurable behavior.

As mentioned I'm not against making this a configurable behavior. I'm just not sure for what device plugin we would be implementing this.

@jiayingz
Copy link
Contributor

jiayingz commented Jan 19, 2018 via email

@flx42
Copy link

flx42 commented Jan 20, 2018

I actually expect device plugin to communicate container runtime setup requirements to Kubelet and then to container runtime.

It's an interesting question. It's also a fundamental question since we will probably continue disagreeing until we have settled this matter. :)

I see 4 different features we want to enable for each device:

  1. Runtime support (mounts, cgroups, LSMs)
  2. Health check
  3. Hardware setup and device cleanup
  4. Monitoring

There is potential overlap in the tools used for each component, but I feel that this distinction makes sense.

Now, where should we put each of those?

  • a) In the container runtime or an extension (e.g. CRI pre-start hook).
  • b) In the device plugin.
  • c) In a separate process.

1) -> a) obviously, the device plugin shouldn't (and can't) handle low-level details like the location of the container's ldcache, interacting with the namespaced network stack or how to handle a SELinux context.

2) -> b) by design of the device plugin. Are you advocating that this should be done in the runtime?
I believe, and that's a rather strong claim, that the VM vs container scenario can't be handled in 1) only (the runtime). It will have to leak in most of the stack.
It also affects the way you do health check and monitoring, at least. Not all devices might be able to provide these features when you take the role of the hypervisor.
If I remember correctly, it also impacts the type of device reset you might be able to perform.

4) -> c) for instance cadvisor, or another metrics exporter. There is some overlap with health check though. For us health check and monitoring can be achieved by the same tools (NVML or DCGM).
A device plugin could therefore also do monitoring and metrics export in a separate thread, but that shouldn't be a requirement.

3) is a bit more fuzzy for me, but I see multiple cases:

  • Hardware setup that doesn't depend on your container: for instance power-up the device or GPIO twiddling, those steps are generic so they could also be in the device plugin.
  • Hardware setup that depends on your container: for instance if you need to reprogram your FPGA, well, you don't have a choice, you might need to do it in runtime, or let the container do it (giving it high privileges). The cost is that this container might take more time to start than the other containers that it needs to interact with. It also increases the possibility of failure at the very last step before the container start, which is not ideal.
  • Device cleanup (reset/scrub): ideally, could be done by a background thread/process after the last container exited. That would map well with the device plugin system.
    So 3) -> b) when you can, but 3) -> a) when you must.

It would be nice to also get feedback from people that are much more knowledgeable with VFIO and hypervisors than me.

@jiayingz
Copy link
Contributor

jiayingz commented Jan 20, 2018 via email

@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 20, 2018
@vikaschoudhary16
Copy link
Contributor Author

@vishh updated, PTAL.

@vishh
Copy link
Contributor

vishh commented Feb 20, 2018

Right now, we need to enable device plugin feature gate by default (i.e., graduate device plugin to beta) to run the stub based e2e_node device plugin test in normal e2e_node suite

We can run it in a separate suite meanwhile.

@vikaschoudhary16 @RenaudWasTaken @jiayingz APIs are hard to change. I see "potential" use cases for this API, but no concrete ones yet. Can one of you list more concretely what would happen for one (ideally two) devices with this API? Do reach out to the wider community if necessary. Let's all be certain that this API is the right one based on use cases, since we are introducing this API directly to Beta.

// PreStart is called, if indicated by Device Plugin during registeration phase,
// before each container start. Device plugin can run device specific operations
// such as reseting the device before making devices available to the container
rpc PreStart(PreStartRequest) returns (PreStartResponse) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can you name it PreStartContainer coz we may have a PreStartPod in the future (potentially for Network Plugins)?

@vishh
Copy link
Contributor

vishh commented Feb 20, 2018

@RenaudWasTaken

Run inexpensive cleanup operations (i.e: not scrubbing the memory)

Can you specify what operations these are?

Run inexpensive and synchronous check operations (i.e: look if all the memory is available) and switch the GPU to unhealthy if operations can't be done within the required ti

Can these operations be run out of band, not as a process on the device?

@RenaudWasTaken
Copy link
Contributor

@RenaudWasTaken given that most of your suggestions are unit test improvements, I would also suggest that we address them in a follow-up PR, given that we do need this change to unblock other changes, like changing the gke gpu device plugin etc.

@jiayingz there is one outstanding bug here that is related to the device plugin system never removing a device plugin (or cleaning up the map). But in that case the fix is significant and is not completly related to this PR.

However, most of the other standing comments don't require a significant amount of work and are mostly about improving the code introduced in this PR, which is a normal process that we require for every other PRs.
Code freeze is in 6 days which is enough time to merge this PR and I'm not sure what compromising on code quality for merging the feature earlier will bring.

@RenaudWasTaken based on our Wed's discussions in rmwg meeting, could you comment on whether 30s timeout is still necessary or we can make it shorter?

We're still discussing internally what operations should be ran and what operations shouldn't as well as getting the metrics for all the architectures and driver versions.
Intuitively it shouldn't reach 30 seconds but we have to take into account that one second operation on a 16 GPU board is 16s.

@RenaudWasTaken
Copy link
Contributor

RenaudWasTaken commented Feb 20, 2018

Can you specify what operations these are?

We're still discussing internally what operations should be ran and what operations shouldn't as well as getting the metrics for all the architectures and driver versions.
But here are a few of these:

  • healthCheck and reset in case of bad state
  • GPU Allocated memory checks and cleanup
  • "Zombie processes" checks and cleanup
  • ...

@jiayingz
Copy link
Contributor

@RenaudWasTaken thanks a lot for your input. To quickly summarize from your comment, you think PreStart grpc call is still needed for nvidia gpu device plugin and 30s seems OK as a starting point?

@vishh based on these inputs, are you ok to proceed on this PR?

@jiayingz
Copy link
Contributor

@RenaudWasTaken could you file a separate issue for device manager never removing a device plugin if you believe it is a true bug?

Other than adding a new PreStart grpc call, this PR also changes device manager to the v1beta1 API. I was thinking to make a separate PR of doing this but decided to wait for this one in first to avoid merge conflicts. I have a pending PR on the gke gpu device plugin to support v1beta1 DP API that depends on this change. Then I can change the gpu device plugin e2e test to pick up the right image to pass. In parallel, I would like to send a PR to change DevicePlugins feature gate to beta that also depends on changing device manager to the v1beta1 API. I think people's reviewing cycles are getting more limited as we are approaching the code freeze. I really don't want to accumulate so many changes till the last minute. I understand you have certain preference on how to structure unit test code. To save some back-forth communication, could you make a separate PR for unit test improvements? That will be a device manager local change and is usually reviewed much faster. Thanks!

@RenaudWasTaken
Copy link
Contributor

I really don't want to accumulate so many changes till the last minute.

It's not like I am asking for significant changes, I'm asking for sane code changes and some more lines of tests that increase code quality. As @vishh was mentioning since we are introducing this API directly to Beta let's make sure the code quality is high.

I understand you have certain preference on how to structure unit test code.

Half if not more of the requested changes are to make sure that we are correctly testing the feature or prevent deadlocks in future tests and are not structure related just code quality:

  • FatalF instead of using Failnow
  • Don't use Hardcoded values for testing
  • Use NewManager instead of filling by hand the manager structure

I think people's reviewing cycles are getting more limited as we are approaching the code freeze
To save some back-forth communication, could you make a separate PR for unit test improvements? That will be a device manager local change and is usually reviewed much faster. Thanks!

Now I'm confused, do we have limited cycles or not?
If we have then let's make sure this PR we all agree on has a quality of code we all feel ok with merging into the Kubernetes code base instead of merging something we don't feel comfortable with and losing cycles on reviewing code improvements for code we were'nt comfortable with in the first place.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 21, 2018
@vikaschoudhary16
Copy link
Contributor Author

@vishh @RenaudWasTaken @jiayingz PTAL now.

@jiayingz
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 21, 2018
Copy link
Contributor

@RenaudWasTaken RenaudWasTaken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@vishh
Copy link
Contributor

vishh commented Feb 21, 2018

/approve
/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jiayingz, RenaudWasTaken, vikaschoudhary16, vishh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 21, 2018
@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/hw-accelerators cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide a mechanism in Kubelet device plugin manager to ensure per-container-creation Allocate gRPC call