Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix GPU resource validation that incorrectly allows zero limits #50218

Conversation

dixudx
Copy link
Member

@dixudx dixudx commented Aug 7, 2017

What this PR does / why we need it:

The validation logic for GPUs is not run if limits is not set for GPUs.
We need to check limits equals requests even if just request is set for GPUs.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #50182

Special notes for your reviewer:
/assign @vishh
/cc @jiayingz

Release note:

fix GPU resource validation that incorrectly allows zero limits

@k8s-ci-robot
Copy link
Contributor

@dixudx: GitHub didn't allow me to request PR reviews from the following users: jiayingz.

Note that only kubernetes members can review this PR, and authors cannot review their own PRs.

In response to this:

What this PR does / why we need it:

The validation logic for GPUs is not run if limits is not set for GPUs.
We need to check limits equals requests even if just request is set for GPUs.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #50182

Special notes for your reviewer:
/assign @vishh
/cc @jiayingz

Release note:

fix GPU resource validation that incorrectly allows zero limits

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 7, 2017
@k8s-ci-robot
Copy link
Contributor

Hi @dixudx. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-github-robot k8s-github-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Aug 7, 2017
@dixudx
Copy link
Member Author

dixudx commented Aug 7, 2017

/assign @lavalamp @thockin

@vishh
Copy link
Contributor

vishh commented Aug 7, 2017

@dixudx I appreciate posting this PR. Your contributions are much welcome. In the future, can I request you to mention explicitly that you are working on a fix for an issue before you post a patch?
This will help avoid multiple people spending time on the same problem. Co-ordination is a but difficult in an OSS project. Your cooperation is much appreciated :)

@@ -60,6 +60,15 @@ func ValidateResourceRequirements(requirements *v1.ResourceRequirements, fldPath
allErrs = append(allErrs, validateContainerResourceName(string(resourceName), fldPath)...)
// Validate resource quantity.
allErrs = append(allErrs, ValidateResourceQuantityValue(string(resourceName), quantity, fldPath)...)

// For GPUs, not only requests can't exceed limits, they also can't be lower, i.e. must be equal.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not appropriate to the check down below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vishh Updated. PTAL. Thanks.

Also many thanks for your kind reminder.

@dixudx dixudx force-pushed the fix_GPU_resource_zero_limit_validation branch from 3a39bbd to e150ea5 Compare August 7, 2017 02:34
@@ -60,6 +60,15 @@ func ValidateResourceRequirements(requirements *v1.ResourceRequirements, fldPath
allErrs = append(allErrs, validateContainerResourceName(string(resourceName), fldPath)...)
// Validate resource quantity.
allErrs = append(allErrs, ValidateResourceQuantityValue(string(resourceName), quantity, fldPath)...)

// For GPUs, limits has to be set equal to requests.
if resourceName == v1.ResourceNvidiaGPU {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since limits has to be set equal to requests for GPU, would it be better to set limits defaulting to requests if limits not configured? @vishh

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we don't really have any precedent for request/limit defaulting per-resource. I sort of wish we did. What do you think @vishh

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, for resources like CPU/memory, if their requests are not configured, they will be defaulted to the limits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every line of code we introduce that's per resource makes the system harder to explain.
GPUs will be kicked out to an extension sometime soon. So I don't think its worth expanding the validation logic just for GPUs.

@thockin
Copy link
Member

thockin commented Aug 7, 2017 via email

@xiangpengzhao
Copy link
Contributor

AFAIK, the default for both CPU and RAM if limit is unspecified is "inf".

Correct. But what I meant is that if limit is specified but request is unspecified, request will default to limit. @vishh right?

@dixudx
Copy link
Member Author

dixudx commented Aug 7, 2017

@xiangpengzhao

if limit is specified but request is unspecified, request will default to limit.

Exactly.

@vishh
Copy link
Contributor

vishh commented Aug 7, 2017

Why can't GPU just ignore limit - it's kind of semantically silly?

@thockin specifying request for GPUs is silly as well since GPUs (or any non shared resource) cannot exceed their requests.
I think the validation logic could have been limits must be specified for GPUs and request must equal limits. If only limits is specified, validation would always pass due to defaulting.

@vishh
Copy link
Contributor

vishh commented Aug 7, 2017

@dixudx We need unit tests for this PR.

@@ -60,6 +60,15 @@ func ValidateResourceRequirements(requirements *v1.ResourceRequirements, fldPath
allErrs = append(allErrs, validateContainerResourceName(string(resourceName), fldPath)...)
// Validate resource quantity.
allErrs = append(allErrs, ValidateResourceQuantityValue(string(resourceName), quantity, fldPath)...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the check under limits validation can be merged into this logic.

  1. If limits are specified and requests aren't specified, requests will equal limits. The logic here will be hit.
  2. If limits aren't specified and request is, then this logic will be hit as well.
  3. If neither is specified, we do not care.

@dixudx dixudx force-pushed the fix_GPU_resource_zero_limit_validation branch from e150ea5 to a7deba8 Compare August 9, 2017 12:53
@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 9, 2017
@dixudx
Copy link
Member Author

dixudx commented Aug 9, 2017

@vishh @thockin Updated. PTAL. Thanks.

@cmluciano
Copy link

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 9, 2017
@dixudx
Copy link
Member Author

dixudx commented Aug 10, 2017

/retest

@dixudx
Copy link
Member Author

dixudx commented Aug 10, 2017

@vishh @thockin Updated. PTAL. Thanks.

@vishh
Copy link
Contributor

vishh commented Aug 10, 2017

/lgtm
/approve

@thockin This PR is ready for your review (again)

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 10, 2017
}
for resourceName, quantity := range requirements.Requests {
fldPath := reqPath.Key(string(resourceName))
// Validate resource name.
allErrs = append(allErrs, validateContainerResourceName(string(resourceName), fldPath)...)
// Validate resource quantity.
allErrs = append(allErrs, ValidateResourceQuantityValue(string(resourceName), quantity, fldPath)...)

// Check that request <= limit.
limitQuantity, exists := requirements.Limits[resourceName]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this function exist in 2 places? @caesarxuchao your fingerprints are on the commit that copied these functions into versioned. Why? This is a nightmare for updates like this - who would expect this code to be in two places?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, there are also other functions which exist in two places. We DO NOT want to maintain things this way.

See one example:
https://github.com/kubernetes/kubernetes/blob/master/pkg/api/v1/service/util.go
https://github.com/kubernetes/kubernetes/blob/master/pkg/api/service/util.go

They almost have the same functions. We'd better only keep one copy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tim, I sent an issue to track this. #50653

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per #50653 (comment) by @liggitt , we should keep them as-is...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with liggitt's comment. I'd like to explore if we can make admission controllers operate on versioned objects, because we want to make the life easier for developers of thirdparty admission webhooks.

@thockin
Copy link
Member

thockin commented Aug 17, 2017

/approve

@k8s-github-robot k8s-github-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Aug 17, 2017
@dixudx dixudx force-pushed the fix_GPU_resource_zero_limit_validation branch from 6a177a8 to 75258b2 Compare August 17, 2017 07:43
@k8s-github-robot k8s-github-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Aug 17, 2017
@dixudx
Copy link
Member Author

dixudx commented Aug 17, 2017

/test pull-kubernetes-e2e-gce-etcd3

@dixudx
Copy link
Member Author

dixudx commented Aug 18, 2017

ping @vishh @thockin @caesarxuchao

Rebased. PTAL. Need your /lgtm again. Thanks.

@vishh
Copy link
Contributor

vishh commented Aug 18, 2017

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 18, 2017
@vishh vishh added this to the v1.8 milestone Aug 18, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dixudx, thockin, vishh

Associated issue: 50182

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 075d209 into kubernetes:master Aug 18, 2017
@dixudx dixudx deleted the fix_GPU_resource_zero_limit_validation branch August 18, 2017 02:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GPU resource validation incorrectly allows zero limits
9 participants