Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Unhealthy devices #57266

Merged

Conversation

vikaschoudhary16
Copy link
Contributor

Update node capacity with sum of both healthy and unhealthy devices.
Node allocatable reflect only healthy devices.

What this PR does / why we need it:
Currently node capacity only reflects healthy devices. Unhealthy devices are ignored totally while updating node status. This PR handles unhealthy devices while updating node status.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #57241

Special notes for your reviewer:

Release note:

Handle Unhealthy devices

/cc @tengqm @ConnorDoyle @jiayingz @vishh @jeremyeder @sjenning @resouer @ScorpioCPH @lichuqiang @RenaudWasTaken @balajismaniam

/sig node

@k8s-ci-robot k8s-ci-robot requested a review from vishh December 16, 2017 06:48
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 16, 2017
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 16, 2017
@k8s-ci-robot
Copy link
Contributor

@vikaschoudhary16: GitHub didn't allow me to request PR reviews from the following users: RenaudWasTaken, tengqm, ScorpioCPH.

Note that only kubernetes members can review this PR, and authors cannot review their own PRs.

In response to this:

Update node capacity with sum of both healthy and unhealthy devices.
Node allocatable reflect only healthy devices.

What this PR does / why we need it:
Currently node capacity only reflects healthy devices. Unhealthy devices are ignored totally while updating node status. This PR handles unhealthy devices while updating node status.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #57241

Special notes for your reviewer:

Release note:

Handle Unhealthy devices

/cc @tengqm @ConnorDoyle @jiayingz @vishh @jeremyeder @sjenning @resouer @ScorpioCPH @lichuqiang @RenaudWasTaken @balajismaniam

/sig node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 16, 2017
Copy link
Contributor

@ScorpioCPH ScorpioCPH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your work! Some quick comments.

for _, name := range deletedResources {
if name != resourceName {
continue
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we remove else statement? as we have called continue in the above if statement.

Copy link
Contributor Author

@vikaschoudhary16 vikaschoudhary16 Dec 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is changed now.

healthyDevices map[string]sets.String

// unhealthyDevices contains all of the unhealthy devices and their exported device IDs.
unhealthyDevices map[string]sets.String
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just wondering is there any benefits we keep unhealthyDevices in cache here. Maybe we can make more discussions about this :)

Copy link
Contributor Author

@vikaschoudhary16 vikaschoudhary16 Dec 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubelet needs to update node status taking into account unhealthy devices also(in the capacity). If we dont store unhealthy devices here in device manager, i am not sure how kubelet will sync this info. Would love to hear any suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main benefit is to surface unhealthy device information more clearly through node status to facilitate monitoring and problem detection.

@@ -545,6 +545,7 @@ func (kl *Kubelet) setNodeAddress(node *v1.Node) error {
func (kl *Kubelet) setNodeStatusMachineInfo(node *v1.Node) {
// Note: avoid blindly overwriting the capacity in case opaque
// resources are being advertised.
//glog.Info("Error getting machine info: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is this a error info?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out. Left by mistake.

@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 16, 2017
@vikaschoudhary16
Copy link
Contributor Author

/test pull-kubernetes-unit

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 17, 2017
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 19, 2017
Copy link
Contributor

@jiayingz jiayingz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the change! Please see inline comments.

@@ -72,7 +72,7 @@ type ContainerManager interface {

// GetDevicePluginResourceCapacity returns the amount of device plugin resources available on the node
// and inactive device plugin resources previously registered on the node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the comment to cope with the change?

healthyDevices map[string]sets.String

// unhealthyDevices contains all of the unhealthy devices and their exported device IDs.
unhealthyDevices map[string]sets.String
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main benefit is to surface unhealthy device information more clearly through node status to facilitate monitoring and problem detection.

}
}
for _, dev := range deleted {
m.allDevices[resourceName].Delete(dev.ID)
if dev.Health == pluginapi.Healthy {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dev health state reported here may not be consistent with the cached state. Maybe simply do:
m.healthyDevices[resourceName].Delete(dev.ID)
m.unhealthyDevices[resourceName].Delete(dev.ID)

@@ -594,13 +596,15 @@ func (kl *Kubelet) setNodeStatusMachineInfo(node *v1.Node) {
}
}

devicePluginCapacity, removedDevicePlugins := kl.containerManager.GetDevicePluginResourceCapacity()
devicePluginCapacity, allocatable, removedDevicePlugins := kl.containerManager.GetDevicePluginResourceCapacity()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just assign to devicePluginAllocatable here directly?

@@ -453,9 +487,9 @@ func (m *ManagerImpl) readCheckpoint() error {
m.podDevices.fromCheckpointData(data.PodDeviceEntries)
m.allocatedDevices = m.podDevices.devices()
for resource, devices := range data.RegisteredDevices {
m.allDevices[resource] = sets.NewString()
m.healthyDevices[resource] = sets.NewString()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a TODO comment on considering to also checkpoint unhealthy device information?

} else {
capacityCount := capacity[v1.ResourceName(resourceName)]
unhealthyCount := *resource.NewQuantity(int64(devices.Len()), resource.DecimalSI)
capacityCount.Add(unhealthyCount)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if resource doesn't exist in capacity yet? Would we get a segfault here? Can we add some unit test for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting to know. I guess the map lookup just returns struct with zero values when the key doesn't exist.

@RenaudWasTaken
Copy link
Contributor

These documents will need to be updated:

It looks we need to start tracking the API changes we introduce so that we can document them.
Do you think the spreadsheet sounds like a good place to do that ?

@vikaschoudhary16 vikaschoudhary16 force-pushed the unhealthy_device branch 2 times, most recently from c3d1832 to dc541fa Compare January 9, 2018 16:20
@vikaschoudhary16
Copy link
Contributor Author

/retest

Update node capacity with sum of both healthy and unhealthy devices.
Node allocatable reflect only healthy devices.
@vikaschoudhary16
Copy link
Contributor Author

/retest

1 similar comment
@vikaschoudhary16
Copy link
Contributor Author

/retest

@vikaschoudhary16
Copy link
Contributor Author

@jiayingz ping

@jiayingz
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 10, 2018
@vikaschoudhary16
Copy link
Contributor Author

ping @dchen1107 @derekwaynecarr
Looks like OWNER's approval is needed.

Copy link
Contributor

@RenaudWasTaken RenaudWasTaken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@derekwaynecarr
Copy link
Member

Nice job.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, jiayingz, RenaudWasTaken, vikaschoudhary16

Associated issue: #57241

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 13, 2018
@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit f2e46a2 into kubernetes:master Jan 13, 2018
@vikaschoudhary16 vikaschoudhary16 deleted the unhealthy_device branch January 16, 2018 06:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve handling of Unhealthy devices
9 participants