Handle Unhealthy devices #57266

vikaschoudhary16 · 2017-12-16T06:48:55Z

Update node capacity with sum of both healthy and unhealthy devices.
Node allocatable reflect only healthy devices.

What this PR does / why we need it:
Currently node capacity only reflects healthy devices. Unhealthy devices are ignored totally while updating node status. This PR handles unhealthy devices while updating node status.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #57241

Special notes for your reviewer:

Release note:

Handle Unhealthy devices

/cc @tengqm @ConnorDoyle @jiayingz @vishh @jeremyeder @sjenning @resouer @ScorpioCPH @lichuqiang @RenaudWasTaken @balajismaniam

/sig node

k8s-ci-robot · 2017-12-16T06:49:01Z

@vikaschoudhary16: GitHub didn't allow me to request PR reviews from the following users: RenaudWasTaken, tengqm, ScorpioCPH.

Note that only kubernetes members can review this PR, and authors cannot review their own PRs.

In response to this:

Update node capacity with sum of both healthy and unhealthy devices.
Node allocatable reflect only healthy devices.

What this PR does / why we need it:
Currently node capacity only reflects healthy devices. Unhealthy devices are ignored totally while updating node status. This PR handles unhealthy devices while updating node status.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #57241

Special notes for your reviewer:

Release note:
Handle Unhealthy devices
/cc @tengqm @ConnorDoyle @jiayingz @vishh @jeremyeder @sjenning @resouer @ScorpioCPH @lichuqiang @RenaudWasTaken @balajismaniam

/sig node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ScorpioCPH

Thanks a lot for your work! Some quick comments.

ScorpioCPH · 2017-12-16T11:43:03Z

pkg/kubelet/cm/deviceplugin/manager.go

+			for _, name := range deletedResources {
+				if name != resourceName {
+					continue
+				} else {


nit: can we remove else statement? as we have called continue in the above if statement.

This part is changed now.

ScorpioCPH · 2017-12-16T12:05:43Z

pkg/kubelet/cm/deviceplugin/manager.go

+	healthyDevices map[string]sets.String
+
+	// unhealthyDevices contains all of the unhealthy devices and their exported device IDs.
+	unhealthyDevices map[string]sets.String


I'm just wondering is there any benefits we keep unhealthyDevices in cache here. Maybe we can make more discussions about this :)

Kubelet needs to update node status taking into account unhealthy devices also(in the capacity). If we dont store unhealthy devices here in device manager, i am not sure how kubelet will sync this info. Would love to hear any suggestions?

I think the main benefit is to surface unhealthy device information more clearly through node status to facilitate monitoring and problem detection.

ScorpioCPH · 2017-12-16T12:09:58Z

pkg/kubelet/kubelet_node_status.go

@@ -545,6 +545,7 @@ func (kl *Kubelet) setNodeAddress(node *v1.Node) error {
 func (kl *Kubelet) setNodeStatusMachineInfo(node *v1.Node) {
 	// Note: avoid blindly overwriting the capacity in case opaque
 	//       resources are being advertised.
+	//glog.Info("Error getting machine info: %v", err)


nit: is this a error info?

Thanks for pointing out. Left by mistake.

vikaschoudhary16 · 2017-12-16T17:08:18Z

/test pull-kubernetes-unit

jiayingz

Thanks a lot for the change! Please see inline comments.

jiayingz · 2018-01-03T21:57:41Z

pkg/kubelet/cm/container_manager.go

@@ -72,7 +72,7 @@ type ContainerManager interface {

 	// GetDevicePluginResourceCapacity returns the amount of device plugin resources available on the node
 	// and inactive device plugin resources previously registered on the node.


Could you update the comment to cope with the change?

jiayingz · 2018-01-03T21:59:54Z

pkg/kubelet/cm/deviceplugin/manager.go

+	healthyDevices map[string]sets.String
+
+	// unhealthyDevices contains all of the unhealthy devices and their exported device IDs.
+	unhealthyDevices map[string]sets.String


I think the main benefit is to surface unhealthy device information more clearly through node status to facilitate monitoring and problem detection.

jiayingz · 2018-01-03T22:02:09Z

pkg/kubelet/cm/deviceplugin/manager.go

 		}
 	}
 	for _, dev := range deleted {
-		m.allDevices[resourceName].Delete(dev.ID)
+		if dev.Health == pluginapi.Healthy {


The dev health state reported here may not be consistent with the cached state. Maybe simply do:
m.healthyDevices[resourceName].Delete(dev.ID)
m.unhealthyDevices[resourceName].Delete(dev.ID)

jiayingz · 2018-01-03T22:07:50Z

pkg/kubelet/kubelet_node_status.go

@@ -594,13 +596,15 @@ func (kl *Kubelet) setNodeStatusMachineInfo(node *v1.Node) {
 			}
 		}

-		devicePluginCapacity, removedDevicePlugins := kl.containerManager.GetDevicePluginResourceCapacity()
+		devicePluginCapacity, allocatable, removedDevicePlugins := kl.containerManager.GetDevicePluginResourceCapacity()


Can we just assign to devicePluginAllocatable here directly?

jiayingz · 2018-01-03T22:11:50Z

pkg/kubelet/cm/deviceplugin/manager.go

@@ -453,9 +487,9 @@ func (m *ManagerImpl) readCheckpoint() error {
 	m.podDevices.fromCheckpointData(data.PodDeviceEntries)
 	m.allocatedDevices = m.podDevices.devices()
 	for resource, devices := range data.RegisteredDevices {
-		m.allDevices[resource] = sets.NewString()
+		m.healthyDevices[resource] = sets.NewString()


Could you add a TODO comment on considering to also checkpoint unhealthy device information?

jiayingz · 2018-01-03T22:27:05Z

pkg/kubelet/cm/deviceplugin/manager.go

+		} else {
+			capacityCount := capacity[v1.ResourceName(resourceName)]
+			unhealthyCount := *resource.NewQuantity(int64(devices.Len()), resource.DecimalSI)
+			capacityCount.Add(unhealthyCount)


What if resource doesn't exist in capacity yet? Would we get a segfault here? Can we add some unit test for this?

Test is already there: https://github.com/vikaschoudhary16/kubernetes/blob/dc541fa0365de94d1c201baa74bce84d20bd9553/pkg/kubelet/cm/deviceplugin/manager_test.go#L203-L211

Looks like it wont crash. Just take a look here:
https://play.golang.org/p/pi1bI3caJRL

Interesting to know. I guess the map lookup just returns struct with zero values when the key doesn't exist.

RenaudWasTaken · 2018-01-04T01:32:17Z

These documents will need to be updated:

It looks we need to start tracking the API changes we introduce so that we can document them.
Do you think the spreadsheet sounds like a good place to do that ?

vikaschoudhary16 · 2018-01-09T16:24:38Z

/retest

Update node capacity with sum of both healthy and unhealthy devices. Node allocatable reflect only healthy devices.

vikaschoudhary16 · 2018-01-09T16:55:38Z

/retest

vikaschoudhary16 · 2018-01-09T23:19:12Z

/retest

vikaschoudhary16 · 2018-01-10T02:33:50Z

@jiayingz ping

jiayingz · 2018-01-10T20:39:10Z

/lgtm
/approve

vikaschoudhary16 · 2018-01-11T00:01:21Z

ping @dchen1107 @derekwaynecarr
Looks like OWNER's approval is needed.

RenaudWasTaken

/lgtm

derekwaynecarr · 2018-01-13T02:30:37Z

Nice job.

/lgtm
/approve

k8s-ci-robot · 2018-01-13T02:30:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, jiayingz, RenaudWasTaken, vikaschoudhary16

Associated issue: #57241

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~pkg/kubelet/OWNERS~~ [derekwaynecarr]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

k8s-github-robot · 2018-01-13T03:03:16Z

/test all [submit-queue is verifying that this PR is safe to merge]

k8s-github-robot · 2018-01-13T03:55:53Z

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

k8s-ci-robot requested a review from vishh December 16, 2017 06:48

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 16, 2017

k8s-ci-robot requested review from sjenning, resouer and lichuqiang December 16, 2017 06:48

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 16, 2017

k8s-ci-robot requested review from balajismaniam, ConnorDoyle, jiayingz and jeremyeder December 16, 2017 06:48

k8s-github-robot assigned dchen1107 and pmorie Dec 16, 2017

vikaschoudhary16 force-pushed the unhealthy_device branch from ee8267d to 72d3ff2 Compare December 16, 2017 12:08

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 16, 2017

ScorpioCPH reviewed Dec 16, 2017

View reviewed changes

vikaschoudhary16 force-pushed the unhealthy_device branch from 72d3ff2 to 157272a Compare December 16, 2017 13:04

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 16, 2017

vikaschoudhary16 force-pushed the unhealthy_device branch from 157272a to 53a6203 Compare December 16, 2017 13:06

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 17, 2017

vikaschoudhary16 force-pushed the unhealthy_device branch from 53a6203 to fc59f7c Compare December 19, 2017 07:13

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 19, 2017

jiayingz reviewed Jan 3, 2018

View reviewed changes

vikaschoudhary16 force-pushed the unhealthy_device branch 2 times, most recently from c3d1832 to dc541fa Compare January 9, 2018 16:20

Handle Unhealthy devices

e9cf3f1

Update node capacity with sum of both healthy and unhealthy devices. Node allocatable reflect only healthy devices.

vikaschoudhary16 force-pushed the unhealthy_device branch from dc541fa to e9cf3f1 Compare January 9, 2018 16:39

k8s-ci-robot assigned jiayingz Jan 10, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 10, 2018

k8s-ci-robot assigned RenaudWasTaken Jan 13, 2018

RenaudWasTaken reviewed Jan 13, 2018

View reviewed changes

k8s-ci-robot assigned derekwaynecarr Jan 13, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 13, 2018

k8s-github-robot merged commit f2e46a2 into kubernetes:master Jan 13, 2018

vikaschoudhary16 deleted the unhealthy_device branch January 16, 2018 06:16

vikaschoudhary16 mentioned this pull request Apr 16, 2018

Add vikaschoudhary16 to the approvers in device manager #62184

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle Unhealthy devices #57266

Handle Unhealthy devices #57266

vikaschoudhary16 commented Dec 16, 2017

k8s-ci-robot commented Dec 16, 2017

ScorpioCPH left a comment

ScorpioCPH Dec 16, 2017

vikaschoudhary16 Dec 16, 2017 •

edited

Loading

ScorpioCPH Dec 16, 2017

vikaschoudhary16 Dec 16, 2017 •

edited

Loading

jiayingz Jan 3, 2018

ScorpioCPH Dec 16, 2017

vikaschoudhary16 Dec 16, 2017

vikaschoudhary16 commented Dec 16, 2017

jiayingz left a comment

jiayingz Jan 3, 2018

jiayingz Jan 3, 2018

jiayingz Jan 3, 2018

jiayingz Jan 3, 2018

jiayingz Jan 3, 2018

jiayingz Jan 3, 2018

vikaschoudhary16 Jan 9, 2018

jiayingz Jan 10, 2018

RenaudWasTaken commented Jan 4, 2018

vikaschoudhary16 commented Jan 9, 2018

vikaschoudhary16 commented Jan 9, 2018

vikaschoudhary16 commented Jan 9, 2018

vikaschoudhary16 commented Jan 10, 2018

jiayingz commented Jan 10, 2018

vikaschoudhary16 commented Jan 11, 2018

RenaudWasTaken left a comment

derekwaynecarr commented Jan 13, 2018

k8s-ci-robot commented Jan 13, 2018

k8s-github-robot commented Jan 13, 2018

k8s-github-robot commented Jan 13, 2018

		@@ -72,7 +72,7 @@ type ContainerManager interface {

		// GetDevicePluginResourceCapacity returns the amount of device plugin resources available on the node
		// and inactive device plugin resources previously registered on the node.

Handle Unhealthy devices #57266

Handle Unhealthy devices #57266

Conversation

vikaschoudhary16 commented Dec 16, 2017

k8s-ci-robot commented Dec 16, 2017

ScorpioCPH left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vikaschoudhary16 Dec 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vikaschoudhary16 Dec 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vikaschoudhary16 commented Dec 16, 2017

jiayingz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RenaudWasTaken commented Jan 4, 2018

vikaschoudhary16 commented Jan 9, 2018

vikaschoudhary16 commented Jan 9, 2018

vikaschoudhary16 commented Jan 9, 2018

vikaschoudhary16 commented Jan 10, 2018

jiayingz commented Jan 10, 2018

vikaschoudhary16 commented Jan 11, 2018

RenaudWasTaken left a comment

Choose a reason for hiding this comment

derekwaynecarr commented Jan 13, 2018

k8s-ci-robot commented Jan 13, 2018

k8s-github-robot commented Jan 13, 2018

k8s-github-robot commented Jan 13, 2018

vikaschoudhary16 Dec 16, 2017 •

edited

Loading

vikaschoudhary16 Dec 16, 2017 •

edited

Loading