Node-Problem-Detector should Patch NodeStatus not Update #9

saad-ali · 2016-05-26T01:45:22Z

Problem:
kubelet, node-controller, and now the node-problem-detector all update the Node.Status field.

Normally this is not a problem because if changes happen rapidly, resource version mismatch fails and everything is ok.

When a new field is added to Node Status, however:

Since kubelet and node-controller are in the same repository they are recompiled with the new field and Status updates continue to operate normally.
However, since node-problem-detector is in a separate repository and has not been recompiled with the new version, its Update calls end up squashing the new (unknown) field, resetting it to nil.

Suggested Solution:
node-problem-detector should do a patch instead of an update to prevent wiping out fields it is not aware of. Incidentally, kubelet and node-controller should do this as well, but that is not as critical since they are in the same repository and new types are normally added their first.

CC @kubernetes/sig-node @kubernetes/sig-api-machinery @Random-Liu @bgrant0607

The text was updated successfully, but these errors were encountered:

caesarxuchao · 2016-05-26T03:49:51Z

You can use the Patch() method in the dynamic client, but you'll need to modify that function to allow specifying a subresource. cc @krousey

lavalamp · 2016-05-26T04:27:42Z

I didn't know you guys were building this in a different repository. If you aren't going to be version-locked like the other components are, then you need some features that aren't built into our client.

Patch and/or dynamic client seems to be your best way forward.

There are various ways we could have made this easier, but they won't happen in time for your first release...

Random-Liu · 2016-05-26T05:47:35Z

@saad-ali Thanks a lot for reporting!

You can use the Patch() method in the dynamic client, but you'll need to modify that function to allow specifying a subresource. cc @krousey

I'll try this!

@lavalamp Thanks a lot! Patch is good enough for now~ Let me try :)

derekwaynecarr · 2016-05-26T13:06:33Z

I think we should also have a node e2e that verifies the Kubelet itself
does not regress and drop node conditions it does not manage. I don't
think we have anything that verifies that today and it would help avoid
regressions.

On Thursday, May 26, 2016, Lantao Liu notifications@github.com wrote:

@saad-ali https://github.com/saad-ali Thanks a lot for reporting!

You can use the Patch() method in the dynamic client, but you'll need to
modify that function to allow specifying a subresource. cc @krousey
https://github.com/krousey

I'll try this!

@lavalamp https://github.com/lavalamp Thanks a lot! Patch is good
enough for now~ Let me try :)

—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly or view it on GitHub
#9 (comment)

dchen1107 · 2016-05-26T14:25:52Z

Yes, I talked @Random-Liu on e2e test verifying this.

bgrant0607 · 2016-05-26T14:27:22Z

@saad-ali Ah, so it was getting clobbered. Great catch! Where multiple components update a resource, they'll need a way to patch, either via PATCH or via a dedicated subresource.

This is a problem already with node controller and Kubelet. I'm surprised we hadn't been bit by it yet.

We have a similar problem with kubectl round-tripping, which we're working to eliminate.

lavalamp · 2016-05-26T17:22:02Z

@bgrant0607 Yeah. Very surprising this is the first time we've hit this. BTW making kubectl not round-trip wouldn't be sufficient, but @krousey was actually switching everything to use the dynamic client, which will be sufficient to prevent this.

Also, this is the canary in the coal mine: everyone who built automation using our current client will have this problem in 1.3.

Random-Liu · 2016-05-26T17:33:55Z

I think we should also have a node e2e that verifies the Kubelet itself
does not regress and drop node conditions it does not manage. I don't
think we have anything that verifies that today and it would help avoid
regressions.

Yeah, we are going to add test for this. :)

Random-Liu · 2016-05-26T18:10:29Z

FYI, I've tried many ways to patch NodeCondition, but the apiserver keeps throwing out 405 error. I'll work with @caesarxuchao to figure out whether we could and how to patch status correctly (Thanks! :)).

And I had an offline discussion with @saad-ali, to unblock him we'll:

disable node-problem-detector temporarily
get Enable Attach/Detach Controller kubernetes#26351 merged
update the api client in node-problem-detector
re-enable node-problem-detector

Because node-problem-detector only touches NodeStatus and kubernetes/kubernetes#26351 seems to be the only PR adding new field in NodeStatus recently, so this could be a temporarily walk around.

In the meantime, I'll change the node-problem-detector to use Patch asap.

@dchen1107 @bgrant0607 @lavalamp Is this OK?

Random-Liu · 2016-05-26T21:01:02Z

Offline discussed with @dchen1107, we will:

Go with Node-Problem-Detector should Patch NodeStatus not Update #9 (comment) to unblock Enable Attach/Detach Controller kubernetes#26351 @saad-ali
Wait for Expose GET and PATCH for status subresource kubernetes#26381 landing (Thanks @caesarxuchao )
Switch node-problem-detector to use Patch instead of UpdateStatus

@Random-Liu

Automatic merge from submit-queue Expose GET and PATCH for status subresource We can do this for other status subresource. I only updated node/status in this PR to unblock kubernetes/node-problem-detector#9. cc @Random-Liu @lavalamp

wojtek-t · 2016-06-02T06:27:28Z

but @krousey was actually switching everything to use the dynamic client, which will be sufficient to prevent this.

@lavalamp - what do you mean by everything? If you really mean "everything", then we need to support protobufs in it, which we currently don't. @smarterclayton FYI

lavalamp · 2016-06-02T20:59:27Z

Yeah, dynamic client only talk json right now afaik.

On Wed, Jun 1, 2016 at 11:27 PM, Wojciech Tyczynski <
notifications@github.com> wrote:

but @krousey https://github.com/krousey was actually switching
everything to use the dynamic client, which will be sufficient to prevent
this.

@lavalamp https://github.com/lavalamp - what do you mean by everything?
If you really mean "everything", then we need to support protobufs in it,
which we currently don't. @smarterclayton
https://github.com/smarterclayton FYI

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#9 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAnglj--252rYNcJXBo49u0HU6oad_5tks5qHnfSgaJpZM4InIBd
.

This PR contains Kubelet changes to enable attach/detach controller control. * It introduces a new "enable-controller-attach-detach" kubelet flag to enable control by controller. Default enabled. * It removes all references "SafeToDetach" annoation from controller. * It adds the new VolumesInUse field to the Node Status API object. * It modifies the controller to use VolumesInUse instead of SafeToDetach annotation to gate detachment. * There is a bug in node-problem-detector that causes VolumesInUse to get reset every 30 seconds. Issue kubernetes/node-problem-detector#9 opened to fix that.

…Changes Automatic merge from submit-queue Attach/Detach Controller Kubelet Changes This PR contains changes to enable attach/detach controller proposed in #20262. Specifically it: * Introduces a new `enable-controller-attach-detach` kubelet flag to enable control by attach/detach controller. Default enabled. * Removes all references `SafeToDetach` annotation from controller. * Adds the new `VolumesInUse` field to the Node Status API object. * Modifies the controller to use `VolumesInUse` instead of `SafeToDetach` annotation to gate detachment. * Modifies kubelet to set `VolumesInUse` before Mount and after Unmount. * There is a bug in the `node-problem-detector` binary that causes `VolumesInUse` to get reset to nil every 30 seconds. Issue kubernetes/node-problem-detector#9 (comment) opened to fix that. * There is a bug here in the mount/unmount code that prevents resetting `VolumeInUse in some cases, this will be fixed by mount/unmount refactor. * Have controller process detaches before attaches so that volumes referenced by pods that are rescheduled to a different node are detached first. * Fix misc bugs in controller. * Modify GCE attacher to: remove retries, remove mutex, and not fail if volume is already attached or already detached. Fixes #14642, #19953 ```release-note Kubernetes v1.3 introduces a new Attach/Detach Controller. This controller manages attaching and detaching volumes on-behalf of nodes that have the "volumes.kubernetes.io/controller-managed-attach-detach" annotation. A kubelet flag, "enable-controller-attach-detach" (default true), controls whether a node sets the "controller-managed-attach-detach" or not. ```

This PR contains Kubelet changes to enable attach/detach controller control. * It introduces a new "enable-controller-attach-detach" kubelet flag to enable control by controller. Default enabled. * It removes all references "SafeToDetach" annoation from controller. * It adds the new VolumesInUse field to the Node Status API object. * It modifies the controller to use VolumesInUse instead of SafeToDetach annotation to gate detachment. * There is a bug in node-problem-detector that causes VolumesInUse to get reset every 30 seconds. Issue kubernetes/node-problem-detector#9 opened to fix that.

Random-Liu added the bug label May 26, 2016

Random-Liu self-assigned this May 26, 2016

davidopp mentioned this issue May 26, 2016

nodecontroller never finishes updates on 2k nodes kubernetes/kubernetes#26211

Closed

saad-ali mentioned this issue May 26, 2016

Enable Attach/Detach Controller kubernetes/kubernetes#26351

Merged

caesarxuchao mentioned this issue May 26, 2016

Expose GET and PATCH for status subresource kubernetes/kubernetes#26381

Merged

Random-Liu mentioned this issue May 26, 2016

Temporarily disable node-problem-detector to unblock #26351 kubernetes/kubernetes#26384

Merged

bgrant0607 mentioned this issue May 27, 2016

Add a NodeCondition "NetworkUnavailable" to prevent scheduling onto a node until the routes have been created kubernetes/kubernetes#26415

Merged

Random-Liu mentioned this issue May 31, 2016

Use patch instead of update status #11

Merged

dchen1107 closed this as completed in #11 Jun 1, 2016

Random-Liu mentioned this issue Jun 3, 2016

NodeProblem API kubernetes/kubernetes#23028

Closed

Random-Liu mentioned this issue Jun 25, 2016

does this tool work outside Google? #21

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node-Problem-Detector should Patch NodeStatus not Update #9

Node-Problem-Detector should Patch NodeStatus not Update #9

saad-ali commented May 26, 2016

caesarxuchao commented May 26, 2016

lavalamp commented May 26, 2016

Random-Liu commented May 26, 2016

derekwaynecarr commented May 26, 2016

dchen1107 commented May 26, 2016

bgrant0607 commented May 26, 2016

lavalamp commented May 26, 2016

Random-Liu commented May 26, 2016

Random-Liu commented May 26, 2016 •

edited

Loading

Random-Liu commented May 26, 2016 •

edited

Loading

wojtek-t commented Jun 2, 2016

lavalamp commented Jun 2, 2016

Node-Problem-Detector should Patch NodeStatus not Update #9

Node-Problem-Detector should Patch NodeStatus not Update #9

Comments

saad-ali commented May 26, 2016

caesarxuchao commented May 26, 2016

lavalamp commented May 26, 2016

Random-Liu commented May 26, 2016

derekwaynecarr commented May 26, 2016

dchen1107 commented May 26, 2016

bgrant0607 commented May 26, 2016

lavalamp commented May 26, 2016

Random-Liu commented May 26, 2016

Random-Liu commented May 26, 2016 • edited Loading

Random-Liu commented May 26, 2016 • edited Loading

wojtek-t commented Jun 2, 2016

lavalamp commented Jun 2, 2016

Random-Liu commented May 26, 2016 •

edited

Loading

Random-Liu commented May 26, 2016 •

edited

Loading