When running on GCE prevent Nodes to become Ready before GCE routes are created. #26267

gmarek · 2016-05-25T13:59:11Z

This prevent Node to become Ready before networking is set up when on GCE provider. This adds a new NetworkingReady condition instead of writing directly NodeReady, to prevent making NodeController logic even more complicated.

I'll add tests in another commit.

cc @bgrant0607 for new Condition, @wojtek-t @cjcullen @zmerlynn

bgrant0607 · 2016-05-25T15:07:05Z

cc @davidopp @mikedanese

bgrant0607 · 2016-05-25T15:07:54Z

cc @dchen1107 @vishh @derekwaynecarr

zmerlynn · 2016-05-25T15:36:23Z

I suspect that @dchen1107 or @bgrant0607 are better overall reviewers for this. I've been staring at the routecontroller code enough recently to represent that piece, though. And I know this is going to bounce hard against #26263 , which is probably going to go in first - just as a first blush comment I think the goroutine there is getting a bit long in the tooth and we probably need to pull that out.

Who wants it? @bgrant0607 first? (I'll keep adding comments on the routecontroller side and in general.)

bgrant0607 · 2016-05-25T15:54:42Z

pkg/kubelet/kubelet.go

-			Message:           "kubelet is posting ready status",
-			LastHeartbeatTime: currentTime,
+		_, networkingCondition := api.GetNodeCondition(&node.Status, api.NodeNetworkingReady)
+		if (kl.cloud.ProviderName() == gce.ProviderName) && (networkingCondition == nil || networkingCondition.Status != api.ConditionTrue) {


Is this the best way to determine whether to expect NetworkReady to be set?

kubernetes/cmd/kube-controller-manager/app/controllermanager.go

Line 244 in 9625926

if s.AllocateNodeCIDRs {

It also assumes that the controller manager is upgraded before Kubelet, which not everyone does.

Is there a way to distinguish the 1.2 route controller vs the 1.3 route controller via the API? Or would it be possible to confirm the route through the local metadata server (abstracted behind the cloudprovider interface)?

It's gated on GCE and I'm not sure how many people do it by hand vs. doing it by kube-push (however it's called right now). I don't have any idea on how to check if RC is in v1.3. I guess the best I can do is to have Kubelet write "NetworkingReadyCondition=false" on startup. It may lead to races though, which in turn will make us wait 5 minutes for the next RC pass...

I did consider yesterday whether kubelet could do this, but unfortunately I think it's an API call, which would take compute-ro scope, which we just don't want to require on the nodes.

@gmarek Where else do you see checks of ProviderName in the codebase?

I can replace this with the same check that it's done in the controller-manager. The problem is that I don't know how to test it in other environments, and this is clearly a cloud-provider specific thing.

k8s-bot · 2016-05-25T15:55:04Z

GCE e2e build/test failed for commit 77fcfdb.

Please reference the list of currently known flakes when examining this failure. If you request a re-test, you must reference the issue describing the flake.

bgrant0607 · 2016-05-25T16:04:54Z

@zmerlynn @gmarek Yes, I'll take a pass.

derekwaynecarr · 2016-05-25T17:11:33Z

pkg/api/types.go

@@ -1993,6 +1993,8 @@ const (
 	NodeOutOfDisk NodeConditionType = "OutOfDisk"
 	// NodeMemoryPressure means the kubelet is under pressure due to insufficient available memory.
 	NodeMemoryPressure NodeConditionType = "MemoryPressure"
+	// NodeNetworkingReady means that network for the node is correctly configured.
+	NodeNetworkingReady NodeConditionType = "NetworkingReady"


It appears that we are settling on node conditions being expressed with opposite polarity.

see discussion:
#25773 (comment)

Basically, if a condition has State=True, its considered a problem.

I am not sure if this PR demonstrates that this is a bad decision as it seemed natural to expose it in this fashion.

/cc @vishh

+1. Why not invert the condition to be NetworkingNotReady, set it to true by default when kubelet registers a node, and have the route controller set that condition to false later?

Except for NodeReady, where "True" means that everything is OK. I'm not sure about this conventions. @bgrant0607

+1 for what @vishh wrote

derekwaynecarr · 2016-05-25T17:12:29Z

If not on GCE, what will this node condition report?

derekwaynecarr · 2016-05-25T17:14:50Z

Is it a correct understanding that this node condition would always report false when not on GCE?

/cc @smarterclayton

derekwaynecarr · 2016-05-25T17:24:46Z

Actually, looking at this more, I think this PR is not setting this condition to any value when not on a cloud provider. @kubernetes/rh-networking - please weigh in here if you would find this NetworkReady node condition useful as well when running with our SDN.

bgrant0607 · 2016-05-25T17:53:38Z

On a physical network, some entity would perform a connectivity check by actually pinging the nodes. I'd like the condition to be able to represent that, also.

Given that connectivity could drop at any moment, races are inevitable.

Also, we wouldn't want to kill pods or to make all nodes unschedulable just because the connectivity checker went down.

Therefore, I think it would be fine to treat the lack of ability to create a network route as a problem, with the opposite polarity, as @derekwaynecarr mentioned. PodsUnreachable or somesuch. That does seem like a useful convention for uncommon properties, for conciseness if nothing else.

Since the scheduler is generally updated at the same time as the controller manager, it would be easier to interpret the condition there instead of in Kubelet.

However, let's not treat all conditions other than Ready as implying a node should be unschedulable.

Conditions are informational, with reasons, messages, transition times, etc. Different conditions may be interpreted differently in the future -- some may block scheduling, others may kill running pods, etc.

As soon as we complete the taint/toleration/forgiveness implementation, I want all forms of unschedulability to be explicitly represented using taints.

vishh · 2016-05-25T17:53:40Z

pkg/kubelet/kubelet.go

-			Reason:            "KubeletReady",
-			Message:           "kubelet is posting ready status",
-			LastHeartbeatTime: currentTime,
+		_, networkingCondition := api.GetNodeCondition(&node.Status, api.NodeNetworkingReady)


Can the new condition exist independent of NodeReady condition?
Is the end goal to not schedule any pods on a node with network issues?

That and make the 'verify-cluster' wait for all Routes to be set.

danwinship · 2016-05-25T18:15:45Z

@kubernetes/rh-networking - please weigh in here if you would find this NetworkReady node condition useful as well when running with our SDN.

Sure, having pods not get scheduled on a node until the node's network was fully set up would be useful to us. (Currently we work around this by making SetUpPod() check if the network is ready, and if not, it blocks until it is.)

dcbw · 2016-05-25T21:44:35Z

Kubelet network plugins also just got a Status() call, which is used from Kubelet.syncNetworkStatus(). The plugin can return an error until it's ready to go. It's not the best, but perhaps it could be used here or extended by this PR?

gmarek · 2016-05-26T08:17:24Z

@bgrant0607 OK, I see why we may not want to make this Condition not mean NotReady. We may have nodes that we don't need to connect to ever (e.g. some workers that download data from somewhere and upload results somewhere else) - Scheduler might put such Pods on Nodes that does not have Pod networking working.

So it seems that the current consensus is that this Condition should prevent Scheduler from putting any Pods on a given Node. In such scenario can we assume that both Scheduler and CM would be updated in the same time and don't care about the skew?

I also think that we should update 'validate-cluster' function to look onto such condition. My observation from running a cluster with this change is that in 'ordinary' 3-Node e2e-cluster it takes roughly 2x more time to get all Nodes Ready. This means that currently we're possibly starting tests before Routes are created, which may be a cause of some of the flakyness. It's just a feeling though, so maybe it's not worth it. @lavalamp

k8s-github-robot · 2016-05-26T14:35:02Z

@gmarek PR needs rebase

bgrant0607 · 2016-05-26T21:01:54Z

@gmarek Skew would be no worse than current behavior.

lavalamp · 2016-05-27T00:32:31Z

2x what amount of time? Waiting for routes to be set up before e.g. running
the proxy tests sounds like a SUPER GOOD IDEA, and I'm willing to pay for
it with some start up time.

On Thu, May 26, 2016 at 1:17 AM, Marek Grabowski notifications@github.com
wrote:

OK, I see why we may not want to make this Condition not mean NotReady. We
may have nodes that we don't need to connect to ever (e.g. some workers
that download data from somewhere and upload results somewhere else) -
Scheduler might put such Pods on Nodes that does not have Pod networking
working.

So it seems that the current consensus is that this Condition should
prevent Scheduler from putting any Pods on a given Node. In such scenario
can we assume that both Scheduler and CM would be updated in the same time
and don't care about the skew?

I also think that we should update 'validate-cluster' function to look
onto such condition. My observation from running a cluster with this change
is that in 'ordinary' 3-Node e2e-cluster it takes roughly 2x more time to
get all Nodes Ready. This means that currently we're possibly starting
tests before Routes are created, which may be a cause of some of the
flakyness. It's just a feeling though, so maybe it's not worth it.
@lavalamp https://github.com/lavalamp

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#26267 (comment)

wojtek-t · 2016-05-27T07:19:27Z

2x what amount of time? Waiting for routes to be set up before e.g. running
the proxy tests sounds like a SUPER GOOD IDEA, and I'm willing to pay for
it with some start up time.

@lavalamp - currently, creating routes is not blocking anything; so it can potentially happens that we start running tests before routes are created. This is super visible in large clusters, when creating routes take significant amount of time (~30m-35m in 1000-node cluster).

And yes - I definitely think that if we want to run some networking-related tests, we really NEED ROUTES. And we should pay this cost (also note that for small clusters (like 5 nodes), the overhead shouldn't be big (sth like additional 1m doesn't sound terrible and can potentially reduce flakiness).

To be honest, i think that some part of our networking-related flakes may really be related to that. @thockin @bprashanth ^^

wojtek-t · 2016-05-27T07:21:28Z

@gmarek

My observation from running a cluster with this change is that in 'ordinary' 3-Node e2e-cluster it takes roughly 2x more time to get all Nodes Ready. This means that currently we're possibly starting tests before Routes are created, which may be a cause of some of the flakyness. It's just a feeling though, so maybe it's not worth it. @lavalamp

For small clusters I really think this is solvable. Routecontroller is not very sophisticated now and we should just fix it to improve it (I already added a TODO in one of my PRs recently to fix that). I'm happy to do that soon too.

wojtek-t · 2016-05-27T07:48:31Z

And I agree with @bgrant0607 that we probably change the scheduler to look into it instead of changing kubelet.
However, I think that we can be slightly more sophisticated in scheduler. What do you think about disallowing scheduling a pod on a node if "network-not-ready" only if it exposes some port?
@davidopp - thoughts?

--update--
or maybe it doesn't make sense, because this pod may actually want to talk to other pods...

wojtek-t · 2016-05-27T08:58:57Z

pkg/controller/route/routecontroller.go

+func tryUpdateNodeStatus(node *api.Node, kubeClient clientset.Interface) error {
+	for i := 0; i < updateNodeStatusMaxRetries; i++ {
+		if _, err := kubeClient.Core().Nodes().UpdateStatus(node); err == nil {
+			break


return nil

instead

wojtek-t · 2016-05-27T10:40:20Z

Since this is pretty urgent and @gmarek is OOO, I'm picking this up. I will apply the comments and send out a new PR for it.

wojtek-t · 2016-05-27T11:33:14Z

OK - so I pushed the new version of the PR to:
#26415

All comments should be applied there.

wojtek-t · 2016-05-28T06:30:35Z

#26415 is already lgtm-ed. So closing this one.

@gmarek

Automatic merge from submit-queue Add a NodeCondition "NetworkUnavaiable" to prevent scheduling onto a node until the routes have been created This is new version of #26267 (based on top of that one). The new workflow is: - we have an "NetworkNotReady" condition - Kubelet when it creates a node, it sets it to "true" - RouteController will set it to "false" when the route is created - Scheduler is scheduling only on nodes that doesn't have "NetworkNotReady ==true" condition @gmarek @bgrant0607 @zmerlynn @cjcullen @derekwaynecarr @danwinship @dcbw @lavalamp @vishh

Node is NotReady until the Route is created

a681b29

gmarek added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. labels May 25, 2016

gmarek added this to the v1.3 milestone May 25, 2016

gmarek assigned zmerlynn May 25, 2016

googlebot added the cla: yes label May 25, 2016

k8s-github-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 25, 2016

routecontroller tests WIP

77fcfdb

zmerlynn mentioned this pull request May 25, 2016

Limit concurrent route creations #26263

Merged

bgrant0607 reviewed May 25, 2016
View reviewed changes

bgrant0607 assigned bgrant0607 and unassigned zmerlynn May 25, 2016

derekwaynecarr reviewed May 25, 2016
View reviewed changes

vishh reviewed May 25, 2016
View reviewed changes

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 26, 2016

davidopp mentioned this pull request May 27, 2016

Nodes are ready even if GCE routes are not created for them. #23327

Closed

wojtek-t reviewed May 27, 2016
View reviewed changes

wojtek-t mentioned this pull request May 27, 2016

Add a NodeCondition "NetworkUnavailable" to prevent scheduling onto a node until the routes have been created #26415

Merged

wojtek-t closed this May 28, 2016

gmarek deleted the routecontroller branch August 30, 2016 09:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When running on GCE prevent Nodes to become Ready before GCE routes are created. #26267

When running on GCE prevent Nodes to become Ready before GCE routes are created. #26267

gmarek commented May 25, 2016

bgrant0607 commented May 25, 2016

bgrant0607 commented May 25, 2016

zmerlynn commented May 25, 2016

bgrant0607 May 25, 2016

bgrant0607 May 25, 2016

gmarek May 25, 2016

zmerlynn May 25, 2016 •

edited

Loading

bgrant0607 May 25, 2016

gmarek May 25, 2016

k8s-bot commented May 25, 2016

bgrant0607 commented May 25, 2016

derekwaynecarr May 25, 2016

vishh May 25, 2016

gmarek May 25, 2016

wojtek-t May 27, 2016

derekwaynecarr commented May 25, 2016

derekwaynecarr commented May 25, 2016

derekwaynecarr commented May 25, 2016

bgrant0607 commented May 25, 2016

vishh May 25, 2016

gmarek May 25, 2016

danwinship commented May 25, 2016

dcbw commented May 25, 2016

gmarek commented May 26, 2016 •

edited

Loading

k8s-github-robot commented May 26, 2016

bgrant0607 commented May 26, 2016

lavalamp commented May 27, 2016

wojtek-t commented May 27, 2016

wojtek-t commented May 27, 2016

wojtek-t commented May 27, 2016 •

edited

Loading

wojtek-t May 27, 2016

wojtek-t commented May 27, 2016

wojtek-t commented May 27, 2016

wojtek-t commented May 28, 2016

When running on GCE prevent Nodes to become Ready before GCE routes are created. #26267

When running on GCE prevent Nodes to become Ready before GCE routes are created. #26267

Conversation

gmarek commented May 25, 2016

bgrant0607 commented May 25, 2016

bgrant0607 commented May 25, 2016

zmerlynn commented May 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zmerlynn May 25, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-bot commented May 25, 2016

bgrant0607 commented May 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

derekwaynecarr commented May 25, 2016

derekwaynecarr commented May 25, 2016

derekwaynecarr commented May 25, 2016

bgrant0607 commented May 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danwinship commented May 25, 2016

dcbw commented May 25, 2016

gmarek commented May 26, 2016 • edited Loading

k8s-github-robot commented May 26, 2016

bgrant0607 commented May 26, 2016

lavalamp commented May 27, 2016

wojtek-t commented May 27, 2016

wojtek-t commented May 27, 2016

wojtek-t commented May 27, 2016 • edited Loading

Choose a reason for hiding this comment

wojtek-t commented May 27, 2016

wojtek-t commented May 27, 2016

wojtek-t commented May 28, 2016

zmerlynn May 25, 2016 •

edited

Loading

gmarek commented May 26, 2016 •

edited

Loading

wojtek-t commented May 27, 2016 •

edited

Loading