-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When running on GCE prevent Nodes to become Ready before GCE routes are created. #26267
Conversation
I suspect that @dchen1107 or @bgrant0607 are better overall reviewers for this. I've been staring at the routecontroller code enough recently to represent that piece, though. And I know this is going to bounce hard against #26263 , which is probably going to go in first - just as a first blush comment I think the goroutine there is getting a bit long in the tooth and we probably need to pull that out. Who wants it? @bgrant0607 first? (I'll keep adding comments on the routecontroller side and in general.) |
Message: "kubelet is posting ready status", | ||
LastHeartbeatTime: currentTime, | ||
_, networkingCondition := api.GetNodeCondition(&node.Status, api.NodeNetworkingReady) | ||
if (kl.cloud.ProviderName() == gce.ProviderName) && (networkingCondition == nil || networkingCondition.Status != api.ConditionTrue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the best way to determine whether to expect NetworkReady to be set?
if s.AllocateNodeCIDRs { |
It also assumes that the controller manager is upgraded before Kubelet, which not everyone does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to distinguish the 1.2 route controller vs the 1.3 route controller via the API? Or would it be possible to confirm the route through the local metadata server (abstracted behind the cloudprovider interface)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's gated on GCE and I'm not sure how many people do it by hand vs. doing it by kube-push (however it's called right now). I don't have any idea on how to check if RC is in v1.3. I guess the best I can do is to have Kubelet write "NetworkingReadyCondition=false" on startup. It may lead to races though, which in turn will make us wait 5 minutes for the next RC pass...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider yesterday whether kubelet
could do this, but unfortunately I think it's an API call, which would take compute-ro
scope, which we just don't want to require on the nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmarek Where else do you see checks of ProviderName in the codebase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can replace this with the same check that it's done in the controller-manager. The problem is that I don't know how to test it in other environments, and this is clearly a cloud-provider specific thing.
GCE e2e build/test failed for commit 77fcfdb. Please reference the list of currently known flakes when examining this failure. If you request a re-test, you must reference the issue describing the flake. |
@@ -1993,6 +1993,8 @@ const ( | |||
NodeOutOfDisk NodeConditionType = "OutOfDisk" | |||
// NodeMemoryPressure means the kubelet is under pressure due to insufficient available memory. | |||
NodeMemoryPressure NodeConditionType = "MemoryPressure" | |||
// NodeNetworkingReady means that network for the node is correctly configured. | |||
NodeNetworkingReady NodeConditionType = "NetworkingReady" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that we are settling on node conditions being expressed with opposite polarity.
see discussion:
#25773 (comment)
Basically, if a condition has State=True, its considered a problem.
I am not sure if this PR demonstrates that this is a bad decision as it seemed natural to expose it in this fashion.
/cc @vishh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Why not invert the condition to be NetworkingNotReady
, set it to true
by default when kubelet registers a node, and have the route controller set that condition to false
later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except for NodeReady, where "True" means that everything is OK. I'm not sure about this conventions. @bgrant0607
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for what @vishh wrote
If not on GCE, what will this node condition report? |
Is it a correct understanding that this node condition would always report false when not on GCE? /cc @smarterclayton |
Actually, looking at this more, I think this PR is not setting this condition to any value when not on a cloud provider. @kubernetes/rh-networking - please weigh in here if you would find this NetworkReady node condition useful as well when running with our SDN. |
On a physical network, some entity would perform a connectivity check by actually pinging the nodes. I'd like the condition to be able to represent that, also. Given that connectivity could drop at any moment, races are inevitable. Also, we wouldn't want to kill pods or to make all nodes unschedulable just because the connectivity checker went down. Therefore, I think it would be fine to treat the lack of ability to create a network route as a problem, with the opposite polarity, as @derekwaynecarr mentioned. PodsUnreachable or somesuch. That does seem like a useful convention for uncommon properties, for conciseness if nothing else. Since the scheduler is generally updated at the same time as the controller manager, it would be easier to interpret the condition there instead of in Kubelet. However, let's not treat all conditions other than Ready as implying a node should be unschedulable. Conditions are informational, with reasons, messages, transition times, etc. Different conditions may be interpreted differently in the future -- some may block scheduling, others may kill running pods, etc. As soon as we complete the taint/toleration/forgiveness implementation, I want all forms of unschedulability to be explicitly represented using taints. |
Reason: "KubeletReady", | ||
Message: "kubelet is posting ready status", | ||
LastHeartbeatTime: currentTime, | ||
_, networkingCondition := api.GetNodeCondition(&node.Status, api.NodeNetworkingReady) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the new condition exist independent of NodeReady
condition?
Is the end goal to not schedule any pods on a node with network issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That and make the 'verify-cluster' wait for all Routes to be set.
Sure, having pods not get scheduled on a node until the node's network was fully set up would be useful to us. (Currently we work around this by making SetUpPod() check if the network is ready, and if not, it blocks until it is.) |
Kubelet network plugins also just got a Status() call, which is used from Kubelet.syncNetworkStatus(). The plugin can return an error until it's ready to go. It's not the best, but perhaps it could be used here or extended by this PR? |
@bgrant0607 OK, I see why we may not want to make this Condition not mean NotReady. We may have nodes that we don't need to connect to ever (e.g. some workers that download data from somewhere and upload results somewhere else) - Scheduler might put such Pods on Nodes that does not have Pod networking working. So it seems that the current consensus is that this Condition should prevent Scheduler from putting any Pods on a given Node. In such scenario can we assume that both Scheduler and CM would be updated in the same time and don't care about the skew? I also think that we should update 'validate-cluster' function to look onto such condition. My observation from running a cluster with this change is that in 'ordinary' 3-Node e2e-cluster it takes roughly 2x more time to get all Nodes Ready. This means that currently we're possibly starting tests before Routes are created, which may be a cause of some of the flakyness. It's just a feeling though, so maybe it's not worth it. @lavalamp |
@gmarek PR needs rebase |
@gmarek Skew would be no worse than current behavior. |
2x what amount of time? Waiting for routes to be set up before e.g. running On Thu, May 26, 2016 at 1:17 AM, Marek Grabowski notifications@github.com
|
@lavalamp - currently, creating routes is not blocking anything; so it can potentially happens that we start running tests before routes are created. This is super visible in large clusters, when creating routes take significant amount of time (~30m-35m in 1000-node cluster). And yes - I definitely think that if we want to run some networking-related tests, we really NEED ROUTES. And we should pay this cost (also note that for small clusters (like 5 nodes), the overhead shouldn't be big (sth like additional 1m doesn't sound terrible and can potentially reduce flakiness). To be honest, i think that some part of our networking-related flakes may really be related to that. @thockin @bprashanth ^^ |
For small clusters I really think this is solvable. Routecontroller is not very sophisticated now and we should just fix it to improve it (I already added a TODO in one of my PRs recently to fix that). I'm happy to do that soon too. |
And I agree with @bgrant0607 that we probably change the scheduler to look into it instead of changing kubelet. --update-- |
func tryUpdateNodeStatus(node *api.Node, kubeClient clientset.Interface) error { | ||
for i := 0; i < updateNodeStatusMaxRetries; i++ { | ||
if _, err := kubeClient.Core().Nodes().UpdateStatus(node); err == nil { | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nil
instead
Since this is pretty urgent and @gmarek is OOO, I'm picking this up. I will apply the comments and send out a new PR for it. |
OK - so I pushed the new version of the PR to: All comments should be applied there. |
#26415 is already lgtm-ed. So closing this one. |
Automatic merge from submit-queue Add a NodeCondition "NetworkUnavaiable" to prevent scheduling onto a node until the routes have been created This is new version of #26267 (based on top of that one). The new workflow is: - we have an "NetworkNotReady" condition - Kubelet when it creates a node, it sets it to "true" - RouteController will set it to "false" when the route is created - Scheduler is scheduling only on nodes that doesn't have "NetworkNotReady ==true" condition @gmarek @bgrant0607 @zmerlynn @cjcullen @derekwaynecarr @danwinship @dcbw @lavalamp @vishh
This prevent Node to become Ready before networking is set up when on GCE provider. This adds a new
NetworkingReady
condition instead of writing directlyNodeReady
, to prevent making NodeController logic even more complicated.I'll add tests in another commit.
cc @bgrant0607 for new Condition, @wojtek-t @cjcullen @zmerlynn