Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When running on GCE prevent Nodes to become Ready before GCE routes are created. #26267

Closed
wants to merge 2 commits into from

Conversation

gmarek
Copy link
Contributor

@gmarek gmarek commented May 25, 2016

This prevent Node to become Ready before networking is set up when on GCE provider. This adds a new NetworkingReady condition instead of writing directly NodeReady, to prevent making NodeController logic even more complicated.

I'll add tests in another commit.

cc @bgrant0607 for new Condition, @wojtek-t @cjcullen @zmerlynn

@gmarek gmarek added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. labels May 25, 2016
@gmarek gmarek added this to the v1.3 milestone May 25, 2016
@k8s-github-robot k8s-github-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 25, 2016
@bgrant0607
Copy link
Member

cc @davidopp @mikedanese

@bgrant0607
Copy link
Member

cc @dchen1107 @vishh @derekwaynecarr

@zmerlynn
Copy link
Member

I suspect that @dchen1107 or @bgrant0607 are better overall reviewers for this. I've been staring at the routecontroller code enough recently to represent that piece, though. And I know this is going to bounce hard against #26263 , which is probably going to go in first - just as a first blush comment I think the goroutine there is getting a bit long in the tooth and we probably need to pull that out.

Who wants it? @bgrant0607 first? (I'll keep adding comments on the routecontroller side and in general.)

Message: "kubelet is posting ready status",
LastHeartbeatTime: currentTime,
_, networkingCondition := api.GetNodeCondition(&node.Status, api.NodeNetworkingReady)
if (kl.cloud.ProviderName() == gce.ProviderName) && (networkingCondition == nil || networkingCondition.Status != api.ConditionTrue) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the best way to determine whether to expect NetworkReady to be set?

It also assumes that the controller manager is upgraded before Kubelet, which not everyone does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to distinguish the 1.2 route controller vs the 1.3 route controller via the API? Or would it be possible to confirm the route through the local metadata server (abstracted behind the cloudprovider interface)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's gated on GCE and I'm not sure how many people do it by hand vs. doing it by kube-push (however it's called right now). I don't have any idea on how to check if RC is in v1.3. I guess the best I can do is to have Kubelet write "NetworkingReadyCondition=false" on startup. It may lead to races though, which in turn will make us wait 5 minutes for the next RC pass...

Copy link
Member

@zmerlynn zmerlynn May 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider yesterday whether kubelet could do this, but unfortunately I think it's an API call, which would take compute-ro scope, which we just don't want to require on the nodes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmarek Where else do you see checks of ProviderName in the codebase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can replace this with the same check that it's done in the controller-manager. The problem is that I don't know how to test it in other environments, and this is clearly a cloud-provider specific thing.

@k8s-bot
Copy link

k8s-bot commented May 25, 2016

GCE e2e build/test failed for commit 77fcfdb.

Please reference the list of currently known flakes when examining this failure. If you request a re-test, you must reference the issue describing the flake.

@bgrant0607 bgrant0607 assigned bgrant0607 and unassigned zmerlynn May 25, 2016
@bgrant0607
Copy link
Member

@zmerlynn @gmarek Yes, I'll take a pass.

@@ -1993,6 +1993,8 @@ const (
NodeOutOfDisk NodeConditionType = "OutOfDisk"
// NodeMemoryPressure means the kubelet is under pressure due to insufficient available memory.
NodeMemoryPressure NodeConditionType = "MemoryPressure"
// NodeNetworkingReady means that network for the node is correctly configured.
NodeNetworkingReady NodeConditionType = "NetworkingReady"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that we are settling on node conditions being expressed with opposite polarity.

see discussion:
#25773 (comment)

Basically, if a condition has State=True, its considered a problem.

I am not sure if this PR demonstrates that this is a bad decision as it seemed natural to expose it in this fashion.

/cc @vishh

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Why not invert the condition to be NetworkingNotReady, set it to true by default when kubelet registers a node, and have the route controller set that condition to false later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for NodeReady, where "True" means that everything is OK. I'm not sure about this conventions. @bgrant0607

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for what @vishh wrote

@derekwaynecarr
Copy link
Member

If not on GCE, what will this node condition report?

@derekwaynecarr
Copy link
Member

Is it a correct understanding that this node condition would always report false when not on GCE?

/cc @smarterclayton

@derekwaynecarr
Copy link
Member

Actually, looking at this more, I think this PR is not setting this condition to any value when not on a cloud provider. @kubernetes/rh-networking - please weigh in here if you would find this NetworkReady node condition useful as well when running with our SDN.

@bgrant0607
Copy link
Member

On a physical network, some entity would perform a connectivity check by actually pinging the nodes. I'd like the condition to be able to represent that, also.

Given that connectivity could drop at any moment, races are inevitable.

Also, we wouldn't want to kill pods or to make all nodes unschedulable just because the connectivity checker went down.

Therefore, I think it would be fine to treat the lack of ability to create a network route as a problem, with the opposite polarity, as @derekwaynecarr mentioned. PodsUnreachable or somesuch. That does seem like a useful convention for uncommon properties, for conciseness if nothing else.

Since the scheduler is generally updated at the same time as the controller manager, it would be easier to interpret the condition there instead of in Kubelet.

However, let's not treat all conditions other than Ready as implying a node should be unschedulable.

Conditions are informational, with reasons, messages, transition times, etc. Different conditions may be interpreted differently in the future -- some may block scheduling, others may kill running pods, etc.

As soon as we complete the taint/toleration/forgiveness implementation, I want all forms of unschedulability to be explicitly represented using taints.

Reason: "KubeletReady",
Message: "kubelet is posting ready status",
LastHeartbeatTime: currentTime,
_, networkingCondition := api.GetNodeCondition(&node.Status, api.NodeNetworkingReady)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the new condition exist independent of NodeReady condition?
Is the end goal to not schedule any pods on a node with network issues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That and make the 'verify-cluster' wait for all Routes to be set.

@danwinship
Copy link
Contributor

@kubernetes/rh-networking - please weigh in here if you would find this NetworkReady node condition useful as well when running with our SDN.

Sure, having pods not get scheduled on a node until the node's network was fully set up would be useful to us. (Currently we work around this by making SetUpPod() check if the network is ready, and if not, it blocks until it is.)

@dcbw
Copy link
Member

dcbw commented May 25, 2016

Kubelet network plugins also just got a Status() call, which is used from Kubelet.syncNetworkStatus(). The plugin can return an error until it's ready to go. It's not the best, but perhaps it could be used here or extended by this PR?

@gmarek
Copy link
Contributor Author

gmarek commented May 26, 2016

@bgrant0607 OK, I see why we may not want to make this Condition not mean NotReady. We may have nodes that we don't need to connect to ever (e.g. some workers that download data from somewhere and upload results somewhere else) - Scheduler might put such Pods on Nodes that does not have Pod networking working.

So it seems that the current consensus is that this Condition should prevent Scheduler from putting any Pods on a given Node. In such scenario can we assume that both Scheduler and CM would be updated in the same time and don't care about the skew?

I also think that we should update 'validate-cluster' function to look onto such condition. My observation from running a cluster with this change is that in 'ordinary' 3-Node e2e-cluster it takes roughly 2x more time to get all Nodes Ready. This means that currently we're possibly starting tests before Routes are created, which may be a cause of some of the flakyness. It's just a feeling though, so maybe it's not worth it. @lavalamp

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 26, 2016
@k8s-github-robot
Copy link

@gmarek PR needs rebase

@bgrant0607
Copy link
Member

@gmarek Skew would be no worse than current behavior.

@lavalamp
Copy link
Member

2x what amount of time? Waiting for routes to be set up before e.g. running
the proxy tests sounds like a SUPER GOOD IDEA, and I'm willing to pay for
it with some start up time.

On Thu, May 26, 2016 at 1:17 AM, Marek Grabowski notifications@github.com
wrote:

OK, I see why we may not want to make this Condition not mean NotReady. We
may have nodes that we don't need to connect to ever (e.g. some workers
that download data from somewhere and upload results somewhere else) -
Scheduler might put such Pods on Nodes that does not have Pod networking
working.

So it seems that the current consensus is that this Condition should
prevent Scheduler from putting any Pods on a given Node. In such scenario
can we assume that both Scheduler and CM would be updated in the same time
and don't care about the skew?

I also think that we should update 'validate-cluster' function to look
onto such condition. My observation from running a cluster with this change
is that in 'ordinary' 3-Node e2e-cluster it takes roughly 2x more time to
get all Nodes Ready. This means that currently we're possibly starting
tests before Routes are created, which may be a cause of some of the
flakyness. It's just a feeling though, so maybe it's not worth it.
@lavalamp https://github.com/lavalamp


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#26267 (comment)

@wojtek-t
Copy link
Member

2x what amount of time? Waiting for routes to be set up before e.g. running
the proxy tests sounds like a SUPER GOOD IDEA, and I'm willing to pay for
it with some start up time.

@lavalamp - currently, creating routes is not blocking anything; so it can potentially happens that we start running tests before routes are created. This is super visible in large clusters, when creating routes take significant amount of time (~30m-35m in 1000-node cluster).

And yes - I definitely think that if we want to run some networking-related tests, we really NEED ROUTES. And we should pay this cost (also note that for small clusters (like 5 nodes), the overhead shouldn't be big (sth like additional 1m doesn't sound terrible and can potentially reduce flakiness).

To be honest, i think that some part of our networking-related flakes may really be related to that. @thockin @bprashanth ^^

@wojtek-t
Copy link
Member

@gmarek

My observation from running a cluster with this change is that in 'ordinary' 3-Node e2e-cluster it takes roughly 2x more time to get all Nodes Ready. This means that currently we're possibly starting tests before Routes are created, which may be a cause of some of the flakyness. It's just a feeling though, so maybe it's not worth it. @lavalamp

For small clusters I really think this is solvable. Routecontroller is not very sophisticated now and we should just fix it to improve it (I already added a TODO in one of my PRs recently to fix that). I'm happy to do that soon too.

@wojtek-t
Copy link
Member

wojtek-t commented May 27, 2016

And I agree with @bgrant0607 that we probably change the scheduler to look into it instead of changing kubelet.
However, I think that we can be slightly more sophisticated in scheduler. What do you think about disallowing scheduling a pod on a node if "network-not-ready" only if it exposes some port?
@davidopp - thoughts?

--update--
or maybe it doesn't make sense, because this pod may actually want to talk to other pods...

func tryUpdateNodeStatus(node *api.Node, kubeClient clientset.Interface) error {
for i := 0; i < updateNodeStatusMaxRetries; i++ {
if _, err := kubeClient.Core().Nodes().UpdateStatus(node); err == nil {
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return nil

instead

@wojtek-t
Copy link
Member

Since this is pretty urgent and @gmarek is OOO, I'm picking this up. I will apply the comments and send out a new PR for it.

@wojtek-t
Copy link
Member

OK - so I pushed the new version of the PR to:
#26415

All comments should be applied there.

@wojtek-t
Copy link
Member

#26415 is already lgtm-ed. So closing this one.

@wojtek-t wojtek-t closed this May 28, 2016
k8s-github-robot pushed a commit that referenced this pull request May 29, 2016
Automatic merge from submit-queue

Add a NodeCondition "NetworkUnavaiable" to prevent scheduling onto a node until the routes have been created 

This is new version of #26267 (based on top of that one).

The new workflow is:
- we have an "NetworkNotReady" condition
- Kubelet when it creates a node, it sets it to "true"
- RouteController will set it to "false" when the route is created
- Scheduler is scheduling only on nodes that doesn't have "NetworkNotReady ==true" condition

@gmarek @bgrant0607 @zmerlynn @cjcullen @derekwaynecarr @danwinship @dcbw @lavalamp @vishh
@gmarek gmarek deleted the routecontroller branch August 30, 2016 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.