-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate pod CIDR assignments in 1.3.0 #29058
Comments
Similar 1.2 cluster does NOT show dups. |
I'm going to guess something here isn't locking when it should https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/node/cidr_set.go |
@gmarek - this may be related to the changes in node controller done in 1.3 timeframe that you did/reviewed. Could you please take a look? |
It looks like #28604 went in after 1.3.0 was cut. |
AFAIK multizone only affects the cloud provider & the volume provider, and thus is not in the code path for PodCIDR allocation. (The language here is a little ambiguous maybe, so I assume you are talking about the assignment by KCM to Node.Spec.PodCIDR for a newly-created Node) But I'm here if anyone has any questions about multizone! Was it HA master? |
No HA master. re-titleing to remove zone-relation (likely) misinfo |
#26531 was merged into 1.3 |
Noticed one sketchy thing: updateCIDRAllocation didn't re-confirm that PodCIDR was still empty when updating a node. |
It also continues to retry in the case of conflict, though that should just cause performance problems. |
Q: If a node is deleted, we release the NodeCIDR. What happens if it the node is actually still alive? Will the kubelet recreate the same Node record with the existing PodCIDR? This might happen if the cloudprovider stopped reporting an instance as being present, the nodecontroller deletes it, but the node could still be alive e.g. eventual consistency problems on AWS, so kubelet will presumably try to re-register. (Edit: which could make it multizone related, because that is the cloud provider) |
Regardless of my tendency to leap to over complicated explanations... Two big problems in the code that leap out at me:
I bet it is the first. I don't think the second would be root-cause. Do we know if KCM restarted during cluster bringup? (Edit: I guess it should be a panic only in some cases: double allocation => panic, no allocations left we can tolerate, though I don't see retry logic) |
KCM was restarted multiple times during cluster turnup for master resizing. |
The smoking gun would be if nodes with duplicate cidrs started soon after KCM restart. We can probably guess that within (say) 2 minutes of KCM start, the map would be fully populated, even with 800 nodes. |
Suggested repro steps also if so:
(Edit: I haven't actually tried this!) |
yeah that makes sense, you're basically saying the list of nodes isn't ordered so the new nodes without cidrs might show up before the ones with and the cidr map is empty at the time. If you have a minute testing that theory in a unitest would help. |
I can probably have a look this evening. I'll try to verify "real-world" first, just in case I'm missing a trick. BTW why does GCE allow duplicate routes? Our route controller isn't going to like that, I don't think... |
Nice call @justinsb. Your repro worked. I created a 3 node cluster, resized it to 6 and killed KCM while resize was in progress. Got duplicate cidr allocations for 2 different nodes. |
Cool! so maybe we just need to list all nodes on restart and populate the map. Might still be racy, lets see. |
#26531 also added a call to nc.cidrAllocator.Release in the case of an API call failure. That's unsafe. The CIDR could have actually have been written to the node. |
Unless we cherrypick or rollback #28604, we'll need to fix this directly in the release branch, as well as in head. |
And #19242 which IIUC @bprashanth mentioned in #29058 (comment) is the real culprit. |
@bprashanth are you working on implementing list all nodes on NC restart? |
I plan to have a hack pr up so jeff can test shortly. |
Hooray I get my evening back - thanks @jlowdermilk for confirming! |
@jlowdermilk fyi #29062, plan to iterate |
Nice sleuthing, everyone! On Jul 16, 2016 11:52 AM, "Justin Santa Barbara" notifications@github.com
|
I'm going to try this on a larger gce cluster restarting things #29062. A similare GKE test would be great, and eyes on review. |
@jlowdermilk found a reasonable work around for GKE, which should work for now: detect the dup route, pick one of the two VM instances and tell the IGM to delete it. A new one gets created and all is ok. |
And of course you don want to restart the controller manager during turnup if you can avoid it |
Overlapping routes should always work because longest prefix wins. Apparently duplicate routes is "undefined behavior", as our debugging shows. |
@fabioy I've been meaning to write something like that for AWS for a while now. My use case is that sometimes AWS nodes don't show up (particularly with 1.1 where we used master/slave salt), but you can always just get a bad instance. We could kill:
And I suspect we'll find more of these if we create it. I was thinking of a totally external controller than just watches nodes and talks to the GCE/AWS APIs. Thoughts? Want to collaborate? |
Filed #29064 for the bugs in updateCIDRAllocation |
@justinsb Please file a separate issue for your node auto-healing ideas. Note that there's a big difference between killing functioning and broken nodes, however. Functioning nodes need to be drained gracefully and in a fashion that takes application disruption budgets into account. |
Automatic merge from submit-queue List all nodes and occupy cidr map before starting allocations Manually tested by starting a 200 node cluster with frequent controller manager restarts. Fixes #29058
Cherrypick PR is #29063 |
We spun up an 800 node cluster today, across 2 zones and we found 27 duplicate pod CIDRs (and duplicate GCE routes) assigned. This obviously caused havoc.
We don't have hard evidence, but we think it is NOT related to multi-zone (we found some dups in the same zone).
@kubernetes/sig-scalability
The text was updated successfully, but these errors were encountered: