Duplicate pod CIDR assignments in 1.3.0 #29058

thockin · 2016-07-16T08:30:56Z

We spun up an 800 node cluster today, across 2 zones and we found 27 duplicate pod CIDRs (and duplicate GCE routes) assigned. This obviously caused havoc.

We don't have hard evidence, but we think it is NOT related to multi-zone (we found some dups in the same zone).

@kubernetes/sig-scalability

thockin · 2016-07-16T08:40:24Z

Similar 1.2 cluster does NOT show dups.

bprashanth · 2016-07-16T09:34:26Z

I'm going to guess something here isn't locking when it should https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/node/cidr_set.go
@mqliang @gmarek

wojtek-t · 2016-07-16T09:43:04Z

@gmarek - this may be related to the changes in node controller done in 1.3 timeframe that you did/reviewed. Could you please take a look?

bgrant0607 · 2016-07-16T16:36:34Z

It looks like #28604 went in after 1.3.0 was cut.

justinsb · 2016-07-16T16:40:58Z

AFAIK multizone only affects the cloud provider & the volume provider, and thus is not in the code path for PodCIDR allocation. (The language here is a little ambiguous maybe, so I assume you are talking about the assignment by KCM to Node.Spec.PodCIDR for a newly-created Node) But I'm here if anyone has any questions about multizone!

Was it HA master?

thockin · 2016-07-16T16:43:43Z

No HA master. re-titleing to remove zone-relation (likely) misinfo

bgrant0607 · 2016-07-16T16:47:23Z

#26531 was merged into 1.3

bgrant0607 · 2016-07-16T16:53:27Z

Noticed one sketchy thing: updateCIDRAllocation didn't re-confirm that PodCIDR was still empty when updating a node.

bgrant0607 · 2016-07-16T16:54:30Z

It also continues to retry in the case of conflict, though that should just cause performance problems.

justinsb · 2016-07-16T17:00:56Z

Q: If a node is deleted, we release the NodeCIDR. What happens if it the node is actually still alive? Will the kubelet recreate the same Node record with the existing PodCIDR? This might happen if the cloudprovider stopped reporting an instance as being present, the nodecontroller deletes it, but the node could still be alive e.g. eventual consistency problems on AWS, so kubelet will presumably try to re-register.

(Edit: which could make it multizone related, because that is the cloud provider)

bprashanth · 2016-07-16T17:09:52Z

Yeah #28604 should be just a refactor, the original pr is #19242

justinsb · 2016-07-16T17:24:09Z

Regardless of my tendency to leap to over complicated explanations...

Two big problems in the code that leap out at me:

I don't see any code that waits for the Node list to have completed before handing out PodCIDRs. In other words, I don't think we wait for the allocation map to be fully populated before handing out more pod CIDRs. I could obviously be missing it!
glog.Errorf should be a panic (or glog.Fatalf here) in nodecontroller.go:

if nc.allocateNodeCIDRs {
        nodeEventHandlerFuncs = framework.ResourceEventHandlerFuncs{
            AddFunc: func(obj interface{}) {
                node := obj.(*api.Node)
                err := nc.cidrAllocator.AllocateOrOccupyCIDR(node)
                if err != nil {
                    glog.Errorf("Error allocating CIDR: %v", err)
                }
            },
    }

I bet it is the first. I don't think the second would be root-cause.

Do we know if KCM restarted during cluster bringup?

(Edit: I guess it should be a panic only in some cases: double allocation => panic, no allocations left we can tolerate, though I don't see retry logic)

j3ffml · 2016-07-16T17:35:49Z

KCM was restarted multiple times during cluster turnup for master resizing.

justinsb · 2016-07-16T17:38:30Z

The smoking gun would be if nodes with duplicate cidrs started soon after KCM restart. We can probably guess that within (say) 2 minutes of KCM start, the map would be fully populated, even with 800 nodes.

justinsb · 2016-07-16T18:02:27Z

Suggested repro steps also if so:

Bring up 3 nodes (say)
Stop KCM (but leave API running)
Bring up 3 more nodes - they will be created in the API but KCM won't allocate them CIDRs
Restart KCM
Unless KCM is lucky and sees the first 3 nodes and then the next 3 nodes, you will get a collision because we're using first-fit

(Edit: I haven't actually tried this!)

bprashanth · 2016-07-16T18:15:47Z

yeah that makes sense, you're basically saying the list of nodes isn't ordered so the new nodes without cidrs might show up before the ones with and the cidr map is empty at the time. If you have a minute testing that theory in a unitest would help.

justinsb · 2016-07-16T18:21:39Z

I can probably have a look this evening. I'll try to verify "real-world" first, just in case I'm missing a trick.

BTW why does GCE allow duplicate routes? Our route controller isn't going to like that, I don't think...

j3ffml · 2016-07-16T18:23:00Z

Nice call @justinsb. Your repro worked. I created a 3 node cluster, resized it to 6 and killed KCM while resize was in progress. Got duplicate cidr allocations for 2 different nodes.

bprashanth · 2016-07-16T18:23:37Z

Cool! so maybe we just need to list all nodes on restart and populate the map. Might still be racy, lets see.

bgrant0607 · 2016-07-16T18:28:29Z

#26531 also added a call to nc.cidrAllocator.Release in the case of an API call failure. That's unsafe. The CIDR could have actually have been written to the node.

bgrant0607 · 2016-07-16T18:33:57Z

Unless we cherrypick or rollback #28604, we'll need to fix this directly in the release branch, as well as in head.

davidopp · 2016-07-16T18:37:34Z

Unless we cherrypick or rollback #28604, we'll need to fix this directly in the release branch, as well as in head.

And #19242 which IIUC @bprashanth mentioned in #29058 (comment) is the real culprit.

davidopp · 2016-07-16T18:43:44Z

@bprashanth are you working on implementing list all nodes on NC restart?

bprashanth · 2016-07-16T18:44:56Z

I plan to have a hack pr up so jeff can test shortly.

justinsb · 2016-07-16T18:51:39Z

Hooray I get my evening back - thanks @jlowdermilk for confirming!

bprashanth · 2016-07-16T18:55:34Z

@jlowdermilk fyi #29062, plan to iterate

thockin · 2016-07-16T18:55:56Z

Nice sleuthing, everyone!

On Jul 16, 2016 11:52 AM, "Justin Santa Barbara" notifications@github.com
wrote:

Hooray I get my evening back - thanks @jlowdermilk
https://github.com/jlowdermilk for confirming!

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#29058 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVPaRjr8TWuqYUE88iowqzGhG0A3mks5qWShogaJpZM4JN-As
.

bprashanth · 2016-07-16T20:11:14Z

I'm going to try this on a larger gce cluster restarting things #29062. A similare GKE test would be great, and eyes on review.

fabioy · 2016-07-16T20:19:17Z

@jlowdermilk found a reasonable work around for GKE, which should work for now: detect the dup route, pick one of the two VM instances and tell the IGM to delete it. A new one gets created and all is ok.

bprashanth · 2016-07-16T21:02:55Z

And of course you don want to restart the controller manager during turnup if you can avoid it

bprashanth · 2016-07-16T21:06:19Z

BTW why does GCE allow duplicate routes? Our route controller isn't going to like that, I don't think...

Overlapping routes should always work because longest prefix wins. Apparently duplicate routes is "undefined behavior", as our debugging shows.

justinsb · 2016-07-16T21:15:43Z

@fabioy I've been meaning to write something like that for AWS for a while now. My use case is that sometimes AWS nodes don't show up (particularly with 1.1 where we used master/slave salt), but you can always just get a bad instance. We could kill:

Instances that haven't registered as nodes within 1 hour
Nodes that remain in NotReady for 2 hours
Nodes after some number of days (generally people report good results from doing this; particularly with regards to weird docker issues)
Nodes that need security updates (rather than live-reloading?)

And I suspect we'll find more of these if we create it. I was thinking of a totally external controller than just watches nodes and talks to the GCE/AWS APIs.

Thoughts? Want to collaborate?

bgrant0607 · 2016-07-16T22:19:10Z

Filed #29064 for the bugs in updateCIDRAllocation

bgrant0607 · 2016-07-16T22:24:13Z

@justinsb Please file a separate issue for your node auto-healing ideas. Note that there's a big difference between killing functioning and broken nodes, however. Functioning nodes need to be drained gracefully and in a fashion that takes application disruption budgets into account.

Automatic merge from submit-queue List all nodes and occupy cidr map before starting allocations Manually tested by starting a 200 node cluster with frequent controller manager restarts. Fixes #29058

bgrant0607 · 2016-07-16T23:54:47Z

Cherrypick PR is #29063

thockin added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. area/nodecontroller labels Jul 16, 2016

thockin assigned thockin, fabioy, roberthbailey, wojtek-t and bprashanth Jul 16, 2016

roberthbailey assigned j3ffml Jul 16, 2016

wojtek-t assigned gmarek Jul 16, 2016

thockin changed the title ~~Duplicate pod CIDR assignments with 2 zones~~ Duplicate pod CIDR assignments in 1.3.0 Jul 16, 2016

bgrant0607 assigned cjcullen Jul 16, 2016

bprashanth mentioned this issue Jul 16, 2016

List all nodes and occupy cidr map before starting allocations #29062

Merged

bgrant0607 mentioned this issue Jul 16, 2016

updateCIDRAllocation bugs #29064

Closed

k8s-github-robot closed this as completed in #29062 Jul 16, 2016

bgrant0607 added cherrypick-candidate cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. labels Jul 16, 2016

roberthbailey removed cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. cherrypick-candidate labels Jul 17, 2016

Duplicate pod CIDR assignments in 1.3.0 #29058

Duplicate pod CIDR assignments in 1.3.0 #29058

Comments

thockin commented Jul 16, 2016

thockin commented Jul 16, 2016

bprashanth commented Jul 16, 2016

wojtek-t commented Jul 16, 2016

bgrant0607 commented Jul 16, 2016

justinsb commented Jul 16, 2016

thockin commented Jul 16, 2016

bgrant0607 commented Jul 16, 2016

bgrant0607 commented Jul 16, 2016

bgrant0607 commented Jul 16, 2016

justinsb commented Jul 16, 2016 • edited Loading

bprashanth commented Jul 16, 2016

justinsb commented Jul 16, 2016 • edited Loading

j3ffml commented Jul 16, 2016

justinsb commented Jul 16, 2016

justinsb commented Jul 16, 2016 • edited Loading

bprashanth commented Jul 16, 2016

justinsb commented Jul 16, 2016

j3ffml commented Jul 16, 2016

bprashanth commented Jul 16, 2016

bgrant0607 commented Jul 16, 2016

bgrant0607 commented Jul 16, 2016

davidopp commented Jul 16, 2016 • edited Loading

davidopp commented Jul 16, 2016

bprashanth commented Jul 16, 2016

justinsb commented Jul 16, 2016

bprashanth commented Jul 16, 2016

thockin commented Jul 16, 2016

bprashanth commented Jul 16, 2016

fabioy commented Jul 16, 2016

bprashanth commented Jul 16, 2016

bprashanth commented Jul 16, 2016

justinsb commented Jul 16, 2016 • edited Loading

bgrant0607 commented Jul 16, 2016

bgrant0607 commented Jul 16, 2016

bgrant0607 commented Jul 16, 2016

justinsb commented Jul 16, 2016 •

edited

Loading

justinsb commented Jul 16, 2016 •

edited

Loading

justinsb commented Jul 16, 2016 •

edited

Loading

davidopp commented Jul 16, 2016 •

edited

Loading

justinsb commented Jul 16, 2016 •

edited

Loading