Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate pod CIDR assignments in 1.3.0 #29058

Closed
thockin opened this issue Jul 16, 2016 · 35 comments
Closed

Duplicate pod CIDR assignments in 1.3.0 #29058

thockin opened this issue Jul 16, 2016 · 35 comments
Labels
area/nodecontroller priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@thockin
Copy link
Member

thockin commented Jul 16, 2016

We spun up an 800 node cluster today, across 2 zones and we found 27 duplicate pod CIDRs (and duplicate GCE routes) assigned. This obviously caused havoc.

We don't have hard evidence, but we think it is NOT related to multi-zone (we found some dups in the same zone).

@kubernetes/sig-scalability

@thockin thockin added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. area/nodecontroller labels Jul 16, 2016
@thockin
Copy link
Member Author

thockin commented Jul 16, 2016

Similar 1.2 cluster does NOT show dups.

@bprashanth
Copy link
Contributor

I'm going to guess something here isn't locking when it should https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/node/cidr_set.go
@mqliang @gmarek

@wojtek-t
Copy link
Member

@gmarek - this may be related to the changes in node controller done in 1.3 timeframe that you did/reviewed. Could you please take a look?

@bgrant0607
Copy link
Member

It looks like #28604 went in after 1.3.0 was cut.

@justinsb
Copy link
Member

AFAIK multizone only affects the cloud provider & the volume provider, and thus is not in the code path for PodCIDR allocation. (The language here is a little ambiguous maybe, so I assume you are talking about the assignment by KCM to Node.Spec.PodCIDR for a newly-created Node) But I'm here if anyone has any questions about multizone!

Was it HA master?

@thockin
Copy link
Member Author

thockin commented Jul 16, 2016

No HA master. re-titleing to remove zone-relation (likely) misinfo

@thockin thockin changed the title Duplicate pod CIDR assignments with 2 zones Duplicate pod CIDR assignments in 1.3.0 Jul 16, 2016
@bgrant0607
Copy link
Member

#26531 was merged into 1.3

@bgrant0607
Copy link
Member

Noticed one sketchy thing: updateCIDRAllocation didn't re-confirm that PodCIDR was still empty when updating a node.

@bgrant0607
Copy link
Member

It also continues to retry in the case of conflict, though that should just cause performance problems.

@justinsb
Copy link
Member

justinsb commented Jul 16, 2016

Q: If a node is deleted, we release the NodeCIDR. What happens if it the node is actually still alive? Will the kubelet recreate the same Node record with the existing PodCIDR? This might happen if the cloudprovider stopped reporting an instance as being present, the nodecontroller deletes it, but the node could still be alive e.g. eventual consistency problems on AWS, so kubelet will presumably try to re-register.

(Edit: which could make it multizone related, because that is the cloud provider)

@bprashanth
Copy link
Contributor

Yeah #28604 should be just a refactor, the original pr is #19242

@justinsb
Copy link
Member

justinsb commented Jul 16, 2016

Regardless of my tendency to leap to over complicated explanations...

Two big problems in the code that leap out at me:

  1. I don't see any code that waits for the Node list to have completed before handing out PodCIDRs. In other words, I don't think we wait for the allocation map to be fully populated before handing out more pod CIDRs. I could obviously be missing it!
  2. glog.Errorf should be a panic (or glog.Fatalf here) in nodecontroller.go:
if nc.allocateNodeCIDRs {
        nodeEventHandlerFuncs = framework.ResourceEventHandlerFuncs{
            AddFunc: func(obj interface{}) {
                node := obj.(*api.Node)
                err := nc.cidrAllocator.AllocateOrOccupyCIDR(node)
                if err != nil {
                    glog.Errorf("Error allocating CIDR: %v", err)
                }
            },
    }

I bet it is the first. I don't think the second would be root-cause.

Do we know if KCM restarted during cluster bringup?

(Edit: I guess it should be a panic only in some cases: double allocation => panic, no allocations left we can tolerate, though I don't see retry logic)

@j3ffml
Copy link
Contributor

j3ffml commented Jul 16, 2016

KCM was restarted multiple times during cluster turnup for master resizing.

@justinsb
Copy link
Member

The smoking gun would be if nodes with duplicate cidrs started soon after KCM restart. We can probably guess that within (say) 2 minutes of KCM start, the map would be fully populated, even with 800 nodes.

@justinsb
Copy link
Member

justinsb commented Jul 16, 2016

Suggested repro steps also if so:

  • Bring up 3 nodes (say)
  • Stop KCM (but leave API running)
  • Bring up 3 more nodes - they will be created in the API but KCM won't allocate them CIDRs
  • Restart KCM
  • Unless KCM is lucky and sees the first 3 nodes and then the next 3 nodes, you will get a collision because we're using first-fit

(Edit: I haven't actually tried this!)

@bprashanth
Copy link
Contributor

yeah that makes sense, you're basically saying the list of nodes isn't ordered so the new nodes without cidrs might show up before the ones with and the cidr map is empty at the time. If you have a minute testing that theory in a unitest would help.

@justinsb
Copy link
Member

I can probably have a look this evening. I'll try to verify "real-world" first, just in case I'm missing a trick.

BTW why does GCE allow duplicate routes? Our route controller isn't going to like that, I don't think...

@j3ffml
Copy link
Contributor

j3ffml commented Jul 16, 2016

Nice call @justinsb. Your repro worked. I created a 3 node cluster, resized it to 6 and killed KCM while resize was in progress. Got duplicate cidr allocations for 2 different nodes.

@bprashanth
Copy link
Contributor

Cool! so maybe we just need to list all nodes on restart and populate the map. Might still be racy, lets see.

@bgrant0607
Copy link
Member

#26531 also added a call to nc.cidrAllocator.Release in the case of an API call failure. That's unsafe. The CIDR could have actually have been written to the node.

@bgrant0607
Copy link
Member

Unless we cherrypick or rollback #28604, we'll need to fix this directly in the release branch, as well as in head.

@davidopp
Copy link
Member

davidopp commented Jul 16, 2016

Unless we cherrypick or rollback #28604, we'll need to fix this directly in the release branch, as well as in head.

And #19242 which IIUC @bprashanth mentioned in #29058 (comment) is the real culprit.

@davidopp
Copy link
Member

@bprashanth are you working on implementing list all nodes on NC restart?

@bprashanth
Copy link
Contributor

I plan to have a hack pr up so jeff can test shortly.

@justinsb
Copy link
Member

Hooray I get my evening back - thanks @jlowdermilk for confirming!

@bprashanth
Copy link
Contributor

@jlowdermilk fyi #29062, plan to iterate

@thockin
Copy link
Member Author

thockin commented Jul 16, 2016

Nice sleuthing, everyone!

On Jul 16, 2016 11:52 AM, "Justin Santa Barbara" notifications@github.com
wrote:

Hooray I get my evening back - thanks @jlowdermilk
https://github.com/jlowdermilk for confirming!


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#29058 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVPaRjr8TWuqYUE88iowqzGhG0A3mks5qWShogaJpZM4JN-As
.

@bprashanth
Copy link
Contributor

I'm going to try this on a larger gce cluster restarting things #29062. A similare GKE test would be great, and eyes on review.

@fabioy
Copy link
Contributor

fabioy commented Jul 16, 2016

@jlowdermilk found a reasonable work around for GKE, which should work for now: detect the dup route, pick one of the two VM instances and tell the IGM to delete it. A new one gets created and all is ok.

@bprashanth
Copy link
Contributor

And of course you don want to restart the controller manager during turnup if you can avoid it

@bprashanth
Copy link
Contributor

BTW why does GCE allow duplicate routes? Our route controller isn't going to like that, I don't think...

Overlapping routes should always work because longest prefix wins. Apparently duplicate routes is "undefined behavior", as our debugging shows.

@justinsb
Copy link
Member

justinsb commented Jul 16, 2016

@fabioy I've been meaning to write something like that for AWS for a while now. My use case is that sometimes AWS nodes don't show up (particularly with 1.1 where we used master/slave salt), but you can always just get a bad instance. We could kill:

  • Instances that haven't registered as nodes within 1 hour
  • Nodes that remain in NotReady for 2 hours
  • Nodes after some number of days (generally people report good results from doing this; particularly with regards to weird docker issues)
  • Nodes that need security updates (rather than live-reloading?)

And I suspect we'll find more of these if we create it. I was thinking of a totally external controller than just watches nodes and talks to the GCE/AWS APIs.

Thoughts? Want to collaborate?

@bgrant0607
Copy link
Member

Filed #29064 for the bugs in updateCIDRAllocation

@bgrant0607
Copy link
Member

@justinsb Please file a separate issue for your node auto-healing ideas. Note that there's a big difference between killing functioning and broken nodes, however. Functioning nodes need to be drained gracefully and in a fashion that takes application disruption budgets into account.

k8s-github-robot pushed a commit that referenced this issue Jul 16, 2016
Automatic merge from submit-queue

List all nodes and occupy cidr map before starting allocations

Manually tested by starting a 200 node cluster with frequent controller manager restarts.
Fixes #29058
@bgrant0607 bgrant0607 added cherrypick-candidate cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. labels Jul 16, 2016
@bgrant0607
Copy link
Member

Cherrypick PR is #29063

@roberthbailey roberthbailey removed cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. cherrypick-candidate labels Jul 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/nodecontroller priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

No branches or pull requests