-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Cluster Bootstrap with Gossip #30361
Proposal: Cluster Bootstrap with Gossip #30361
Conversation
GCE e2e build/test passed for commit 7e9fe3d. |
What problem does the gossip solve? I really struggle with this question when a Kubernetes cluster has a strong central API control plane. |
Cross-posting from my reply on #30360: The point of gossip is to provide a user-friendly option for generating and distribute the CA cert and a list of API server URLs: precisely the inputs required to TLS bootstrapping. Note that gossip as described here would be optional, including the "out of band" option that @alex-mohr suggested in the SIG yesterday for advanced users who want more control over what's happening, and want to specify the ca cert and API server URLs themselves. We're assuming that's not 90% of new users who will want to kick the tires though; they'll want something that works with short strings they can copy and paste on the commandline. |
@lukemarsden Right, but I still don't get how gossip helps. Let me run through my understanding:
So, what is gossip being used for?! Doesn't the shared secret (token) act as the bootstrap of trust? Why can't the initial control node generate the CA? Why is that happening over gossip? |
|
||
### Gossip implementation | ||
|
||
As soon as has Discover called on it, it attempts to form a secure mesh network using the token and the peers, using [a simple gossip protocol library](https://github.com/weaveworks/mesh). For more information on the library, see [this talk](http://infoq.com/presentations/weave-mesh) by the authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a word missing at the start of this sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The linked docs say that the gossip implementation only scales to 100 peers. Since k8s currently supports 2k node clusters (with the goal to grow quite a bit by the end of the year), how do we expect to get this out of alpha and into production?
I think the key point is that the node kubelets want the ca cert so that On 10 Aug 2016 9:44 p.m., "Brandon Philips" notifications@github.com
|
(I admit that I never questioned why the TLS bootstrap requires the ca cert On 10 Aug 2016 10:29 p.m., "Luke Marsden" me@lukemarsden.net wrote:
|
Ya we need the ca.crt to trust the apiserver at initial post of the CSR. |
Ok so we have two types of nodes:
Nodes of type 2 have to undergo a discovery process to become nodes of type 1. Why is it easier to create nodes of type 2 and do this process than to just create nodes of type 1 directly? |
|
||
As soon as has Discover called on it, it attempts to form a secure mesh network using the token and the peers, using [a simple gossip protocol library](https://github.com/weaveworks/mesh). For more information on the library, see [this talk](http://infoq.com/presentations/weave-mesh) by the authors. | ||
|
||
When that happens, kubelet uses [CRDTs](https://github.com/weaveworks/kubelet-mesh/blob/master/state.go) to support gossiping CA certs and lists of URLs of API servers. As soon as a kubelet learns of both of these pieces of information, it returns from the Discover method, and kubelet proceeds to attempt to perform TLS bootstrap against the API server running on the master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I compromise a node that is used as the seed for other nodes, can I compromise all those other nodes?
Also dead link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AIUI, if you compromise a node on the mesh and read the mesh key out of memory/disk then you'd be able to read the cacert public key and api server URLs, no more than that.
Fixed the dead link, sorry about that. The link goes to our trivial CRDT implementation for adding API server URLs to a set and deciding on CA certs deterministically based on their creation dates (falling back to comparing signatures if the dates are equal). This took us half a day to code, based on the sample code here and watching this explanation.
This doesn't block #30360 right? |
I think it boils down to not wanting to copy around cert files as part of the gating UX. As soon as you ask a busy person to manually scp a certificate, you lose. |
I think it does – at least, finding a discovery mechanism we can use does (could be discovery service, but I got the impression that building and operating that would be tricky plus we heard from @aronchick that users don't want to leave their firewall or operate auxiliary internal services). Put another way, if we don't have a discovery mechanism, the best UX we can do is the one described in "out-of-band" and that is unacceptably unfriendly as the default (copying certs to machines) and thereby not competitive. |
Playing devil's advocate here -- do we need a full gossip implementation? Could we do something simple like have the type 2 node create a JWT (hmac signed by the shared secret) and ask the type 1 node for the info it needs? That response could have the certificate (along with other API servers in an HA world) encoded in another JWT (also hmac signed). More fully, I'd think about having a "cluster parameters" bag of data that a client needs to talk to the cluster -- this includes a bunch of ways to reach the cluster (DNS, internal IPs, external IPs) along with a set of root certs to trust for that cluster. In this case a "client" includes other server components, node components (kubelet/kube-proxy) and kubectl/kubeadm. Ideally the client code would periodically ask the API server for an updated bag (version numbers?) and cache those results. The only way things would really get screwed up is if (a) the set of API servers turns over completely or (b) the root certs rotate out in between client pings. Obviously gossip is one way to distribute this bag, but it may not be really necessary as we do already have a strongly consistent control plane. |
I think the point here is that we don't yet have a control plane set up, and we're trying to bootstrap that control plane in a simple and user-friendly way. Writing a tiny amount of code which uses an existing well-tested gossip library to securely distribute the cacert and api server URLs to the peers seems sensible. Having the gossip network available will also dramatically simplify bootstrapping multi-master setups in such a way that the IP used to join the nodes to the masters isn't "special" (although that's not in scope for Phase I), I think. |
This sounds like what the gossip library is already good at: tracking a changing set of nodes without requiring quorum or careful operational management. We can get into this more in Phase II when we get to changing/adding/removing masters in multi-master, but IMO it's the "right tool for the job". |
|
||
#### Masters | ||
|
||
Initializing a new master shouldn't require TLS bootstrap, because the master already has privileged access to the API server. New masters have to add their own address into the API server URL list in the mesh case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a new master needs TLS serving certs. is this assuming the master will generate self-signed ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in this scheme the first master would generate the CA cert and put its public key out on the gossip network so that the joining nodes can trust and know they're talking to the real API server.
Advanced users may want to specify this cert, we shouldn't stop them. In fact, the "out of band" option will allow them to do exactly this.
We may also want to allow the user to provide a DNS name for the API server, and to have the initial cert have that added as subjectAltName etc. The default "I know nothing" assumption would be to put the IP address of the first master into gossip as the initial API server address.
The part I am missing (though some of the other questions here resonate for me) is how you end up with a "type 2" node at all? How do these gossip peers find each other? |
Sorry that this wasn't clearer. From @mikedanese's definitions:
In my mind a "type 2 node" is one in "gossiping" state in the proposal – it helps to look at the rendered markdown view where you can see the diagrams. A type 1 node is one that's progressed to the "bootstrapping" state.
A user ends up with a type 2 node by typing the commands proposed in #30360. Assuming the master is on 10.0.0.1, a user would... (simplified slightly for legibility). Run on the master:
And run on the node:
(Under the hood, there is auto-signing of CSRs going on here as the node gets the cacert and API server address from gossip, then performs TLS bootstrap – let's assume that in future users would be able to turn off auto-signing and then it accepting the node in would involve the more-secure To expand on this: when the gossip protocol is running on all the kubelets, you can join any server into the cluster just by specifying the (short) shared secret and the address of one other machine in the cluster. It doesn't have to be the same IP of some master that was originally set up months ago and which has since been destroyed: the gossip library handles keeping track of peers (and I'd propose that it continues to do that for the group of masters when we go multi-master so that nodes can always discover a working set of API servers on startup). Immediately in the single-master kick-the-tires case gossip saves the user from But what's critical in the next few days is that we push ahead with a discovery mechanism which gets us to |
Correct me if I'm wrong here -- gossip really excels when you have a set of nodes with dynamic membership where there is no explicit leader. That isn't the case here. From the very start (after the first Since we have etcd behind the scenes, (1) is actually pretty easy. The API servers register with it. As for (2) we can have an explicit API call where the clients can ask for and cache the current set of api servers. That along with some sort of ringdown (if one API server doesn't answer try another) means the clients can find the cluster as long as one of the API servers that they knew about last time is still around. While we could have the clients of the API server (pretty much all other actors in the k8s world including While it looks like gossip may provide some value here, I suspect it may be overkill. And it introduces new failure modes and dependencies. Am I missing something big here? |
Think of our proposal for adding gossip, at least for the single master case we're trying to land in the next 8 days, just as a replacement for the discovery service concept in #28422 – critically, one which doesn't rely on an external discovery. We've heard that users wouldn't like that: they don't want to leave the firewall, they don't want their cluster to rely on a third party service which might go down, and they don't want to turn up auxilliary service just to start another service (Kubernetes). All that gossip is really doing in this case is turning a long ca cert in a file (and the API server URL) into a short gossip token on the commandline, but this makes the UX much, much better. That in turn enables easy, secure distribution of that ca cert to the nodes so that they can kick off TLS bootstrapping. The ringdown stuff makes sense for multi-master, sure. Given that the output of the gossip conversation would be a list of API servers, we'd need to do something like that even if we use gossip for discovery.
We're not suggesting this :) Once gossip is used to distribute the cacert for TLS bootstrapping, the rest will be normal kube APIs. |
What benefits gossip provide us if we already need to provide a master IP to other nodes? Why cannot the master node distribute information by either pushing to the contacted nodes, or the nodes directly pull from the given master? Why do we need to use gossip to distribute information? |
@xiang90 |
@lukemarsden So we use a gossip library but not using it gossip functionality? We actually want the secure transport between the announced node and its receiver? Can we make this clear in the proposal? |
@xiang90 true, the desired behavior could be achieved with secure point-to-point (request/response) functionality for the single-master case. The gossip functionality will become more useful in multi-master case I think, as I was discussing with @mikedanese earlier. |
I wrote up an alternate/additional approach based on JWS -- please take a look and let me know what you think. If it resonates I'm happy to markdownify and submit a PR. https://docs.google.com/document/d/1GVMLTBrEH5kXGxo0fWTqiRxG4dSxQQzyUGLA7PxhzRY/edit#heading=h.i97w20td4jrk |
Alternatively, what's the risk with gossip? Yes, it does way more than we need, but it's super well understood, there are lots of libraries out there, and gives us optionality in the future (e.g. if the master goes away, we could use it to elect a new master). Is it just that it's more than we need now? |
Had a discussion offline - I withdraw my previous comment. |
Can you elucidate? I don't feel like I am well-equipped to derive it from On Thu, Aug 11, 2016 at 7:54 PM, David Aronchick notifications@github.com
|
@lukemarsden if we assert that the short token is secure enough to talk to On Thu, Aug 11, 2016 at 11:31 PM, Tim Hockin thockin@google.com wrote:
|
I read over @jbeda's doc and while I am the last person you want doing On Thu, Aug 11, 2016 at 11:34 PM, Tim Hockin notifications@github.com
|
@thockin @aronchick @jbeda @mikedanese @philips Thanks everyone for the input on this! @jbeda, we really like your JWT solution. It seems simpler and more consistent with the rest of Kubernetes, and it satisfies the requirements. I still think gossip + CRDTs can have value for multi-master and federated setups, but multi-master isn't what we're trying to get to in this first iteration of the new UX, which we all seem to agree on in #30360. So, @errordeveloper and I are busy coding. What we're doing is:
In order to get this all working by the end of next week we will need some help! Hope this is helpful. Happy weekend folks! |
Sounds like we have a plan. I'll convert my google doc to a markdown proposal some time over the weekend. |
It is very unlikely that we will get an API change of this magnitude merged into v1.4. And, I know that prototyping quickly is an important goal of this effort. But, I think that this workflow can be accomplished without any changes to k8s:
Self-hosted:
Static Manifest:
This next steps requires #30090 cc @mtaufen
Thanks to @aaronlevy for whiteboarding this with me. |
Thanks @philips and @aaronlevy! This looks great. @jbeda what do you think? Our WIP efforts to implement Can we collaborate early next week to converge these ideas and make this happen? |
|
||
## Motivation | ||
|
||
As part of the dramatically simplified cluster creation UX described in the above linked proposal, it is desirable to have a mechanism for discovery which enables the desired UX without depending on an external network service, such as a discovery service, so that we don’t have to operate one, and users don’t have to leave their firewall to provision a cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I missed the discussion around this, but I'm not convinced that it's valuable to avoid having a discovery meet-up point (by default) to simplify installation.
The discovery proposal was a way to simplify the UX for folks that didn't want to scp around a cert and type in multiple arguments. But the intent was not to preclude that case either. In a scenario where someone doesn't want to rely on an external service or to run their own bootstrap endpoint, it seems like they might be ok typing a few extra arguments on each node to set up a private cluster. For larger deployments, you'd just run your own bootstrap server (since it's just a docker container and it should be easy to run). |
@philips @aaronlevy are you sure that #30090 is required for the crashloop plan to work? We think we can make this work already by having |
@roberthbailey It seems like we are starting to gel on the JWS based approach. I need to put that in a PR as a markdown file. I'll try to find time today to get that going. I'd like to make it compatible with what you've been thinking around a bootstrap API but also allow for the API server itself to satisfy this for a lighter weight experience. |
Looks like this proposal is going to get overtaken by the new plan at https://github.com/lukemarsden/kubernetes/blob/proposal-dramatically-simplify-cluster-creation/docs/proposals/dramatically-simplify-cluster-creation.md#kubeadm-implementation-plan, so closing. Thanks! |
Promoted from draft here: https://docs.google.com/document/d/1IG9zK88C4Omn9XEURRvw98utZfIw7UshrRNg9U5ip40/edit
This change is