Decision: Standardize on Rudder/Flannel for k8s networking? #1307

jbeda · 2014-09-13T01:09:46Z

Rudder could be built into a separable layer that takes various environments and has various techniques to make IP-per-pod a reality.

This is forked from the discussion in #1059.

cc: @thockin, @eyakubovich

Details on plumbing GCE advanced routing into Rudder:

From @eyakubovich:

There's a PR in Rudder that adds an allocation mode (should get merged soon). It adds the ability to allocate a subnet but it doesn't do anything to the data packets. It just writes the allocation to /run/rudder/subnet.env and signals systemd that it has started. Another dependency could wait on that and issue the GCE command to setup its advanced routing. I would also be happy to create a GCE mode that would do that inside Rudder.

I actually think that for GCE would be a little more complicated. Configuring routes in the GCE API would require permission to call out to the API. I think that we need a "reconciler" that takes the IP ranges assigned to a node and mirrors that into the routes in the GCE API.

A quick sketch of what this might look like:

A new node is started in GCE and rudder is installed. An CIDR is assigned (/24) to that node via the "try and allocate in etcd" method.
Rudder is knows that we are doing "simple" networking and no encapsulation is involved. Instead, we ensure that all traffic to the wider IP space (e.g. 10.0.0.0/8) is put on the wire directly. (Stuff outside that space would be NATd through the DHCP assigned host IP.
A centralized reconciler (running someplace with permission to talk to the GCE API) would pick up the fact that this /24 was assigned to a specific node. It would in turn call the GCE API to ensure that all traffic to that /24 was forwarded to that node.
Furthermore, if a node goes away, this centralized reconciler will know to remove that route.

I think that this model might be necessary for other places where network gear needs to be programed based on dynamic allocations made by rudder.

jbeda · 2014-09-13T01:15:03Z

Forgot to add -- does this sound like a good idea? I think this is compatible with where Rudder might go but not 100% sure. We want to make sure as we do things like IP-per-service and figure out access to external IPs we can make it work with or be built into rudder. I also like that rudder has independent utility outside of Kubernetes.

cc: @brendandburns

smarterclayton · 2014-09-13T02:32:17Z

@rajatchopra can you speak to how OVS setups and things like OpenDaylight would fit here? In cases where you have an external network controller or IaaS - does Rudder add anything?

rajatchopra · 2014-09-13T07:35:17Z

In cases where you have an external network controller or IaaS - does Rudder add anything?
Rudder's allocation mode will provide the book-keeping of layer 3 leases. That is useful no matter what we use to actually do the networking. But at that point, given @jbeda's suggestion on GCE+Rudder as well, we should probably stop calling Rudder a networking solution.

How does OVS fit here?
OVS would be one of the ways to do switching (and encapsulation). tuntap devices are doing the same thing (in userspace, so relatively poorly). Ideally, the switch itself should support various appropriate pluggable encapsulations and suitable hardware accelaration (custom/vxlan/nvgre/intel-dptp..). Plus, it should be mass programmable (e.g. use opendaylight to control ovs).

I believe, it is important that we keep an eye on flow based switching. Multicasts will not survive the scale, and future needs of kubernetes may include strict tenant isolation.

Will Rudder evolve to manage mac-addresses, ip-addresses, vlan tags, vnid per tenant, and control the switching based on all above?

mrunalp · 2014-09-17T18:15:59Z

Echoing @rajatchopra -- OVS/vxlan already solves the problems that rudder intends to solve eventually. A big advantage that OVS has is using flows to program the network to one's needs. Providing entire IP space to each tenant is already possible using flows.

jbeda · 2014-09-17T22:07:03Z

@mrunalp We need to support various environments. There are some environments (such as GCE) where you don't need to do any encap at all and it is about assigning subnets and configuring cloud control systems.

Things I like about rudder:

The current implementation can work anywhere UDP can work.
It can adapt to have different strategies for implementing a consistent networking model
It doesn't wrap docker but instead sets things up and gets out of the way. Anything that has to reach in and muck with the network namespace is going to be harder to integrate.

jbeda · 2014-09-18T22:25:59Z

Note that rudder is now called flannel.

jbeda · 2014-09-18T22:38:51Z

From various IRC conversations, I wanted to snapshot the different strategies that folks have taken to get networking for k8s up and running:

L2 bridging (https://github.com/kelseyhightower/kubernetes-coreos/blob/master/docs/networking.md)
L3 routing (static-routes used by vsphere)
GRE tunnel mesh (via OVS) (used by vagrant set up)
OpenVPN mesh (used by azure)
user mode UDP encap (current rudder)
VxLAN kernel UDP encap (used by rackspace
GCE advanced routing
programming a hardware router (no one is doing this yet?)

It be great if this were layered well instead of being all ad-hoc.

mrunalp · 2014-09-19T00:05:21Z

That is a pretty good list. It shows that different use cases will require different networking setup/topologies. So, if one of them is chosen as the default, it would be good if the APIs still allow for using other advanced networking solutions.

Use cases that VxLan with OVS/OpenFlow helps solve are:

IP per pod with no isolation
IP per pod but isolation among groups of pods (logical application)
Full IP space available to all logical applications

smarterclayton · 2014-09-19T14:29:12Z

Well layered is definitely important - I think default isn't an issue, it's just that the abstraction from Kube -> registration -> rudder / other -> node agent is should be very clear.

jbeda · 2014-09-19T17:47:06Z

Here is how I think about it. For a network solution to work, we need the following:

Each pod gets an IP address that is unique across the cluster
Each pod IP is reachable from any other pod and from host machines
Host machines are reachable from each pod.

As long as this is met, kubernetes will work just fine. If Flannel can encapsulate making that happen it is a good fit.

/cc @kelseyhightower

smarterclayton · 2014-09-22T20:40:24Z

I don't have any objections at the current level of detail, would like to see an in depth proposal to have an opportunity to understand what it means.

jbeda · 2014-09-22T20:54:57Z

@smarterclayton What types of networking do you think are a "must have" before we can switch over? Would you be happy, for example, if the vagrant set up switched over to using the UDP encap instead of GRE/OVS?

smarterclayton · 2014-09-23T19:59:33Z

I'm actually less concerned about the types of networking - instead, it's mostly that real networks and configurations are some of the most heavily opinionated and predefined aspects of deployments. I can imagine an almost infinite amount of flexibility that people will want in real deployments, but I don't think that Kubernetes has to solve that out of the box, nor that Flannel has to solve that out of the box.

Instead, the list of requirements you listed above are what I consider a good first step, and is an excellent place for Flannel to be deployed and integrate with Kubernetes. My concern is solely that an integrator should be able to tie concepts and their own network topologies into Kubernetes (because they have vendor X network solution that can provide this). So I would want clean abstractions between "schedule this pod on this host" and "I have to schedule, then make a call to Flannel, etc". I don't think you're proposing the latter, but I did want to understand a bit more about what the integration with Flannel would look like.

jbeda · 2014-09-23T20:28:41Z

How about this -- we make Flannel the default mechanism for configuring the network for a Kubernetes cluster but we don't make it a hard requirements. If there is some feature that requires us to reach out and configure the networking layer we'd make it pluggable similar to the cloud provider stuff.

But by working with Flannel hand-in-hand we'd guarantee we have something that works (perhaps not optimally) out of the box with a minimum of futzing.

smarterclayton · 2014-09-23T20:29:13Z

That works for me

EDIT: That being said, we'd like to be able to use the vagrant environment as testbeds for alternative networking even if the default ootb is a "just working" Flannel config

thockin · 2014-09-23T22:01:42Z

I'm OK with that once Flannel demonstrate their ability to run native GCE
mode?

On Tue, Sep 23, 2014 at 1:29 PM, Clayton Coleman notifications@github.com
wrote:

That works for me

Reply to this email directly or view it on GitHub
#1307 (comment)
.

jbeda · 2014-09-23T22:30:03Z

@kelseyhightower is way ahead of you Tim: https://github.com/kelseyhightower/flannel-route-manager

My gut is that we should merge the route manager stuff into core flannel. I think it does a great job of showing how flannel is a general network config framework instead of a specific implementation (encap over UDP).

jbeda · 2014-09-23T22:32:14Z

@smarterclayton One thing I'd love to do is to simplify/minimize the salt config. Right now it is a rats nest where we conflate cloud, host os, and network strategy. Are you cool if the salt configs assume flannel? (with the ability to paramaterize the flannel config?)

thockin · 2014-09-23T22:39:51Z

I think I am OK with this. Will this cause OVS people to get in a huff, or
will they be able to make OVS run under rudder cleanly?

On Tue, Sep 23, 2014 at 3:32 PM, Joe Beda notifications@github.com wrote:

@smarterclayton https://github.com/smarterclayton One thing I'd love to
do is to simplify/minimize the salt config. Right now it is a rats nest
where we conflate cloud, host os, and network strategy. Are you cool if the
salt configs assume flannel? (with the ability to paramaterize the flannel
config?)

Reply to this email directly or view it on GitHub
#1307 (comment)
.

jbeda · 2014-09-23T22:51:02Z

I'd see flannel as being a simple driver for OVS. We need to close the loop there though.

mrunalp · 2014-09-23T23:15:40Z

There are two separate problems that flannel solves (atleast from what I've seen/read from their readme).

IP assignment/leases.
Network overlay using UDP encapsulation.

If Kubernetes/flannel were to allow plugins for both these features, then the use cases that we outlined above could be solved. There could be a custom IP assignment plugin that leases out same IPs to different tenants if necessary and an OVS plugin with either GRE/VxLan could be used to create the overlay network.

smarterclayton · 2014-09-23T23:17:44Z

I appreciate the salt config is complex, but the primary purpose of the vagrant env at least was to be able to test and develop Kubernetes in a range of configs. If there's no non-rudder path, that complicates how folks integrate and test against other overlay or network configs.

I really would like to preserve the ability for people to hack on the harder bits of a kube setup in a controlled way - if it's an attention / time thing we can work harder to ensure that path stays working, or potentially we can limit it to the Ansible code path (assuming that was in tree and runnable against vagrant).

I like opinionated choices that make the experience for people trying out kube better, and supporting general use on a wide range of topologies (which flannel does well and the vagrant gre/ovs setup does not). I guess I'm not convinced yet that some important deployment modes for kube won't have to deal with other SDN solutions, and that we won't want test beds for them in tree (like the iaas providers).

On Sep 23, 2014, at 6:32 PM, Joe Beda notifications@github.com wrote:

@smarterclayton One thing I'd love to do is to simplify/minimize the salt config. Right now it is a rats nest where we conflate cloud, host os, and network strategy. Are you cool if the salt configs assume flannel? (with the ability to paramaterize the flannel config?)

—
Reply to this email directly or view it on GitHub.

jbeda · 2014-09-23T23:23:01Z

@mrunalp I'll leave it to @kelseyhightower and @eyakubovich to speak to pluggability on the IP assignment side.

@smarterclayton I hear you -- but right now the Vagrant set up is pretty hard coded to the OVS/GRE strategy. The salt stuff is very brittle in general. I'm cool with writing the salt configs so that we can squeeze in other non-flannel strategies but I'd like to keep it as isolated as possible. How about we argue about this some more as the PR comes up for review and we have specifics?

smarterclayton · 2014-09-23T23:31:29Z

On Sep 23, 2014, at 7:23 PM, Joe Beda notifications@github.com wrote:

@mrunalp I'll leave it to @kelseyhightower and @eyakubovich to speak to pluggability on the IP assignment side.

@smarterclayton I hear you -- but right now the Vagrant set up is pretty hard coded to the OVS/GRE strategy. The salt stuff is very brittle in general. I'm cool with writing the salt configs so that we can squeeze in other non-flannel strategies but I'd like to keep it as isolated as possible. How about we argue about this some more as the PR comes up for review and we have specifics?

Sure, that works for me.

—
Reply to this email directly or view it on GitHub.

eyakubovich · 2014-09-24T04:16:31Z

@mrunalp I wasn't planning on making those strategies pluggable in the sense of having 3rd-party plugins. In case of flannel doing UDP encap, it would require the plugin to communicate IP mappings in real time.

We're working on VXLAN (without OVS) and I don't see much advantage of using GRE (except slightly smaller space overhead).

@jbeda I would like to merge flannel-route-manager into flannel as well.

jbeda · 2014-09-24T16:26:54Z

@eyakubovich The only reason that I think users might prefer GRE over UDP is that any networking gear that is monitoring what is happening will be able to classify and monitor the GRE traffic separate from UDP. I think generally UDP is seen as "application" traffic and GRE is seen as "network infrastructure" traffic. In addition, network capture tools and the like know how to crack GRE and reconstruct what is going on inside the encap'd stream.

Personally I like UDP encap as it is supported pretty much everywhere but I can understand why others would like GRE.

smarterclayton · 2014-09-25T19:22:06Z

As a note for people coming to this thread, there are other discussions in progress around how OVS/ODL could integrate into Kubernetes or Flannel. I'll make sure that there is an issue linked here to discuss that separately.

eparis · 2014-10-01T18:44:18Z

It seems to me that using flannel (as it is today) is going to mean that etcd has to be exposed to the minion. Which I thought a goal was to remove the etcd access requirement from the kubelet/minion. Was I mistaken that was a desired architecture redesign? To make the only communication with etcd from the apiserver rather than from each and every node?

jbeda · 2014-10-01T19:27:30Z

@eparis That is a good point. There is no reason that k8s and flannel have to hit the same etcd. It also doesn't break the conceptional model if flannel requires etcd but k8s hides it. But as a practical matter it is crazy to run 2 etcd instances.

Perhaps this is a impetus for the CoreOS guys to either (a) support some level of ACLing in etcd or (b) abstract out the etcd API such that it can be implemented by a domain specific proxy (and perhaps other stores).

smarterclayton · 2014-10-01T20:04:41Z

----- Original Message -----

@eparis That is a good point. There is no reason that k8s and flannel have
to hit the same etcd. It also doesn't break the conceptional model if
flannel requires etcd but k8s hides it. But as a practical matter it is
crazy to run 2 etcd instances.

And at small scales, I don't think it's a huge issue to talk directly to etcd from the minions. The bigger your cluster gets, and the more diverse your workloads become, the more important it becomes to introduce a separating abstraction between minions and the central data store.

Perhaps this is a impetus for the CoreOS guys to either (a) support some
level of ACLing in etcd or (b) abstract out the etcd API such that it can be
implemented by a domain specific proxy (and perhaps other stores).

I think both of those are valuable. The second could just be for the read api (the flannel component could still write/read directly). Another option c) could be to invest in making the etcd client be able to work with transparent proxies (so the apiserver could proxy /api/v1beta1/some/arbitrary/path directly to etcd as /some/prefix/some/arbitrary/path via path rewriting or something). We've actually been interested in that in general as offering transparent etcd-as-a-service as a resource to consumers of Kubernetes.

eyakubovich · 2014-10-03T04:41:04Z

cc: @philips

bgrant0607 · 2014-10-29T00:01:54Z

/cc @pietern re. OVS

philips · 2014-10-30T00:53:08Z

@jbeda @mrunalp GUE is another option here for encap (flannel-io/flannel#64). And I think it is an interesting one because it keeps everything L3 and in-Kernel.

@smarterclayton @jbeda On the ACL topic it is something we want to do and it would be great if someone could help define an API and implement it as a proxy. Taking that work and bringing it into etcd itself should be straightforward after that. In etcd 0.5.0 (currently in alpha) we have a proxy package for implementing simply etcd proxies now. Putting an ACL thing in front should be straightforward. Here is a simple proxy example built on the package that filters out certain key prefixes and HTTP verbs.

mrunalp · 2014-10-30T02:19:40Z

@philips That looks interesting. I will check it out. Thanks.

bgrant0607 · 2014-12-15T20:35:51Z

For reference, OpenStack's IP management API: http://docs.openstack.org/api/openstack-network/2.0/content/Overview-d1e71.html

bgrant0607 · 2014-12-17T01:21:42Z

Description of IBM's SDN for containers: http://thoughtsoncloud.com/2014/12/can-enterprise-portable-network-docker-opportunity-sdn/

errordeveloper · 2015-03-23T14:32:11Z

I believe that the original discussion here is mostly overtaken by the plugins introduced with #5069. The only question that still remains is whether there needs to a default plugin or not, but that's a separate discussion, I believe, as the context has changed with introduction of plugins.

kamalmarhubi · 2015-04-29T14:49:25Z

Given @errordeveloper's comment above, should this issue be closed? Came across it and the current state is slightly ambiguous.

bgrant0607 · 2015-04-29T14:51:59Z

Yes, thanks.

UPSTREAM: 110039: Add readinessProbe to aggregated api service test

jbeda mentioned this issue Sep 13, 2014

Support for deploying k8s cluster into Digital Ocean #1059

Closed

bgrant0607 added kind/design Categorizes issue or PR as related to design. question sig/network Categorizes an issue or PR as relevant to SIG Network. labels Sep 16, 2014

jbeda changed the title ~~Decision: Standardize on Rudder for k8s networking?~~ Decision: Standardize on Rudder/Flannel for k8s networking? Sep 19, 2014

bgrant0607 mentioned this issue Sep 25, 2014

Decide whether/how to extend the networking model #188

Closed

bgrant0607 added this to the v1.0 milestone Oct 4, 2014

jdef mentioned this issue Nov 14, 2014

Networking TBD. d2iq-archive/kubernetes-mesos#5

Open

bgrant0607 removed this from the v1.0 milestone Dec 3, 2014

bgrant0607 added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Dec 4, 2014

thockin added kind/support Categorizes issue or PR as a support question. and removed kind/support-question labels Jan 7, 2015

goltermann added priority/support and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jan 7, 2015

thockin added the team/cluster label Feb 6, 2015

roberthbailey removed the kind/support Categorizes issue or PR as a support question. label Feb 12, 2015

bgrant0607 closed this as completed Apr 29, 2015

sjenning pushed a commit to sjenning/kubernetes that referenced this issue Aug 30, 2022

Merge pull request kubernetes#1307 from pacevedom/readiness-aggregator

c83b5d0

UPSTREAM: 110039: Add readinessProbe to aggregated api service test

Decision: Standardize on Rudder/Flannel for k8s networking? #1307

Decision: Standardize on Rudder/Flannel for k8s networking? #1307

Comments

jbeda commented Sep 13, 2014

jbeda commented Sep 13, 2014

smarterclayton commented Sep 13, 2014

rajatchopra commented Sep 13, 2014

mrunalp commented Sep 17, 2014

jbeda commented Sep 17, 2014

jbeda commented Sep 18, 2014

jbeda commented Sep 18, 2014

mrunalp commented Sep 19, 2014

smarterclayton commented Sep 19, 2014

jbeda commented Sep 19, 2014

smarterclayton commented Sep 22, 2014

jbeda commented Sep 22, 2014

smarterclayton commented Sep 23, 2014

jbeda commented Sep 23, 2014

smarterclayton commented Sep 23, 2014

thockin commented Sep 23, 2014

jbeda commented Sep 23, 2014

jbeda commented Sep 23, 2014

thockin commented Sep 23, 2014

jbeda commented Sep 23, 2014

mrunalp commented Sep 23, 2014

smarterclayton commented Sep 23, 2014

jbeda commented Sep 23, 2014

smarterclayton commented Sep 23, 2014

eyakubovich commented Sep 24, 2014

jbeda commented Sep 24, 2014

smarterclayton commented Sep 25, 2014

eparis commented Oct 1, 2014

jbeda commented Oct 1, 2014

smarterclayton commented Oct 1, 2014

eyakubovich commented Oct 3, 2014

bgrant0607 commented Oct 29, 2014

philips commented Oct 30, 2014

mrunalp commented Oct 30, 2014

bgrant0607 commented Dec 15, 2014

bgrant0607 commented Dec 17, 2014

errordeveloper commented Mar 23, 2015

kamalmarhubi commented Apr 29, 2015

bgrant0607 commented Apr 29, 2015