Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decision: Standardize on Rudder/Flannel for k8s networking? #1307

Closed
jbeda opened this issue Sep 13, 2014 · 39 comments
Closed

Decision: Standardize on Rudder/Flannel for k8s networking? #1307

jbeda opened this issue Sep 13, 2014 · 39 comments
Labels
kind/design Categorizes issue or PR as related to design. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@jbeda
Copy link
Contributor

jbeda commented Sep 13, 2014

Rudder could be built into a separable layer that takes various environments and has various techniques to make IP-per-pod a reality.

This is forked from the discussion in #1059.

cc: @thockin, @eyakubovich


Details on plumbing GCE advanced routing into Rudder:

From @eyakubovich:

There's a PR in Rudder that adds an allocation mode (should get merged soon). It adds the ability to allocate a subnet but it doesn't do anything to the data packets. It just writes the allocation to /run/rudder/subnet.env and signals systemd that it has started. Another dependency could wait on that and issue the GCE command to setup its advanced routing. I would also be happy to create a GCE mode that would do that inside Rudder.

I actually think that for GCE would be a little more complicated. Configuring routes in the GCE API would require permission to call out to the API. I think that we need a "reconciler" that takes the IP ranges assigned to a node and mirrors that into the routes in the GCE API.

A quick sketch of what this might look like:

  1. A new node is started in GCE and rudder is installed. An CIDR is assigned (/24) to that node via the "try and allocate in etcd" method.
  2. Rudder is knows that we are doing "simple" networking and no encapsulation is involved. Instead, we ensure that all traffic to the wider IP space (e.g. 10.0.0.0/8) is put on the wire directly. (Stuff outside that space would be NATd through the DHCP assigned host IP.
  3. A centralized reconciler (running someplace with permission to talk to the GCE API) would pick up the fact that this /24 was assigned to a specific node. It would in turn call the GCE API to ensure that all traffic to that /24 was forwarded to that node.
  4. Furthermore, if a node goes away, this centralized reconciler will know to remove that route.

I think that this model might be necessary for other places where network gear needs to be programed based on dynamic allocations made by rudder.

@jbeda
Copy link
Contributor Author

jbeda commented Sep 13, 2014

Forgot to add -- does this sound like a good idea? I think this is compatible with where Rudder might go but not 100% sure. We want to make sure as we do things like IP-per-service and figure out access to external IPs we can make it work with or be built into rudder. I also like that rudder has independent utility outside of Kubernetes.

cc: @brendandburns

@smarterclayton
Copy link
Contributor

@rajatchopra can you speak to how OVS setups and things like OpenDaylight would fit here? In cases where you have an external network controller or IaaS - does Rudder add anything?

@rajatchopra
Copy link
Contributor

In cases where you have an external network controller or IaaS - does Rudder add anything?
Rudder's allocation mode will provide the book-keeping of layer 3 leases. That is useful no matter what we use to actually do the networking. But at that point, given @jbeda's suggestion on GCE+Rudder as well, we should probably stop calling Rudder a networking solution.

How does OVS fit here?
OVS would be one of the ways to do switching (and encapsulation). tuntap devices are doing the same thing (in userspace, so relatively poorly). Ideally, the switch itself should support various appropriate pluggable encapsulations and suitable hardware accelaration (custom/vxlan/nvgre/intel-dptp..). Plus, it should be mass programmable (e.g. use opendaylight to control ovs).

I believe, it is important that we keep an eye on flow based switching. Multicasts will not survive the scale, and future needs of kubernetes may include strict tenant isolation.

Will Rudder evolve to manage mac-addresses, ip-addresses, vlan tags, vnid per tenant, and control the switching based on all above?

@bgrant0607 bgrant0607 added kind/design Categorizes issue or PR as related to design. question sig/network Categorizes an issue or PR as relevant to SIG Network. labels Sep 16, 2014
@mrunalp
Copy link
Contributor

mrunalp commented Sep 17, 2014

Echoing @rajatchopra -- OVS/vxlan already solves the problems that rudder intends to solve eventually. A big advantage that OVS has is using flows to program the network to one's needs. Providing entire IP space to each tenant is already possible using flows.

@jbeda
Copy link
Contributor Author

jbeda commented Sep 17, 2014

@mrunalp We need to support various environments. There are some environments (such as GCE) where you don't need to do any encap at all and it is about assigning subnets and configuring cloud control systems.

Things I like about rudder:

  1. The current implementation can work anywhere UDP can work.
  2. It can adapt to have different strategies for implementing a consistent networking model
  3. It doesn't wrap docker but instead sets things up and gets out of the way. Anything that has to reach in and muck with the network namespace is going to be harder to integrate.

@jbeda
Copy link
Contributor Author

jbeda commented Sep 18, 2014

Note that rudder is now called flannel.

@jbeda
Copy link
Contributor Author

jbeda commented Sep 18, 2014

From various IRC conversations, I wanted to snapshot the different strategies that folks have taken to get networking for k8s up and running:

It be great if this were layered well instead of being all ad-hoc.

@mrunalp
Copy link
Contributor

mrunalp commented Sep 19, 2014

That is a pretty good list. It shows that different use cases will require different networking setup/topologies. So, if one of them is chosen as the default, it would be good if the APIs still allow for using other advanced networking solutions.

Use cases that VxLan with OVS/OpenFlow helps solve are:

  • IP per pod with no isolation
  • IP per pod but isolation among groups of pods (logical application)
  • Full IP space available to all logical applications

@smarterclayton
Copy link
Contributor

Well layered is definitely important - I think default isn't an issue, it's just that the abstraction from Kube -> registration -> rudder / other -> node agent is should be very clear.

@jbeda
Copy link
Contributor Author

jbeda commented Sep 19, 2014

Here is how I think about it. For a network solution to work, we need the following:

  1. Each pod gets an IP address that is unique across the cluster
  2. Each pod IP is reachable from any other pod and from host machines
  3. Host machines are reachable from each pod.

As long as this is met, kubernetes will work just fine. If Flannel can encapsulate making that happen it is a good fit.

/cc @kelseyhightower

@jbeda jbeda changed the title Decision: Standardize on Rudder for k8s networking? Decision: Standardize on Rudder/Flannel for k8s networking? Sep 19, 2014
@smarterclayton
Copy link
Contributor

I don't have any objections at the current level of detail, would like to see an in depth proposal to have an opportunity to understand what it means.

@jbeda
Copy link
Contributor Author

jbeda commented Sep 22, 2014

@smarterclayton What types of networking do you think are a "must have" before we can switch over? Would you be happy, for example, if the vagrant set up switched over to using the UDP encap instead of GRE/OVS?

@smarterclayton
Copy link
Contributor

I'm actually less concerned about the types of networking - instead, it's mostly that real networks and configurations are some of the most heavily opinionated and predefined aspects of deployments. I can imagine an almost infinite amount of flexibility that people will want in real deployments, but I don't think that Kubernetes has to solve that out of the box, nor that Flannel has to solve that out of the box.

Instead, the list of requirements you listed above are what I consider a good first step, and is an excellent place for Flannel to be deployed and integrate with Kubernetes. My concern is solely that an integrator should be able to tie concepts and their own network topologies into Kubernetes (because they have vendor X network solution that can provide this). So I would want clean abstractions between "schedule this pod on this host" and "I have to schedule, then make a call to Flannel, etc". I don't think you're proposing the latter, but I did want to understand a bit more about what the integration with Flannel would look like.

@jbeda
Copy link
Contributor Author

jbeda commented Sep 23, 2014

How about this -- we make Flannel the default mechanism for configuring the network for a Kubernetes cluster but we don't make it a hard requirements. If there is some feature that requires us to reach out and configure the networking layer we'd make it pluggable similar to the cloud provider stuff.

But by working with Flannel hand-in-hand we'd guarantee we have something that works (perhaps not optimally) out of the box with a minimum of futzing.

@smarterclayton
Copy link
Contributor

That works for me

EDIT: That being said, we'd like to be able to use the vagrant environment as testbeds for alternative networking even if the default ootb is a "just working" Flannel config

@thockin
Copy link
Member

thockin commented Sep 23, 2014

I'm OK with that once Flannel demonstrate their ability to run native GCE
mode?

On Tue, Sep 23, 2014 at 1:29 PM, Clayton Coleman notifications@github.com
wrote:

That works for me

Reply to this email directly or view it on GitHub
#1307 (comment)
.

@jbeda
Copy link
Contributor Author

jbeda commented Sep 23, 2014

@kelseyhightower is way ahead of you Tim: https://github.com/kelseyhightower/flannel-route-manager

My gut is that we should merge the route manager stuff into core flannel. I think it does a great job of showing how flannel is a general network config framework instead of a specific implementation (encap over UDP).

@jbeda
Copy link
Contributor Author

jbeda commented Sep 23, 2014

@smarterclayton One thing I'd love to do is to simplify/minimize the salt config. Right now it is a rats nest where we conflate cloud, host os, and network strategy. Are you cool if the salt configs assume flannel? (with the ability to paramaterize the flannel config?)

@thockin
Copy link
Member

thockin commented Sep 23, 2014

I think I am OK with this. Will this cause OVS people to get in a huff, or
will they be able to make OVS run under rudder cleanly?

On Tue, Sep 23, 2014 at 3:32 PM, Joe Beda notifications@github.com wrote:

@smarterclayton https://github.com/smarterclayton One thing I'd love to
do is to simplify/minimize the salt config. Right now it is a rats nest
where we conflate cloud, host os, and network strategy. Are you cool if the
salt configs assume flannel? (with the ability to paramaterize the flannel
config?)

Reply to this email directly or view it on GitHub
#1307 (comment)
.

@jbeda
Copy link
Contributor Author

jbeda commented Sep 23, 2014

I'd see flannel as being a simple driver for OVS. We need to close the loop there though.

@mrunalp
Copy link
Contributor

mrunalp commented Sep 23, 2014

There are two separate problems that flannel solves (atleast from what I've seen/read from their readme).

  1. IP assignment/leases.
  2. Network overlay using UDP encapsulation.

If Kubernetes/flannel were to allow plugins for both these features, then the use cases that we outlined above could be solved. There could be a custom IP assignment plugin that leases out same IPs to different tenants if necessary and an OVS plugin with either GRE/VxLan could be used to create the overlay network.

@smarterclayton
Copy link
Contributor

I appreciate the salt config is complex, but the primary purpose of the vagrant env at least was to be able to test and develop Kubernetes in a range of configs. If there's no non-rudder path, that complicates how folks integrate and test against other overlay or network configs.

I really would like to preserve the ability for people to hack on the harder bits of a kube setup in a controlled way - if it's an attention / time thing we can work harder to ensure that path stays working, or potentially we can limit it to the Ansible code path (assuming that was in tree and runnable against vagrant).

I like opinionated choices that make the experience for people trying out kube better, and supporting general use on a wide range of topologies (which flannel does well and the vagrant gre/ovs setup does not). I guess I'm not convinced yet that some important deployment modes for kube won't have to deal with other SDN solutions, and that we won't want test beds for them in tree (like the iaas providers).

On Sep 23, 2014, at 6:32 PM, Joe Beda notifications@github.com wrote:

@smarterclayton One thing I'd love to do is to simplify/minimize the salt config. Right now it is a rats nest where we conflate cloud, host os, and network strategy. Are you cool if the salt configs assume flannel? (with the ability to paramaterize the flannel config?)


Reply to this email directly or view it on GitHub.

@jbeda
Copy link
Contributor Author

jbeda commented Sep 23, 2014

@mrunalp I'll leave it to @kelseyhightower and @eyakubovich to speak to pluggability on the IP assignment side.

@smarterclayton I hear you -- but right now the Vagrant set up is pretty hard coded to the OVS/GRE strategy. The salt stuff is very brittle in general. I'm cool with writing the salt configs so that we can squeeze in other non-flannel strategies but I'd like to keep it as isolated as possible. How about we argue about this some more as the PR comes up for review and we have specifics?

@smarterclayton
Copy link
Contributor

On Sep 23, 2014, at 7:23 PM, Joe Beda notifications@github.com wrote:

@mrunalp I'll leave it to @kelseyhightower and @eyakubovich to speak to pluggability on the IP assignment side.

@smarterclayton I hear you -- but right now the Vagrant set up is pretty hard coded to the OVS/GRE strategy. The salt stuff is very brittle in general. I'm cool with writing the salt configs so that we can squeeze in other non-flannel strategies but I'd like to keep it as isolated as possible. How about we argue about this some more as the PR comes up for review and we have specifics?

Sure, that works for me.


Reply to this email directly or view it on GitHub.

@eyakubovich
Copy link

@mrunalp I wasn't planning on making those strategies pluggable in the sense of having 3rd-party plugins. In case of flannel doing UDP encap, it would require the plugin to communicate IP mappings in real time.

We're working on VXLAN (without OVS) and I don't see much advantage of using GRE (except slightly smaller space overhead).

@jbeda I would like to merge flannel-route-manager into flannel as well.

@jbeda
Copy link
Contributor Author

jbeda commented Sep 24, 2014

@eyakubovich The only reason that I think users might prefer GRE over UDP is that any networking gear that is monitoring what is happening will be able to classify and monitor the GRE traffic separate from UDP. I think generally UDP is seen as "application" traffic and GRE is seen as "network infrastructure" traffic. In addition, network capture tools and the like know how to crack GRE and reconstruct what is going on inside the encap'd stream.

Personally I like UDP encap as it is supported pretty much everywhere but I can understand why others would like GRE.

@smarterclayton
Copy link
Contributor

As a note for people coming to this thread, there are other discussions in progress around how OVS/ODL could integrate into Kubernetes or Flannel. I'll make sure that there is an issue linked here to discuss that separately.

@eparis
Copy link
Contributor

eparis commented Oct 1, 2014

It seems to me that using flannel (as it is today) is going to mean that etcd has to be exposed to the minion. Which I thought a goal was to remove the etcd access requirement from the kubelet/minion. Was I mistaken that was a desired architecture redesign? To make the only communication with etcd from the apiserver rather than from each and every node?

@jbeda
Copy link
Contributor Author

jbeda commented Oct 1, 2014

@eparis That is a good point. There is no reason that k8s and flannel have to hit the same etcd. It also doesn't break the conceptional model if flannel requires etcd but k8s hides it. But as a practical matter it is crazy to run 2 etcd instances.

Perhaps this is a impetus for the CoreOS guys to either (a) support some level of ACLing in etcd or (b) abstract out the etcd API such that it can be implemented by a domain specific proxy (and perhaps other stores).

@smarterclayton
Copy link
Contributor

----- Original Message -----

@eparis That is a good point. There is no reason that k8s and flannel have
to hit the same etcd. It also doesn't break the conceptional model if
flannel requires etcd but k8s hides it. But as a practical matter it is
crazy to run 2 etcd instances.

And at small scales, I don't think it's a huge issue to talk directly to etcd from the minions. The bigger your cluster gets, and the more diverse your workloads become, the more important it becomes to introduce a separating abstraction between minions and the central data store.

Perhaps this is a impetus for the CoreOS guys to either (a) support some
level of ACLing in etcd or (b) abstract out the etcd API such that it can be
implemented by a domain specific proxy (and perhaps other stores).

I think both of those are valuable. The second could just be for the read api (the flannel component could still write/read directly). Another option c) could be to invest in making the etcd client be able to work with transparent proxies (so the apiserver could proxy /api/v1beta1/some/arbitrary/path directly to etcd as /some/prefix/some/arbitrary/path via path rewriting or something). We've actually been interested in that in general as offering transparent etcd-as-a-service as a resource to consumers of Kubernetes.

@eyakubovich
Copy link

cc: @philips

@bgrant0607 bgrant0607 added this to the v1.0 milestone Oct 4, 2014
@bgrant0607
Copy link
Member

/cc @pietern re. OVS

@philips
Copy link
Contributor

philips commented Oct 30, 2014

@jbeda @mrunalp GUE is another option here for encap (flannel-io/flannel#64). And I think it is an interesting one because it keeps everything L3 and in-Kernel.

@smarterclayton @jbeda On the ACL topic it is something we want to do and it would be great if someone could help define an API and implement it as a proxy. Taking that work and bringing it into etcd itself should be straightforward after that. In etcd 0.5.0 (currently in alpha) we have a proxy package for implementing simply etcd proxies now. Putting an ACL thing in front should be straightforward. Here is a simple proxy example built on the package that filters out certain key prefixes and HTTP verbs.

@mrunalp
Copy link
Contributor

mrunalp commented Oct 30, 2014

@philips That looks interesting. I will check it out. Thanks.

@bgrant0607 bgrant0607 removed this from the v1.0 milestone Dec 3, 2014
@bgrant0607 bgrant0607 added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Dec 4, 2014
@bgrant0607
Copy link
Member

For reference, OpenStack's IP management API: http://docs.openstack.org/api/openstack-network/2.0/content/Overview-d1e71.html

@bgrant0607
Copy link
Member

@thockin thockin added kind/support Categorizes issue or PR as a support question. and removed kind/support-question labels Jan 7, 2015
@goltermann goltermann added priority/support and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jan 7, 2015
@roberthbailey roberthbailey removed the kind/support Categorizes issue or PR as a support question. label Feb 12, 2015
@errordeveloper
Copy link
Member

I believe that the original discussion here is mostly overtaken by the plugins introduced with #5069. The only question that still remains is whether there needs to a default plugin or not, but that's a separate discussion, I believe, as the context has changed with introduction of plugins.

@kamalmarhubi
Copy link
Contributor

Given @errordeveloper's comment above, should this issue be closed? Came across it and the current state is slightly ambiguous.

@bgrant0607
Copy link
Member

Yes, thanks.

sjenning pushed a commit to sjenning/kubernetes that referenced this issue Aug 30, 2022
UPSTREAM: 110039: Add readinessProbe to aggregated api service test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests