-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decide whether/how to extend the networking model #188
Comments
The VM per pod model would negate the container efficiency gains - I'm not sure how many people would be interested in deploying like that vs. simply using a VM. Practically speaking, if routable IP per pod is out of the technical capabilities of existing clouds at reasonable densities (is the 60 IP per amazon m1.xlarge too low for practical use cases?) or out of the administrative / organizational capabilities of non-cloud shops (which deserves further investigation), and if IPv6 is still 2-3 years out from reasonable deployment, then the kubernetes model is only deployable on GCE in practice. Would be good to list out the practical limits in other clouds (openstack neutron, aws, soft layer) as well as a recommended IP per pod configuration that would work without a ton of admin headaches on metal. It's possible that dynamic port allocation could be an alternate mode, supported with a subset of features and known limitations. What would that abstraction have to look like for Kubernetes to continue to work on the ideal path? A few things I can think of - the scheduler has to be aware of port exhaustion and record allocated ports OR the exposed ports have to be reported back via some backchannel to the master. If there is a global record of allocated ports, the mechanism to efficiently distribute that port information to the appropriate proxies is required. You must implement at least one level of abstraction between container communication (either in a local or shared proxy or an iptables NAT translation a la geard). You also must implement a more complex migration path for things like CRIU - with more steps before and afterwards to ensure that network abstraction is ready to accept the moved container. |
I've started a doc on this topic, but will be out of office about half of this week. |
Still gathering more feedback from customers and ops folks, but there's a lot of concern about being able to deploy the IP-per-container model outside of the big cloud providers. Recording comments I'm hearing:
|
Some feedback here on the ip per pod/container model. |
I would like to get an IPv6 per pod. |
Thoughts on networking and naming, including more background, partly from Tim's aforementioned doc. Kubernetes's current networking model is described here: We start Docker with: We set up this bridge on each node with SaltStack: cbr0: We make these addresses routable in GCE: The minion IP ranges are /24s in the 10-dot space. GCE itself does not know anything about these IPs, though. These are not externally routable, though, so containers that need to communicate with the outside world need to use host networking. To set up an external IP that forwards to the VM, it will only forward to the VM's primary IP (which is assigned to no pod). So we use docker's -p flag to map published ports to the main interface. This has the side effect of disallowing two pods from exposing the same port. (More discussion on this in #390.) We create a container to use for the pod network namespace -- a single loopback device and a single veth device. All the user's containers get their network namespaces from this pod networking container. Docker allocates IP addresses from a bridge we create on each node, using its “container” networking mode.
The net result is that all user containers within a pod behave as if they are on the same host with regard to networking. They can all reach each other’s ports on localhost. Ports which are published to the host interface are done so in the normal Docker way. All containers in all pods can talk to all other containers in all other pods by their 10-dot addresses. In addition to avoiding the aforementioned problems with dynamic port allocation, this approach reduces friction for applications moving from the world of uncontainerized apps on physical or virtual hosts to containers within pods. People running application stacks together on the same host have already figured out how to make ports not conflict (e.g., by configuring them through environment variables) and have arranged for clients to find them. It reduces isolation between containers within a pod -- ports could conflict, and there couldn't be private ports across containers within a pod, but applications requiring their own port spaces could just run as separate pods and processes requiring private communication could run within the same container. Besides, the premise of pods is that containers within a pod share some resources (volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation. Additionally, the user can control what containers belong to the same pod whereas, in general, they don't control what pods land together on a host. When any container calls SIOCGIFADDR, it sees the IP that any peer container would see them coming from -- each pod has its own IP address that other pods can know. By making IP addresses and ports the same within and outside the containers and pods, we create a NAT-less, flat address space. "ip addr show" should work as expected. This would enable all existing naming/discovery mechanisms to work out of the box, including self-registration mechanisms and applications that distribute IP addresses. (We should test that with etcd and perhaps one other option, such as Eureka (used by Acme Air) or Consul.) We should be optimizing for inter-pod network communication. Within a pod, containers are more likely to use communication through volumes (e.g., tmpfs) or IPC. This is different from the standard Docker model. In that mode, each container gets an IP in the 172-dot space and would only see that 172-dot address from SIOCGIFADDR. If these containers connect to another container the peer would see the connect coming from a different IP than the container itself knows. In short - you can never self-register anything from a container, because a container can not be reached on its private IP. An alternative we considered was an additional layer of addressing: pod-centric IP per container. Each container would have its own local IP address, visible only within that pod. This would perhaps make it easier for containerized applications to move from physical/virtual hosts to pods, but would be more complex to implement (e.g., requiring a bridge per pod, split-horizon/VP DNS) and to reason about, due to the additional layer of addressing, and would break self-registration and IP distribution mechanisms. We want to be able to assign IP addresses externally from Docker (moby/moby#6743) so that we don't need to statically allocate fixed-size IP ranges to each node, so that IP addresses can be made stable across network container restarts (moby/moby#2801), and to facilitate pod migration. Right now, if the network container dies, all the user containers must be stopped and restarted because the netns of the network container will change on restart, and any subsequent user container restart will join that new netns, thereby not being able to see its peers. Additionally, a change in IP address would encounter DNS caching/TTL problems. External IP assignment would also simplify DNS support (see below). And, we could potentially eliminate the bridge and use network interface aliases instead. IPv6 would be a nice option, also, but we can't depend on it yet. Docker support is in progress: moby/moby#2974, moby/moby#6923, moby/moby#6975. Additionally, direct ipv6 assignment to instances doesn't appear to be supported by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull requests from people running Kubernetes on bare metal, though. :-) We'd also like to setup DNS automatically (#146). hostname, $HOSTNAME, etc. should return a name for the pod (#298), and gethostbyname should be able to resolve names of other pods. Probably we need to set up a DNS resolver to do the latter (moby/moby#2267), so that we don't need to keep /etc/hosts files up to date dynamically. If we want Docker links and/or docker inspect to work, we may have work to do there. Right now, docker inspect doesn't show the networking configuration of the containers, since they derive it from another container. That information should be exposed somehow. I haven't looked to see whether link variables would be set correctly, but I think there's a possibility they aren't. We need to think more about what to do with the service proxy. Using a flat service namespace doesn't scale and environment variables don't permit dynamic updates. We'd also like to accommodate other load-balancing solutions (e.g., HAProxy), non-load-balanced services (#260), and other types of groups (worker pools, etc.). Providing the ability to Watch a label selector applied to pod addresses would enable efficient monitoring of group membership, which could be directly consumed or synced with a discovery mechanism. Event hooks (#140) for join/leave events would probably make this even easier. We'd even like to make pods directly routable from the external internet, though we can't do that yet. One approach could be to create a new host interface for each pod, if we had a way to route an external IP to it. We're also working on making it possible to specify a different bridge for each container. We may or may not still need this, but it could be useful for certain scenarios: |
@ironcladlou is also looking at OpenVSwitch and OpenDaylight integrations - making it easier to deploy these sorts of topologies on non-cloud infrastructure (or clouds with limited networking). |
we did some work on docker openvswitch as proof of concept.. would be interested in those integrations -- i would even venture to say that it will be of interest to cloud providers who are looking to provide higher abstraction services ala, container / pod deployment in lieu of/in addition to plain IaaS abstractions. |
@bgrant0607 there is one thing that seems to be in conflict in your document: It is the use a range of IP-addresses per Docker host: which seems to conflict with faciliting pod migration and stable IP-addresses: "We want to be able to assign IP addresses externally from Docker so that we don't need to statically allocate fixed-size IP ranges to each node, so that IP addresses can be made stable across network container restarts, and to facilitate pod migration." Sounds to me like you have a choice between:
|
@Lennie Yes, our current implementation doesn't support migration. Another possibility I'm considering is to allocate an IP per service, and use services as the stable addressing mechanism. That also would work in the case that a pod was replaced by a replicationController. |
I've been racking my brain trying to find a solution to this problem for a while now. Have you seen how consulate and ambassadord use service discovery to connect containers ?: https://github.com/progrium/consulate https://github.com/progrium/ambassadord https://github.com/progrium/registrator (previously named docksul) It uses 4 ideas:
What it does is, you first deploy an ambassadord container with a proxy-server that has access to the Docker-socket. When you deploy a new container you add meta-data about what backend you want the container to link to and on which port and link it to the ambassadord container. When the process in the new container tries to connect to it's ambassador it will connect to the IP-address of the proxy and the port of the backend it wants to connect to. iptables-redirect than sends that connection to the proxy. The proxy can see the source IP-address and original destination port. It can then use that information to look up the meta-data of the source container or use service discovery stored in etcd or consul and connect it to an available backend which could be on an other host. If the service has to be found through service discovery, something has to register it there. That is what the other project is for: registrator that is the one watching the containers being started/stopped. |
Hi Lennie - the pattern you describe is familiar to me (see #494 for a similar discussing of interconnection). I do tend to think that the ambassador should be something the infrastructure provides, rather than something modeled directly as a Docker container, but you always need a pattern that works with only docker.
I think most of the pieces are in place in Kubernetes, but id like to see something like the consulate pattern. |
I believe the pattern I described is actually meant to be a Docker container that is part of the infrastructure. For one it get access to the Docker-socket, you probably don't want to do that if it is not a trusted part of the infrastructure. And the step for setting up the iptables rules that are applied to the container need --privileged too. Have a look at the interview with the author: The reason it is a Docker container is so it can be linked to and if everything is a container it makes it easier to deploy it. That ambasssador is a local proxy which handles the loadbalancing and watching for changes service changes. On 1 and 4, I totally agree on using real networking. The question is: do you want the infrastructure to provide the service discovery, reconfiguration, loadbalacing or do you want the developer of the container that is being deployed to handle it - probably the first and in that case, you'll need a local proxy. |
Sorry, I meant as a single ambassador container per remote link (should have clarified). Running the infrastructure proxy as a container makes sense for certain. And links v2 in docker is moving towards the idea of a discovery hub on the host that can be externally configured, as another potential container listening on libchan. The service proxy on each minion is implementing much of this pattern today, although I'd like to additionally offer the ability for the ips of pods that correspond to a label query to be late bound as environment variables at container start time (to allow direct connection) as well as create local virtual network links per container for singletons (for ease of development, 127.0.0.1:3306 in your container points to either a local service proxy or a pod ip) that can be late bound dynamically. |
That is why I mentioned consul, because it has atleast 3 of these properties:
|
So far the only thing it doesn't do, but I haven't seen anyone implement or even mention it, is to implement multi-host inter-container communication firewalling with iptables (--icc=false). Almost the complete opposite of direct connections. That doesn't mean I haven't thought about it. And my idea right now is, maybe it can be done. If all containers use the proxy to talk to other containers, then the containers on the source host shouldn't be able to talk to anything else but the proxy. This is almost what icc=false does right now when you use linking. If you have that, then all you need to do is set up a firewall rule on the destination host with a set of IP-addresses of source hosts. And you can actually use a pretty efficient ipset for that. Maybe even just 1 ipset. Obviously, in the current model where every pod as an IP-address that list might grow pretty large. Docker has 1 mode how to publish a port on it's public IP-address: publish. Expose only has an effect locally. But you could have a third, with 1 iptables rule with that ipset. You can even have something like consul in a Docker container with --host=net automatically manage that ipset based something like service discovery only it will just have a list of all the hosts of this deployment. One of the reasons why I would like to see something like that. Is because I want some kind of multitenancy. Not of different customers, but different deployments different or the same applications from the same customer/developer/user. Yeah, I know maybe that is just crazy talk. :-) |
That matches with at least some of the plans we have to enable MT in Kubernetes - it's probably just a matter of time before someone takes a stab at it. |
I am currently working on a docker IPv6 implementation. The first tests look good. Every docker host has a subnet from which it delegates one IPv6 to each container. |
regenerate files for 2018 year
Make sure the cursor is a string in JSON
…ravis_ci support go 1.10 in travis ci
Yesterday on IRC, @smarterclayton raised an issue with @jbeda and @thockin about whether other cloud providers/platforms (e.g., OpenStack, AWS) could handle the IP-per-pod networking model. I'm opening an issue to capture discussion about how to address this.
DESIGN.md doesn't really explain the motivation for the model in depth. We should capture our decision there.
IP-per-pod creates a clean, backward-compatible model where pods can be treated much like VMs from the perspectives of port allocation, networking, naming, service discovery, load balancing, application configuration, and migration.
OTOH, dynamic port allocation requires supporting both static ports (e.g., for externally accessible services) and dynamically allocated ports, requires partitioning centrally allocated and locally acquired dynamic ports, complicates scheduling, is inconvenient for users, complicates application configuration, is plagued by port conflicts and reuse and exhaustion, requires non-standard approaches to naming (e.g., etcd rather than DNS), requires proxies and/or redirection for programs using standard naming/addressing mechanisms (e.g., web browsers), requires watching and cache invalidation for address/port changes for instances in addition to watching group membership changes, and obstructs container/pod migration (e.g., using CRIU).
One possibility is that we could use libswarm to spin up a new VM for each pod on systems that don't have good enough routing support for allocation of an IP per pod.
The text was updated successfully, but these errors were encountered: