Description
I am working on building Docker images for Apache Cassandra. I have a working PoC application wrapper that can automatically configure Cassandra to run inside Docker. The first thing I found missing with this app was a way to introspect the allowed amount of memory from inside the container so I can set the JVM heap up correctly.
When I went searching for available solutions on the web, I found a few mailing list threads and related issues, but nothing stands out as a clear winner so here we are.
Related links:
- https://groups.google.com/d/topic/docker-dev/n0CrW_v-E8s/discussion
- Add host and container info to DockerInfo API #1091 #2607
- http://jpetazzo.github.io/2013/10/08/docker-containers-metrics/
- http://blog.dotcloud.com/dashboard-memory-metrics-reloaded
- http://fabiokung.com/2014/03/13/memory-inside-linux-containers/
- Container downward/upward API umbrella issue kubernetes/kubernetes#386
Related issues: #7472 #7255 #1270 #3778
Requirements
- read-only from the container (security, simplicity)
- values that change on the host are reflected in the container's view (dynamic)
- cannot break existing containers
- should be universal; public images can rely on it
Approaches
Each of these approaches has a different set of tradeoffs. I'm willing to write the code for whichever the core team decides is best. My preference would be for something that can be provided as a basic service to every container managed by Docker. My goal is to be able to create containers that rely on this information and have them work without change in as many Docker environments as possible. For example, I hope my Cassandra containers can work equally well in a full-on CoreOS setup as they do in a simple boot2docker instance.
Environment Variables
This is easily the least complex solution. Before staring a container, Docker would add some DOCKER_ environment variables to the container's first process. These could propagate or be blocked by the init process at the users' choice.
The big downside is that it is impossible to change environment variables after a process starts. I also don't know of a way to ensure they don't get munged by processes inside the containers, so they aren't really read-only. That said, they also don't provide a vector into the host OS so maybe that's OK.
IPC key/value
IPC namespaces are there so it should be fairly easy to provide a SysV SHM with key/value pairs in it. This could be updated from the host at any time and would be fast.
Since the interface into the container would be shared memory, that means tools would need to be available and then everything gets even more complicated. Moving on ...
REST over HTTP
The inspiration for this comes from EC2's metadata API. Every EC2 VM can route to a link-local address of 169.254.169.254 that runs a REST API. This is what tools like ohai and facter use to learn about EC2 VMs. The link-local address can be hard-coded in scripts and has good support in every programming language.
- https://www.digitalocean.com/community/tutorials/an-introduction-to-droplet-metadata
- http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html
- http://en.wikipedia.org/wiki/Link-local_address
I have a working PoC for the REST approach that simply exposes memory.memory_limit_in_bytes as GET /memory/memory.memory_limit_in_bytes. A more complete implementation would follow REST best practices and may choose to expose less Linux semantics.
dummy / link-local
My PoC currently works with a dummy interface + link-local IP per-container (with a PR for libcontainer to enable dummy network strategy). Docker could inject this interface into every container except those that use --net host or share with another container. If those features are important, this is a no-go on being universal.
link-local alias on docker0
The service could also be run bound to the bridge on a link-local address. This works with very little configuration in my setup, but would have to be very careful in how it validates packet origin in order to avoid leaking container data across containers. It also falls apart when people use non-standard bridge setups, which is a deal-breaker.
AF_UNIX
I avoided AF_UNIX at first because of how many times I've had to fix daemons whose socket was unlinked by accident. That said, AF_UNIX is probably the best option since it's easy to verify exactly which container made a request. Perhaps setting chatr +i on the socket file will be good enough to prevent the common problems. Some HTTP clients don't support AF_UNIX, but it's not uncommon either. As long as curl works, I think most users will be happy with this.
related: https://github.com/cpuguy83/lestrade
Filesystem
Since Linux applications typically use /proc and /sys for system introspection, this is the most natural choice, but it is also fairly high in complexity of implementation.
The two big options for filesystems are FUSE and bind mounts. libvirt-lxc provides a FUSE interface for /proc/meminfo that seems to work out OK, but many are not comfortable with the size and complexity of the FUSE API. FUSE can do the job, the question is if it's OK to have this as a requirement in every container. Since Docker already relies on FUSE maybe this isn't an issue?
A read-only bind mount into the container can provide the same information. Assuming that RO bind mounts are safe enough security-wise, Docker could write out and maintain all the relevant metadata on the host side then bind mount each container's tree into that container read-only. Care would have to be taken to do transactional (write + link) updates to metadata files. Docker would also end up maintaining the filesystem tree on disk which would be tedious. It is fairly easy to test though.
My worry is that providing a POSIX-like filesystem will mean that POSIX semantics need to be preserved. Making the fs read-only does remove a lot of the nastier problems though.
Not emulating /proc would remove a lot of compatibility issues by giving up being able to run existing scripts.
IMNSHO
Personally, I'm leaning towards AF_UNIX + REST. The main reason is that it has the least complexity overall while providing well-known semantics to users. It can be accessed using readily available tools and libraries. It doesn't have to be HTTP+JSON. I like those because of tool availability. A memcached-like protocol would also be fine since it can still be accessed with tools like busybox nc
, but for now I'd like protocol design to be something to decide after the big decisions are made.
Edits:
- add link to Fabio Kung's blog post
- s/tmpfs/bind mounts/ since it could be any fs
- added some links to related issues from comments
- add link to Digital Ocean's new metadata API
Activity