geard

geard is an opinionated tool for installing Docker images as containers onto a systemd-enabled Linux operating system (systemd 207 or newer). It may be run as a command:

$ sudo gear install pmorie/sti-html-app my-sample-service

to install the public image pmorie/sti-html-app to systemd on the local box with the service name "ctr-my-sample-service". The command can also start as a daemon and serve API requests over HTTP (port 43273 is the default):

$ sudo gear daemon
2014/02/21 02:59:42 ports: searching block 41, 4000-4099
2014/02/21 02:59:42 Starting HTTP on :43273 ...

You can also use the gear command against a remote daemon:

$ gear stop localhost/my-sample-service
$ gear install pmorie/sti-html-app localhost/my-sample-service.1 localhost/my-sample-service.2
$ gear start localhost/my-sample-service.1 localhost/my-sample-service.2

The gear daemon and local commands must run as root to interface with the Docker daemon over its Unix socket and systemd over DBus.

geard exposes primitives for dealing with containers across hosts and is intended to work closely with a Docker installation - as the plugin system in Docker evolves, many of these primitives may move into plugins of Docker itself.

What's a gear?

A gear is an isolated Linux container and is an evolution of the SELinux jails used in OpenShift. For those familiar with Docker, it's a started container with some bound ports, some shared environment, some linking, some resource isolation and allocation, and some opionated defaults about configuration that ease use. Here's some of those defaults:

Gears are isolated from each other and the host, except where they're explicitly connected

By default, a container doesn't have access to the host system processes or files, except where an administrator explicitly chooses, just like Docker.
Gears are portable across hosts

A gear, like a Docker image, should be usable on many different hosts. This means that the underlying Docker abstractions (links, port mappings, environment files) should be used to ensure the gear does not become dependent on the host system. The system should make it easy to share environment and context between gears and move them among host systems.
Systemd is in charge of starting and stopping gears and journald is in charge of log aggregation

A Linux container (Docker or not) is just a process. No other process manager is as powerful or flexible as systemd, so it's only natural to depend on systemd to run processes and Docker to isolate them. All of the flexibility of systemd should be available to customize gears, with reasonable defaults to make it easy to get started.
By default, every gear is quota bound and security constrained

An isolated gear needs to minimize its impact on other gears in predictable ways. Leveraging a host user id (uid) per gear allows the operating system to impose limits to file writes, and using SELinux MCS category labels ensures that processes and files in different gears are strongly separated. An administrator might choose to share some of these limits, but by default enforcing them is good.

A consequence of per gear uids is that each container can be placed in its own user namespace - the users within the container might be defined by the image creator, but the system sees a consistent user.
The default network configuration of a container is simple

By default a container will have 0..N ports exposed and the system will automatically allocate those ports. An admin may choose to override or change those mappings at runtime, or apply rules to the system that are applied each time a new gear is added. Much of the linking between containers is done over the network or the upcoming Beam constructs in Docker.

Actions on a container

Here are the initial set of supported container actions - these should map cleanly to Docker, systemd, or a very simple combination of the two. Geard unifies the services, but does not reinterpret them.

Create a new system unit file that runs a single docker image (install and start a container)

$ gear install pmorie/sti-html-app localhost/my-sample-service --start
$ curl -X PUT "http://localhost:43273/container/my-sample-service" -H "Content-Type: application/json" -d '{"Image": "pmorie/sti-html-app", "Started":true}'

Stop, start, and restart a container

$ gear stop localhost/my-sample-service
$ curl -X PUT "http://localhost:43273/container/my-sample-service/stopped"
$ gear start localhost/my-sample-service
$ curl -X PUT "http://localhost:43273/container/my-sample-service/started"
$ gear restart localhost/my-sample-service
$ curl -X POST "http://localhost:43273/container/my-sample-service/restart"

Deploy a set of containers on one or more systems, with links between them:

# create a simple two container web app
$ gear deploy deployment/fixtures/simple_deploy.json localhost

The links between containers are iptables based rules - try curling 127.0.0.1:8081 to see the second web container.

# create a mongo db replica set (some assembly required)
$ gear deploy deployment/fixtures/mongo_deploy.json localhost
$ sudo switchns db-1 /bin/bash
> mongo 192.168.1.1
MongoDB shell version: 2.4.9
> rs.initiate({_id: "replica0", version: 1, members:[{_id: 0, host:"192.168.1.1:27017"}]})
> rs.add("192.168.1.2")
> rs.add("192.168.1.3")
> rs.status()
# wait....
> rs.status()

The argument to initiate() sets the correct hostname for the first member, otherwise the other members cannot connect.

View the systemd status of a container

$ gear status localhost/my-sample-service
$ curl "http://localhost:43273/container/my-sample-service/status"

Tail the logs for a container (will end after 30 seconds)

$ curl "http://localhost:43273/container/my-sample-service/log"

List all installed containers (for one or more servers)

$ gear list-units localhost
$ curl "http://localhost:43273/containers"

Create a new empty Git repository

$ curl -X PUT "http://localhost:43273/repository/my-sample-repo"

Link containers with local loopback ports (for e.g. 127.0.0.2:8081 -> 9.8.23.14:8080). If local ip isn't specified, it defaults to 127.0.0.1
```
$ gear link -n=127.0.0.2:8081:9.8.23.14:8080 localhost/my-sample-service
```

Set a public key as enabling SSH or Git SSH access to a container or repository (respectively)

$ gear keys --key-file=[FILE] my-sample-service
$ curl -X POST "http://localhost:43273/keys" -H "Content-Type: application/json" -d '{"Keys": [{"Type":"authorized_keys","Value":"ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA6NF8iallvQVp22WDkTkyrtvp9eWW6A8YVr+kz4TjGYe7gHzIw+niNltGEFHzD8+v1I2YJ6oXevct1YeS0o9HZyN1Q9qgCgzUFtdOKLv6IedplqoPkcmF0aYet2PkEDo3MlTBckFXPITAMzF8dJSIFo9D8HfdOV0IAdx4O7PtixWKn5y2hMNG0zQPyUecp4pzC6kivAIhyfHilFR61RGL+GPXQ2MWZWFYbAGjyiYJnAmCP3NOTd0jMZEnDkbUvxhMmBYSdETk1rRgm+R4LOzFUGaHqHDLKLX+FIPKcF96hrucXzcWyLbIbEgE98OHlnVYCzRdK8jlqm8tehUc9c9WhQ=="}], "Containers": [{"Id": "my-sample-service"}]}'

Enable SSH access to join a container for a set of authorized keys (requires 'gear install --isolate')
```
TODO: add fixture public and private key for example
```

Build a new image from a source URL and base image

$ curl -X POST "http://localhost:43273/build-image" -H "Content-Type: application/json" -d '{"BaseImage":"pmorie/fedora-mock","Source":"git://github.com/pmorie/simple-html","Tag":"mybuild-1"}'

Fetch a Git archive zip for a repository

$ curl "http://localhost:43273/repository/my-sample-repo/archive/master"

Set and retrieve environment files for sharing between containers (patch and pull operations)
```
$ gear set-env localhost/my-sample-service A=B B=C
$ gear env localhost/my-sample-service
$ curl "http://localhost:43273/environment/my-sample-service"
$ gear set-env localhost/my-sample-service --reset
```
You can set environment during installation
```
$ gear install ccoleman/envtest localhost/env-test1 --env-file=deployment/fixtures/simple.env
```
Loading environment into a running container is dependent on the "docker run --env-file" option in Docker master from 0.9.x after April 1st. You must start the daemon with "gear daemon --has-env-file" in order to use the option - this option will be made the default after 0.9.1 lands and the minimal requirements will be updated.
More to come....

geard allows an administrator to easily ensure a given Docker container will always run on the system by creating a systemd unit describing a docker run command. It will execute the Docker container processes as children of the systemd unit, allowing auto restart of the container, customization of additional namespace options, the capture stdout and stderr to journald, and audit/seccomp integration to those child processes.

Note: foreground execution is currently not in Docker master - see https://github.com/alexlarsson/docker/tree/forking-run for some prototype work demonstrating the concept.

Each created systemd unit can be assigned a unique Unix user for quota and security purposes. An SELinux MCS category label will automatically be assigned to the container to separate it from the other containers on the system, and containers can be set into systemd slices with resource constraints.

A container may also be optionally enabled for public key SSH access for a set of known keys under the user identifier associated with the container. On SSH to the host, they'll join the running namespace for that container.

How can geard be used in orchestration?

geard is intended to be useful in different scales of container management:

as a simple command line tool that can quickly generate new unit files and complement the systemctl command line
as a component in a large distributed infrastructure under the control of a central orchestrator
as an extensible component for other forms of orchestration

As this is a wide range of scales to satisfy, the core operations are designed to be usable over most common transports - including HTTP, message queues, and gossip protocols. The default transport is HTTP, and a few operations like log streaming, transferring large binary files, or waiting for operations to complete are best modeled by direct HTTP calls to a given server. The remaining calls expect to receive a limited set of input and then effect changes to the state of the system - operations like install, delete, stop, and start. In many cases these are simple passthrough calls to the systemd DBus API and persist additional data to disk (described below). However, other orchestration styles like pull-from-config-server could implement a transport that would watch the config server for changes and then invoke those fundamental primitives.

From the gear CLI, you can perform operations directly as root (use the embedded gear API library code) or connect to one or more geard instances over HTTP (or another transport). This works well for managing a few servers or interacting with a subset of hosts in a larger system.

At larger scales, an orchestrator component is required to implement features like automatic rebalancing of hosts, failure detection, and autoscaling. The different types of orchestrators and some of their limitations are shown in the diagrams below:

As noted, the different topologies have different security and isolation characteristics - generally you trade ease of setup and ease of distributing changes for increasing host isolation. At the extreme, a large multi-tenant provider may want to minimize the risks of host compromise by preventing nodes from being able to talk to each other, except when the orchestrator delegates. The encrypted/ package demonstrates one way of doing host delegation - a signed, encrypted token which only the orchestrator can generate, but hosts can validate. The orchestrator can then give node 1 a token which allows it to call an API on node 2.

A second part of securing large clusters is ensuring the data flowing back to the orchestrator can be properly attributed - if a host is compromised it should not be able to write data onto a shared message bus that masquerades as other hosts, or to execute commands on those other hosts. This usually means a request-reply pattern (such as implemented by MCollective over STOMP) where requests are read off one queue and written to another, and the caller is responsible for checking that responses match valid requests.

On the other end of the spectrum, in small clusters ease of setup is the gating factor and there tend to be less extreme multi-tenant security concerns. A gossip network or distributed config server like etcd can integrate with geard to serve as both data store and transport layer.

Try it out

The geard code depends on:

systemd 207 (Fedora 20 or newer)
Docker 0.7 or newer (0.9.x from Apr 1 to use --env-file, various other experimental features not in tree)

If you don't have those, you can use the following to run in a development vm:

Vagrant
VirtualBox

If you have Go installed locally (have a valid GOPATH env variable set), run:

go get github.com/openshift/geard
cd $GOPATH/src/github.com/openshift/geard
vagrant up

If you don't have Go installed locally, run the following steps:

git clone git@github.com:openshift/geard && cd geard
vagrant up

The geard project is set up such that vagrant up will download and install geard's dependencies as well as installing and enabling the systemd unit to run geard in a docker container under systemd.

Once vagrant up is running, you can ssh into the vm:

vagrant ssh

The contrib/build script allows you to build and run the project in two different ways:

Build binaries locally and run the daemon interactively with gear daemon
Build a docker image and run the containerized daemon as a systemd unit

Building and running locally

To build and run locally, run the following commands in an ssh session to your development vm:

contrib/build -s
sudo ./gear daemon

The gear daemon's logs will go to the console in this case.

Building and running in a container

To build a Docker image and run the containerized daemon as a systemd unit:

contrib/build -d

This will build the Docker image and start the geard.service systemd unit (as well as restart the unit if it is already running).

See contrib/example.sh and contrib/stress.sh for more examples of API calls.

Concepts

Outline of how some of the core operations work:

Linking - use iptable rules and environment variables to simplify container interconnect
SSH - generate authorized_keys for a user on demand
Isolated container - start an arbitrary image and force it to run as a given user on the host by chown the image prior to execution
Idling - use iptable rules to wake containers on SYN packets
Git - host Git repositories inside a running Docker container
Logs - stream journald log entries to clients
Builds - use transient systemd units to execute a build inside a container
Jobs - run one-off jobs as systemd transient units and extract their logs and output after completion

Not yet implemented:

Integrated health check - mark containers as available once a pluggable/configurable health check passes
Joining - reconnect to an already running operation
Direct server to server image pulls - allow hosts to act as a distributed registry
Job callbacks - invoke a remote endpoint after an operation completes
Local routing - automatically distribute config for inbound and outbound proxying via HAProxy
Repair - cleanup and perform consistency checks on stored data (most operations assume some cleanup)
Capacity reporting - report capacity via API calls, allow precondition PUTs based on remaining capacity ("If-Match: capacity>=5"), allow capacity to be defined via config

API Design

The API is structured around fast and slow idempotent operations - all API responses should finish their primary objective in <10ms with success or failure, and either return immediately with failure, or on success additional data may be streamed to the client in structured (JSON events) or unstructured (logs from journald) form. In general, all operations should be reentrant - invoking the same operation multiple times with different request ids should yield exactly the same result. Some operations cannot be repeated because they depend on the state of external resources at a point in time (build of the "master" branch of a git repository) and subsequent operations may not have the same outcome. These operations should be gated by request identifier where possible, and it is the client's responsibility to ensure that condition holds.

The API takes into account the concept of "joining" - if two requests are made with the same request id, where possible the second request should attach to the first job's result and streams in order to provide an identical return value and logs. This allows clients to design around retries or at-least-once delivery mechanisms safely. The second job may check the invariants of the first as long as data races can be avoided.

All non-content-streaming jobs (which should already be idempotent and repeatable) will eventually be structured in two phases - execute and report. The execute phase attempts to assert that the state on the host is accurate (systemd unit created, symlinks on disk, data input correct) and to return a 2xx response on success or an error body and 4xx or 5xx response on error as fast as possible. API operations should not wait for asynchronous events like the stable start status of a process, the external ports being bound, or image specific data to be written to disk. Instead, those are modelled with separate API calls. The report phase is optional for all jobs, and is where additional data may be streamed to the consumer over HTTP or a message bus.

In general, the philosophy of create/fail fast operations is based around the recognition that distributed systems may fail at any time, but those failures are rare. If a failure does occur, the recovery path is for a client to retry the operation as originally submitted, or to delete the affected resources, for for a resynchronization to occur. A service may take several minutes to start only to fail - since failure cannot be predicted, clients should be given tools to recognize and correct failures.

At the current time there are no resynchronization operations implemented, but the additional metadata (vector clocks or consistent versions) for that should be supportable via the existing interfaces. An orchestrator would prepare a list of the expected resource state and a reasonably synchronized clock identifier, and the agent would be able to compare that to the persisted resources on disk older than a window. The "repair" functionality on the agent would perform a similar function - ensuring that the set of persisted resources (units, links, port mappings, keys) are internally consistent, and that outside of a minimum window (minutes) any unreferenced content is removed.

Concrete example:

Starting a Docker image on a system for the first time may involve several slow steps:

Downloading the initial image
Starting the process

Those steps may fail in unpredictable ways - for instance, the service may start but fail due to a configuration error and never begin listening. A client cannot know for certain the cause of the failure (unless they've solved the Halting Problem), and so a wait is nondeterministic. A download may stall for minutes or hours due to network unpredictability, or the local disk may run out of storage during the download and fail (due to other users of the system).

The API forces the client to provide the following info up front:

A unique locator for the image (which may include the destination from which the image can be fetched)
The identifier the process will be referenced by in future transactions (so the client can immediately begin dispatching subsequent requests)
Any initial mapping of network ports or access control configuration for ssh

The API records the effect of this call as a unit file on disk for systemd that can, with no extra input from a client, result in a started process. The API then returns success and streams the logs to the client. A client may disconnect at this point, without interrupting the operation. A client may then begin wiring together this process with other processes in the system immediately with the explicit understanding that the endpoints being wired may not yet be available.

In general, systems wired together this way already need to deal with uncertainty of network connectivity and potential startup races. The API design formalizes that behavior - it is expected that the components "heal" by waiting for their dependencies to become available. Where possible, the host system will attempt to offer blocking behavior on a per unit basis that allows the logic of the system to be distributed. In some cases, like TCP and HTTP proxy load balancing, those systems already have mechanisms to tolerate components that may not be started.

Disk Structure

Assumptions:

Gear identifiers are hexadecimal 32 character strings (may be changed) and specified by the caller. Random distribution of identifiers is important to prevent unbalanced search trees
Ports are passed by the caller and are assumed to match to the image. A caller is allowed to specify an external port, which may fail if the port is taken.
Directories which hold ids are partitioned by integer blocks (ports) or the first two characters of the id (ids) to prevent gear sizes from growing excessively.
The structure of persistent state on disk should facilitate administrators recovering the state of their systems using filesystem backups, and also be friendly to standard Linux toolchain introspection of their contents.

The on disk structure of geard is exploratory at the moment. The major components are described below:

/etc/systemd/system/container-active.target.wants/
  ctr-abcdef.service -> <symlink>

    This directory is read by systemd on startup (container-active.target is WantedBy multi-user) to 
    start containers on startup.  Containers stopped via the stop API call will not be started on
    reboot.

/var/lib/containers/
  All content is located under this root

  units/
    ab/
      ctr-abcdef.service   # hardlink to the current unit file version
      ctr-abcdef.idle      # flag indicating this unit is currently idle
      abcdef/
        <requestid>        # a particular version of the unit file.

    A container is considered present on this system if a service file exists inside the namespaced container
    directory.

    The unit file is "enabled" in systemd (symlinked to systemd's unit directory) upon creation, and "disabled"
    (unsymlinked) on the remove operation.  The definition can be updated atomically (write new definition,
    update hardlink) when a new version of the container is deployed to the system.

    If a container is idled, a flag is written to the appropriate units directory.  Only containers with an
    idle flag are considered valid targets for unidling.

  targets/
    container.target         # default target
    container-active.target  # active target

    All containers are assigned to one of these two targets - on create or start, they have
    "WantedBy=container-active.target".  If a container is stopped via the API it is altered to be 
    "WantedBy=container.target".  In this fashion the disk structure for each unit reflects whether the container
    should be started on reboot vs. being explicitly idled.  Also, assuming the /var/lib/containers directory
    is an attached disk, on node recovery each *.service file is enabled with systemd and then the
    "container-active.target" can be started.

  slices/
    container.slice        # default slice
    container-small.slice  # more limited slice

    All slice units are created in this directory.  At the moment, the two slices are defaults and are created
    on first startup of the process, enabled, then started.  More advanced cgroup settings must be configured
    after creation, which is outside the scope of this prototype.

    All containers are created in the "container-small" slice at the moment.

  env/
    contents/
      a3/
        a3408aabfed

        Files storing environment variables and values in KEY="VALUE" (one per line) form.

  data/
    TBD (reserved for container unique volumes)

  ports/
    links/
      3f/
        3fabc98341ac3fe...24  # text file describing internal->external links to other networks

        Each container has one file with one line per network link, internal port first, a tab, then
        external port, then external host IP / DNS.

        On startup, gear init --post attempts to convert this file to a set of iptables rules in
        the container to outbound traffic.

    interfaces/
      1/
        49/
          4900  # softlink to the container's unit file

          To allocate a port, the daemon scans a block (49) of 100 ports for a set of free ports.  If no ports
          are found, it continues to the next block.  Currently the daemon starts at the low end of the port
          range and walks disk until it finds the first free port.  Worst case is that the daemon would do
          many directory reads (30-50) until it finds a gap.

          To remove a container, the unit file is deleted, and then any broken softlinks can be deleted.

          The first subdirectory represents an interface, to allow future expansion of the external IP space
          onto multiple devices, or to allow multiple external ports to be bound to the same IP (for VPC)

          Example script:

            sudo find /var/lib/containers/ports/interfaces -type l -printf "%l %f " -exec cut -f 1-2 {} \;

          prints the port description path (of which the name of the path is the container id), the public port,
          and the value of the description file (which might have multiple lines).  Would show what ports
          are mismatched.

  keys/
    ab/
      ab0a8oeunthxjqkgjfrJQKNHa7384  # text file in authorized_keys format representing a single public key

      Each file represents a single public key, with the identifier being the a base64 encoded SHA256 sum of
      the binary value of the key.  The file is stored in authorized_keys format for SSHD, but with only the
      type and value sections present and no newlines.

      Any key that has zero incoming links can be deleted.

  access/
    containers/
      3f/
        3fabc98341ac3fe...24/  # container id
          key1  # softlink to a public key authorized to access this container

          The names of the softlink should map to an container id or container label (future) - each container id should match
          to a user on the system to allow sshd to login via the container id.  In the future, improvements in sshd
          may allow us to use virtual users.

    git/
      read/
        ab/
          ab934xrcgqkou08/  # repository id
            key1  # softlink to a public key authorized for read access to this repo

      write/
        ab/
          ab934xrcgqkou08/  # repository id
            key2  # softlink to a public key authorized for write access to this repo

Building Images

geard uses Docker Source to Images (STI) to build deployable images from a base image and application source. STI supports a number of use cases for building deployable images, including:

Use a git repository as a source
Incremental builds: downloaded dependencies and generated artifacts are re-used across builds
Extended prepare: build and deploy on different images (compatible with incremental builds)

A number of public STI base images exist:

pmorie/centos-ruby2 - ruby2 on centos
pmorie/ubuntu-buildpack - foreman running on ubuntu
pmorie/fedora-mock - a simple Webrick server for static html, on fedora

See the STI docs for information on creating your own base images to use with STI.

License

Apache Software License (ASL) 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 414 Commits
cmd		cmd
config		config
containers		containers
contrib		contrib
deployment		deployment
dispatcher		dispatcher
docker		docker
docs		docs
encrypted		encrypted
git		git
http		http
idler		idler
jobs		jobs
pkg		pkg
selinux		selinux
systemd		systemd
tests		tests
utils		utils
vendor		vendor
.gitignore		.gitignore
.travis.yml		.travis.yml
Dockerfile		Dockerfile
README.md		README.md
Vagrantfile		Vagrantfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

geard

What's a gear?

Actions on a container

How can geard be used in orchestration?

Try it out

Building and running locally

Building and running in a container

Concepts

API Design

Concrete example:

Disk Structure

Building Images

License

About

Releases

Packages

Contributors 31

Languages

License

openshift/geard

Folders and files

Latest commit

History

Repository files navigation

geard

What's a gear?

Actions on a container

How can geard be used in orchestration?

Try it out

Building and running locally

Building and running in a container

Concepts

API Design

Concrete example:

Disk Structure

Building Images

License

About

Resources

License

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 31

Languages

Packages