Durable data #1515

erictune · 2014-09-30T17:42:30Z

I'm not looking to actually merge this PR, just get some feedback on two possible approaches.

If there is agreement on one approach, then I will code it up, and convert this design document into user documentation.

I'm inclined to add a new type of REST resource to model Durable local data. I know new resources are always controversial, so I've written this doc to compare the alternatives. Comments welcome from anyone. I'd particularly like comments from @dchen1107 @bgrant0607 @thockin @brendanburns @jbeda @lavalamp @smarterclayton.

Addresses #598.

smarterclayton · 2014-10-01T00:52:02Z

One thought that appeals to me is that a New Resource Kind can handle a central lock independent of scheduling (i.e., the request to attach / mount the data from the New Resource Kind Source can block / fail in a way that makes the container fail, which offers central coordination per data element). Perhaps new Resource Kind can be broken into two parts - allocation (which ultimately is a lot like allocating a GCE volume), and mount. The mount semantics can be managed by the server controlling replication as well (allows replication, must have recent updates within X).

erictune · 2014-10-01T15:00:26Z

Thanks for feedback smarterclayton. Hoping for feedback from others on this soon so I can begin implementation in time for milestone 0.7

erictune · 2014-10-01T17:34:56Z

Apparently this is not due til 1.0

thockin · 2014-10-01T21:53:14Z

docs/proposals/durable_data.md

+A Pod has a (current and desired) PodState, which has ContainerManifest that lists the Volumes and Containers of a Pod.
+A Container always has a writeable filesystem, and it can attach Volumes to that which are also writeable.  Writes not
+to a volume are not visible to other containers and are not preserved past container failures.  Writes to an
+EmptyDirectory volume are visible to other containers in the pod.  An EmptyDirectory is only shared by containers which


EmptyDir writes are also visible across container restarts

eparis · 2014-10-27T16:27:51Z

@erictune I think a lot of Mark's comments are influenced heavily by my comments in #2003. He got to see it on Friday...

stp-ip · 2014-11-21T13:30:39Z

I have a few concerns with some of the ideas. Chaining a pod to a specific minion just to make data "durable" seems like the wrong way to go. It gives a wrong sense of durability and security.
I suggest to use a more pluggable and modular interface, where the choice the user makes affects the durability. So that he chooses to be exposed to disk failure instead of merely using "durable" data methods.
In the proposal the container/pod uses a specific directory structure. For data it would use /con/data so it does not interfere and can be used from a lot of different providers. I agree that the simplest solution would be to bound a pod to a specific host and just use Host Volumes. These can easily be mounted and used as "does not loose data on restart" kind of durability. On the other hand as @smarterclayton already said, there are a lot of different projects to provide durable data already such as Ceph, Gluster and the actual services in aws, gce etc.. The best way to use these is to have a common way to switch from Host Volumes to the more favoured (in my opinion) Data Volume Containers to Side Containers. Side Containers basically provide additional logic to bind the exposed volume to ceph for example.

So instead of only looking for pinning pods to hosts, which does not make me feel secure as it exposes one to hardware failure issues. I agree that pinning to specific hardware characteristics is a need, but not specific hardware. This was discussed in #2342. It therefore seems already to be possible to let specific attributes be used in scheduling such as ssd, gpu etc.

With the defined common interface containers could then easily provide data. Basically the above plugins are the equivalent of Side Containers. (Data Volume Containers with specific logic). These methods would enable a mariadb image for example to be used with Host Volumes in development, used with Data Volume Containers in Staging and depending on the durability need be used with Side Containers exposing distributed storage systems or a completely integrated approach called Volume as a Service on GKE.

For Kubernetes we could use this standard interface to mount volumes and then make it unnecessary to pin a pod to a minion, but to use Data Volume Containers instead and move them with the pod. Easiest solution: stop - pipe to new Data Volume Container - start new container.

The solution I favour is a more integrated approach. Side Containers could be the custom method for users, but providing infrastructure specific and integrated Side Containers and exposing them as Volumes as a Service would be ideal. K8s could expose a distributed data volume, which basically maps the data to a distributed persistent disk. In a selfhosted szenario VaaS could be an admin provided Side Container binding the volume not to a Google persistent disk, but to Ceph, Gluster or so.

The actual proposal can be found here: moby/moby#9277

hjwp · 2014-11-21T14:05:16Z

My two cents -- permanent storage is such a basic requirement that I think it should be a service in the same way as docker or etcd are. it's all entirely in my head at this stage, but my plan for paas world domination will involve running something like gluster on a bunch of instances inside my cluster, and mounting a single massive shared filesystem on all machines. individual containers can then just mount parts of that filesystem as volumes. so now, at the container level, i never have to worry about which machine i come up on and will the permanent storage be attached because it's always available, everywhere.

then the problem becomes the platform-specific one of managing the actual storage that the gluster instances use, EBS volumes or PDs or whatever, but that should be a rare task. once gluster is up and running and on multiple machines, i shouldn't have to worry about it any more, and gluster should be redundant enough that losing any individual gluster machine won't hurt...

stp-ip · 2014-11-21T14:39:54Z

@hjwp I agree with most parts.
So basically you set up your Gluster pod for example. It needs the logic to run Gluster and where the data lives.
Then you use this Gluster Service in special Side Containers, which expose simple volumes.
These volumes can then easily be mounted in each pod you want it to be available.
This way you have just decoupled the Base Container, the binding of volume and distributed storage, the distributed storage deployment.
This is outlined in moby/moby#9277 mostly. I agree that storage is a primitive, but that doesn't mean we should force durability to be one thing, but to enable durability via modular solutions.

erictune · 2014-11-25T18:28:40Z

@stp-ip

Durable is the wrong word. Let's forget I used it. When I get back to working on this after the holidays, I plan to chose a different name for the concept. I'm thinking of calling the new resource Kind a "nodeVol", because its two essential properties are that it is tied to a node and that it can be referenced in a VolumeSource.

There are two main types of storage we will have:

local storage which exposes the user to hardware failures, and which is tied to a single node, but which may have improved performance characteristics. Resource allocation is done by kubernetes.
remote storage, which is not tied to a single node, and which typically provides durable storage through replication. Resource allocation is done by external system.
We definite need both. "Local" is used for:
a building block to implement services like Ceph and Gluster on top of kuberentes.
applications that need the higher performance of a local source and are willing to deal with failures.

@hjwp made a good distinction between using a cluster filesystem and admining a cluster filesystem. I think kubernetes should support both, and that you need "local" as a building block to provide a service that enables "remote".

@stp-ip
You make a distinction between pods depending on a hardware type versus depending on specific hardware that has specific data on it (for the fast restart case.) I think we need to support both cases. I believe that the "new resource Kind" proposal in this PR allows users to implement either behavior.

I like your example of a mariadb that uses different volume sources at different stages of development. I had considered this. It is closely related to the need to have config be portable across cloud providers and different on-premise setups. To address this, I expect that configuration would be written as templates, where the pod's volumeSource is left unspecified, to be filled in by different instantiations of the template (prod vs dev, etc).

I've taken a look at moby/moby#9277, and I've subscribed to the discussion.
The key parts of that proposal, IIUC, are directory conventions, and new flavors of docker containers and volumes. The most important aspect of what is being proposed here, is a way to allocate, account for and control access to node storage resources, independent of how they are mapped into containers/volumes. Therefore, I think that the two proposals are largely complementary.

One thing to note is that some forms of node-local storage don't manifest themselves as a filesystem, so filesystem standards may not apply directly. For example:

a mysql database in a pod that uses innodb with raw block device access
an application that uses an NVMe SSD via ioctls on the block device.

kubernetes-bot · 2014-11-26T04:58:26Z

Can one of the admins verify this patch?

stp-ip · 2014-11-27T18:26:42Z

@erictune
The directory structure in moby/moby#9277 could be used with raw devices too. The easiest method was just mounting volumes and using directories, but one could in my opinion just mount a raw device at that mountpoint or most likely similar for ioctls.
Part of the suggestions in the proposal are more technical nature and talk about how projects can give users easier tools for volumes. I agree that there are distinctions to be made.

These are the issues/solutions:

Standardized directory/mountpoints for modular images and easier deployment for users
"Local" way to provide volumes (host volumes)
"Remote" way to provide volumes (Data Volume Container)
Additional logic for mounted volumes (Side Containers) providing config generation or similar
Platform integration of volumes (VaaS)
- Remote volume such as git based volume or connector/connector Side Container for GCEpersistent disk etc.
- Local volume enabling passthrough abillities for raw devices and scheduling of the underlaying disk utility to run ceoh on top of k8s

So I agree with most of your points and would love to see both better support for passing through disks and raw access to hardware for local usage. (still favoring to use /con/data as default)
Additionally the integration of remote volumes as is outlined in the volume plugin PR (which I can't find right now)

hjwp · 2014-11-29T08:20:10Z

I never really understood the point of volumes-from and data-volume containers. No doubt that's me being a noob, but, in the specific case of our "remote storage" use case, ie, i have a container app that needs access to some permanent storage, and that doesn't want to be tied to a particular node, just know that wherever it comes up, it can access said storage. let's assume said storage is available as a mounted filesystem on the underlying machines (which in the background is implemented using whatever distributed filesystem voodoo we like). Why would i want the extra layer of indirection of a data volume container, rather than just mounting in the remote storage paths directly into my app container?
Forgive me if that's a stupid question.

stp-ip · 2014-11-29T17:49:07Z

@hjwp because then you have either the logic of a specific distributed mount inside your container, which is counterintuative to decoupling. Or you mount a node specific directory. Even if every node has this mountpoint and passes this to some distributed storage, you define special cases. With a data volume container for example you don't have to mount distributed storage for project A on every node, even if it only runs on 1/10th of the nodes for example. Additionally updating the binding software for such a mount is now not easily upgradable via containers, but involves node updates.

So there are a few negative aspects about running node specific stuff, not decoupling etc.. Especially the decoupling in one container for each concern is something, one has to wrap his head around. Sure you could just run apache+storage-endpoint+mysql+whatever in one container, but then one could just use a VM. Even when you have a bit more complexity with data-volume containers or volumes-from you add the ability for decoupling and a more service oriented infrastructure.

Data containers will always have a place, but in production systems such as k8s they will most likely be replaced with Volumes as a Service aka k8s mounts k8s provided volumes, which then map to git, secret store, distributed data store (GCE persistent disk) etc..
If you want to get some ideas and some insights in why one would use data volume containers or Side Containers (data volume containers with additional logic) you can take a look at moby/moby#9277.

brendandburns · 2014-12-16T00:02:52Z

I'm marking this p2 as it is speculative and a design doc. (and also we have limited the scope of data for v1) Also this is kind of redundant with #2598

brendandburns · 2014-12-16T00:09:04Z

Copying a quote from @erictune from #598

Based on what I learned in the discussion, I think a good first step before implementing durable data would be to develop a working prototype of a database (e.g. mysql master and slave instance) used by a replicated frontend layer.

Attributes of prototype:

individually label two nodes: database-master and database-slave.
constrain mysql master pod to database-master with node selector, and likewise for slave.
constrain frontend pods to remaining machines.
give mysql pods "hostDir" access so that they can have direct disk access and so that the lifetime of the tables is the same as that of the VM.
make it possible to narrow the scope of hostDir; make hostDir a capability that can be granted to individual pods (e.g. to mysql, but not frontends).
demonstrate deleting a mysql pod and then starting a new version, with the tables still working.
demonstrate rolling upgrades on the frontends.
demonstrate a service with a fixed IP address with just the master as an endpoint. Then demonstrate updating the service to fail over to the slave, in conjunction with whatever mysql commands are needed to promote the slave.
I think we will learn quite a bit with such an excercise that will improve an eventual implementation of a durable data implementation. And it will give users a template for what to do until we have durable data implementation.

markturansky · 2014-12-18T13:32:49Z

@brendanburns I will make the example as part of my work for #2609. This is very close to the example I was already planning on building, but with a persistent disk instead of hostDir -- though in the near term I was planning on using my local host as the "persistent" disk through hostDir. Very copacetic.

erictune · 2015-01-12T17:48:21Z

I no longer believe that we should implement the durable data concept described in this PR. Therefore I am going to close this PR.

I now see that there are two use cases that should be handled separately:

make it easy to run things that want to attach to networked, truly persistent (replicated or tape-backed-up) storage.
make it possible to run things which must have local storage device access.

Why not handle both cases with one concept? Because the thing you end up with:

has muddled concepts because it is trying to model too many cases
adds extra API complexity (e.g. new object which is similar to but not the same as a pod, with own lifetime)
adds scheduler complexity (dealing with pairing durable data).
an attractive nuisance which reduces pod mobility, which will block lots of things (upgrades, autoscaling, rescheduling, etc.)

For the "easy to run things that want attach to networked storage" case, we should do:

each cluster has one or more types of networked storage available, that, to a first degree, are accessible to every minion.
use something like Make kubelet volumes be cleanly separated plugins #2598 volumes framework to extension for various networked storage solutions.
maybe provide a way to allow admins to export and access control subsets of those network storage soultions to users, along the general lines of WIP: Persistent Storage #3318
pods are still completely mobile (not bound to specific machines with specific local data.
Do some kind of hack to deal with the GCE 1 Writer limitation.
nfs and ceph clients are better examples of this category.

For the "possible to run things which must have local storage device access" case, we should do:

make a hostDir capability, and perhaps narrower capabilities to use specific file systems or devices.
use policy to limit which pods can use those capabilities. In mature installations, only "infrastructure" pods would typically get that (ones which e.g. implement ceph, cassandra, hdfs, etc servers).

erictune · 2015-01-14T18:26:06Z

@bgrant0607 we talked just now about how I don't think we should do durable data in any of the forms previously discussed. See my post above.

smarterclayton · 2015-01-14T22:31:14Z

----- Original Message -----

I no longer believe that we should implement the durable data concept
described in this PR. Therefore I am going to close this PR.

I now see that there are two use cases that should be handled separately:

make it easy to run things that want to attach to networked, truly
persistent (replicated or tape-backed-up) storage.

make it possible to run things which must have local storage device
access.

Why not handle both cases with one concept? Because the thing you end up
with:

has muddled concepts because it is trying to model too many cases

adds extra API complexity (e.g. new object which is similar to but not
the same as a pod, with own lifetime)

adds scheduler complexity (dealing with pairing durable data).

an attractive nuisance which reduces pod mobility, which will block lots
of things (upgrades, autoscaling, rescheduling, etc.)

For the "easy to run things that want attach to networked storage" case, we
should do:

each cluster has one or more types of networked storage available, that,
to a first degree, are accessible to every minion.

use something like Make kubelet volumes be cleanly separated plugins #2598 volumes framework to extension for various
networked storage solutions.

maybe provide a way to allow admins to export and access control subsets
of those network storage soultions to users, along the general lines of
WIP: Persistent Storage #3318

pods are still completely mobile (not bound to specific machines with
specific local data.

Do some kind of hack to deal with the GCE 1 Writer limitation.

nfs and ceph clients are better examples of this category.

Down the road, create a PersistentVolume type that pretends to offer network volume, but is really handled by a ride-along pod that periodically snapshots the volumes to some durable storage, and offer a volume type that inits to latest snapshot or uses what's on disk.

For the "possible to run things which must have local storage device access"
case, we should do:

make a hostDir capability, and perhaps narrower capabilities to use
specific file systems or devices.

use policy to limit which pods can use those capabilities. In mature
installations, only "infrastructure" pods would typically get that (ones
which e.g. implement ceph, cassandra, hdfs, etc servers).

erictune · 2015-01-15T15:45:21Z

@smarterclayton
with the periodic snapshots idea: if application has other state that is not snapshottable (state on remote services), then on a recovery, it will have skewed local/remote state. Not sure what fraction of apps would be able to use this?

smarterclayton · 2015-01-15T18:20:39Z

At least for the types of networked applications in aware of, the utility of this snapshot is more for simple singleton processes that could potentially tolerate some loss of data (last 10 minutes) in preference to having no data. An admin team could modify their restart / node replacement processes in order to effectively maintain a possibly reasonable SLA, without the rest of the system being aware. For instance, but adding a tool that tries to snapshot all the instances of this volume type on a node prior to beginning evacuation.

I'm thinking of things like Git repositories, Jenkins servers, simple Redis cache services, test databases, QA workloads, simple collaboration style apps, etc. Losing even 15 minutes of state is unlikely to impact those sorts of tools, and most of them are inherently single pod stateful. There's probably a cost / effort sweet spot for loose consistency of data here.

On Jan 15, 2015, at 10:45 AM, Eric Tune notifications@github.com wrote:

@smarterclayton
with the periodic snapshots idea: if application has other state that is not snapshottable (state on remote services), then on a recovery, it will have skewed local/remote state. Not sure what fraction of apps would fit be able to use this?

—
Reply to this email directly or view it on GitHub.

…erry-pick-1499-to-release-4.13 [release-4.13] OCPBUGS-10432: CSI Inline Volume admission plugin does not log object name correctly

erictune added area/api Indicates an issue on api area. kind/design Categorizes issue or PR as related to design. labels Oct 1, 2014

erictune added this to the v0.7 milestone Oct 1, 2014

erictune mentioned this pull request Oct 1, 2014

Durable local storage #598

Closed

erictune modified the milestones: v1.0, v0.7 Oct 1, 2014

erictune modified the milestones: v0.8, v1.0 Oct 1, 2014

thockin reviewed Oct 1, 2014
View reviewed changes

bgrant0607 mentioned this pull request Nov 20, 2014

Robust rollingupdate and rollback #1353

Closed

bgrant0607 mentioned this pull request Nov 26, 2014

Make it easy to run Kubernetes on top of the Kubelet (aka self-hosting) #246

Closed

erictune mentioned this pull request Nov 27, 2014

WIP: Persistent disks proposal #2609

Closed

bgrant0607 added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Dec 3, 2014

brendandburns added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Dec 16, 2014

markturansky mentioned this pull request Dec 19, 2014

Add Host to PodSpec and add a predicate to make the scheduler work. #3043

Merged

erictune closed this Jan 12, 2015

erictune mentioned this pull request Jan 12, 2015

WIP: Persistent Storage #3318

Closed

bgrant0607 mentioned this pull request Feb 7, 2015

[DO NOT MERGE] Persistent Storage #4055

Closed

bgrant0607 mentioned this pull request Apr 24, 2015

volume idea: "durable emptyDir" #7285

Closed

markturansky mentioned this pull request Apr 30, 2015

There should be a local storage persistent volume #7562

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Durable data #1515

Durable data #1515

erictune commented Sep 30, 2014

smarterclayton commented Oct 1, 2014

erictune commented Oct 1, 2014

erictune commented Oct 1, 2014

thockin Oct 1, 2014

eparis commented Oct 27, 2014

stp-ip commented Nov 21, 2014

hjwp commented Nov 21, 2014

stp-ip commented Nov 21, 2014

erictune commented Nov 25, 2014

kubernetes-bot commented Nov 26, 2014

stp-ip commented Nov 27, 2014

hjwp commented Nov 29, 2014

stp-ip commented Nov 29, 2014

brendandburns commented Dec 16, 2014

brendandburns commented Dec 16, 2014

markturansky commented Dec 18, 2014

erictune commented Jan 12, 2015

erictune commented Jan 14, 2015

smarterclayton commented Jan 14, 2015

erictune commented Jan 15, 2015

smarterclayton commented Jan 15, 2015

Durable data #1515

Durable data #1515

Conversation

erictune commented Sep 30, 2014

smarterclayton commented Oct 1, 2014

erictune commented Oct 1, 2014

erictune commented Oct 1, 2014

thockin Oct 1, 2014

Choose a reason for hiding this comment

eparis commented Oct 27, 2014

stp-ip commented Nov 21, 2014

hjwp commented Nov 21, 2014

stp-ip commented Nov 21, 2014

erictune commented Nov 25, 2014

kubernetes-bot commented Nov 26, 2014

stp-ip commented Nov 27, 2014

hjwp commented Nov 29, 2014

stp-ip commented Nov 29, 2014

brendandburns commented Dec 16, 2014

brendandburns commented Dec 16, 2014

markturansky commented Dec 18, 2014

erictune commented Jan 12, 2015

erictune commented Jan 14, 2015

smarterclayton commented Jan 14, 2015

erictune commented Jan 15, 2015

smarterclayton commented Jan 15, 2015