-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Durable data #1515
Durable data #1515
Conversation
One thought that appeals to me is that a New Resource Kind can handle a central lock independent of scheduling (i.e., the request to attach / mount the data from the New Resource Kind Source can block / fail in a way that makes the container fail, which offers central coordination per data element). Perhaps new Resource Kind can be broken into two parts - allocation (which ultimately is a lot like allocating a GCE volume), and mount. The mount semantics can be managed by the server controlling replication as well (allows replication, must have recent updates within X). |
Thanks for feedback smarterclayton. Hoping for feedback from others on this soon so I can begin implementation in time for milestone 0.7 |
Apparently this is not due til 1.0 |
A Pod has a (current and desired) PodState, which has ContainerManifest that lists the Volumes and Containers of a Pod. | ||
A Container always has a writeable filesystem, and it can attach Volumes to that which are also writeable. Writes not | ||
to a volume are not visible to other containers and are not preserved past container failures. Writes to an | ||
EmptyDirectory volume are visible to other containers in the pod. An EmptyDirectory is only shared by containers which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EmptyDir writes are also visible across container restarts
I have a few concerns with some of the ideas. Chaining a pod to a specific minion just to make data "durable" seems like the wrong way to go. It gives a wrong sense of durability and security. So instead of only looking for pinning pods to hosts, which does not make me feel secure as it exposes one to hardware failure issues. I agree that pinning to specific hardware characteristics is a need, but not specific hardware. This was discussed in #2342. It therefore seems already to be possible to let specific attributes be used in scheduling such as ssd, gpu etc. With the defined common interface containers could then easily provide data. Basically the above plugins are the equivalent of Side Containers. (Data Volume Containers with specific logic). These methods would enable a mariadb image for example to be used with Host Volumes in development, used with Data Volume Containers in Staging and depending on the durability need be used with Side Containers exposing distributed storage systems or a completely integrated approach called Volume as a Service on GKE. For Kubernetes we could use this standard interface to mount volumes and then make it unnecessary to pin a pod to a minion, but to use Data Volume Containers instead and move them with the pod. Easiest solution: stop - pipe to new Data Volume Container - start new container. The solution I favour is a more integrated approach. Side Containers could be the custom method for users, but providing infrastructure specific and integrated Side Containers and exposing them as Volumes as a Service would be ideal. K8s could expose a distributed data volume, which basically maps the data to a distributed persistent disk. In a selfhosted szenario VaaS could be an admin provided Side Container binding the volume not to a Google persistent disk, but to Ceph, Gluster or so. The actual proposal can be found here: moby/moby#9277 |
My two cents -- permanent storage is such a basic requirement that I think it should be a service in the same way as docker or etcd are. it's all entirely in my head at this stage, but my plan for paas world domination will involve running something like gluster on a bunch of instances inside my cluster, and mounting a single massive shared filesystem on all machines. individual containers can then just mount parts of that filesystem as volumes. so now, at the container level, i never have to worry about which machine i come up on and will the permanent storage be attached because it's always available, everywhere. then the problem becomes the platform-specific one of managing the actual storage that the gluster instances use, EBS volumes or PDs or whatever, but that should be a rare task. once gluster is up and running and on multiple machines, i shouldn't have to worry about it any more, and gluster should be redundant enough that losing any individual gluster machine won't hurt... |
@hjwp I agree with most parts. |
Durable is the wrong word. Let's forget I used it. When I get back to working on this after the holidays, I plan to chose a different name for the concept. I'm thinking of calling the new resource Kind a "nodeVol", because its two essential properties are that it is tied to a node and that it can be referenced in a VolumeSource. There are two main types of storage we will have:
@hjwp made a good distinction between using a cluster filesystem and admining a cluster filesystem. I think kubernetes should support both, and that you need "local" as a building block to provide a service that enables "remote". @stp-ip I like your example of a mariadb that uses different volume sources at different stages of development. I had considered this. It is closely related to the need to have config be portable across cloud providers and different on-premise setups. To address this, I expect that configuration would be written as templates, where the pod's volumeSource is left unspecified, to be filled in by different instantiations of the template (prod vs dev, etc). I've taken a look at moby/moby#9277, and I've subscribed to the discussion. One thing to note is that some forms of node-local storage don't manifest themselves as a filesystem, so filesystem standards may not apply directly. For example:
|
Can one of the admins verify this patch? |
@erictune These are the issues/solutions:
So I agree with most of your points and would love to see both better support for passing through disks and raw access to hardware for local usage. (still favoring to use /con/data as default) |
I never really understood the point of volumes-from and data-volume containers. No doubt that's me being a noob, but, in the specific case of our "remote storage" use case, ie, i have a container app that needs access to some permanent storage, and that doesn't want to be tied to a particular node, just know that wherever it comes up, it can access said storage. let's assume said storage is available as a mounted filesystem on the underlying machines (which in the background is implemented using whatever distributed filesystem voodoo we like). Why would i want the extra layer of indirection of a data volume container, rather than just mounting in the remote storage paths directly into my app container? |
@hjwp because then you have either the logic of a specific distributed mount inside your container, which is counterintuative to decoupling. Or you mount a node specific directory. Even if every node has this mountpoint and passes this to some distributed storage, you define special cases. With a data volume container for example you don't have to mount distributed storage for project A on every node, even if it only runs on 1/10th of the nodes for example. Additionally updating the binding software for such a mount is now not easily upgradable via containers, but involves node updates. So there are a few negative aspects about running node specific stuff, not decoupling etc.. Especially the decoupling in one container for each concern is something, one has to wrap his head around. Sure you could just run apache+storage-endpoint+mysql+whatever in one container, but then one could just use a VM. Even when you have a bit more complexity with data-volume containers or volumes-from you add the ability for decoupling and a more service oriented infrastructure. Data containers will always have a place, but in production systems such as k8s they will most likely be replaced with Volumes as a Service aka k8s mounts k8s provided volumes, which then map to git, secret store, distributed data store (GCE persistent disk) etc.. |
I'm marking this p2 as it is speculative and a design doc. (and also we have limited the scope of data for v1) Also this is kind of redundant with #2598 |
Copying a quote from @erictune from #598 Based on what I learned in the discussion, I think a good first step before implementing durable data would be to develop a working prototype of a database (e.g. mysql master and slave instance) used by a replicated frontend layer. Attributes of prototype: individually label two nodes: database-master and database-slave. |
@brendanburns I will make the example as part of my work for #2609. This is very close to the example I was already planning on building, but with a persistent disk instead of hostDir -- though in the near term I was planning on using my local host as the "persistent" disk through hostDir. Very copacetic. |
I no longer believe that we should implement the durable data concept described in this PR. Therefore I am going to close this PR. I now see that there are two use cases that should be handled separately:
Why not handle both cases with one concept? Because the thing you end up with:
For the "easy to run things that want attach to networked storage" case, we should do:
For the "possible to run things which must have local storage device access" case, we should do:
|
@bgrant0607 we talked just now about how I don't think we should do durable data in any of the forms previously discussed. See my post above. |
----- Original Message -----
|
@smarterclayton |
At least for the types of networked applications in aware of, the utility of this snapshot is more for simple singleton processes that could potentially tolerate some loss of data (last 10 minutes) in preference to having no data. An admin team could modify their restart / node replacement processes in order to effectively maintain a possibly reasonable SLA, without the rest of the system being aware. For instance, but adding a tool that tries to snapshot all the instances of this volume type on a node prior to beginning evacuation. I'm thinking of things like Git repositories, Jenkins servers, simple Redis cache services, test databases, QA workloads, simple collaboration style apps, etc. Losing even 15 minutes of state is unlikely to impact those sorts of tools, and most of them are inherently single pod stateful. There's probably a cost / effort sweet spot for loose consistency of data here.
|
…erry-pick-1499-to-release-4.13 [release-4.13] OCPBUGS-10432: CSI Inline Volume admission plugin does not log object name correctly
I'm not looking to actually merge this PR, just get some feedback on two possible approaches.
If there is agreement on one approach, then I will code it up, and convert this design document into user documentation.
I'm inclined to add a new type of REST resource to model Durable local data. I know new resources are always controversial, so I've written this doc to compare the alternatives. Comments welcome from anyone. I'd particularly like comments from @dchen1107 @bgrant0607 @thockin @brendanburns @jbeda @lavalamp @smarterclayton.
Addresses #598.