[GarbageCollector] Adding a proposal for server-side cascading deletion #23656

caesarxuchao · 2016-03-30T23:29:47Z

@lavalamp @bgrant0607
cc @kubernetes/sig-api-machinery @derekwaynecarr

ref #12143 #19054 #17305 (initializer proposal)

This change is

bgrant0607 · 2016-03-31T04:32:32Z

bgrant0607 · 2016-03-31T04:33:48Z

docs/proposals/cascading-deletion.md

+```
+type DeleteOptions struct {
+	…
+	Orphaning bool


Delete is a verb, so it's ok to use a verb here: OrphanChildren

Do we have a clear parent child relationship defined somewhere?

Do we need something like controllerRef to materalize the same basic thing we have today for ObjectMeta.Namespace so the relationship is explicit?

bgrant0607 · 2016-03-31T04:46:21Z

I'm in favor of finalizers.

I find tombstones scary for a number of reasons (e.g., privacy, extensibility, scalability, failure modes due to lack of atomicity, change in semantics vs. what kubectl delete --cascade currently provides), they likely would conflict with my plan in the future to use all resources as their own tombstones, and I want finalizers for a number of other use cases.

bgrant0607 · 2016-03-31T04:47:22Z

docs/proposals/cascading-deletion.md

+```
+type ObjectMeta struct {
+	…
+	DeletionInProgress bool


I would like to find a way to use one of the existing deletion fields (e.g., DeletionTimestamp)

In the draft I wrote "we can reuse ObjectMeta.DeletionTimestamp if the DeletionTimestamp does not require the resource to be deleted after the time stamp is reached".
On second thought, this shouldn't prevent us from reusing the DeletionTimestamp, because no matter whether we reuse it, finalizers will break this on-time deletion promise made by the DeletionTimestamp.
cc @smarterclayton.

DeletionTimestamp changes when you call delete multiple times for a resource that is undergoing graceful termination... so I think you want a different field/concept.

Technically DeletionTimestamp is not a promise. Because we don't assume global time in the cluster, DeletionTimestamp is a best effort record of the anticipated deletion of the resource. No component in the cluster should be using DeletionTimestamp as a clock comparison (they should be using their own clock and incrementing it by GracefulDeletionPeriodSeconds in their own timeline).

Once DeletionTimestamp is set, it can never be unset (by API contract).

We only need to check whether DeletionTimesstamp is non-nil, its value doesn't matter.

It's value does matter for pods, right?

@smarterclayton thanks for the clarification. The comment in types.go (

kubernetes/pkg/api/v1/types.go

Line 144 in b60ef6f

// DeletionTimestamp is RFC 3339 date and time at which this resource will be deleted. This

) is misleading.

@derekwaynecarr, I mean its value doesn't matter to the finalizers, because finalizers only check if DeletionTimestamp is nil. And finalizers should only update objects, which won't modify the DeletionGracePeriod or DeletionTimestamp (see https://github.com/kubernetes/kubernetes/blob/master/pkg/api/validation/validation.go#L353), so introducing finalizers won't affect the graceful termination. Did I miss something?

@caesarxuchao - it matters for who does the final delete call to the API server. In the current namespace pattern, the finalizer controller will send a delete when it observes all finalizers have emptied. in the case for pods, the kubelet will send a delete ?gracePeriod=0 which would have expected the pod to be removed. I guess I need to think a little more about what it would mean to attach a finalizer to a pod and the kubelet interaction associated.

The following is how I imagine it will work, feel free to point out the loopholes:
In API server's deletion handler, if gracePeriod=0 and the finalizers field is empty, then deletes the object immediately; otherwise just do an update, which will set the DeletionGracePeriod.
And when the last finalizer has done its job and sent an Update request to API server, API server will notice the finalizers are empty, it checks if gracePeriod=0, if so, API server deletes it; otherwise API server just updates the finalizers field to empty, some other parties will send a delete with graceperiod=0 later.

A pod may exist for a while after kubelet sends a deletion request with gracePeriod=0, and kubelet will continue to receive update event about the pod as its finalizers get removed, so kubelet will send more deletion requests with gracePeriod=0, but I think it's harmless.

derekwaynecarr · 2016-03-31T19:45:00Z

I have a feeling this would have a significant impact on the speed of our e2e runs, and the number of flakes we may or may not encounter unless we restructure them with the idea that orphans are OK. The namespace deletion flake is a flake that never dies.

/cc @kubernetes/rh-cluster-infra - this potentially extends the namespace finalizer concept to all of the other resource types.

derekwaynecarr · 2016-03-31T19:46:57Z

I need more time to reason on this, and would like to understand if the gc or finalizer controller is bundled with the controller-manager. If so, it feels like it would give a strong argument to having shared informers i.e. #23575

derekwaynecarr · 2016-03-31T21:14:49Z

If we extend the finalizer pattern, I think its imperative that we have a kubectl command that can be run with proper authority to release a finalizer that is failing to respond in a timely enough fashion. Otherwise, you end up with permanently stuck objects.

caesarxuchao · 2016-03-31T21:48:38Z

docs/proposals/cascading-deletion.md

+    * If the `ObjectMeta.Finalizers` of the object being deleted is non-empty, then updates the DeletionInProgress field to true.
+    * If the `ObjectMeta.Finalizers` is empty, then deletes the object.
+  * Update handler:
+    * If the update removes the last finalizer, and the DeletionInProgress is true, then deletes the object from the registry.


If we reuse the DeletionTimestamp field, here API server also needs to check if DeletionGracePeriod is 0, if so, immediately deletes the object, otherwise just carries out the update.

bgrant0607 · 2016-04-05T00:09:42Z

docs/proposals/cascading-deletion.md

+}
+```
+
+**ObjectMeta.ParentReferences**: links the resource to the parent resources. For example, a replica set `R` created by a deployment `D` should have an entry in ObjectMeta.ParentReferences pointing to `D`. The link should be set when the child object is created. It can be updated after the creation.


"Parent" doesn't have any obvious, standard meaning. OwnerReferences may be more familiar, similar to the usual concept of object/memory ownership.

We'd need to think of another term for "Children", however.

Also, we can't change the default behavior for existing APIs/versions, so cascading deletion needs to not happen by default through the API, or, equivalently, orphaning needs to occur by default.

We'd need to think of another term for "Children", however.

How about "Dependent"? It sounds like a good match to "owner".

I'll update the terminology after we decide the winner of the two designs. The new terminology is already used in the PR that adds the API #23928.

The current Namespace finalizer is similar to the dependent object, but in this case the dependency is a cluster provider of some kind (Kubernetes, OpenShift). If we go this route for dependencies as a object reference, we need to let non-namespaced resources depend on something akin to a API provider that must be registered with the cluster. I also could see the same value in Node(s)

I guess the general question is if cluster-scoped resources can have dependents and to what things.

I'm not sure I follow, can you give an example?

On Tue, Apr 12, 2016 at 8:30 PM, Derek Carr notifications@github.com
wrote:

In docs/proposals/cascading-deletion.md
#23656 (comment)
:

+type DeleteOptions struct {

…

OrphanChildren bool
+}
+`

+DeleteOptions.OrphanChildren: allows a user to express whether the child objects should be orphaned.
+
+`
+type ObjectMeta struct {

...

ParentReferences []ObjectReference
+}
+```

+ObjectMeta.ParentReferences: links the resource to the parent resources. For example, a replica setRcreated by a deploymentDshould have an entry in ObjectMeta.ParentReferences pointing toD. The link should be set when the child object is created. It can be updated after the creation.

The current Namespace finalizer is similar to the dependent object, but in
this case the dependency is a cluster provider of some kind (Kubernetes,
OpenShift). If we go this route for dependencies as a object reference, we
need to let non-namespaced resources depend on something akin to a API
provider that must be registered with the cluster. I also could see the
same value in Node(s)

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/pull/23656/files/cc943fae771b704370a7ddca1d6267b6786fd807#r59488717

caesarxuchao · 2016-05-02T23:01:48Z

docs/proposals/garbage-collection.md

+
+Non-goals include:
+* Releasing the name of an object immediately, so it can be reused ASAP.
+* Propagating the grace period in cascading deletion.


Explicitly listed "propagating the grace period" as non goal of this proposal.

caesarxuchao · 2016-05-02T23:06:22Z

docs/proposals/garbage-collection.md

+
+2. How to propagate the grace period in a cascading deletion? For example, when deleting a ReplicaSet with grace period of 5s, a user may expect the same grace period to be applied to the deletion of the Pods controlled the ReplicaSet.
+
+  Propagating grace period in a cascading deletion is a ***non-goal*** of this proposal. Nevertheless, the presented design can be extended to support it. A tentative solution is letting the garbage collector to propagate the grace period when deleting dependent object. To persist the grace period set by the user, the owning object should not be deleted from the registry until all its dependent objects are in the graceful deletion state. This could be ensured by introducing another finalizer, tentatively named as the "populating graceful deletion" finalizer. Upon receiving the graceful deletion request, the API server adds this finalizer to the finalizers list of the owning object. Later the GC will remove it when all dependents are in the graceful deletion state.


Moved "propagating grace period" to Open Qeustions and noted down the tentative solution.

lavalamp · 2016-05-03T00:09:05Z

LGTM! @smarterclayton @liggitt any last comments before I apply the label?

k8s-bot · 2016-05-03T00:13:00Z

GCE e2e build/test passed for commit 999677a76c85b75642683b26ea6eab490034af3a.

k8s-bot · 2016-05-03T00:23:59Z

GCE e2e build/test passed for commit 4044e334dbc38e897092feef132f763a840fb4ef.

smarterclayton · 2016-05-03T03:43:32Z

None from me.

k8s-bot · 2016-05-03T21:16:40Z

GCE e2e build/test passed for commit 91cb08f9dc76724d1256b5ba9160658a9f911f13.

k8s-bot · 2016-05-04T18:48:22Z

GCE e2e build/test failed for commit b3d7297.

Please reference the list of currently known flakes when examining this failure. If you request a re-test, you must reference the issue describing the flake.

k8s-github-robot · 2016-05-04T21:44:10Z

Automatic merge from submit-queue

@lavalamp

…anges Automatic merge from submit-queue API changes for Cascading deletion This PR includes the necessary API changes to implement cascading deletion with finalizers as proposed is in #23656. Comments are welcome. @lavalamp @derekwaynecarr @bgrant0607 @rata @hongchaodeng

yuvipanda · 2016-06-07T04:35:32Z

Is there a PR for this anywhere I can follow?

caesarxuchao · 2016-06-07T04:43:43Z

@yuvipanda here's a list of all the PRs so far: https://github.com/kubernetes/kubernetes/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Amerged+garbage+collector+author%3Acaesarxuchao

yuvipanda · 2016-06-07T04:49:50Z

@caesarxuchao awesome! Do you think all of this will land for 1.3?

caesarxuchao · 2016-06-07T04:51:44Z

All the listed PRs are merged. GC will be alpha (disabled by default) for 1.3. We'll push for beta in 1.4.

yuvipanda · 2016-06-07T05:07:24Z

Awesome! \o/ (I'll be happy to turn it on in our install once 1.3 releases)

On Mon, Jun 6, 2016 at 9:52 PM, Chao Xu notifications@github.com wrote:

All the listed PRs are merged. GC will be alpha (disabled by default) for
1.3. We'll push for beta in 1.4.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#23656 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAB23itbVvR5sBCYITnxaM9S-rdDGyWGks5qJPkWgaJpZM4H8Stk
.

Yuvi Panda T
http://yuvi.in/blog

pwittrock · 2016-06-17T19:30:28Z

@caesarxuchao
@lavalamp

Would you provide an update on the status for the documentation for this feature as well as add any PRs as they are created?

Not Started / In Progress / In Review / Done

Docs Instructions for v1.3

Thanks
@pwittrock

caesarxuchao · 2016-06-17T23:35:26Z

@pwittrock I sent a PR to add a user doc.

caesarxuchao assigned lavalamp Mar 30, 2016

caesarxuchao added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Mar 30, 2016

googlebot added the cla: yes label Mar 30, 2016

k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 30, 2016

bgrant0607 reviewed Mar 31, 2016
View reviewed changes

caesarxuchao reviewed Mar 31, 2016
View reviewed changes

k8s-github-robot added the release-note-label-needed label Mar 31, 2016

bgrant0607 reviewed Apr 5, 2016
View reviewed changes

caesarxuchao reviewed May 2, 2016
View reviewed changes

caesarxuchao force-pushed the cascading-deletion-proposal branch 2 times, most recently from 5064696 to 4044e33 Compare May 2, 2016 23:04

caesarxuchao reviewed May 2, 2016
View reviewed changes

caesarxuchao force-pushed the cascading-deletion-proposal branch from 4044e33 to 91cb08f Compare May 3, 2016 20:16

lavalamp added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 4, 2016

add a proposal for generic garbage collection

b3d7297

caesarxuchao force-pushed the cascading-deletion-proposal branch from 91cb08f to b3d7297 Compare May 4, 2016 18:17

caesarxuchao added e2e-not-required lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed e2e-not-required lgtm "Looks good to me", indicates that a PR is ready to be merged. labels May 4, 2016

k8s-github-robot merged commit 6ecb80c into kubernetes:master May 4, 2016

soltysh mentioned this pull request May 17, 2016

kubectl delete jobs get's rate limited and prints out annoying logs #25704

Closed

caesarxuchao changed the title ~~Adding a proposal for server-side cascading deletion~~ [GarbageCollector] Adding a proposal for server-side cascading deletion Aug 15, 2016


		2. How to propagate the grace period in a cascading deletion? For example, when deleting a ReplicaSet with grace period of 5s, a user may expect the same grace period to be applied to the deletion of the Pods controlled the ReplicaSet.

		Propagating grace period in a cascading deletion is a *non-goal* of this proposal. Nevertheless, the presented design can be extended to support it. A tentative solution is letting the garbage collector to propagate the grace period when deleting dependent object. To persist the grace period set by the user, the owning object should not be deleted from the registry until all its dependent objects are in the graceful deletion state. This could be ensured by introducing another finalizer, tentatively named as the "populating graceful deletion" finalizer. Upon receiving the graceful deletion request, the API server adds this finalizer to the finalizers list of the owning object. Later the GC will remove it when all dependents are in the graceful deletion state.

[GarbageCollector] Adding a proposal for server-side cascading deletion #23656

[GarbageCollector] Adding a proposal for server-side cascading deletion #23656

Conversation

caesarxuchao commented Mar 30, 2016 • edited by k8s-oncall Loading

bgrant0607 commented Mar 31, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bgrant0607 commented Mar 31, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

derekwaynecarr commented Mar 31, 2016

derekwaynecarr commented Mar 31, 2016

derekwaynecarr commented Mar 31, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

caesarxuchao May 2, 2016 • edited Loading

Choose a reason for hiding this comment

lavalamp commented May 3, 2016

k8s-bot commented May 3, 2016

k8s-bot commented May 3, 2016

smarterclayton commented May 3, 2016 via email

k8s-bot commented May 3, 2016

k8s-bot commented May 4, 2016

k8s-github-robot commented May 4, 2016

yuvipanda commented Jun 7, 2016

caesarxuchao commented Jun 7, 2016

yuvipanda commented Jun 7, 2016

caesarxuchao commented Jun 7, 2016

yuvipanda commented Jun 7, 2016

pwittrock commented Jun 17, 2016 • edited Loading

caesarxuchao commented Jun 17, 2016

caesarxuchao commented Mar 30, 2016 •

edited by k8s-oncall

Loading

caesarxuchao May 2, 2016 •

edited

Loading

pwittrock commented Jun 17, 2016 •

edited

Loading