[RFC] Add a proposal for synchronous garbage collection #32628

caesarxuchao · 2016-09-14T00:33:36Z

Tracking issue: #29891

@lavalamp @kubernetes/sig-api-machinery @kubernetes/api-review-team

This change is

smarterclayton · 2016-09-14T00:48:11Z

docs/proposals/synchronous-garbage-collection.md

+* When deleting dependents, it should use the same `DeleteOptions.SynchronousGC` as the owner’s finalizers suggest.
+
+**Note**
+We cannot have circular dependencies anymore. Otherwise SynchronousGC will enter a deadlock.


How will we handle this?

Why would this be the case. It should not be the case.

caesarxuchao · 2016-09-14T04:13:12Z

docs/proposals/synchronous-garbage-collection.md

+
+**Modifications to processItem()**
+
+`processItem()` consumes the `dirtyQueue`, requests the API server to delete an item if all of its owners do not exist. To support synchronous GC, it has to:


Added more detail as requested.

k8s-bot · 2016-09-14T04:51:59Z

GCE e2e build/test passed for commit d67eb53.

caesarxuchao · 2016-09-21T17:42:48Z

docs/proposals/synchronous-garbage-collection.md

+
+**kubectl**: synchronous GC can replace the **kubectl delete** reapers. Currently `kubectl delete` blocks until all dependents and the owner are deleted. To maintain this behavior, after switched to using synchronous GC, *kubectl delete* needs to poll on the removal of the owner object.
+
+## Security implications


@liggitt @deads2k might be interested in this.

caesarxuchao · 2016-09-21T17:43:04Z

docs/proposals/synchronous-garbage-collection.md

+
+**Handling circular dependencies**
+
+SynchronousGC will enter a deadlock in the presence of circular dependencies. The garbage collector can break the circle by lazily detecting circular dependencies: when `processItem()` processes an object, if it finds the object and all of its owners have the `GCFinalizer`, it searches the internal owner-dependency relationship graph (`uidToNode`) to check if the object and any of its owner are in a circle where all objects have the `GCFinalizer`. If so, it removes the `GCFinzlier` from the object to break the circle.


@lavalamp how about handling dependency circle this way? This should break a circle timely.

I think it should track which object a user requested the sync deletion for, and which objects have the finalizer merely so that the GC can track their deletion (you could do this with two different finalizer strings, for example). The only thing that's important is that the thing the user sent gets deleted after everything else.

The only time there's a problematic circle is if users simultaneously delete all of the items in the circle. In that case the GC probably needs to downgrade someone.

I added an idea at the top that expands on this concept of two finalizer classes.

I talked with Brian about this, too-- since we don't expect users to do this on purpose, let's do the easiest possible thing and break the cycle as soon as we notice it (i.e., as soon as we add the sync finalizer to an object which depends on an object that already has the sync label, we remove that link). We can add guarantees later if needed.

Thanks. I didn't mention the "two finalizers" idea in the proposal. I think we need only one finalizer for now, and we can switch to use two finalizers when we want to support circular dependencies.

caesarxuchao · 2016-09-21T17:43:06Z

docs/proposals/synchronous-garbage-collection.md

+* If the GC observes the owning object with the `GCFinalizer` before it observes the creation of all the dependents, GC will remove the finalizer from the owning object before all dependents are gone. Hence, “Synchronous GC” is best-effort, though we guarantee that the dependents will be deleted eventually. We face a similar case when handling OrphanFinalizer, see [GC known issues](https://github.com/kubernetes/kubernetes/issues/26120).
+
+
+## Implications to existing clients


Added analysis of several controllers that might be affected by synchronous GC.

lavalamp · 2016-09-22T22:27:07Z

docs/proposals/synchronous-garbage-collection.md

+  // If SynchronousGarbageCollection is set, the object will not be deleted immediately. Instead, a GarbageCollectionInProgress finalizer will be placed on the object. The garbage collector will remove the finalizer from the object when all depdendents are deleted.
+  // SynchronousGarbageCollection and OrphanDependents are exclusive.
+  // SynchronousGarbageCollection default to false.
+  // SynchronousGarbageCollection is cascading, i.e., the object’s dependents will be deleted with the same SynchronousGarbageCollection.


I don't think this is necessary, actually. It's one possible implementation.

And it may be very undesirable as it makes it harder to handle cycles.

lavalamp · 2016-09-22T22:28:11Z

docs/proposals/synchronous-garbage-collection.md

+  …
+  // If SynchronousGarbageCollection is set, the object will not be deleted immediately. Instead, a GarbageCollectionInProgress finalizer will be placed on the object. The garbage collector will remove the finalizer from the object when all depdendents are deleted.
+  // SynchronousGarbageCollection and OrphanDependents are exclusive.
+  // SynchronousGarbageCollection default to false.


We should consider defaulting it in the way that causes the smallest change in old client behavior.

(might be different for different objects)

Defaulting to false is the least disruptive way. Old clients don't expect GC.

lavalamp · 2016-09-22T22:33:44Z

docs/proposals/synchronous-garbage-collection.md

+}
+```
+
+We will introduce a new standard finalizer: const GCFinalizer string = “GarbageCollectionInProgress”


How about "CollectingGarbage".

The string for the OrphanFinalizer is "orphan" (here), so probably "collectingGarbage". I hope we had named the OrphanFinalizer "orphaning".

lavalamp · 2016-09-22T22:34:26Z

docs/proposals/synchronous-garbage-collection.md

+
+Delete() function needs to check the DeleteOptions.SynchronousGarbageCollection.
+
+* The option is ignored if DeleteOptions.OrphanDependents is true or nil.


IMO, reject the request if the user sets both.

lavalamp · 2016-09-22T22:37:24Z

docs/proposals/synchronous-garbage-collection.md

+
+**Node controller** forcefully deletes pod if the pod is scheduled to a node that does not exist ([code](../../pkg/controller/node/nodecontroller.go#L474)). The pod will continue to exist if it has pending finalizers. The node controller will futilely retry the deletion. Also, the `node controller` forcefully deletes pods before deleting the node ([code](../../pkg/controller/node/nodecontroller.go#L592)). If the pods have pending finalizers, the `node controller` will go ahead deleting the node, leaving those pods behind. These pods will be deleted from the key-value store when the pending finalizers are removed.
+
+**Podgc** deletes terminated pods if there are too many of them in the cluster. `Podgc` should remove any pending finalizers to make sure the pods are deleted.


If we have a problem with finalizers not getting taken off quickly enough we should solve it elsewhere. Podgc shouldn't need to do anything specific about finalizers.

lavalamp · 2016-09-22T22:39:42Z

docs/proposals/synchronous-garbage-collection.md

+
+**Podgc** deletes terminated pods if there are too many of them in the cluster. `Podgc` should remove any pending finalizers to make sure the pods are deleted.
+
+**Deployment controller** adopts existing `ReplicaSet` (RS) if its template matches. If a matching RS has a pending `GCFinalizer`, deployment shouldn't adopt it, because the RS controller will not scale up/down a RS that's being deleted. Hence, `deployment controller` needs to check if a RS is being deleted before adopting it. If the RS is being deleted, then the `deployment controller` should wait for the status of the RS to show 0 replicas to avoid creating extra pods, then create a new RS.


If it's being deleted, its pods still count. The only difference is that the deployment controller shouldn't try to mutate it at that point.

lavalamp · 2016-09-22T22:40:31Z

docs/proposals/synchronous-garbage-collection.md

+
+**Replication controller manager**, **Job controller**, and **ReplicaSet controller** ignore pods in terminated phase, so pods with pending finalizers will not block these controllers.
+
+**PetSet controller** will be blocked by a pod with pending finalizers, so Synchronous GC might slow down its progress.


Pet set controller may wish to use synchronous deletion itself. @erictune

lavalamp · 2016-09-22T22:54:50Z

docs/proposals/synchronous-garbage-collection.md

+* handle Add or Update events where `obj.Finalizers.Has(GCFinalizer) && obj.DeletionTimestamp != nil`. The object will be added into the `synchronousGC queue`. The object will be marked as “GC in progress” in `uidToNode`.
+* Upon receiving the deletion event of an object, put its owner into the `synchronousGC queue`. This is to force the `GCFinalizer` (described next) to re-check if all dependents of the owner is deleted.
+
+**Addition of GCFinalizer() routine**


Again I think we can just do this in the current function that pops off of the current queue. if object.waiting { check_if_done() } else { ... }

lavalamp · 2016-09-22T22:56:30Z

docs/proposals/synchronous-garbage-collection.md

+`processItem()` consumes the `dirtyQueue`, requests the API server to delete an item if all of its owners do not exist. To support synchronous GC, it has to:
+
+* treat an owner as "not exist" if `owner.DeletionTimestamp != nil && !owner.Finalizers.Has(OrphanFinalizer)`, otherwise Synchronous GC will not progress because the owner keeps existing in the key-value store.
+* when deleting dependents, it should use the same `DeleteOptions.SynchronousGC` as the owner’s finalizers suggest.


This is one way to do it. I guess it's probably easiest.

lavalamp · 2016-09-22T22:56:46Z

docs/proposals/synchronous-garbage-collection.md

+
+* treat an owner as "not exist" if `owner.DeletionTimestamp != nil && !owner.Finalizers.Has(OrphanFinalizer)`, otherwise Synchronous GC will not progress because the owner keeps existing in the key-value store.
+* when deleting dependents, it should use the same `DeleteOptions.SynchronousGC` as the owner’s finalizers suggest.
+* if an object has multiple owners, some owners still exit while other owners are in the synchronous GC stage, then according to the existing logic of GC, the object wouldn't be deleted. To unblock the synchronous GC of those owners, `processItem()` has to remove the ownerReferences pointing to them.


s/exit/exist/

lavalamp · 2016-09-22T22:59:47Z

docs/proposals/synchronous-garbage-collection.md

+
+# Overview
+
+Some users of the server-side garbage collection need to tell if the garbage collection is done ([example](https://github.com/kubernetes/kubernetes/issues/19701#issuecomment-236997077)). Synchronous Garbage Collection is a best-effort (see [unhandled cases](#unhandled-cases)) mechanism to enable such use cases: after the API server receives a deletion request of an owning object, the object keeps existing in the key-value store until all its dependents are deleted from the key-value store by the garbage collector.


Why do they need to know? Please add some text motivating this change.

Also clarify what exactly it means: Does it wait until all dependents are themselves in the terminating state, or until they're actually deleted and names released?

An idea might be to offer two flavors of sync GC:

Class one merely waits for all dependent objects to be in terminating state. There's no circular dependency problems with this class.

Class two waits for all dependent objects to additionally have their names released for reuse. There's a potential circular dependency problem here; one resolution is to say that the garbage collector may downgrade a single item in the circle to class one. A better fix is to allow releasing the name as a thing that happens before the object is physically deleted from the storage layer, but we currently can't do that. If we could, this wouldn't have circular dependency problems, either.

lavalamp · 2016-09-22T23:05:05Z

docs/proposals/synchronous-garbage-collection.md

+
+**Handling circular dependencies**
+
+SynchronousGC will enter a deadlock in the presence of circular dependencies. The garbage collector can break the circle by lazily detecting circular dependencies: when `processItem()` processes an object, if it finds the object and all of its owners have the `GCFinalizer`, it searches the internal owner-dependency relationship graph (`uidToNode`) to check if the object and any of its owner are in a circle where all objects have the `GCFinalizer`. If so, it removes the `GCFinzlier` from the object to break the circle.


I think it should track which object a user requested the sync deletion for, and which objects have the finalizer merely so that the GC can track their deletion (you could do this with two different finalizer strings, for example). The only thing that's important is that the thing the user sent gets deleted after everything else.

The only time there's a problematic circle is if users simultaneously delete all of the items in the circle. In that case the GC probably needs to downgrade someone.

lavalamp

Good start, but I have some suggestions.

lavalamp · 2016-09-22T23:13:41Z

docs/proposals/synchronous-garbage-collection.md

+
+**Handling circular dependencies**
+
+SynchronousGC will enter a deadlock in the presence of circular dependencies. The garbage collector can break the circle by lazily detecting circular dependencies: when `processItem()` processes an object, if it finds the object and all of its owners have the `GCFinalizer`, it searches the internal owner-dependency relationship graph (`uidToNode`) to check if the object and any of its owner are in a circle where all objects have the `GCFinalizer`. If so, it removes the `GCFinzlier` from the object to break the circle.


I added an idea at the top that expands on this concept of two finalizer classes.

lavalamp · 2016-09-22T23:14:33Z

docs/proposals/synchronous-garbage-collection.md

+
+## Security implications
+
+A user who is authorized to update one object can affect the synchronous GC behavior of another object. Specifically, by setting an object as a pod's owner, and setting a very long grace termination period for the pod, a user can make the synchronous GC of the owner to take long time.


This is potentially a big deal. Do we have any strategies to mitigate this? @deads2k

This is potentially a big deal. Do we have any strategies to mitigate this? @deads2k

I think this is a general problem with finalizing behavior which this proposal makes more obvious, but which always existed. Reading through, this looks like a way to provide nicer client behavior, but the mechanism is roughly the same. I don't know that you can do much better than allowing a timeout and a timeout behavior (fail or orphan), along with a status that links to offenders.

It looks very similar to the case where someone adds a finalizer ref on a namespace, but fails to start a controller that collects that finalizer. The namespace sticks around forever. @smarterclayton and I have talked about it, but we've never come up with a general solution to the stuck finalizer or stuck initializer problem.

I talked a bit with Brian and came up with the idea of the GC impersonating the user at deletion time, and if the object is not visible to that user, then ignoring or breaking the link. It seems like the GC controller can do something in this area.

The GC impersonating idea doesn't solve the scenario I described:
User A only has access to object Oa of kind Ka, but he sets Oa's ownerRef to point to Ob of kind Kb. User A also makes Oa's deletion take a very long time.
When user B, who has access to both Ka and Kb, initiates the synchronous deletion of Ob, he will find Ob's deletion takes a very long time, because of Oa.
The problem in this scenario is that user A gets the ability to affect the behavior of Ob.

along with a status that links to offenders

I think the garbage collector needs to expose the dependencies via the controller manager for debugging purposes. I'm not sure about the security implications. I think it's safe if the exposed information only contains the resource kind and the UID.

I don't know that you can do much better than allowing a timeout and a timeout behavior (fail or orphan), along with a status that links to offenders.

I'm not in favor of a timeout. #33799 (comment) suggested let users define a timeout. It's hard for user to estimate how long GC will take, because it's hard for a user to figure out how many dependents are out there. On the other hand, if GC defines a default timeout, user will never know if all dependents are actually deleted.

I think we should disallow a user from setting ownerReference to point to a resource that he doesn't have update/delete permission. I think that's possible in our current system?

I think there's a number of ways to reduce the severity of this problem. These are just some ideas.

apiserver could control the list of finalizers you're allowed to add to an object. You'd have to register a finalizer and only sysadmins would have permission to do that. This reduces the chance of an attack where you keep an object around by adding a bogus finalizer.

If we go with the finalizer registry idea, you could supply a max time along with the finalizer (probably bad idea for a number of reasons).

We could make adding a finalizer require a particular permission, although this would be a field level auth. We might need a dedicated subresource for this.

We could consider adding a restriction, like "cross-user links are {not permitted by apiserver | not honored by sync deletion by the GC}"

We could add a restriction like, you can't add a link to an object that can't see you. Although this is pretty weird because it means less permissions is better for the attackee.

Maybe better than that, just require mutual visibility.

Actually, even better-- make the sync GC mode do a permission check. If Ob->Oa, and Oa is sync deleted, but Oa cannot see Ob, then still go ahead and delete Ob, but don't consider it as something that needs to be waited for. This still lets people in some namespace get their thing turned down when a dependency is turned down, but they can't block the dependency's turndown process.

We could surface a metric, time spent in sync GC. sysadmins could get paged if it starts going up. (we should surface this metric no matter what, arguably.)

We should come up with a way to express what's blocking the deletion, maybe in an annotation. After N seconds, we add an annotation like "deletion waiting for: eviluser/pods/sleepforever to be deleted". Then a user could either manually remove that link or force delete it to make the original operation proceed.

1, 3 might be useful even out of the scope of synchronous GC.
4 is not possible today, we don't record which user owns an object.
5 and 7 do not solve #32628 (comment).
6 not sure what does "visibility" mean. If it means having create/delete/update permissions, then it's similar to what I suggested in #32628 (comment). But that might block some useful use cases ( e.g., a use wants his pod to share the fate of a database service, but he doesn't have delete permission of the database service, so he can't setup the reference).

8,9 look reasonable.

erictune · 2016-09-23T18:47:01Z

Didn't get a chance to read this but discussed in hall with @bgrant0607 and @lavalamp .

I think the following can happen:

User creates object Oa of kind Ka and gets 200 / verifies existence of Oa.
User creates object Ob of kind Kb that has parent object Oa.
GC controller is watching Ka's and Kb's.
GC controller sees Ob before it sees Oa. GC Controller concludes Ob needs to be deleted.

I don't know any way to fix this with our consistency model.

caesarxuchao · 2016-09-23T19:09:26Z

GC controller sees Ob before it sees Oa. GC Controller concludes Ob needs to be deleted.

@erictune when handling Ob, GC issues a GET to check if Oa exists.

liggitt · 2016-09-23T20:56:39Z

The Get can return a NotFound error in a HA etcd cluster.

caesarxuchao · 2016-09-23T21:05:30Z

@liggitt I don't understand HA etcd cluster well. In the case you described, is it because one etcd has processed the creation of the object, but the consensus of the etcd cluster is that the object does not exist?

lavalamp · 2016-09-23T21:13:06Z

We should do consistent reads, so that shouldn't be true.

On Fri, Sep 23, 2016 at 2:05 PM, Chao Xu notifications@github.com wrote:

@liggitt https://github.com/liggitt I don't understand HA etcd cluster
well. In the case you described, is it because one etcd has processed the
creation of the object, but the consensus of the etcd cluster is that the
object does not exist?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#32628 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAnglsu59U0PMI0ZP9iz6kqdZBUNXCpfks5qtD8kgaJpZM4J8Saa
.

deads2k · 2016-10-11T19:53:25Z

Regarding admission controllers, I think they are a bunch of plugins. I don't think we should write an admission controller whose sole responsibility is checking ownerReferences, but I don't see which existing admission controller can take this responsibility. Do you have suggestions?

I would use a separate admission plugin. We have a proposal to ease admission, particularly the ordering aspects which are extremely difficult to reason about today (https://github.com/kubernetes/kubernetes/pull/34131/files#diff-2be0f77c0f5171454aa4db95bd66d342R110). Given a structure like that, its pretty easy to add yourself to the chain in an appropriate location.

If you like, I can create one for owner references that requires permission to delete an object before allowing someone to set any ownerrefs to be used as a starting point.

caesarxuchao · 2016-10-11T20:53:39Z

If you like, I can create one for owner references that requires permission to delete an object before allowing someone to set any ownerrefs to be used as a starting point.

Thanks, that would be great, though we need to get both proposals in first.

caesarxuchao · 2016-10-11T21:06:35Z

docs/proposals/synchronous-garbage-collection.md

+)
+```
+
+`DeleteOptions.OrphanDependents *bool` will be removed. We decided not to add a `DeleteOptions.SynchronousGC *bool`, because together with `DeleteOptions.OrphanDependents` it results in 9 possible combinations and is thus confusing. This is a breaking change, so the new DeleteOptions will live in `api.k8s.io` group with version `v1`, as proposed in [#33900](https://github.com/kubernetes/kubernetes/pull/33900).


@smarterclayton @lavalamp can we make breaking changes to DeleteOptions and move it to api.k8s.io/v1 in 1.5?

Don't remove the old field, just add the new field and declare that only one of them may be set.

Mark the old field deprecated and say we'll remove it in 1.7.

It's easy to look at both fields and figure out which one the client set, and do the right thing.

lavalamp · 2016-10-11T22:01:15Z

docs/proposals/synchronous-garbage-collection.md

+  …
+  // Whether and how garbage collection will be performed.
+  // Defaults to GarbageCollectionDefault
+  GarbageCollectionPolicy GarbageCollectionPolicy


Make it a pointer--it's optional.

lavalamp · 2016-10-11T22:01:29Z

docs/proposals/synchronous-garbage-collection.md

+  GarbageCollectionPolicy GarbageCollectionPolicy
+}
+
+type GarabgeCollectionPolicy string


s/Garabge/Garbage/

smarterclayton · 2016-10-11T22:12:10Z

I really don't like calling this Garbage Collection for end users. It's
really not garbage, it's dependent object deletion. I feel pretty strongly
we should be aligning what this does, not that it is

On Oct 11, 2016, at 3:01 PM, Daniel Smith notifications@github.com wrote:

@lavalamp commented on this pull request.

In docs/proposals/synchronous-garbage-collection.md
#32628 (review)
:

+}
++ +The initial draft of the proposal did not include this field and it had a security loophole: a user who is only authorized to update one resource can set ownerReference to block the synchronous GC of other resources. Requiring users to explicitly setBlockOwnerDeletion allows the master to properly authorize the request. + +## DeleteOptions + +go
+DeleteOptions {

…

// Whether and how garbage collection will be performed.

// Defaults to GarbageCollectionDefault

GarbageCollectionPolicy GarbageCollectionPolicy
+}

+type GarabgeCollectionPolicy string

s/Garabge/Garbage/

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#32628 (review),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p6frJo0FtY4zORmN4slpu8zsyGt3ks5qzAdHgaJpZM4J8Saa
.

lavalamp · 2016-10-11T22:18:42Z

How about

PropagationPolicy DeletionPropagationPolicy

as a name?

On Tue, Oct 11, 2016 at 3:12 PM, Clayton Coleman notifications@github.com
wrote:

I really don't like calling this Garbage Collection for end users. It's
really not garbage, it's dependent object deletion. I feel pretty strongly
we should be aligning what this does, not that it is

On Oct 11, 2016, at 3:01 PM, Daniel Smith notifications@github.com
wrote:

@lavalamp commented on this pull request.

In docs/proposals/synchronous-garbage-collection.md
<#32628 (review)
3774163>
:

+}
++ +The initial draft of the proposal did not include this field and it had a security loophole: a user who is only authorized to update one resource can set ownerReference to block the synchronous GC of other resources. Requiring users to explicitly setBlockOwnerDeletion allows the master to properly authorize the request. + +## DeleteOptions + +go
+DeleteOptions {

…

// Whether and how garbage collection will be performed.

// Defaults to GarbageCollectionDefault

GarbageCollectionPolicy GarbageCollectionPolicy
+}
+
+type GarabgeCollectionPolicy string

s/Garabge/Garbage/

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#32628 (review)
3774163>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_
p6frJo0FtY4zORmN4slpu8zsyGt3ks5qzAdHgaJpZM4J8Saa>
.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#32628 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAnglvXEKdZaNuQ9KVet-6ChhCxdJ6z6ks5qzAnEgaJpZM4J8Saa
.

caesarxuchao · 2016-10-11T22:25:16Z

How about this?

DeleteOptions {
  …
  DeletionPropagationPolicy DeletionPropagationPolicy
}

type DeletionPropagationPolicy string

const (
    DefaultDeletionPropagationPolicy DeletionPropagationPolicy = "Default"
    OrphanDependents DeletionPropagationPolicy = "OrphanDependents"
    DeleteDependentsInBackground DeletionPropagationPolicy = "DeleteDependentsInBackground"
    DeleteAfterDependentsAreDeleted DeletionPropagationPolicy = "DeleteAfterDependentsAreDeleted"
)

caesarxuchao · 2016-10-12T02:57:05Z

docs/proposals/synchronous-garbage-collection.md

+  …
+  // Whether and how garbage collection will be performed.
+  // Defaults to DefaultPropagationPolicy
+  DeletionPropagationPolicy *DeletionPropagationPolicy


@smarterclayton how about these new names?

Other than tense (not sure if it should be DeletePropagation or DeletionPropagation) I like this much more (they relate to propagation of deletes)

Do you mean "Delete" is present tense and "Deletion" is sort of perfect tense? I'll leave this to native speakers.

The field name can probably just be PropagationPolicy, it's redundant to say "delete" since it's in the DeleteOptions already. (Type name should still include "Delete")

The field name can probably just be PropagationPolicy

Done.

smarterclayton · 2016-10-12T13:35:43Z

Probably "deletePropagationPolicy" is most correct.

On Tue, Oct 11, 2016 at 9:26 PM, Chao Xu notifications@github.com wrote:

@caesarxuchao commented on this pull request.

In docs/proposals/synchronous-garbage-collection.md
#32628:
// Defaults to false.
// To set this field, a user needs "update" and "delete" permission of the owner, otherwise 422 (Unprocessable Entity) will be returned.
BlockOwnerDeletion *bool
+}
++ +The initial draft of the proposal did not include this field and it had a security loophole: a user who is only authorized to update one resource can set ownerReference to block the synchronous GC of other resources. Requiring users to explicitly setBlockOwnerDeletion allows the master to properly authorize the request. + +## DeleteOptions + +go
+DeleteOptions {
…

// Whether and how garbage collection will be performed.

// Defaults to DefaultPropagationPolicy

DeletionPropagationPolicy *DeletionPropagationPolicy
Do you mean "Delete" is present tense and "Deletion" is sort of perfect
tense? I'll leave this to native speakers.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#32628, or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p6H_dt-Y6btx9Tf_VEb4OYo48hItks5qzGF2gaJpZM4J8Saa
.

caesarxuchao · 2016-10-12T16:33:18Z

@krousey @lavalamp @deads2k are there more comments? This a 1.5 feature. Thanks.

deads2k · 2016-10-12T17:13:32Z

@krousey @lavalamp @deads2k are there more comments? This a 1.5 feature. Thanks.

I agree with @smarterclayton's nit on the name ("delete propagation" instead of "deletion propagation").

lgtm

caesarxuchao · 2016-10-12T20:21:39Z

docs/proposals/synchronous-garbage-collection.md

+
+**kubectl**: synchronous GC can simplify the **kubectl delete** reapers. Let's take the `deployment reaper` as an example, since it's the most complicated one. Currently, the reaper finds all `RS` with matching labels, scales them down, polls until `RS.Status.Replica` reaches 0, deletes the `RS`es, and finally deletes the `deployment`. If using synchronous GC, `kubectl delete deployment` is as easy as sending a synchronous GC delete request for the deployment, and polls until the deployment is deleted from the key-value store.
+
+Note that this **changes the behavior** of `kubectl delete`. The command will be blocked until all pods are deleted from the key-value store, instead of being blocked until pods are in the terminating state. This means `kubectl delete` blocks for longer time, but it has the benefit that the resources used by the pods are released when the `kubectl delete` returns. To allow kubectl user not waiting for the cleanup, we will add a `--wait` flag. It defaults to true; if it's set to `false`, `kubectl delete` will send the delete request with `DeletePropagationPolicy=DeleteDependentsInBackground` and return immediately.


@deads2k I added --wait to kubectl delete

lavalamp · 2016-10-12T20:34:09Z

I'd argue that it should be "deletionOptions" and "deletionPropagationPolicy" but it's hard to change the former, so we might as well be consistent and say "deletePropagationPolicy".

lavalamp

comments mostly on the api.

lavalamp · 2016-10-12T21:42:02Z

docs/proposals/synchronous-garbage-collection.md

+DeleteOptions {
+  …
+  // Whether and how garbage collection will be performed.
+  // Defaults to DefaultPropagationPolicy


// Either this field or OrphanDependents may be set, but not both.

lavalamp · 2016-10-12T21:43:38Z

docs/proposals/synchronous-garbage-collection.md

+
+const (
+    // Respects the existing garbage collection related finalizers on the object and the default garbage collection policy of the resource.
+    DefaultPropagationPolicy DeletePropagationPolicy = "Default"


DeletePropagationPolicyDefault
DeletePropagationPolicyOrphan
DeletePropagationPolicyBackground
DeletePropagationPolicyForeground

...or make the names consistent some other way. Maybe the word "policy" isn't required.

In the comment, note explicitly that the default depends on the object type and the controller for the object. (If the controller pre-adds the sync finalizer label, then you'll get foreground behavior as the default unless you specifically request something else).

lavalamp · 2016-10-12T21:44:54Z

docs/proposals/synchronous-garbage-collection.md

+    // The object exists in the key-value store until the garbage collector deletes all the dependents whose ownerReference.blockOwnerDeletion=true from the key-value store.
+    // API sever will put the "DeletingDependents" finalizer on the object, and sets its deletionTimestamp.
+    // This policy is cascading, i.e., the dependents will be deleted with GarbageCollectionSynchronous.
+    DeleteAfterBlockingDependentsAreDeleted DeletePropagationPolicy = "DeleteAfterBlockingDependentsAreDeleted"


I think this name is not going to work. I prefer something short and not immediately obvious (like Foreground) so that users will read the longer, hopefully very clear, description in the documentation.

lavalamp · 2016-10-12T21:47:17Z

docs/proposals/synchronous-garbage-collection.md

+
+## API Server
+
+`Delete()` function checks `DeleteOptions.PropagationPolicy`. If the policy is `DeleteAfterBlockingDependentsAreDeleted`, the API server will update the object instead of deleting it, add the finalizer, and set the `ObjectMeta.DeletionTimestamp`.


What if the finalizer is already present?

(i.e., if it's present and they ask for orphan or background gc, we need to remove it)

caesarxuchao · 2016-10-12T23:22:39Z

docs/proposals/synchronous-garbage-collection.md

+type DeletePropagationPolicy string
+
+const (
+    // The default depends on the existing finalizers on the object and the type of the object.


In the comment, note explicitly that the default depends on the object type and the controller for the object. (If the controller pre-adds the sync finalizer label, then you'll get foreground behavior as the default unless you specifically request something else).

@lavalamp Is the current comment clear? I think saying the default depends on the "existing finalizers" is more general than saying it depends on "the controller for the object".

Also, do we need a DeletePropagationDefault? Can we use nil to mean default?

Ok, I'll keep the default for now. We can move the comment to DeleteOptions.PropagationPolicy if we later decide DeletePropagationDefault is redundent.

lavalamp · 2016-10-13T00:09:20Z

Comment looks good. Leaving a default might be good just to have a place to
put documentation, but I don't feel super strongly.

On Wed, Oct 12, 2016 at 4:23 PM, Chao Xu notifications@github.com wrote:

@caesarxuchao commented on this pull request.

In docs/proposals/synchronous-garbage-collection.md
#32628 (review)
:

+## DeleteOptions
+
+```go
+DeleteOptions {

…

// Whether and how garbage collection will be performed.

// Defaults to DeletePropagationDefault

// Either this field or OrphanDependents may be set, but not both.

PropagationPolicy *DeletePropagationPolicy
+}

+type DeletePropagationPolicy string
+
+const (

// The default depends on the existing finalizers on the object and the type of the object.

In the comment, note explicitly that the default depends on the object
type and the controller for the object. (If the controller pre-adds the
sync finalizer label, then you'll get foreground behavior as the default
unless you specifically request something else).

@lavalamp https://github.com/lavalamp Is the current comment clear? I
think saying the default depends on the "existing finalizers" is more

general than saying it depends on "the controller for the object".

In docs/proposals/synchronous-garbage-collection.md
#32628 (review)
:

+## DeleteOptions
+
+```go
+DeleteOptions {

…

// Whether and how garbage collection will be performed.

// Defaults to DeletePropagationDefault

// Either this field or OrphanDependents may be set, but not both.

PropagationPolicy *DeletePropagationPolicy
+}

+type DeletePropagationPolicy string
+
+const (

// The default depends on the existing finalizers on the object and the type of the object.

Also, do we need a DeletePropagationDefault? Can we use nil to mean
default?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#32628 (review),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAngltsu4xlMUfmNf9xbTSbZzaf0eZShks5qzWvtgaJpZM4J8Saa
.

caesarxuchao · 2016-10-14T18:13:52Z

@lavalamp can we merge the proposal? @deads2k had given his lgtm (#32628 (comment))

caesarxuchao · 2016-10-17T21:41:31Z

Is there any objection merging the proposal? If not I'll apply the lgtm tomorrow.

caesarxuchao · 2016-10-18T19:13:44Z

Applying the lgtm label.

There is not much time left for 1.5. I'll focus on converting controllers to use client-go. Probably I'll do the implementation of this in 1.6.

k8s-github-robot · 2016-10-18T19:37:42Z

Automatic merge from submit-queue

@lavalamp

Automatic merge from submit-queue [RFC] Add a proposal for synchronous garbage collection Tracking issue: kubernetes#29891 @lavalamp @kubernetes/sig-api-machinery @kubernetes/api-review-team

add a proposal for synchronous garbage collection

5d9e361

caesarxuchao added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Sep 14, 2016

caesarxuchao assigned lavalamp Sep 14, 2016

googlebot added the cla: yes label Sep 14, 2016

k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels Sep 14, 2016

smarterclayton reviewed Sep 14, 2016
View reviewed changes

addressing comments

d67eb53

caesarxuchao reviewed Sep 14, 2016
View reviewed changes

more

b58746f

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 21, 2016

addressing comments

0298cd3

caesarxuchao force-pushed the sync-gc branch from cba45c3 to 0298cd3 Compare September 21, 2016 17:11

caesarxuchao commented Sep 21, 2016

View reviewed changes

lavalamp reviewed Sep 22, 2016

View reviewed changes

lavalamp suggested changes Sep 22, 2016

View reviewed changes

caesarxuchao commented Oct 11, 2016

View reviewed changes

lavalamp reviewed Oct 11, 2016

View reviewed changes

caesarxuchao commented Oct 12, 2016

View reviewed changes

addressing comment; rename user facing "Garbage Collection"

214bfec

caesarxuchao force-pushed the sync-gc branch from 545481e to 214bfec Compare October 12, 2016 04:22

fixing nits; adding --wait to kubetl delete

5880a6d

caesarxuchao commented Oct 12, 2016

View reviewed changes

lavalamp reviewed Oct 12, 2016

View reviewed changes

change the field name to PropagationPolicy

fb81c1e

caesarxuchao force-pushed the sync-gc branch from 154db54 to fb81c1e Compare October 12, 2016 23:20

caesarxuchao commented Oct 12, 2016

View reviewed changes

caesarxuchao added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 18, 2016

k8s-github-robot merged commit 97525c0 into kubernetes:master Oct 18, 2016

caesarxuchao mentioned this pull request Oct 26, 2016

[RFC][GarbageCollector] expose the progress of garbage collection #29891

Closed


		Modifications to processItem()

		`processItem()` consumes the `dirtyQueue`, requests the API server to delete an item if all of its owners do not exist. To support synchronous GC, it has to:


		kubectl: synchronous GC can replace the kubectl delete reapers. Currently `kubectl delete` blocks until all dependents and the owner are deleted. To maintain this behavior, after switched to using synchronous GC, kubectl delete needs to poll on the removal of the owner object.

		## Security implications


		Handling circular dependencies

		SynchronousGC will enter a deadlock in the presence of circular dependencies. The garbage collector can break the circle by lazily detecting circular dependencies: when `processItem()` processes an object, if it finds the object and all of its owners have the `GCFinalizer`, it searches the internal owner-dependency relationship graph (`uidToNode`) to check if the object and any of its owner are in a circle where all objects have the `GCFinalizer`. If so, it removes the `GCFinzlier` from the object to break the circle.

		* If the GC observes the owning object with the `GCFinalizer` before it observes the creation of all the dependents, GC will remove the finalizer from the owning object before all dependents are gone. Hence, “Synchronous GC” is best-effort, though we guarantee that the dependents will be deleted eventually. We face a similar case when handling OrphanFinalizer, see [GC known issues](https://github.com/kubernetes/kubernetes/issues/26120).


		## Implications to existing clients


		Delete() function needs to check the DeleteOptions.SynchronousGarbageCollection.

		* The option is ignored if DeleteOptions.OrphanDependents is true or nil.


		Node controller forcefully deletes pod if the pod is scheduled to a node that does not exist ([code](../../pkg/controller/node/nodecontroller.go#L474)). The pod will continue to exist if it has pending finalizers. The node controller will futilely retry the deletion. Also, the `node controller` forcefully deletes pods before deleting the node ([code](../../pkg/controller/node/nodecontroller.go#L592)). If the pods have pending finalizers, the `node controller` will go ahead deleting the node, leaving those pods behind. These pods will be deleted from the key-value store when the pending finalizers are removed.

		Podgc deletes terminated pods if there are too many of them in the cluster. `Podgc` should remove any pending finalizers to make sure the pods are deleted.


		Podgc deletes terminated pods if there are too many of them in the cluster. `Podgc` should remove any pending finalizers to make sure the pods are deleted.

		Deployment controller adopts existing `ReplicaSet` (RS) if its template matches. If a matching RS has a pending `GCFinalizer`, deployment shouldn't adopt it, because the RS controller will not scale up/down a RS that's being deleted. Hence, `deployment controller` needs to check if a RS is being deleted before adopting it. If the RS is being deleted, then the `deployment controller` should wait for the status of the RS to show 0 replicas to avoid creating extra pods, then create a new RS.


		Replication controller manager, Job controller, and ReplicaSet controller ignore pods in terminated phase, so pods with pending finalizers will not block these controllers.

		PetSet controller will be blocked by a pod with pending finalizers, so Synchronous GC might slow down its progress.


		# Overview

		Some users of the server-side garbage collection need to tell if the garbage collection is done ([example](https://github.com/kubernetes/kubernetes/issues/19701#issuecomment-236997077)). Synchronous Garbage Collection is a best-effort (see [unhandled cases](#unhandled-cases)) mechanism to enable such use cases: after the API server receives a deletion request of an owning object, the object keeps existing in the key-value store until all its dependents are deleted from the key-value store by the garbage collector.


		## Security implications

		A user who is authorized to update one object can affect the synchronous GC behavior of another object. Specifically, by setting an object as a pod's owner, and setting a very long grace termination period for the pod, a user can make the synchronous GC of the owner to take long time.


		kubectl: synchronous GC can simplify the kubectl delete reapers. Let's take the `deployment reaper` as an example, since it's the most complicated one. Currently, the reaper finds all `RS` with matching labels, scales them down, polls until `RS.Status.Replica` reaches 0, deletes the `RS`es, and finally deletes the `deployment`. If using synchronous GC, `kubectl delete deployment` is as easy as sending a synchronous GC delete request for the deployment, and polls until the deployment is deleted from the key-value store.

		Note that this changes the behavior of `kubectl delete`. The command will be blocked until all pods are deleted from the key-value store, instead of being blocked until pods are in the terminating state. This means `kubectl delete` blocks for longer time, but it has the benefit that the resources used by the pods are released when the `kubectl delete` returns. To allow kubectl user not waiting for the cleanup, we will add a `--wait` flag. It defaults to true; if it's set to `false`, `kubectl delete` will send the delete request with `DeletePropagationPolicy=DeleteDependentsInBackground` and return immediately.


		## API Server

		`Delete()` function checks `DeleteOptions.PropagationPolicy`. If the policy is `DeleteAfterBlockingDependentsAreDeleted`, the API server will update the object instead of deleting it, add the finalizer, and set the `ObjectMeta.DeletionTimestamp`.

[RFC] Add a proposal for synchronous garbage collection #32628

[RFC] Add a proposal for synchronous garbage collection #32628

Conversation

caesarxuchao commented Sep 14, 2016 • edited by spxtr Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-bot commented Sep 14, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lavalamp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erictune commented Sep 23, 2016

caesarxuchao commented Sep 23, 2016

liggitt commented Sep 23, 2016

caesarxuchao commented Sep 23, 2016

lavalamp commented Sep 23, 2016

deads2k commented Oct 11, 2016

caesarxuchao commented Oct 11, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smarterclayton commented Oct 11, 2016

@lavalamp commented on this pull request.

lavalamp commented Oct 11, 2016

@lavalamp commented on this pull request.

caesarxuchao commented Oct 11, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smarterclayton commented Oct 12, 2016

@caesarxuchao commented on this pull request.

caesarxuchao commented Oct 12, 2016

deads2k commented Oct 12, 2016

Choose a reason for hiding this comment

lavalamp commented Oct 12, 2016

lavalamp left a comment

Choose a reason for hiding this comment

caesarxuchao commented Sep 14, 2016 •

edited by spxtr

Loading

caesarxuchao commented Oct 11, 2016 •

edited

Loading