-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GarbageCollector] Adding a proposal for server-side cascading deletion #23656
[GarbageCollector] Adding a proposal for server-side cascading deletion #23656
Conversation
ref #1535 |
``` | ||
type DeleteOptions struct { | ||
… | ||
Orphaning bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete is a verb, so it's ok to use a verb here: OrphanChildren
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a clear parent child relationship defined somewhere?
Do we need something like controllerRef to materalize the same basic thing we have today for ObjectMeta.Namespace
so the relationship is explicit?
I'm in favor of finalizers. I find tombstones scary for a number of reasons (e.g., privacy, extensibility, scalability, failure modes due to lack of atomicity, change in semantics vs. what kubectl delete --cascade currently provides), they likely would conflict with my plan in the future to use all resources as their own tombstones, and I want finalizers for a number of other use cases. |
``` | ||
type ObjectMeta struct { | ||
… | ||
DeletionInProgress bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to find a way to use one of the existing deletion fields (e.g., DeletionTimestamp)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the draft I wrote "we can reuse ObjectMeta.DeletionTimestamp if the DeletionTimestamp does not require the resource to be deleted after the time stamp is reached".
On second thought, this shouldn't prevent us from reusing the DeletionTimestamp, because no matter whether we reuse it, finalizers will break this on-time deletion promise made by the DeletionTimestamp.
cc @smarterclayton.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeletionTimestamp changes when you call delete multiple times for a resource that is undergoing graceful termination... so I think you want a different field/concept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically DeletionTimestamp is not a promise. Because we don't assume global time in the cluster, DeletionTimestamp is a best effort record of the anticipated deletion of the resource. No component in the cluster should be using DeletionTimestamp as a clock comparison (they should be using their own clock and incrementing it by GracefulDeletionPeriodSeconds in their own timeline).
Once DeletionTimestamp is set, it can never be unset (by API contract).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need to check whether DeletionTimesstamp is non-nil, its value doesn't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's value does matter for pods, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton thanks for the clarification. The comment in types.go (
kubernetes/pkg/api/v1/types.go
Line 144 in b60ef6f
// DeletionTimestamp is RFC 3339 date and time at which this resource will be deleted. This |
@derekwaynecarr, I mean its value doesn't matter to the finalizers, because finalizers only check if DeletionTimestamp is nil. And finalizers should only update objects, which won't modify the DeletionGracePeriod or DeletionTimestamp (see https://github.com/kubernetes/kubernetes/blob/master/pkg/api/validation/validation.go#L353), so introducing finalizers won't affect the graceful termination. Did I miss something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@caesarxuchao - it matters for who does the final delete call to the API server. In the current namespace pattern, the finalizer controller will send a delete when it observes all finalizers have emptied. in the case for pods, the kubelet will send a delete ?gracePeriod=0 which would have expected the pod to be removed. I guess I need to think a little more about what it would mean to attach a finalizer to a pod and the kubelet interaction associated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following is how I imagine it will work, feel free to point out the loopholes:
In API server's deletion handler, if gracePeriod=0 and the finalizers field is empty, then deletes the object immediately; otherwise just do an update, which will set the DeletionGracePeriod.
And when the last finalizer has done its job and sent an Update request to API server, API server will notice the finalizers are empty, it checks if gracePeriod=0, if so, API server deletes it; otherwise API server just updates the finalizers field to empty, some other parties will send a delete with graceperiod=0 later.
A pod may exist for a while after kubelet sends a deletion request with gracePeriod=0, and kubelet will continue to receive update event about the pod as its finalizers get removed, so kubelet will send more deletion requests with gracePeriod=0, but I think it's harmless.
I have a feeling this would have a significant impact on the speed of our e2e runs, and the number of flakes we may or may not encounter unless we restructure them with the idea that orphans are OK. The namespace deletion flake is a flake that never dies. /cc @kubernetes/rh-cluster-infra - this potentially extends the namespace finalizer concept to all of the other resource types. |
I need more time to reason on this, and would like to understand if the gc or finalizer controller is bundled with the controller-manager. If so, it feels like it would give a strong argument to having shared informers i.e. #23575 |
If we extend the finalizer pattern, I think its imperative that we have a |
* If the `ObjectMeta.Finalizers` of the object being deleted is non-empty, then updates the DeletionInProgress field to true. | ||
* If the `ObjectMeta.Finalizers` is empty, then deletes the object. | ||
* Update handler: | ||
* If the update removes the last finalizer, and the DeletionInProgress is true, then deletes the object from the registry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we reuse the DeletionTimestamp field, here API server also needs to check if DeletionGracePeriod is 0, if so, immediately deletes the object, otherwise just carries out the update.
} | ||
``` | ||
|
||
**ObjectMeta.ParentReferences**: links the resource to the parent resources. For example, a replica set `R` created by a deployment `D` should have an entry in ObjectMeta.ParentReferences pointing to `D`. The link should be set when the child object is created. It can be updated after the creation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Parent" doesn't have any obvious, standard meaning. OwnerReferences may be more familiar, similar to the usual concept of object/memory ownership.
We'd need to think of another term for "Children", however.
Also, we can't change the default behavior for existing APIs/versions, so cascading deletion needs to not happen by default through the API, or, equivalently, orphaning needs to occur by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd need to think of another term for "Children", however.
How about "Dependent"? It sounds like a good match to "owner".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update the terminology after we decide the winner of the two designs. The new terminology is already used in the PR that adds the API #23928.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current Namespace finalizer is similar to the dependent object, but in this case the dependency is a cluster provider of some kind (Kubernetes, OpenShift). If we go this route for dependencies as a object reference, we need to let non-namespaced resources depend on something akin to a API provider that must be registered with the cluster. I also could see the same value in Node(s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the general question is if cluster-scoped resources can have dependents and to what things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow, can you give an example?
On Tue, Apr 12, 2016 at 8:30 PM, Derek Carr notifications@github.com
wrote:
In docs/proposals/cascading-deletion.md
#23656 (comment)
:+type DeleteOptions struct {
- …
- OrphanChildren bool
+}
+`
+DeleteOptions.OrphanChildren: allows a user to express whether the child objects should be orphaned.
+
+`
+type ObjectMeta struct {
- ...
- ParentReferences []ObjectReference
+}
+```
+ObjectMeta.ParentReferences: links the resource to the parent resources. For example, a replica set
R
created by a deploymentD
should have an entry in ObjectMeta.ParentReferences pointing toD
. The link should be set when the child object is created. It can be updated after the creation.The current Namespace finalizer is similar to the dependent object, but in
this case the dependency is a cluster provider of some kind (Kubernetes,
OpenShift). If we go this route for dependencies as a object reference, we
need to let non-namespaced resources depend on something akin to a API
provider that must be registered with the cluster. I also could see the
same value in Node(s)—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/pull/23656/files/cc943fae771b704370a7ddca1d6267b6786fd807#r59488717
|
||
Non-goals include: | ||
* Releasing the name of an object immediately, so it can be reused ASAP. | ||
* Propagating the grace period in cascading deletion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicitly listed "propagating the grace period" as non goal of this proposal.
5064696
to
4044e33
Compare
|
||
2. How to propagate the grace period in a cascading deletion? For example, when deleting a ReplicaSet with grace period of 5s, a user may expect the same grace period to be applied to the deletion of the Pods controlled the ReplicaSet. | ||
|
||
Propagating grace period in a cascading deletion is a ***non-goal*** of this proposal. Nevertheless, the presented design can be extended to support it. A tentative solution is letting the garbage collector to propagate the grace period when deleting dependent object. To persist the grace period set by the user, the owning object should not be deleted from the registry until all its dependent objects are in the graceful deletion state. This could be ensured by introducing another finalizer, tentatively named as the "populating graceful deletion" finalizer. Upon receiving the graceful deletion request, the API server adds this finalizer to the finalizers list of the owning object. Later the GC will remove it when all dependents are in the graceful deletion state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved "propagating grace period" to Open Qeustions
and noted down the tentative solution.
LGTM! @smarterclayton @liggitt any last comments before I apply the label? |
GCE e2e build/test passed for commit 999677a76c85b75642683b26ea6eab490034af3a. |
GCE e2e build/test passed for commit 4044e334dbc38e897092feef132f763a840fb4ef. |
None from me.
|
4044e33
to
91cb08f
Compare
GCE e2e build/test passed for commit 91cb08f9dc76724d1256b5ba9160658a9f911f13. |
91cb08f
to
b3d7297
Compare
GCE e2e build/test failed for commit b3d7297. Please reference the list of currently known flakes when examining this failure. If you request a re-test, you must reference the issue describing the flake. |
Automatic merge from submit-queue |
…anges Automatic merge from submit-queue API changes for Cascading deletion This PR includes the necessary API changes to implement cascading deletion with finalizers as proposed is in #23656. Comments are welcome. @lavalamp @derekwaynecarr @bgrant0607 @rata @hongchaodeng
Is there a PR for this anywhere I can follow? |
@yuvipanda here's a list of all the PRs so far: https://github.com/kubernetes/kubernetes/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Amerged+garbage+collector+author%3Acaesarxuchao |
@caesarxuchao awesome! Do you think all of this will land for 1.3? |
All the listed PRs are merged. GC will be alpha (disabled by default) for 1.3. We'll push for beta in 1.4. |
Awesome! \o/ (I'll be happy to turn it on in our install once 1.3 releases) On Mon, Jun 6, 2016 at 9:52 PM, Chao Xu notifications@github.com wrote:
Yuvi Panda T |
Would you provide an update on the status for the documentation for this feature as well as add any PRs as they are created? Not Started / In Progress / In Review / Done Thanks |
@pwittrock I sent a PR to add a user doc. |
@lavalamp @bgrant0607
cc @kubernetes/sig-api-machinery @derekwaynecarr
ref #12143 #19054 #17305 (initializer proposal)
This change is