Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GarbageCollector] Adding a proposal for server-side cascading deletion #23656

Merged

Conversation

caesarxuchao
Copy link
Member

@caesarxuchao caesarxuchao commented Mar 30, 2016

@lavalamp @bgrant0607
cc @kubernetes/sig-api-machinery @derekwaynecarr

ref #12143 #19054 #17305 (initializer proposal)


This change is Reviewable

@caesarxuchao caesarxuchao added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Mar 30, 2016
@k8s-github-robot k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 30, 2016
@bgrant0607
Copy link
Member

ref #1535

```
type DeleteOptions struct {
Orphaning bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete is a verb, so it's ok to use a verb here: OrphanChildren

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a clear parent child relationship defined somewhere?

Do we need something like controllerRef to materalize the same basic thing we have today for ObjectMeta.Namespace so the relationship is explicit?

@bgrant0607
Copy link
Member

I'm in favor of finalizers.

I find tombstones scary for a number of reasons (e.g., privacy, extensibility, scalability, failure modes due to lack of atomicity, change in semantics vs. what kubectl delete --cascade currently provides), they likely would conflict with my plan in the future to use all resources as their own tombstones, and I want finalizers for a number of other use cases.

```
type ObjectMeta struct {
DeletionInProgress bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to find a way to use one of the existing deletion fields (e.g., DeletionTimestamp)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the draft I wrote "we can reuse ObjectMeta.DeletionTimestamp if the DeletionTimestamp does not require the resource to be deleted after the time stamp is reached".
On second thought, this shouldn't prevent us from reusing the DeletionTimestamp, because no matter whether we reuse it, finalizers will break this on-time deletion promise made by the DeletionTimestamp.
cc @smarterclayton.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeletionTimestamp changes when you call delete multiple times for a resource that is undergoing graceful termination... so I think you want a different field/concept.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically DeletionTimestamp is not a promise. Because we don't assume global time in the cluster, DeletionTimestamp is a best effort record of the anticipated deletion of the resource. No component in the cluster should be using DeletionTimestamp as a clock comparison (they should be using their own clock and incrementing it by GracefulDeletionPeriodSeconds in their own timeline).

Once DeletionTimestamp is set, it can never be unset (by API contract).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need to check whether DeletionTimesstamp is non-nil, its value doesn't matter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's value does matter for pods, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton thanks for the clarification. The comment in types.go (

// DeletionTimestamp is RFC 3339 date and time at which this resource will be deleted. This
) is misleading.

@derekwaynecarr, I mean its value doesn't matter to the finalizers, because finalizers only check if DeletionTimestamp is nil. And finalizers should only update objects, which won't modify the DeletionGracePeriod or DeletionTimestamp (see https://github.com/kubernetes/kubernetes/blob/master/pkg/api/validation/validation.go#L353), so introducing finalizers won't affect the graceful termination. Did I miss something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@caesarxuchao - it matters for who does the final delete call to the API server. In the current namespace pattern, the finalizer controller will send a delete when it observes all finalizers have emptied. in the case for pods, the kubelet will send a delete ?gracePeriod=0 which would have expected the pod to be removed. I guess I need to think a little more about what it would mean to attach a finalizer to a pod and the kubelet interaction associated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following is how I imagine it will work, feel free to point out the loopholes:
In API server's deletion handler, if gracePeriod=0 and the finalizers field is empty, then deletes the object immediately; otherwise just do an update, which will set the DeletionGracePeriod.
And when the last finalizer has done its job and sent an Update request to API server, API server will notice the finalizers are empty, it checks if gracePeriod=0, if so, API server deletes it; otherwise API server just updates the finalizers field to empty, some other parties will send a delete with graceperiod=0 later.

A pod may exist for a while after kubelet sends a deletion request with gracePeriod=0, and kubelet will continue to receive update event about the pod as its finalizers get removed, so kubelet will send more deletion requests with gracePeriod=0, but I think it's harmless.

@derekwaynecarr
Copy link
Member

I have a feeling this would have a significant impact on the speed of our e2e runs, and the number of flakes we may or may not encounter unless we restructure them with the idea that orphans are OK. The namespace deletion flake is a flake that never dies.

/cc @kubernetes/rh-cluster-infra - this potentially extends the namespace finalizer concept to all of the other resource types.

@derekwaynecarr
Copy link
Member

I need more time to reason on this, and would like to understand if the gc or finalizer controller is bundled with the controller-manager. If so, it feels like it would give a strong argument to having shared informers i.e. #23575

@derekwaynecarr
Copy link
Member

If we extend the finalizer pattern, I think its imperative that we have a kubectl command that can be run with proper authority to release a finalizer that is failing to respond in a timely enough fashion. Otherwise, you end up with permanently stuck objects.

* If the `ObjectMeta.Finalizers` of the object being deleted is non-empty, then updates the DeletionInProgress field to true.
* If the `ObjectMeta.Finalizers` is empty, then deletes the object.
* Update handler:
* If the update removes the last finalizer, and the DeletionInProgress is true, then deletes the object from the registry.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we reuse the DeletionTimestamp field, here API server also needs to check if DeletionGracePeriod is 0, if so, immediately deletes the object, otherwise just carries out the update.

}
```

**ObjectMeta.ParentReferences**: links the resource to the parent resources. For example, a replica set `R` created by a deployment `D` should have an entry in ObjectMeta.ParentReferences pointing to `D`. The link should be set when the child object is created. It can be updated after the creation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Parent" doesn't have any obvious, standard meaning. OwnerReferences may be more familiar, similar to the usual concept of object/memory ownership.

We'd need to think of another term for "Children", however.

Also, we can't change the default behavior for existing APIs/versions, so cascading deletion needs to not happen by default through the API, or, equivalently, orphaning needs to occur by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd need to think of another term for "Children", however.

How about "Dependent"? It sounds like a good match to "owner".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the terminology after we decide the winner of the two designs. The new terminology is already used in the PR that adds the API #23928.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current Namespace finalizer is similar to the dependent object, but in this case the dependency is a cluster provider of some kind (Kubernetes, OpenShift). If we go this route for dependencies as a object reference, we need to let non-namespaced resources depend on something akin to a API provider that must be registered with the cluster. I also could see the same value in Node(s)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the general question is if cluster-scoped resources can have dependents and to what things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow, can you give an example?

On Tue, Apr 12, 2016 at 8:30 PM, Derek Carr notifications@github.com
wrote:

In docs/proposals/cascading-deletion.md
#23656 (comment)
:

+type DeleteOptions struct {

  • OrphanChildren bool
    +}
    +`

+DeleteOptions.OrphanChildren: allows a user to express whether the child objects should be orphaned.
+
+`
+type ObjectMeta struct {

  • ...
  • ParentReferences []ObjectReference
    +}
    +```

+ObjectMeta.ParentReferences: links the resource to the parent resources. For example, a replica setRcreated by a deploymentDshould have an entry in ObjectMeta.ParentReferences pointing toD. The link should be set when the child object is created. It can be updated after the creation.

The current Namespace finalizer is similar to the dependent object, but in
this case the dependency is a cluster provider of some kind (Kubernetes,
OpenShift). If we go this route for dependencies as a object reference, we
need to let non-namespaced resources depend on something akin to a API
provider that must be registered with the cluster. I also could see the
same value in Node(s)


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/pull/23656/files/cc943fae771b704370a7ddca1d6267b6786fd807#r59488717


Non-goals include:
* Releasing the name of an object immediately, so it can be reused ASAP.
* Propagating the grace period in cascading deletion.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly listed "propagating the grace period" as non goal of this proposal.

@caesarxuchao caesarxuchao force-pushed the cascading-deletion-proposal branch 2 times, most recently from 5064696 to 4044e33 Compare May 2, 2016 23:04

2. How to propagate the grace period in a cascading deletion? For example, when deleting a ReplicaSet with grace period of 5s, a user may expect the same grace period to be applied to the deletion of the Pods controlled the ReplicaSet.

Propagating grace period in a cascading deletion is a ***non-goal*** of this proposal. Nevertheless, the presented design can be extended to support it. A tentative solution is letting the garbage collector to propagate the grace period when deleting dependent object. To persist the grace period set by the user, the owning object should not be deleted from the registry until all its dependent objects are in the graceful deletion state. This could be ensured by introducing another finalizer, tentatively named as the "populating graceful deletion" finalizer. Upon receiving the graceful deletion request, the API server adds this finalizer to the finalizers list of the owning object. Later the GC will remove it when all dependents are in the graceful deletion state.
Copy link
Member Author

@caesarxuchao caesarxuchao May 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved "propagating grace period" to Open Qeustions and noted down the tentative solution.

@lavalamp
Copy link
Member

lavalamp commented May 3, 2016

LGTM! @smarterclayton @liggitt any last comments before I apply the label?

@k8s-bot
Copy link

k8s-bot commented May 3, 2016

GCE e2e build/test passed for commit 999677a76c85b75642683b26ea6eab490034af3a.

@k8s-bot
Copy link

k8s-bot commented May 3, 2016

GCE e2e build/test passed for commit 4044e334dbc38e897092feef132f763a840fb4ef.

@smarterclayton
Copy link
Contributor

smarterclayton commented May 3, 2016 via email

@caesarxuchao caesarxuchao force-pushed the cascading-deletion-proposal branch from 4044e33 to 91cb08f Compare May 3, 2016 20:16
@k8s-bot
Copy link

k8s-bot commented May 3, 2016

GCE e2e build/test passed for commit 91cb08f9dc76724d1256b5ba9160658a9f911f13.

@lavalamp lavalamp added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 4, 2016
@caesarxuchao caesarxuchao force-pushed the cascading-deletion-proposal branch from 91cb08f to b3d7297 Compare May 4, 2016 18:17
@caesarxuchao caesarxuchao added e2e-not-required lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed e2e-not-required lgtm "Looks good to me", indicates that a PR is ready to be merged. labels May 4, 2016
@k8s-bot
Copy link

k8s-bot commented May 4, 2016

GCE e2e build/test failed for commit b3d7297.

Please reference the list of currently known flakes when examining this failure. If you request a re-test, you must reference the issue describing the flake.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 6ecb80c into kubernetes:master May 4, 2016
k8s-github-robot pushed a commit that referenced this pull request May 6, 2016
…anges

Automatic merge from submit-queue

API changes for Cascading deletion 

This PR includes the necessary API changes to implement cascading deletion with finalizers as proposed is in #23656. Comments are welcome.

@lavalamp @derekwaynecarr @bgrant0607 @rata @hongchaodeng
@yuvipanda
Copy link
Contributor

Is there a PR for this anywhere I can follow?

@caesarxuchao
Copy link
Member Author

@yuvipanda
Copy link
Contributor

@caesarxuchao awesome! Do you think all of this will land for 1.3?

@caesarxuchao
Copy link
Member Author

All the listed PRs are merged. GC will be alpha (disabled by default) for 1.3. We'll push for beta in 1.4.

@yuvipanda
Copy link
Contributor

Awesome! \o/ (I'll be happy to turn it on in our install once 1.3 releases)

On Mon, Jun 6, 2016 at 9:52 PM, Chao Xu notifications@github.com wrote:

All the listed PRs are merged. GC will be alpha (disabled by default) for
1.3. We'll push for beta in 1.4.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#23656 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAB23itbVvR5sBCYITnxaM9S-rdDGyWGks5qJPkWgaJpZM4H8Stk
.

Yuvi Panda T
http://yuvi.in/blog

@pwittrock
Copy link
Member

pwittrock commented Jun 17, 2016

@caesarxuchao
@lavalamp

Would you provide an update on the status for the documentation for this feature as well as add any PRs as they are created?

Not Started / In Progress / In Review / Done

Docs Instructions for v1.3

Thanks
@pwittrock

@caesarxuchao
Copy link
Member Author

@pwittrock I sent a PR to add a user doc.

@caesarxuchao caesarxuchao changed the title Adding a proposal for server-side cascading deletion [GarbageCollector] Adding a proposal for server-side cascading deletion Aug 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.