Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle deleting external load balancers across kube-controller-manager restarts #15203

Closed
a-robinson opened this issue Oct 7, 2015 · 34 comments
Labels
area/teardown lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@a-robinson
Copy link
Contributor

Right now, the service controller has no sense of responsibility for the load balancer resources that exist in a cloud user's project. It only deletes things that it knows it owns, meaning that if it gets restarted while deleting a load balancer, it won't finish deleting the load balancer when it restarts (or if it's simply not running when a load balancer gets deleted, it won't even start deleting the load balancer).

This should reduce resource leaks, especially in any e2e tests that involve restarting (including upgrading) the master.
Cross-referencing various leak-related issues: #13515, #14597, #15083, #15084

@davidopp @quinton-hoole

@a-robinson a-robinson added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. team/control-plane labels Oct 7, 2015
@bprashanth
Copy link
Contributor

I would recommend getting the service controller out of kube-controller-manager and into a pod. We know how to handle shutdown for pods (terminationGracePeriod). Of course there're things we need to do on start up to prevent leaking (or getting wedged) that the pod won't help with.

@thockin
Copy link
Member

thockin commented Oct 7, 2015

We need a bit of a pow-wow on this. There's value is busting controllers
out into pods, but there's also costs. There's cost to putting them all in
one monolith, but there's also value.

We should work up a plan for this across existing L3 balancers, upcoming L7
(provisioning, binding), and storage (provisioning, recycling, binding) -
all of which have slightly different answers right now.

@markturansky @smarterclayton @saad-ali @davidopp

Options:

  1. put all control loops for supported clouds into the monolithic
    controller manager
  2. break each individual control loop out into a pod
  3. make a per-cloud-provider pod that monoliths all the control loops for
    that provider

Nothing stops users/admins from doing arbitrarily many pods with control
loops, but we need to decide how many we want to have on a default
installation on (for example but not limited to) GCE. We already take flak
for "too many components".

@aronchick FYI

On Wed, Oct 7, 2015 at 8:57 AM, Prashanth B notifications@github.com
wrote:

I would recommend getting the service controller out of
kube-controller-manager and into a pod. We know how to handle shutdown for
pods (terminationGracePeriod). Of course there're things we need to do on
start up to prevent leaking (or getting wedged) that the pod won't help
with.


Reply to this email directly or view it on GitHub
#15203 (comment)
.

@markturansky
Copy link
Contributor

@a-robinson add #14443 to your list of potential resource leaks.

@ghost
Copy link

ghost commented Oct 7, 2015

@a-robinson the not-yet-deleted load balancers are still implicitly listed in etcd, associated with the relevant services, right? So on startup, service controller could find them there (via api-server), and repopulate it's state? Or is it more complicated than that?

@a-robinson
Copy link
Contributor Author

@quinton-hoole - no, the issue is for services that have been deleted via the API (or have been updated to remove their load balancer via the API). The service controller is making the changes asynchronously from the API updates, so unless it can access the old state of the services, it won't know that the service ever existed or had a load balancer, respectively.

Services that still exist with their load balancer when the service controller restarts are handled fine if they're later deleted.

@ghost
Copy link

ghost commented Oct 7, 2015

@a-robinson My apologies - I did not state my point clearly. Let me try again.

Should a deleted service not remain in "deleting" state (in etcd) until it has been completely and successfully deleted (including all GCE resource deletion)? If so, it would remain in etcd, and the process I described above could work?

@a-robinson
Copy link
Contributor Author

Ah, sorry for the misunderstanding. If we were to add a "deleting" state for service objects, and only the service controller could move them out of that state, then this would indeed be fixed (and that pattern would also work for other objects that need external resources to be cleaned up asynchronously).

@smarterclayton
Copy link
Contributor

The deleting state would be graceful deletion. You can implement that
today.

On Wed, Oct 7, 2015 at 4:58 PM, Alex Robinson notifications@github.com
wrote:

Ah, sorry for the misunderstanding. If we were to add a "deleting" state
for service objects, and only the service controller could move them out of
that state, then this would indeed be fixed (and it would also work for
other objects that need external resources to be cleaned up asynchronously).


Reply to this email directly or view it on GitHub
#15203 (comment)
.

@smarterclayton
Copy link
Contributor

That said, another way to handle this is via resynchronization - I thought
we already have controller implementations that could provide the framework
to do this synchronization for us (retrieve all load balancers from GCE and
compare to the list of services?). This won't be the only case where we
have to sync external resources - the deleting state doesn't make the
experience better, as long as we can correctly identify which load
balancers don't exist.

On Wed, Oct 7, 2015 at 5:11 PM, Clayton Coleman ccoleman@redhat.com wrote:

The deleting state would be graceful deletion. You can implement that
today.

On Wed, Oct 7, 2015 at 4:58 PM, Alex Robinson notifications@github.com
wrote:

Ah, sorry for the misunderstanding. If we were to add a "deleting" state
for service objects, and only the service controller could move them out of
that state, then this would indeed be fixed (and it would also work for
other objects that need external resources to be cleaned up asynchronously).


Reply to this email directly or view it on GitHub
#15203 (comment)
.

@markturansky
Copy link
Contributor

This won't be the only case where we have to sync external resources

Volumes need to be reconciled with the provider: #14443

@thockin
Copy link
Member

thockin commented Oct 8, 2015

Maybe we should, like Docker containers, claim a prefix as "assumed for
kubernetes". The problem is that we're sharing an unmanaged space with
non-kubernetes users.

On Wed, Oct 7, 2015 at 6:39 PM, Mark Turansky notifications@github.com
wrote:

This won't be the only case where we have to sync external resources

Volumes need to be reconciled with the provider: #14443
#14443


Reply to this email directly or view it on GitHub
#15203 (comment)
.

@davidopp
Copy link
Member

davidopp commented Oct 8, 2015

This is not a 1.1 blocker, right?

@markturansky
Copy link
Contributor

@thockin I reversed the order in which I create things in our Provisioner. I now create the PV first (with a faux placeholder ID to pass validation). If that transaction succeeds, then we provision the volume in the provider. We can feed the PV's real name to the EBS/GCEPD/Cinder volume for desc/tags.

Recon for volumes will be querying the provider for all volumes and using the PV's name to make sure it's in the cluster.

Not sure how it should work for other resources, but this approach is working well for volumes, I think.

@ghost
Copy link

ghost commented Oct 8, 2015

@davidopp Yup, not a blocker, IMO.

@a-robinson
Copy link
Contributor Author

Yup, not a blocker. This issue is just following up on our conversation earlier this week.

@a-robinson a-robinson added this to the v1.2 milestone Dec 8, 2015
@ghost
Copy link

ghost commented Dec 15, 2015

Resolving this for the purpose of deflaking tests alone makes it worthy of P1 IMO. @a-robinson to what extent are leaking network resources still causing e2e test failures? If they're not, I'd suggest dropping priority to P2?

@a-robinson
Copy link
Contributor Author

I honestly have no idea at this point. I haven't been watching them other than when I'm build copping since I've been doing so little OSS work lately.

@davidopp davidopp assigned ghost Dec 15, 2015
@ghost
Copy link

ghost commented Dec 16, 2015

OK, I'll:

  1. try to consolidate all of the related "leaking cloud resources" issues that have been reported in various places (typically in the context of e2e tests failing) into a smaller number of apparent root causes. I anticipate that these will fall into three buckets:
    1. test infrastructure that does not clean up properly after tests (Framework.AfterEach())
    2. cluster teardown ("kube-down.sh") that does not reliably turn down clusters and/or their backend cloud resources.
    3. k8s controllers that do not reliably garbage-collect cloud resources.
  2. Find owners for each of the underlying root causes

cc @ixdy FYI, as you've looked into this a fair bit in the past.

@kubernetes/goog-testing

@a-robinson
Copy link
Contributor Author

cc @brendandburns

@davidopp
Copy link
Member

davidopp commented Feb 4, 2016

This is marked as 1.2 milestone but it doesn't seem like there will be anything resembling closure on it for 1.2. Is there any piece of it that's critical for 1.2? If not, we can punt it to the next release, and/or work on it after the feature freeze. Suggestions?

@a-robinson
Copy link
Contributor Author

It's not critical for 1.2, but could certainly be considered a test flake issue given that it leads to static IP leaks in our e2e tests

@bprashanth
Copy link
Contributor

If we're restarting controller manager outside of daemon restart during a test run when it has live loadbalancer, that would be a worse bug IMO

@davidopp
Copy link
Member

This issue is full of good ideas but nobody had time to implement any of them, so I'm going to move this to the next-candidate milestone...

@davidopp davidopp added this to the next-candidate milestone Feb 12, 2016
@guoshimin
Copy link
Contributor

Can we at least on AWS tag the ELBs created by kubernetes with cluster name and service name, so that manual clean up is possible?

@guoshimin
Copy link
Contributor

On Fri, Feb 26, 2016 at 12:27 PM, Jeff Grafton notifications@github.com
wrote:

@guoshimin https://github.com/guoshimin I think we've looked into this
before. The problem is that the allowed name is very limited. From
http://docs.aws.amazon.com/ElasticLoadBalancing/latest/APIReference/API_CreateLoadBalancer.html
:

This name must be unique within your set of load balancers for the region,
must have a maximum of 32 characters, must contain only alphanumeric
characters or hyphens, and cannot begin or end with a hyphen.

Requiring uniqueness + maximum of 32 characters really constrains the
ability for a general solution that also includes the cluster/service name.

I don't mean the names of the ELBs, but tags. Kubernetes already tags the
security groups:
https://github.com/kubernetes/kubernetes/blob/v1.2.0-alpha.8/pkg/cloudprovider/providers/aws/aws.go#L1753.
We should tag the ELBs the same way.


Reply to this email directly or view it on GitHub
#15203 (comment)
.

@ixdy
Copy link
Member

ixdy commented Feb 26, 2016

Yeah, I realized what you meant right after making that comment.

@ixdy ixdy closed this as completed Feb 26, 2016
@ixdy ixdy reopened this Feb 26, 2016
@davidopp
Copy link
Member

@justinsb what do you think about
#15203 (comment)
?

It seems like a good backstop in case #19879 doesn't get merged in time for 1.2

@a-robinson
Copy link
Contributor Author

#19879 won't be going in for 1.2, we went with #22069 instead. And neither one actually addresses this specific issue. They fix other bad behavior, but explicitly don't handle the sort of ownership tracking across controller-manager restarts or graceful deletion state needed for this.

@justinsb
Copy link
Member

@davidopp @guoshimin tagging of ELBs is a good request. We recently added some tagging; it should be easy to add any missing tagging. Tracking as #22118

@justinsb
Copy link
Member

Oh ... and related: on AWS we already reserved some tags. I screwed up originally and called the primary tag "KubernetesCluster", but otherwise we're now claiming the whole "*.kubernetes.io/..." namespace for our own purposes. Nobody's complained yet, but then these are only tags (not names) and I haven't exactly been broadcasting the fact that you shouldn't allow your tags to collide. But on the other hand, I don't see people getting a collision accidentally. The biggest problem on AWS is the opposite - people that install without using kube-up invariably forget these tags and find things don't work as they expect.

On GCE though you have project isolation (that's what I am simulating with KubernetesCluster on AWS) so it doesn't seem unreasonable to grab a name prefix like k8sio- or kubernetes-io---; if I install k8s into a project it's not a huge ask to not choose colliding names in that project. At least that's my understanding of GCE naming rules.

On the other hand, having a "Terminating" state would be more consistent with the pod lifecycle...

@bgrant0607
Copy link
Member

Ref #4630

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 25, 2017
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 24, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/teardown lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests