Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Perform a graceful shutdown upon SIGTERM #128

Closed
purplexa opened this issue Oct 17, 2017 · 45 comments
Closed

Feature Request: Perform a graceful shutdown upon SIGTERM #128

purplexa opened this issue Oct 17, 2017 · 45 comments
Labels
type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@purplexa
Copy link

The method documented in Connecting from Google Container Engine for using Cloud SQL Proxy from GKE is to run the proxy as a sidecar container in your pod alongside your application. For normal use, this works well, but as this StackOverflow question describes, there are problems when trying to do the same in a Kubernetes Job.

The main issue is that for a Job to finish, all the containers in its Pod need to finish and return 0, so when using Cloud SQL Proxy as a sidecar container, even after the application container finishes, the proxy container stays around and Kubernetes believes the Job is not finished.

What would be helpful is for there to be some way to tell Cloud SQL Proxy to shut itself down cleanly after the application has finished. Perhaps something like a configurable timeout where the proxy shuts down after having no active connections for a period of time, or even some way for the application to explicitly tell the proxy it has finished before the application itself shuts down.

@AthenaShi
Copy link
Contributor

This can be done by just SIGTERM the proxy on shutdown.

For more information, (well not directly related):
#86

@purplexa
Copy link
Author

purplexa commented Nov 7, 2017

Sending a SIGTERM won't work because it causes Cloud SQL Proxy to exit 2, not exit 0, which Kubernetes will treat as a Job failure, not a success.

@EamonKeane
Copy link

@thrnio having the same issue, did you find a workaround for this?

@EamonKeane
Copy link

possible solution here? https://stackoverflow.com/questions/41679364/kubernetes-stop-cloudsql-proxy-sidecar-container-in-multi-container-pod-job

One possible solution would be a separate cloudsql-proxy deployment with a matching service. You would then only need your migration container inside the job that connects to your proxy service.

This comes with some downsides:

higher network latency, no pod local mysql communication
possible security issue if you provide the sql port to your whole kubernetes cluster
If you want to open cloudsql-proxy to the whole cluster you have to replace tcp:3306 with tcp:0.0.0.0:3306 in the -instance parameter on the cloudsql-proxy.

@purplexa
Copy link
Author

@EamonKeane not really, right now I'm using a script that that polls for the individual container statuses and checks for when the other container succeeds or fails then just deletes the Job, which sort of works well enough for most of my use cases, but not totally and it's far from ideal.

@EamonKeane
Copy link

EamonKeane commented Nov 29, 2017

yea not great. What is google up to here? Because egress IP is not controllable (I understand), sql-proxy seems to be the blessed way to connect, so might be expected to be robust to all GKE use cases. The other sigterm issue mentioned above (#86) has been open for months. Databases tend to be an important part of an application....

@AthenaShi
Copy link
Contributor

Thanks for the explanations of the use case here, I've created an internal enhancement ticket for this and it will be prioritized.

The new feature would be:

  1. upon receiving SIGTERM (or some exit signal), Cloud SQL proxy would stop accepting new connections.
  2. Cloud SQL proxy would exit (return 0) if no active connections.
  3. After 30s (by default), Cloud SQL proxy would close all active connections and exits and returns 0.

==============================================================================
Meanwhile, here's a workaround before we have this feature:

You can easily trap the sigterm signal the following way in your deployment:

command: ["/bin/bash", "-c", "trap 'sleep 30; exit 0' SIGTERM; /cloud_sql_proxy -dir=/cloudsql -instances=..."]

You can skip the "sleep 30" part if an immediate exit is needed. This is proposed from (#86) that's why I linked it here.

@AthenaShi
Copy link
Contributor

And the reason why #86 is still open is b/c it doesn't want Cloud SQL proxy to continue accept new connection after SIGTERM.

@AthenaShi AthenaShi changed the title Using Cloud SQL Proxy as a sidecar container in a GKE Job Feature: performs a graceful shutdown upon SIGTERM Dec 22, 2017
@lunemec
Copy link

lunemec commented Jan 19, 2018

@AthenaShi the command you posted:

command: ["/bin/bash", "-c", "trap 'sleep 30; exit 0' SIGTERM; /cloud_sql_proxy -dir=/cloudsql -instances=..."]

is incorrect. There is no /bin/bash inside the container. You need to use /bin/sh:

command: ["/bin/sh", "-c", "/cloud_sql_proxy [options...]"]

@eCeleritas
Copy link

It is mentioned here that we can simply SIGTERM the proxy and then have the proxy handle the SIGTERM gracefully. How can we actually trigger the SIGTERM on the proxy from then container that is doing the actual work for the Job?

@eCeleritas
Copy link

I was able to handle this case by implementing the following:

  1. Create a new docker image based on gcr.io/cloudsql-docker/gce-proxy
  2. Add a script in the image, called watch_job.sh, which has the ability to interrogate the k8s api and determine if the pod in which the custom container is running has at least one completed container (in my scenario, I only have two containers - the proxy and the one doing the actual work). This script continues to run until the other container is complete.
  3. In the k8s job configuration, I pass the command:
command: ["/bin/sh"]
            args: [ "-c",
                    "/cloud_sql_proxy -instances=$(SQL_INSTANCE) -credential_file=/secrets/cloudsql/credentials.json & /watch_job.sh"]

So, this way, the k8s job is actually waiting on the watch_job.sh script to exit rather than waiting on cloud_sql_proxy. No need to trap the SIGTERM or anything else.

Notes:

@bpiselli
Copy link

@eCeleritas Thxs for the ideas : seems great : could you opensource your code ?

@kurtisvg kurtisvg added Status: In Progress type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Jul 16, 2018
@kurtisvg kurtisvg changed the title Feature: performs a graceful shutdown upon SIGTERM Feature Request: Perform a graceful shutdown upon SIGTERM Jul 16, 2018
@kurtisvg
Copy link
Contributor

@AthenaShi Can you update this issue with the current status of the work?

@hfwang
Copy link
Contributor

hfwang commented Jul 24, 2018

#178 adds a configurable wait after receiving SIGTERM, however it's not incredibly clever, it will wait the full configured wait since it doesn't know if there are any open connections, and it doesn't start rejecting new connections either.

@kilianc
Copy link

kilianc commented Aug 15, 2018

since it doesn't know if there are any open connections, and it doesn't start rejecting new connections either

@hfwang is anyone working on this?

@kilianc
Copy link

kilianc commented Aug 15, 2018

The fact that this is marked as feature request and not a bug is alarming. This is de-facto a show-stopper for production use. The documentation should actually mention that this is not production ready. I haven't seen any workaround that supports a clean shutdown that guarantees a consistent a reliable behavior.

@Carrotman42
Copy link
Contributor

In general I don't think this feature is necessary for generic use of the Cloud SQL Proxy - I believe this issue is mostly geared toward integration with Kubernetes.

The Proxy process is stateless; it can be killed (and restarted) if necessary and well-acting applications should handle it (e.g., retry when connections to the backend database break). It seems reasonable to me that the Proxy exits quickly with a non-zero status when it is sent a SIGTERM; any automation around the process (e.g. Kubernetes, systemd, upstart) could be programmed to handle this functionality; in this issue there are numerous workarounds for Kubernetes, each potentially tunable to every individual setup's requirements.

But maybe there should be some more common handling logic in the Proxy. Unfortunately, I can't tell from this thread that there is even a consensus about how this feature should work exactly, and how it is better than the current world in most cases.

@kilianc: you seem very interested in having this issue solved. Could you summarize exactly the interface and functionality you would like to see? Anyone else who has a concrete example and suggestion is welcome to chime in too. Once we reach a threshold of usefulness without over-fitting the known usecases we can move forward with a PR.

@kilianc
Copy link

kilianc commented Aug 15, 2018

@Carrotman42 I do understand your point of view, but a database connection layer that doesn't drain its connection pool it's unheard-of. I may be missing something; that's why I am so polarized on this.

Let me regroup my thoughts.

Current behavior

  • Currently deploying my app with k8s requires cloudsql-proxy as a sidecar in the same pod as my application container.
  • On rolling updates k8s sends SIGTERM to both app container and sidecar.
  • App container handles SIGTERM gracefully, stops accepting new connections, waits for all the requests to complete for a fixed grace period allowed, after that is either clean exit or SIGKILL.
  • The sidecar closes all connections on SIGTERM causing the main app to error all in-flight requests without a chance to re-attempt anything since it's shutting down.

I guess I don't really see how this is the recommended way to deploy your app on k8s considering that some teams deploy multiple times a day and the goal is always 0 downtime.

I am open to any workaround that is SRE friendly and it's not impacting the end end user UX.

Expected behavior

  • SIGTERM
  • Both app and proxy stop accepting connections.
  • App tries to drain all sockets and closes pg connections in a clean way once it's done processing.
  • Proxy waits until all connections are closed or max grace time.
  • Both Exit 0 or > 0.

Even always waiting for max grace time is bad because is going to make all pods in all my postgres app wait for max grace time for no reason, it's just not optimal.

The experience provided by the proxy should be comparable to what you would have connecting straight to Postgres.

@kilianc
Copy link

kilianc commented Aug 16, 2018

I found a solid workaround 🚀💯🔥🦄

App Container:

commands:
- /bin/sh
- -c
- node server.js & pid="$!"; trap "kill -s SIGTERM $pid; wait $pid; touch /opt/exit-signals/SIGTERM;" SIGTERM; wait $pid;
volumeMounts:
- mountPath: /opt/exit-signals
  name: exit-signals

Sidecar Container

lifecycle:
  preStop:
    exec:
      command:
      - /bin/sh
      - -c
      - until [ -f /opt/exit-signals/SIGTERM ]; do sleep 1; done;
volumeMounts:
- mountPath: /opt/exit-signals
  name: exit-signals

Deployment

volumes:
- name: exit-signals
  emptyDir: {}

Works perfectly.

Resources

N.B. This is not a final fix but just a workaround. If you expose the sidecar in a service and something else connects to it, those transactions will be dropped at every re-deployment. Classic use case is ETL of your db somewhere else.

N.B.B. This approach also works with CronJobs with a slight modification.

@josephtyler
Copy link

@kilianc Is this workaround still working well for you?

@laszlocph
Copy link

We implemented it a week ago. The basic tests showed that the containers shut down in the right order. Have it running in production on a single app. No issues so far, but of course it's early to tell for sure.

@hfwang
Copy link
Contributor

hfwang commented Jan 10, 2019

As an alternative: A flag was added in #206 so the proxy will wait up to some configurable period after receiving a sigint or sigterm for all connections to close before it terminates (and if so, can exit cleanly).

@hfwang hfwang closed this as completed Jan 10, 2019
@kilianc
Copy link

kilianc commented Jan 10, 2019

@carlosfrancia it's probably a shell version mismatch. I am running on alpine-node

@kilianc
Copy link

kilianc commented Jan 10, 2019

@carlosfrancia make sure you're sending the correct signal kill -s SIGTERM $pid

@giladsh1
Copy link

checkout this blog - How to terminate a side-car container in Kubernetes Job

@taylorchu
Copy link

We use in-pod supervisor (https://github.com/taylorchu/wait-for) to solve the sidecar container issue. It is simple and can be used even if you don't use k8s.

@huwcbjones
Copy link

FYI for anyone coming here from google, --term_timeout has been replaced with --max-sigterm-delay in v2
#1742 (comment)

@enocom
Copy link
Member

enocom commented Sep 1, 2023

For a full list of flag changes see the migration guide.

@same-id
Copy link

same-id commented Nov 15, 2023

--max-sigterm-delay does not solve this with current behavior.

  1. Container exits when last connection is closed, however new connection might be needed for the pod while it shuts down - should just sleep
  2. quitquitquit could be used to signal the process to actually shut down while it sleeps (since if 1 is changed you need to know when to stop sleeping)

Instead, we will be adding a preStop command that sleeps and call the quitquitquit endpoint using httpGet postStop for the actual workload.

@enocom
Copy link
Member

enocom commented Nov 15, 2023

@same-id if there's a feature request in there, feel free to open a new issue. We're still working on smoothing out the k8s use cases.

@same-id
Copy link

same-id commented Nov 19, 2023

@enocom, to be honest, I'm not sure - just saying that this feature does not 100% with k8s.

What we did, is simply switch to the alpine flavor of the image so we can have sleep binary in the image and sleep in the preStop hook while the actual service is gracefully shutting down.

Then we make our service call quitquitquit.

This also is not too great.

I believe that "smoothing out the k8s use cases" will soon be fixed in k8s instead (KEP-753):

https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/753-sidecar-containers/README.md

Which should already be available in most major cloud providers (Kubernetes 1.28) behind a feature flag.

Fun fact, also sleep hook will soon be introduced, so "goodbye" distro-full images.
(kubernetes/enhancements#3960)

@enocom
Copy link
Member

enocom commented Nov 20, 2023

Really hoping that proper sidecar container support in GKE makes all this go away.

This also is not too great.

Is it because it's more manual setup toil? What makes this not great? Asking to see if there's something we can do better here. Maybe a stop command to match the start command here #1117?

@same-id
Copy link

same-id commented Nov 20, 2023

Since "postStop" hooks do not exist in k8s:

  1. All our services need to be coded with quitquitquit on shutdown - if you forget, rollouts will be slower - they will wait for gracefulShutdown to expire fore cloud-sql-proxy.
  2. It does not make sense if the "main" service restarts for any reason, so cloud-sql-proxy will as well.
  3. Our code also waits for cloud-sql-proxy readiness before it starts, but that's another thing.

I just don't think there's any reason to pursue any hacky solution as long as KEP-753 will hit the road on December (even if it's feature-gated)

This was mainly me saying that --max-sigterm-delay is less useful than a plain-ol' sleep for all purposes and no need to add that to cloud-sql-proxy since it can just be used from the container image itself.

@enocom
Copy link
Member

enocom commented Nov 20, 2023

Thanks, @same-id. A lot of this stuff is hacking around the lack of KEP-753.

@same-id
Copy link

same-id commented May 26, 2024

With kubernetes v1.29.4 sidecar containers are now supported (1.29.3 minor is required for Jobs bugfix)

All hail

@devauxbr
Copy link

Thanks @same-id ! I had the same problem and just discovered this github issue 👀 For those that will arrive here in the future : I can confirm that transitioning the cloudsql proxy to a native sidecar container (i.e. put it in the initContainer block and add a restartPolicy: Always) did the trick ! ✅

The documentation does state that kube will wait for the "main" container to stop before sending the SIGTERM to the sidecars container 🎉

@devauxbr
Copy link

I don't know if the GCP documentation is versioned in this repository, but I think this subtlety deserves to be documented there :)

@enocom
Copy link
Member

enocom commented Jul 30, 2024

Good point @devauxbr -- the Cloud SQL docs for Kubernetes could probably use a fresh revision.

cc @jackwotherspoon as FYI

@enocom
Copy link
Member

enocom commented Jul 30, 2024

Meanwhile, we could update our example here to make the new sidecar feature more obvious for folks. @devauxbr would you be interested in writing up an issue for what you ran into and how we might make life easier?

@devauxbr
Copy link

devauxbr commented Aug 2, 2024

Thanks for your feedback @enocom ! About the problems I ran into, I am not sure if it deserves a new issue, as I had exactly the same use cases as both @purplexa and @kilianc :

  • My cronjobs running cloudsql-proxy as a sidecar would never complete
  • My deployments would get their Postgres connections killed in the middle of running queries for in-flight requests when K8S would deploy a new version of my app

The "fix" was to transition both my cronjobs and deployments from :

[...]
spec:
  containers:
    - name: main
      image: [my_app_image]
      [...]
    - name: cloud-sql-proxy
      image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.11.0
      args:
        - --private-ip
        - --auto-iam-authn
        - --exit-zero-on-sigterm
        - $(INSTANCE_NAME)
      env:
        - name: INSTANCE_NAME
          value: [my_cloudsql_instance_name]

To :

spec:
  containers:
    - name: main
      image: [my_app_image]
  initContainers:
    - name: cloud-sql-proxy
      restartPolicy: Always
      image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.11.0
      args:
        - --private-ip
        - --auto-iam-authn
        - --exit-zero-on-sigterm
        - $(INSTANCE_NAME)
      env:
        - name: INSTANCE_NAME
          value: [my_cloudsql_instance_name]
        - name: CSQL_PROXY_HEALTH_CHECK
          value: "true"
        - name: CSQL_PROXY_HTTP_PORT
          value: "9801"
        - name: CSQL_PROXY_HTTP_ADDRESS
          value: 0.0.0.0
      startupProbe:
        failureThreshold: 60
        httpGet:
          path: /startup
          port: 9801
        periodSeconds: 1
        timeoutSeconds: 10

EDIT : the import change is moving the cloud-sql-proxy container in an initContainer block, and adding restartPolicy: Always to make it a "native K8S sidecar" (c.f. K8S announcement blogpost )

Note that I also needed to add the startupProbe config block (and their related env variables) to make sure K8S would wait for the cloudsql-proxy to be ready to receive connection, before starting my main app container

Do you want me to create an issue for that ? I think updating the docs with these pointers would be sufficient 🙏

@enocom
Copy link
Member

enocom commented Aug 5, 2024

Thanks @devauxbr for the super clear write-up. We'll get this added to the docs and I'll just open an issue with these details to track it (and make it easy for other folks to find).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests