Feature Request: Perform a graceful shutdown upon SIGTERM #128

purplexa · 2017-10-17T18:25:07Z

The method documented in Connecting from Google Container Engine for using Cloud SQL Proxy from GKE is to run the proxy as a sidecar container in your pod alongside your application. For normal use, this works well, but as this StackOverflow question describes, there are problems when trying to do the same in a Kubernetes Job.

The main issue is that for a Job to finish, all the containers in its Pod need to finish and return 0, so when using Cloud SQL Proxy as a sidecar container, even after the application container finishes, the proxy container stays around and Kubernetes believes the Job is not finished.

What would be helpful is for there to be some way to tell Cloud SQL Proxy to shut itself down cleanly after the application has finished. Perhaps something like a configurable timeout where the proxy shuts down after having no active connections for a period of time, or even some way for the application to explicitly tell the proxy it has finished before the application itself shuts down.

AthenaShi · 2017-11-03T18:38:13Z

This can be done by just SIGTERM the proxy on shutdown.

For more information, (well not directly related):
#86

purplexa · 2017-11-07T01:11:17Z

Sending a SIGTERM won't work because it causes Cloud SQL Proxy to exit 2, not exit 0, which Kubernetes will treat as a Job failure, not a success.

EamonKeane · 2017-11-29T18:20:12Z

@thrnio having the same issue, did you find a workaround for this?

EamonKeane · 2017-11-29T18:32:15Z

possible solution here? https://stackoverflow.com/questions/41679364/kubernetes-stop-cloudsql-proxy-sidecar-container-in-multi-container-pod-job

One possible solution would be a separate cloudsql-proxy deployment with a matching service. You would then only need your migration container inside the job that connects to your proxy service.

This comes with some downsides:

higher network latency, no pod local mysql communication
possible security issue if you provide the sql port to your whole kubernetes cluster
If you want to open cloudsql-proxy to the whole cluster you have to replace tcp:3306 with tcp:0.0.0.0:3306 in the -instance parameter on the cloudsql-proxy.

purplexa · 2017-11-29T22:49:01Z

@EamonKeane not really, right now I'm using a script that that polls for the individual container statuses and checks for when the other container succeeds or fails then just deletes the Job, which sort of works well enough for most of my use cases, but not totally and it's far from ideal.

EamonKeane · 2017-11-29T23:57:11Z

yea not great. What is google up to here? Because egress IP is not controllable (I understand), sql-proxy seems to be the blessed way to connect, so might be expected to be robust to all GKE use cases. The other sigterm issue mentioned above (#86) has been open for months. Databases tend to be an important part of an application....

AthenaShi · 2017-12-22T02:28:16Z

Thanks for the explanations of the use case here, I've created an internal enhancement ticket for this and it will be prioritized.

The new feature would be:

upon receiving SIGTERM (or some exit signal), Cloud SQL proxy would stop accepting new connections.
Cloud SQL proxy would exit (return 0) if no active connections.
After 30s (by default), Cloud SQL proxy would close all active connections and exits and returns 0.

==============================================================================
Meanwhile, here's a workaround before we have this feature:

You can easily trap the sigterm signal the following way in your deployment:

command: ["/bin/bash", "-c", "trap 'sleep 30; exit 0' SIGTERM; /cloud_sql_proxy -dir=/cloudsql -instances=..."]

You can skip the "sleep 30" part if an immediate exit is needed. This is proposed from (#86) that's why I linked it here.

AthenaShi · 2017-12-22T02:32:35Z

And the reason why #86 is still open is b/c it doesn't want Cloud SQL proxy to continue accept new connection after SIGTERM.

lunemec · 2018-01-19T11:40:05Z

@AthenaShi the command you posted:

command: ["/bin/bash", "-c", "trap 'sleep 30; exit 0' SIGTERM; /cloud_sql_proxy -dir=/cloudsql -instances=..."]

is incorrect. There is no /bin/bash inside the container. You need to use /bin/sh:

command: ["/bin/sh", "-c", "/cloud_sql_proxy [options...]"]

eCeleritas · 2018-04-12T01:44:43Z

It is mentioned here that we can simply SIGTERM the proxy and then have the proxy handle the SIGTERM gracefully. How can we actually trigger the SIGTERM on the proxy from then container that is doing the actual work for the Job?

eCeleritas · 2018-04-12T23:19:51Z

I was able to handle this case by implementing the following:

Create a new docker image based on gcr.io/cloudsql-docker/gce-proxy
Add a script in the image, called watch_job.sh, which has the ability to interrogate the k8s api and determine if the pod in which the custom container is running has at least one completed container (in my scenario, I only have two containers - the proxy and the one doing the actual work). This script continues to run until the other container is complete.
In the k8s job configuration, I pass the command:

command: ["/bin/sh"]
            args: [ "-c",
                    "/cloud_sql_proxy -instances=$(SQL_INSTANCE) -credential_file=/secrets/cloudsql/credentials.json & /watch_job.sh"]

So, this way, the k8s job is actually waiting on the watch_job.sh script to exit rather than waiting on cloud_sql_proxy. No need to trap the SIGTERM or anything else.

Notes:

I am accessing the k8s api similarly to what is described here: https://stackoverflow.com/questions/30690186/how-do-i-access-the-kubernetes-api-from-within-a-pod-container
I know my pod name by using the Downward API as described here: https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/

bpiselli · 2018-06-18T09:39:07Z

@eCeleritas Thxs for the ideas : seems great : could you opensource your code ?

kurtisvg · 2018-07-16T16:54:27Z

@AthenaShi Can you update this issue with the current status of the work?

hfwang · 2018-07-24T04:55:05Z

#178 adds a configurable wait after receiving SIGTERM, however it's not incredibly clever, it will wait the full configured wait since it doesn't know if there are any open connections, and it doesn't start rejecting new connections either.

kilianc · 2018-08-15T07:35:18Z

since it doesn't know if there are any open connections, and it doesn't start rejecting new connections either

@hfwang is anyone working on this?

kilianc · 2018-08-15T18:27:14Z

The fact that this is marked as feature request and not a bug is alarming. This is de-facto a show-stopper for production use. The documentation should actually mention that this is not production ready. I haven't seen any workaround that supports a clean shutdown that guarantees a consistent a reliable behavior.

Carrotman42 · 2018-08-15T18:50:58Z

In general I don't think this feature is necessary for generic use of the Cloud SQL Proxy - I believe this issue is mostly geared toward integration with Kubernetes.

The Proxy process is stateless; it can be killed (and restarted) if necessary and well-acting applications should handle it (e.g., retry when connections to the backend database break). It seems reasonable to me that the Proxy exits quickly with a non-zero status when it is sent a SIGTERM; any automation around the process (e.g. Kubernetes, systemd, upstart) could be programmed to handle this functionality; in this issue there are numerous workarounds for Kubernetes, each potentially tunable to every individual setup's requirements.

But maybe there should be some more common handling logic in the Proxy. Unfortunately, I can't tell from this thread that there is even a consensus about how this feature should work exactly, and how it is better than the current world in most cases.

@kilianc: you seem very interested in having this issue solved. Could you summarize exactly the interface and functionality you would like to see? Anyone else who has a concrete example and suggestion is welcome to chime in too. Once we reach a threshold of usefulness without over-fitting the known usecases we can move forward with a PR.

kilianc · 2018-08-15T20:12:52Z

@Carrotman42 I do understand your point of view, but a database connection layer that doesn't drain its connection pool it's unheard-of. I may be missing something; that's why I am so polarized on this.

Let me regroup my thoughts.

Current behavior

Currently deploying my app with k8s requires cloudsql-proxy as a sidecar in the same pod as my application container.
On rolling updates k8s sends SIGTERM to both app container and sidecar.
App container handles SIGTERM gracefully, stops accepting new connections, waits for all the requests to complete for a fixed grace period allowed, after that is either clean exit or SIGKILL.
The sidecar closes all connections on SIGTERM causing the main app to error all in-flight requests without a chance to re-attempt anything since it's shutting down.

I guess I don't really see how this is the recommended way to deploy your app on k8s considering that some teams deploy multiple times a day and the goal is always 0 downtime.

I am open to any workaround that is SRE friendly and it's not impacting the end end user UX.

Expected behavior

SIGTERM
Both app and proxy stop accepting connections.
App tries to drain all sockets and closes pg connections in a clean way once it's done processing.
Proxy waits until all connections are closed or max grace time.
Both Exit 0 or > 0.

Even always waiting for max grace time is bad because is going to make all pods in all my postgres app wait for max grace time for no reason, it's just not optimal.

The experience provided by the proxy should be comparable to what you would have connecting straight to Postgres.

kilianc · 2018-08-16T06:50:39Z

I found a solid workaround 🚀💯🔥🦄

App Container:

commands:
- /bin/sh
- -c
- node server.js & pid="$!"; trap "kill -s SIGTERM $pid; wait $pid; touch /opt/exit-signals/SIGTERM;" SIGTERM; wait $pid;
volumeMounts:
- mountPath: /opt/exit-signals
  name: exit-signals

Sidecar Container

lifecycle:
  preStop:
    exec:
      command:
      - /bin/sh
      - -c
      - until [ -f /opt/exit-signals/SIGTERM ]; do sleep 1; done;
volumeMounts:
- mountPath: /opt/exit-signals
  name: exit-signals

Deployment

volumes:
- name: exit-signals
  emptyDir: {}

Works perfectly.

Resources

N.B. This is not a final fix but just a workaround. If you expose the sidecar in a service and something else connects to it, those transactions will be dropped at every re-deployment. Classic use case is ETL of your db somewhere else.

N.B.B. This approach also works with CronJobs with a slight modification.

josephtyler · 2018-09-24T23:51:15Z

@kilianc Is this workaround still working well for you?

laszlocph · 2018-09-25T07:23:03Z

We implemented it a week ago. The basic tests showed that the containers shut down in the right order. Have it running in production on a single app. No issues so far, but of course it's early to tell for sure.

hfwang · 2019-01-10T19:00:25Z

As an alternative: A flag was added in #206 so the proxy will wait up to some configurable period after receiving a sigint or sigterm for all connections to close before it terminates (and if so, can exit cleanly).

kilianc · 2019-01-10T19:15:39Z

@carlosfrancia it's probably a shell version mismatch. I am running on alpine-node

kilianc · 2019-01-10T19:30:30Z

@carlosfrancia make sure you're sending the correct signal kill -s SIGTERM $pid

giladsh1 · 2019-06-23T08:44:21Z

checkout this blog - How to terminate a side-car container in Kubernetes Job

taylorchu · 2022-11-02T16:46:05Z

We use in-pod supervisor (https://github.com/taylorchu/wait-for) to solve the sidecar container issue. It is simple and can be used even if you don't use k8s.

…ility (GoogleCloudPlatform#128)

huwcbjones · 2023-09-01T15:17:34Z

FYI for anyone coming here from google, --term_timeout has been replaced with --max-sigterm-delay in v2
#1742 (comment)

enocom · 2023-09-01T15:25:37Z

For a full list of flag changes see the migration guide.

same-id · 2023-11-15T15:06:59Z

--max-sigterm-delay does not solve this with current behavior.

Container exits when last connection is closed, however new connection might be needed for the pod while it shuts down - should just sleep
quitquitquit could be used to signal the process to actually shut down while it sleeps (since if 1 is changed you need to know when to stop sleeping)

Instead, we will be adding a preStop command that sleeps and call the quitquitquit endpoint using httpGet postStop for the actual workload.

enocom · 2023-11-15T16:29:33Z

@same-id if there's a feature request in there, feel free to open a new issue. We're still working on smoothing out the k8s use cases.

same-id · 2023-11-19T07:53:41Z

@enocom, to be honest, I'm not sure - just saying that this feature does not 100% with k8s.

What we did, is simply switch to the alpine flavor of the image so we can have sleep binary in the image and sleep in the preStop hook while the actual service is gracefully shutting down.

Then we make our service call quitquitquit.

This also is not too great.

I believe that "smoothing out the k8s use cases" will soon be fixed in k8s instead (KEP-753):

https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/753-sidecar-containers/README.md

Which should already be available in most major cloud providers (Kubernetes 1.28) behind a feature flag.

Fun fact, also sleep hook will soon be introduced, so "goodbye" distro-full images.
(kubernetes/enhancements#3960)

enocom · 2023-11-20T18:01:53Z

Really hoping that proper sidecar container support in GKE makes all this go away.

This also is not too great.

Is it because it's more manual setup toil? What makes this not great? Asking to see if there's something we can do better here. Maybe a stop command to match the start command here #1117?

same-id · 2023-11-20T20:31:04Z

Since "postStop" hooks do not exist in k8s:

All our services need to be coded with quitquitquit on shutdown - if you forget, rollouts will be slower - they will wait for gracefulShutdown to expire fore cloud-sql-proxy.
It does not make sense if the "main" service restarts for any reason, so cloud-sql-proxy will as well.
Our code also waits for cloud-sql-proxy readiness before it starts, but that's another thing.

I just don't think there's any reason to pursue any hacky solution as long as KEP-753 will hit the road on December (even if it's feature-gated)

This was mainly me saying that --max-sigterm-delay is less useful than a plain-ol' sleep for all purposes and no need to add that to cloud-sql-proxy since it can just be used from the container image itself.

enocom · 2023-11-20T21:23:03Z

Thanks, @same-id. A lot of this stuff is hacking around the lack of KEP-753.

same-id · 2024-05-26T09:17:12Z

With kubernetes v1.29.4 sidecar containers are now supported (1.29.3 minor is required for Jobs bugfix)

All hail

devauxbr · 2024-07-26T15:13:04Z

Thanks @same-id ! I had the same problem and just discovered this github issue 👀 For those that will arrive here in the future : I can confirm that transitioning the cloudsql proxy to a native sidecar container (i.e. put it in the initContainer block and add a restartPolicy: Always) did the trick ! ✅

The documentation does state that kube will wait for the "main" container to stop before sending the SIGTERM to the sidecars container 🎉

devauxbr · 2024-07-26T15:16:01Z

I don't know if the GCP documentation is versioned in this repository, but I think this subtlety deserves to be documented there :)

enocom · 2024-07-30T14:58:43Z

Good point @devauxbr -- the Cloud SQL docs for Kubernetes could probably use a fresh revision.

cc @jackwotherspoon as FYI

enocom · 2024-07-30T14:59:43Z

Meanwhile, we could update our example here to make the new sidecar feature more obvious for folks. @devauxbr would you be interested in writing up an issue for what you ran into and how we might make life easier?

devauxbr · 2024-08-02T07:51:25Z

Thanks for your feedback @enocom ! About the problems I ran into, I am not sure if it deserves a new issue, as I had exactly the same use cases as both @purplexa and @kilianc :

My cronjobs running cloudsql-proxy as a sidecar would never complete
My deployments would get their Postgres connections killed in the middle of running queries for in-flight requests when K8S would deploy a new version of my app

The "fix" was to transition both my cronjobs and deployments from :

[...]
spec:
  containers:
    - name: main
      image: [my_app_image]
      [...]
    - name: cloud-sql-proxy
      image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.11.0
      args:
        - --private-ip
        - --auto-iam-authn
        - --exit-zero-on-sigterm
        - $(INSTANCE_NAME)
      env:
        - name: INSTANCE_NAME
          value: [my_cloudsql_instance_name]

To :

spec:
  containers:
    - name: main
      image: [my_app_image]
  initContainers:
    - name: cloud-sql-proxy
      restartPolicy: Always
      image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.11.0
      args:
        - --private-ip
        - --auto-iam-authn
        - --exit-zero-on-sigterm
        - $(INSTANCE_NAME)
      env:
        - name: INSTANCE_NAME
          value: [my_cloudsql_instance_name]
        - name: CSQL_PROXY_HEALTH_CHECK
          value: "true"
        - name: CSQL_PROXY_HTTP_PORT
          value: "9801"
        - name: CSQL_PROXY_HTTP_ADDRESS
          value: 0.0.0.0
      startupProbe:
        failureThreshold: 60
        httpGet:
          path: /startup
          port: 9801
        periodSeconds: 1
        timeoutSeconds: 10

EDIT : the import change is moving the cloud-sql-proxy container in an initContainer block, and adding restartPolicy: Always to make it a "native K8S sidecar" (c.f. K8S announcement blogpost )

Note that I also needed to add the startupProbe config block (and their related env variables) to make sure K8S would wait for the cloudsql-proxy to be ready to receive connection, before starting my main app container

Do you want me to create an issue for that ? I think updating the docs with these pointers would be sufficient 🙏

enocom · 2024-08-05T15:25:58Z

Thanks @devauxbr for the super clear write-up. We'll get this added to the docs and I'll just open an issue with these details to track it (and make it easy for other folks to find).

AthenaShi mentioned this issue Dec 22, 2017

Staying alive on SIGTERM #86

Closed

AthenaShi changed the title ~~Using Cloud SQL Proxy as a sidecar container in a GKE Job~~ Feature: performs a graceful shutdown upon SIGTERM Dec 22, 2017

kurtisvg assigned AthenaShi Jul 16, 2018

kurtisvg added Status: In Progress type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Jul 16, 2018

kurtisvg changed the title ~~Feature: performs a graceful shutdown upon SIGTERM~~ Feature Request: Perform a graceful shutdown upon SIGTERM Jul 16, 2018

kilianc mentioned this issue Aug 17, 2018

Better support for sidecar containers in batch jobs kubernetes/kubernetes#25908

Closed

jeremywadsack mentioned this issue Sep 18, 2018

Ability to kill sidecar container keylimetoolbox/resque-kubernetes#16

Closed

jbrook mentioned this issue Sep 25, 2018

Terminate gracefully after receiving a SIGTERM signal. #212

Closed

hfwang closed this as completed Jan 10, 2019

giladsh1 unassigned AthenaShi Jun 23, 2019

enocom mentioned this issue May 17, 2021

Error during SIGTERM shutdown: 1 active connections still exist after waiting for 0s #799

Closed

kurtisvg mentioned this issue Jul 7, 2021

Provide a way to shutdown the process from another container #828

Closed

yosatak pushed a commit to yosatak/cloud-sql-proxy that referenced this issue Feb 26, 2023

Check for bundler before attempting to install, for Ruby 2.5 compatib…

4cda9cc

…ility (GoogleCloudPlatform#128)

dheerajd3v mentioned this issue May 11, 2023

Cloud SQL Proxy side-car container issuing periodic SIGTERM requests #1789

Closed

enocom mentioned this issue Aug 5, 2024

Examples using the new sidecar/initContainers features in Kubernetes 1.29. #2261

Open

jackwotherspoon mentioned this issue Aug 6, 2024

feat: add support for wait command #2041

Merged

enocom mentioned this issue Aug 6, 2024

How can I ensure cloud-sql-proxy sidecar container is running before other containers connecting to cloudsql from k8s? #2063

Closed

hiramf mentioned this issue Nov 15, 2024

update k8s kubeflow/kubeflow#7660

Open

Feature Request: Perform a graceful shutdown upon SIGTERM #128

Feature Request: Perform a graceful shutdown upon SIGTERM #128

Comments

purplexa commented Oct 17, 2017

AthenaShi commented Nov 3, 2017

purplexa commented Nov 7, 2017

EamonKeane commented Nov 29, 2017

EamonKeane commented Nov 29, 2017

purplexa commented Nov 29, 2017

EamonKeane commented Nov 29, 2017 • edited Loading

AthenaShi commented Dec 22, 2017

AthenaShi commented Dec 22, 2017

lunemec commented Jan 19, 2018

eCeleritas commented Apr 12, 2018

eCeleritas commented Apr 12, 2018

bpiselli commented Jun 18, 2018

kurtisvg commented Jul 16, 2018

hfwang commented Jul 24, 2018

kilianc commented Aug 15, 2018

kilianc commented Aug 15, 2018

Carrotman42 commented Aug 15, 2018

kilianc commented Aug 15, 2018 • edited Loading

Current behavior

Expected behavior

kilianc commented Aug 16, 2018 • edited Loading

App Container:

Sidecar Container

Deployment

Resources

josephtyler commented Sep 24, 2018

laszlocph commented Sep 25, 2018

hfwang commented Jan 10, 2019

kilianc commented Jan 10, 2019

kilianc commented Jan 10, 2019

giladsh1 commented Jun 23, 2019

taylorchu commented Nov 2, 2022

huwcbjones commented Sep 1, 2023

enocom commented Sep 1, 2023

same-id commented Nov 15, 2023

enocom commented Nov 15, 2023

same-id commented Nov 19, 2023

enocom commented Nov 20, 2023 • edited Loading

same-id commented Nov 20, 2023

enocom commented Nov 20, 2023

same-id commented May 26, 2024

devauxbr commented Jul 26, 2024

devauxbr commented Jul 26, 2024

enocom commented Jul 30, 2024

enocom commented Jul 30, 2024

devauxbr commented Aug 2, 2024 • edited Loading

enocom commented Aug 5, 2024

EamonKeane commented Nov 29, 2017 •

edited

Loading

kilianc commented Aug 15, 2018 •

edited

Loading

kilianc commented Aug 16, 2018 •

edited

Loading

enocom commented Nov 20, 2023 •

edited

Loading

devauxbr commented Aug 2, 2024 •

edited

Loading