-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Perform a graceful shutdown upon SIGTERM #128
Comments
This can be done by just SIGTERM the proxy on shutdown. For more information, (well not directly related): |
Sending a SIGTERM won't work because it causes Cloud SQL Proxy to exit 2, not exit 0, which Kubernetes will treat as a Job failure, not a success. |
@thrnio having the same issue, did you find a workaround for this? |
possible solution here? https://stackoverflow.com/questions/41679364/kubernetes-stop-cloudsql-proxy-sidecar-container-in-multi-container-pod-job One possible solution would be a separate cloudsql-proxy deployment with a matching service. You would then only need your migration container inside the job that connects to your proxy service. This comes with some downsides: higher network latency, no pod local mysql communication |
@EamonKeane not really, right now I'm using a script that that polls for the individual container statuses and checks for when the other container succeeds or fails then just deletes the Job, which sort of works well enough for most of my use cases, but not totally and it's far from ideal. |
yea not great. What is google up to here? Because egress IP is not controllable (I understand), sql-proxy seems to be the blessed way to connect, so might be expected to be robust to all GKE use cases. The other sigterm issue mentioned above (#86) has been open for months. Databases tend to be an important part of an application.... |
Thanks for the explanations of the use case here, I've created an internal enhancement ticket for this and it will be prioritized. The new feature would be:
============================================================================== You can easily trap the sigterm signal the following way in your deployment:
You can skip the "sleep 30" part if an immediate exit is needed. This is proposed from (#86) that's why I linked it here. |
And the reason why #86 is still open is b/c it doesn't want Cloud SQL proxy to continue accept new connection after SIGTERM. |
@AthenaShi the command you posted:
is incorrect. There is no /bin/bash inside the container. You need to use
|
It is mentioned here that we can simply SIGTERM the proxy and then have the proxy handle the SIGTERM gracefully. How can we actually trigger the SIGTERM on the proxy from then container that is doing the actual work for the Job? |
I was able to handle this case by implementing the following:
command: ["/bin/sh"]
args: [ "-c",
"/cloud_sql_proxy -instances=$(SQL_INSTANCE) -credential_file=/secrets/cloudsql/credentials.json & /watch_job.sh"] So, this way, the k8s job is actually waiting on the Notes:
|
@eCeleritas Thxs for the ideas : seems great : could you opensource your code ? |
@AthenaShi Can you update this issue with the current status of the work? |
#178 adds a configurable wait after receiving SIGTERM, however it's not incredibly clever, it will wait the full configured wait since it doesn't know if there are any open connections, and it doesn't start rejecting new connections either. |
@hfwang is anyone working on this? |
The fact that this is marked as feature request and not a bug is alarming. This is de-facto a show-stopper for production use. The documentation should actually mention that this is not production ready. I haven't seen any workaround that supports a clean shutdown that guarantees a consistent a reliable behavior. |
In general I don't think this feature is necessary for generic use of the Cloud SQL Proxy - I believe this issue is mostly geared toward integration with Kubernetes. The Proxy process is stateless; it can be killed (and restarted) if necessary and well-acting applications should handle it (e.g., retry when connections to the backend database break). It seems reasonable to me that the Proxy exits quickly with a non-zero status when it is sent a SIGTERM; any automation around the process (e.g. Kubernetes, systemd, upstart) could be programmed to handle this functionality; in this issue there are numerous workarounds for Kubernetes, each potentially tunable to every individual setup's requirements. But maybe there should be some more common handling logic in the Proxy. Unfortunately, I can't tell from this thread that there is even a consensus about how this feature should work exactly, and how it is better than the current world in most cases. @kilianc: you seem very interested in having this issue solved. Could you summarize exactly the interface and functionality you would like to see? Anyone else who has a concrete example and suggestion is welcome to chime in too. Once we reach a threshold of usefulness without over-fitting the known usecases we can move forward with a PR. |
@Carrotman42 I do understand your point of view, but a database connection layer that doesn't drain its connection pool it's unheard-of. I may be missing something; that's why I am so polarized on this. Let me regroup my thoughts. Current behavior
I guess I don't really see how this is the recommended way to deploy your app on k8s considering that some teams deploy multiple times a day and the goal is always 0 downtime. I am open to any workaround that is SRE friendly and it's not impacting the end end user UX. Expected behavior
Even always waiting for max grace time is bad because is going to make all pods in all my postgres app wait for max grace time for no reason, it's just not optimal. The experience provided by the proxy should be comparable to what you would have connecting straight to Postgres. |
I found a solid workaround 🚀💯🔥🦄 App Container:commands:
- /bin/sh
- -c
- node server.js & pid="$!"; trap "kill -s SIGTERM $pid; wait $pid; touch /opt/exit-signals/SIGTERM;" SIGTERM; wait $pid;
volumeMounts:
- mountPath: /opt/exit-signals
name: exit-signals
Sidecar Containerlifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- until [ -f /opt/exit-signals/SIGTERM ]; do sleep 1; done;
volumeMounts:
- mountPath: /opt/exit-signals
name: exit-signals
Deploymentvolumes:
- name: exit-signals
emptyDir: {} Works perfectly. Resources
N.B. This is not a final fix but just a workaround. If you expose the sidecar in a service and something else connects to it, those transactions will be dropped at every re-deployment. Classic use case is ETL of your db somewhere else. N.B.B. This approach also works with |
@kilianc Is this workaround still working well for you? |
We implemented it a week ago. The basic tests showed that the containers shut down in the right order. Have it running in production on a single app. No issues so far, but of course it's early to tell for sure. |
As an alternative: A flag was added in #206 so the proxy will wait up to some configurable period after receiving a sigint or sigterm for all connections to close before it terminates (and if so, can exit cleanly). |
@carlosfrancia it's probably a shell version mismatch. I am running on |
@carlosfrancia make sure you're sending the correct signal |
checkout this blog - How to terminate a side-car container in Kubernetes Job |
We use in-pod supervisor (https://github.com/taylorchu/wait-for) to solve the sidecar container issue. It is simple and can be used even if you don't use k8s. |
FYI for anyone coming here from google, |
For a full list of flag changes see the migration guide. |
Instead, we will be adding a |
@same-id if there's a feature request in there, feel free to open a new issue. We're still working on smoothing out the k8s use cases. |
@enocom, to be honest, I'm not sure - just saying that this feature does not 100% with k8s. What we did, is simply switch to the alpine flavor of the image so we can have Then we make our service call This also is not too great. I believe that "smoothing out the k8s use cases" will soon be fixed in k8s instead (KEP-753): Which should already be available in most major cloud providers (Kubernetes 1.28) behind a feature flag. Fun fact, also |
Really hoping that proper sidecar container support in GKE makes all this go away.
Is it because it's more manual setup toil? What makes this not great? Asking to see if there's something we can do better here. Maybe a stop command to match the start command here #1117? |
Since "postStop" hooks do not exist in k8s:
I just don't think there's any reason to pursue any hacky solution as long as KEP-753 will hit the road on December (even if it's feature-gated) This was mainly me saying that |
Thanks, @same-id. A lot of this stuff is hacking around the lack of KEP-753. |
With kubernetes v1.29.4 sidecar containers are now supported (1.29.3 minor is required for Jobs bugfix) All hail |
Thanks @same-id ! I had the same problem and just discovered this github issue 👀 For those that will arrive here in the future : I can confirm that transitioning the cloudsql proxy to a native sidecar container (i.e. put it in the The documentation does state that kube will wait for the "main" container to stop before sending the SIGTERM to the sidecars container 🎉 |
I don't know if the GCP documentation is versioned in this repository, but I think this subtlety deserves to be documented there :) |
Good point @devauxbr -- the Cloud SQL docs for Kubernetes could probably use a fresh revision. cc @jackwotherspoon as FYI |
Meanwhile, we could update our example here to make the new sidecar feature more obvious for folks. @devauxbr would you be interested in writing up an issue for what you ran into and how we might make life easier? |
Thanks for your feedback @enocom ! About the problems I ran into, I am not sure if it deserves a new issue, as I had exactly the same use cases as both @purplexa and @kilianc :
The "fix" was to transition both my cronjobs and deployments from : [...]
spec:
containers:
- name: main
image: [my_app_image]
[...]
- name: cloud-sql-proxy
image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.11.0
args:
- --private-ip
- --auto-iam-authn
- --exit-zero-on-sigterm
- $(INSTANCE_NAME)
env:
- name: INSTANCE_NAME
value: [my_cloudsql_instance_name] To : spec:
containers:
- name: main
image: [my_app_image]
initContainers:
- name: cloud-sql-proxy
restartPolicy: Always
image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.11.0
args:
- --private-ip
- --auto-iam-authn
- --exit-zero-on-sigterm
- $(INSTANCE_NAME)
env:
- name: INSTANCE_NAME
value: [my_cloudsql_instance_name]
- name: CSQL_PROXY_HEALTH_CHECK
value: "true"
- name: CSQL_PROXY_HTTP_PORT
value: "9801"
- name: CSQL_PROXY_HTTP_ADDRESS
value: 0.0.0.0
startupProbe:
failureThreshold: 60
httpGet:
path: /startup
port: 9801
periodSeconds: 1
timeoutSeconds: 10 EDIT : the import change is moving the cloud-sql-proxy container in an Note that I also needed to add the startupProbe config block (and their related env variables) to make sure K8S would wait for the cloudsql-proxy to be ready to receive connection, before starting my main app container Do you want me to create an issue for that ? I think updating the docs with these pointers would be sufficient 🙏 |
Thanks @devauxbr for the super clear write-up. We'll get this added to the docs and I'll just open an issue with these details to track it (and make it easy for other folks to find). |
The method documented in Connecting from Google Container Engine for using Cloud SQL Proxy from GKE is to run the proxy as a sidecar container in your pod alongside your application. For normal use, this works well, but as this StackOverflow question describes, there are problems when trying to do the same in a Kubernetes Job.
The main issue is that for a Job to finish, all the containers in its Pod need to finish and return 0, so when using Cloud SQL Proxy as a sidecar container, even after the application container finishes, the proxy container stays around and Kubernetes believes the Job is not finished.
What would be helpful is for there to be some way to tell Cloud SQL Proxy to shut itself down cleanly after the application has finished. Perhaps something like a configurable timeout where the proxy shuts down after having no active connections for a period of time, or even some way for the application to explicitly tell the proxy it has finished before the application itself shuts down.
The text was updated successfully, but these errors were encountered: