Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud SQL Proxy side-car container issuing periodic SIGTERM requests #1789

Closed
dheerajd3v opened this issue May 11, 2023 · 7 comments
Closed
Assignees
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: question Request for information or clarification.

Comments

@dheerajd3v
Copy link

Bug Description

We moved to Cloud SQL proxy image v2 from v1 couple of weeks back and our k8s deployment was running fine until yesterday when we started seeing SIGTERM requests to cloud sql proxy to shutdown the service.

Earlier today, one of our k8s services went down for 10 minutes. I did some investigating and I believe the reason was a SIGTERM signal was sent to cloudsql-proxy service causing it to shut down in a "not so graceful way".

Here are the sequence of events that I believe occured

  1. ping to database is successful (top of the logs in the screenshot)
  2. Cloudsql receives a TERM signal (no idea why but this happens periodically and i think this is the root cause).
    because cloudsql-proxy receives a SIGTERM signal, and it still has 8 active connections to the database, cloudsql-proxy exits with an error code 2 rather than error code 0 (references to error codes with sigterm: Feature Request: Perform a graceful shutdown upon SIGTERM #128 (comment) Feature Request: Perform a graceful shutdown upon SIGTERM #128 (comment))
  3. I believe because it sees that error code 2 is thrown (unhealthy and not gracefully shutdown), the liveness probe RESTARTS our microservice when cloudsql-proxy doesn't gracefully shutdown. So our microservice (deployed as a sidecar) has to wait till cloudsql-proxy service is up before it can connect to the database. If the microservice starts first while cloudsql-proxy is still starting, then it's sorta like a race condition (which I believe happened here) where our microservice tries to connect to the DB and it can't connect (hence the node ping error).

The main issue here is, we shouldn't lose connection to cloudsql-proxy from our microservice. Cloudsql-proxy shouldn't receive a SIGTERM because it causes this weird scenario where one is ready (microservice) and the other is not (the database)

Note:- We are using k8s deployment with a sidecar pattern

Example code (or command)

// paste your code or command here

Stacktrace

No response

Steps to reproduce?

  1. ?
  2. ?
  3. ?
    ...

Environment

  1. OS type and version: GKE 1.24
  2. Cloud SQL Proxy version (./cloud-sql-proxy --version): v2.0.0
  3. Proxy invocation command (for example, ./cloud-sql-proxy --enable_iam_login --dir /path/to/dir INSTANCE_CONNECTION_NAME):
  • name: cloud-sql-proxy
    image: "gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.0.0"
    args:
    - "--private-ip"
    - "--port=5432"
    - ""
    - "--credentials-file="
  1. We are using this image from GCR - gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.0.0

Additional Details

No response

@dheerajd3v dheerajd3v added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label May 11, 2023
@enocom enocom added priority: p2 Moderately-important priority. Fix may not be included in next release. type: question Request for information or clarification. and removed type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels May 11, 2023
@enocom
Copy link
Member

enocom commented May 11, 2023

The Proxy won't ever send SIGTERM signals. In v2 we've fixed the exit code handling such that the Proxy will exit with a non-zero code if there are still active connections. It sounds like Kubernetes is shutting down your pod (for reasons unknown to me and beyond what the Proxy knows), and the non-zero exit is making things worse.

Here are a few of things to try:

  1. Update to latest v2 (currently v2.2.0)
  2. Set the max-sigterm-delay to some reasonable value. This will give the app more time to shut down its active connections and possibly result in the Proxy exiting cleanly.
  3. And probably less useful but worth knowing about: consider if the quitquitquit flag would help. here.

@dheerajd3v
Copy link
Author

dheerajd3v commented May 11, 2023

The Proxy won't ever send SIGTERM signals. In v2 we've fixed the exit code handling such that the Proxy will exit with a non-zero code if there are still active connections. It sounds like Kubernetes is shutting down your pod (for reasons unknown to me and beyond what the Proxy knows), and the non-zero exit is making things worse.

Here are a few of things to try:

  1. Update to latest v2 (currently v2.2.0)
  2. Set the max-sigterm-delay to some reasonable value. This will give the app more time to shut down its active connections and possibly result in the Proxy exiting cleanly.
  3. And probably less useful but worth knowing about: consider if the quitquitquit flag would help. here.

Thanks @enocom for the quick appreciate it very much. We run cloud sql proxy as a side car container so my question is where do I set the max-sigterm-delay is it in the CLI?

@jackwotherspoon
Copy link
Collaborator

jackwotherspoon commented May 11, 2023

@dheerajd3v You just set it as one of your args in your YAML.

name: cloud-sql-proxy
image: "gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.2.0"
args:
- "--private-ip"
- "--port=5432"
- "--max-sigterm-delay=30"

The above will wait 30 seconds for connections to close after receiving a TERM signal.

@dheerajd3v
Copy link
Author

dheerajd3v commented May 11, 2023

@dheerajd3v You just set it as one of your args in your YAML.

name: cloud-sql-proxy
image: "gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.2.0"
args:
- "--private-ip"
- "--port=5432"
- "--max-sigterm-delay=30"

The above will wait 30 seconds for connections to close after receiving a TERM signal.

Thanks @jackwotherspoon will make the changes and test. On a different note I also see that there is a Cloud SQL proxy operator in Preview, thoughts on migrating to use this operator instead of the side car pattern? This is the link I am referring to https://cloud.google.com/sql/docs/postgres/connect-proxy-operator. I am not sure why official google docs recommend using a sidecar pattern over the initContainer I am thinking about the race condition what happens if my App container comes up and DB container is still not up?

@dheerajd3v
Copy link
Author

I am not sure how to do this . This is from GCP official docs

Screenshot 2023-05-11 at 11 34 28 AM

@hessjcg
Copy link
Collaborator

hessjcg commented May 11, 2023

Hi @dheerajd3v

The Cloud SQL Auth Proxy Operator will make it easier to set up proxy sidecar containers for your apps. We expect the operator to be within GA in the next 2 months. Give it a try and let us know if it works for you.

We recommend running the proxy as a sidecar container for two reasons:

  • The proxy container and the application container can run at the same time.
  • The connection is more secure when the application container connects to the proxy container using the localhost network 127.0.0.1. It is even more secure if the proxy exposes the database as a unix socket. To do this in kubernets, the proxy needs to run in the same pod as the application.

Init containers serve a different purpose than sidecar containers. Init containers must exit before the pod containers start. See Init Containers

Unfortunately, there is no pure Kubernetes way to avoid the race condition when a pod's containers start up. Kubernetes does not allow you to specify a startup order for a pod's containers.

We recommend that you write your app to be resilient to failed database connections. Your app should retry a failed database connection attempt for a reasonable period of time (maybe 30 seconds) before exiting with a failure.

@jackwotherspoon
Copy link
Collaborator

Going to go ahead and close this as I believe the initial question has been answered 😄

If there are any follow-up questions feel free to re-open the issue or create a new issue for the question.

Have a great day and thanks for using the Cloud SQL Proxy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: question Request for information or clarification.
Projects
None yet
Development

No branches or pull requests

4 participants