Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Certificate refresh cycle does not work with laptop sleep #1199

Closed
honDhan opened this issue May 11, 2022 · 17 comments · Fixed by #1223
Closed

Certificate refresh cycle does not work with laptop sleep #1199

honDhan opened this issue May 11, 2022 · 17 comments · Fixed by #1223
Assignees
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@honDhan
Copy link

honDhan commented May 11, 2022

Question

What is the best way to set up Cloud SQL Proxy on MacOS for local access to databases? This is so that our engineering team can securely connected to our databases.

I currently have things set up using launchd and launchctl.

Additional Context

As mentioned, I have things set up with launchd on MacOS.

The issue is that sometimes, the proxy does not refresh the certificate. This happens when laptops go to sleep and wake up after the refresh time:

2022/05/11 09:14:08 Scheduling refresh of ephemeral certificate in 59m3.548513709s
2022/05/11 09:14:15 ephemeral certificate for instance OUR_DATABASE will expire soon, refreshing now.
2022/05/11 09:14:15 refreshing ephemeral certificate for instance OUR_DATABASE
2022/05/11 09:14:15 Scheduling refresh of ephemeral certificate in 59m3.678916417s
2022/05/11 10:49:50 New connection for "OUR_DATABASE"
2022/05/11 10:49:51 Reading data from OUR_DATABASE had error: remote error: tls: bad certificate
2022/05/11 10:49:51 New connection for "OUR_DATABASE"
2022/05/11 10:49:51 Reading data from OUR_DATABASE had error: remote error: tls: bad certificate
2022/05/11 10:49:51 New connection for "OUR_DATABASE"
2022/05/11 10:49:51 Reading data from OUR_DATABASE had error: remote error: tls: bad certificate
2022/05/11 10:49:55 New connection for "OUR_DATABASE"
2022/05/11 10:49:56 Reading data from OUR_DATABASE had error: remote error: tls: bad certificate
2022/05/11 10:50:01 New connection for "OUR_DATABASE"
2022/05/11 10:50:01 Reading data from OUR_DATABASE had error: remote error: tls: bad certificate
2022/05/11 10:54:16 ephemeral certificate for instance athelas-staging:us-central1:athelas-staging will expire soon, refreshing now.
2022/05/11 10:54:16 refreshing ephemeral certificate for instance athelas-staging:us-central1:athelas-staging
2022/05/11 10:54:17 Scheduling refresh of ephemeral certificate in 59m3.524621084s
2022/05/11 10:54:23 ephemeral certificate for instance OUR_DATABASE will expire soon, refreshing now.
2022/05/11 10:54:23 refreshing ephemeral certificate for instance OUR_DATABASE
2022/05/11 10:54:23 Scheduling refresh of ephemeral certificate in 59m3.670461042s
2022/05/11 10:56:27 New connection for "OUR_DATABASE"

Notice how there was a refresh scheduled for 10:14, but it did not go through.

A possible solution is to have the launchd service restart on sleep/wake, but that isn't a feature of launchd and I'd prefer not to do that. Is this an issue with the proxy, Go, or MacOS?

@honDhan honDhan added the type: question Request for information or clarification. label May 11, 2022
@enocom
Copy link
Member

enocom commented May 12, 2022

I'm not sure what I'd expect from Go. In effect, the CPU is asleep when the refresh goroutine should run. So when the CPU awakes, does the refresh goroutine run immediately, or is it dropped?

Here's the code in question:

https://github.com/GoogleCloudPlatform/cloudsql-proxy/blob/main/proxy/proxy/client.go#L238-L242

I'll need to test this out myself.

@enocom
Copy link
Member

enocom commented May 12, 2022

After playing around with this, I see the problem is that time.After will only run once the goroutine has observed so much time passing. When the computer goes to sleep, the clock also stops in effect. So this is probably a bug in our implementation.

@enocom enocom added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. and removed type: question Request for information or clarification. labels May 12, 2022
@enocom enocom changed the title Best way to setup Cloud SQL Proxy as a service on MacOS Certificate refresh cycle does not work with laptop sleep May 12, 2022
@kurtisvg
Copy link
Contributor

After playing around with this, I see the problem is that time.After will only run once the goroutine has observed so much time passing. When the computer goes to sleep, the clock also stops in effect. So this is probably a bug in our implementation.

Is this is a bug with our implementation or in Go?

@enocom
Copy link
Member

enocom commented May 12, 2022

When we say time.After(duration), that duration only elapses if the computer is awake. So I assume we could alter our implementation such that we don't just wait a duration, but check if now is after the time we want to do a refresh. Based on my testing, though, Go is behaving in unusual ways. I'll update the thread here with more details.

@enocom
Copy link
Member

enocom commented May 12, 2022

But perhaps it is a bug in Go with time.After -- the question is does Go make any claims about an elapsed duration when the CPU is asleep for part of it?

@enocom
Copy link
Member

enocom commented May 12, 2022

From the docs on time:

On some systems the monotonic clock will stop if the computer goes to sleep. On such a system, t.Sub(u) may not accurately reflect the actual time that passed between t and u.

This might just be a limitation of the hardware Go is running on. If the monotonic clock stops when the computer goes to sleep, I'm not sure what we could do here.

@honDhan
Copy link
Author

honDhan commented May 13, 2022

check if now is after the time we want to do a refresh

I think this would be a valid workaround.

Another solution I was thinking of was having a flag to lower the refresh time for certs. e.g. if a refresh is scheduled for every 20-30 minutes, the time between (laptop waking up)<->(needing to connect to db) is lower

This might just be a limitation of the hardware Go is running on. If the monotonic clock stops when the computer goes to sleep, I'm not sure what we could do here.

Probably the issue. Our eng team uses M1 macs, so I'm not surprised it's causing an issue.

@honDhan
Copy link
Author

honDhan commented May 13, 2022

Perhaps giving the refresh a context with a wall clock based deadline?
something like:

func (c *Client) refreshCertAfter(instance string, timeToRefresh time.Duration, ctx Context) {

        select {
        case <-ctx.Done(): // potentially wall clock based?
	case <-time.After(timeToRefresh): // retain monotonic clock functionality
        }

@honDhan
Copy link
Author

honDhan commented May 13, 2022

did some digging, seems like this is an accepted proposal for Go: golang/go#36141

@enocom
Copy link
Member

enocom commented May 23, 2022

Nice find! I'm curious to explore what implementation options we have in the meantime.

@honDhan
Copy link
Author

honDhan commented Jun 10, 2022

Hi! Any update on a possible fix? This is causing inconvenience in our workflows so a fix would be greatly appreciated. Thanks!

@kurtisvg
Copy link
Contributor

@honDhan Do you know what version of the proxy you are using?

Looking at head, I suspect this might already be fixed since we check the cert expiration here: https://github.com/GoogleCloudPlatform/cloudsql-proxy/blob/49d3003c018afdc0cde54340d5be808f9dcd5c84/proxy/proxy/client.go#L380

and here:
https://github.com/GoogleCloudPlatform/cloudsql-proxy/blob/49d3003c018afdc0cde54340d5be808f9dcd5c84/proxy/proxy/client.go#L423-L426

@enocom
Copy link
Member

enocom commented Jun 10, 2022

I tested the latest version of the proxy, and see that this issue persists. In short, after waking from sleep, the proxy doesn't know to immediately refresh the certificate.

I have a possible fix and will send a PR for discussion.

@enocom
Copy link
Member

enocom commented Jun 14, 2022

@honDhan I did some testing with my patch and believe my PR fixes the issue. We'll be releasing this on the first Tuesday of next month as part of the usual release cadence, but if you'd like to try a dev build off of main, I'd be curious to hear if you see an improvement.

@enocom
Copy link
Member

enocom commented Aug 26, 2022

@honDhan Circling back to this -- I wonder if you've seen any improvement with this fix in v1.31.1?

@honDhan
Copy link
Author

honDhan commented Aug 27, 2022

Hey! Sorry for the lack of response. I am using .32.1 and it's been going well so far. No complaints from team members either. Thank you so much!

@enocom
Copy link
Member

enocom commented Aug 29, 2022

Fantastic. Thanks for the update!

enocom added a commit to GoogleCloudPlatform/cloud-sql-go-connector that referenced this issue Mar 10, 2024
In Go, when two time.Time objects have a monotonic clock value, calls to
time.After will use the monotonic readings to compare the two values.
However, when the CPU goes to sleep and later awakes, the monotonic
clock will stop on some systems. As a result, the time comparison will
produce inaccurate results. See the Go documentation on time for a
discussion of the details: https://pkg.go.dev/time#hdr-Monotonic_Clocks.

In an earlier commit [1], we added code to ensure the ephemeral
certificate was not expired. If it was, we blocked the connection
attempt on a certificate refresh. This solved one bug where a laptop
went to sleep, later awoke, and was unable to connect given the existing
certificate had expired. However, the change in #686 fixed this bug, but
only for clients connecting with built-in authentication.

[1] #686

When the Go Connector is configured to use built-in authentication, it
compares the ephemeral certificate's expiration (a time.Time object
produced from JSON deserialization and therefore lacking a monotonic clock
reading) with time.Now(). Because only one of the two time values has a
monotonic reading (i.e. time.Now()), the wall clock time is used to
compare the two vales and calls to time.After produce valid results.

When the Go Connector is configured to use auto IAM authentication, it
updates the certificate's expiration to be the earlier of the
certificate's expiration or the OAuth2 token's expiration. In the
majority of cases, the OAuth2 token will expire earlier and so its
expiration will be set to the certificate's expiration. Further, the
OAuth2 token's expiration is a time.Time value *with* a monotonic clock
reading. As a result, calls to time.After will be using time values that
both have a monotonic clock reading, and so are susceptible to
producing the wrong result when a machine has been asleep for some
time.

For example, with debug logging enabled in the Proxy, we observed the
following logs after suspending the VM running the Proxy for more
than an hour, resuming the VM, and then attempting to connect:

Accepted connection from 127.0.0.1:56782
Now = 2024-03-10T04:34:37Z, Current cert expiration = 2024-03-10T00:38:48Z
Cert is valid = true
Dialing <INSTANCE_IP_ADDR>
connection aborted - error reading from instance: remote error: tls: bad certificate

Even though the cert expiration is clearly before "now," the Go
Connector incorrectly concludes the certificate is still valid. This is
on account of the erroneous monotonic clock value in time.Now that is
then used in calls to time.After.

This commit resolves this last bug by stripping the monotonic clock
value from both times before doing the comparison. In addition to
time.Round(0), time.UTC() will also strip the monotonic reading. This
commit uses time.UTC().

Past, related issues:

GoogleCloudPlatform/cloud-sql-proxy#1788
GoogleCloudPlatform/cloud-sql-proxy#1199
#402
enocom added a commit to GoogleCloudPlatform/cloud-sql-go-connector that referenced this issue Mar 10, 2024
In Go, when two time.Time objects have a monotonic clock value, calls to
time.After will use the monotonic readings to compare the two values.
However, when the CPU goes to sleep and later awakes, the monotonic
clock will stop on some systems. As a result, the time comparison will
produce inaccurate results. See the Go documentation on time for a
discussion of the details: https://pkg.go.dev/time#hdr-Monotonic_Clocks.

In an earlier commit [1], we added code to ensure the ephemeral
certificate was not expired. If it was, we blocked the connection
attempt on a certificate refresh. This solved one bug where a laptop
went to sleep, later awoke, and was unable to connect given the existing
certificate had expired. However, the change in #686 fixed this bug, but
only for clients connecting with built-in authentication.

[1] #686

When the Go Connector is configured to use built-in authentication, it
compares the ephemeral certificate's expiration (a time.Time object
produced from JSON deserialization and therefore lacking a monotonic clock
reading) with time.Now(). Because only one of the two time values has a
monotonic reading (i.e. time.Now()), the wall clock time is used to
compare the two vales and calls to time.After produce valid results.

When the Go Connector is configured to use auto IAM authentication, it
updates the certificate's expiration to be the earlier of the
certificate's expiration or the OAuth2 token's expiration. In the
majority of cases, the OAuth2 token will expire earlier and so its
expiration will be set to the certificate's expiration. Further, the
OAuth2 token's expiration is a time.Time value *with* a monotonic clock
reading. As a result, calls to time.After will be using time values that
both have a monotonic clock reading, and so are susceptible to
producing the wrong result when a machine has been asleep for some
time.

For example, with debug logging enabled in the Proxy, we observed the
following logs after suspending the VM running the Proxy for more
than an hour, resuming the VM, and then attempting to connect:

Accepted connection from 127.0.0.1:56782
Now = 2024-03-10T04:34:37Z, Current cert expiration = 2024-03-10T00:38:48Z
Cert is valid = true
Dialing <INSTANCE_IP_ADDR>
connection aborted - error reading from instance: remote error: tls: bad certificate

Even though the cert expiration is clearly before "now," the Go
Connector incorrectly concludes the certificate is still valid. This is
on account of the erroneous monotonic clock value in time.Now that is
then used in calls to time.After.

This commit resolves this last bug by stripping the monotonic clock
value from both times before doing the comparison. In addition to
time.Round(0), time.UTC() will also strip the monotonic reading. This
commit uses time.UTC().

Past, related issues:

GoogleCloudPlatform/cloud-sql-proxy#1788
GoogleCloudPlatform/cloud-sql-proxy#1199
#402
enocom added a commit to GoogleCloudPlatform/cloud-sql-go-connector that referenced this issue Mar 10, 2024
In Go, when two time.Time objects have a monotonic clock value, calls to
time.After will use the monotonic readings to compare the two values.
However, when the CPU goes to sleep and later awakes, the monotonic
clock will stop on some systems. As a result, the time comparison will
produce inaccurate results. See the Go documentation on time for a
discussion of the details: https://pkg.go.dev/time#hdr-Monotonic_Clocks.

In an earlier commit [1], we added code to ensure the ephemeral
certificate was not expired. If it was, we blocked the connection
attempt on a certificate refresh. This solved one bug where a laptop
went to sleep, later awoke, and was unable to connect given the existing
certificate had expired. However, the change in #686 fixed this bug, but
only for clients connecting with built-in authentication.

[1] #686

When the Go Connector is configured to use built-in authentication, it
compares the ephemeral certificate's expiration (a time.Time object
produced from JSON deserialization and therefore lacking a monotonic clock
reading) with time.Now(). Because only one of the two time values has a
monotonic reading (i.e. time.Now()), the wall clock time is used to
compare the two vales and calls to time.After produce valid results.

When the Go Connector is configured to use auto IAM authentication, it
updates the certificate's expiration to be the earlier of the
certificate's expiration or the OAuth2 token's expiration. In the
majority of cases, the OAuth2 token will expire earlier and so its
expiration will be set to the certificate's expiration. Further, the
OAuth2 token's expiration is a time.Time value *with* a monotonic clock
reading. As a result, calls to time.After will be using time values that
both have a monotonic clock reading, and so are susceptible to
producing the wrong result when a machine has been asleep for some
time.

For example, with debug logging enabled in the Proxy, we observed the
following logs after suspending the VM running the Proxy for more
than an hour, resuming the VM, and then attempting to connect:

Accepted connection from 127.0.0.1:56782
Now = 2024-03-10T04:34:37Z, Current cert expiration = 2024-03-10T00:38:48Z
Cert is valid = true
Dialing <INSTANCE_IP_ADDR>
connection aborted - error reading from instance: remote error: tls: bad certificate

Even though the cert expiration is clearly before "now," the Go
Connector incorrectly concludes the certificate is still valid. This is
on account of the erroneous monotonic clock value in time.Now that is
then used in calls to time.After.

This commit resolves this last bug by stripping the monotonic clock
value from both times before doing the comparison. In addition to
time.Round(0), time.UTC() will also strip the monotonic reading. This
commit uses time.UTC().

Past, related issues:

GoogleCloudPlatform/cloud-sql-proxy#1788
GoogleCloudPlatform/cloud-sql-proxy#1199
#402
enocom added a commit to GoogleCloudPlatform/cloud-sql-go-connector that referenced this issue Mar 11, 2024
* chore: improve debug logging

This commit adds a few extra debug logs to make it clear when a
certificate is invalid and being refreshed.

In addition, to help with the readability of the debug logs, this commit
rounds all refresh duration to the minute (e.g., 56m0s vs 55m59s).

* fix: strip monotonic clock reading in cert check

In Go, when two time.Time objects have a monotonic clock value, calls to
time.After will use the monotonic readings to compare the two values.
However, when the CPU goes to sleep and later awakes, the monotonic
clock will stop on some systems. As a result, the time comparison will
produce inaccurate results. See the Go documentation on time for a
discussion of the details: https://pkg.go.dev/time#hdr-Monotonic_Clocks.

In an earlier commit [1], we added code to ensure the ephemeral
certificate was not expired. If it was, we blocked the connection
attempt on a certificate refresh. This solved one bug where a laptop
went to sleep, later awoke, and was unable to connect given the existing
certificate had expired. However, the change in #686 fixed this bug, but
only for clients connecting with built-in authentication.

[1] #686

When the Go Connector is configured to use built-in authentication, it
compares the ephemeral certificate's expiration (a time.Time object
produced from JSON deserialization and therefore lacking a monotonic clock
reading) with time.Now(). Because only one of the two time values has a
monotonic reading (i.e. time.Now()), the wall clock time is used to
compare the two vales and calls to time.After produce valid results.

When the Go Connector is configured to use auto IAM authentication, it
updates the certificate's expiration to be the earlier of the
certificate's expiration or the OAuth2 token's expiration. In the
majority of cases, the OAuth2 token will expire earlier and so its
expiration will be set to the certificate's expiration. Further, the
OAuth2 token's expiration is a time.Time value *with* a monotonic clock
reading. As a result, calls to time.After will be using time values that
both have a monotonic clock reading, and so are susceptible to
producing the wrong result when a machine has been asleep for some
time.

For example, with debug logging enabled in the Proxy, we observed the
following logs after suspending the VM running the Proxy for more
than an hour, resuming the VM, and then attempting to connect:

Accepted connection from 127.0.0.1:56782
Now = 2024-03-10T04:34:37Z, Current cert expiration = 2024-03-10T00:38:48Z
Cert is valid = true
Dialing <INSTANCE_IP_ADDR>
connection aborted - error reading from instance: remote error: tls: bad certificate

Even though the cert expiration is clearly before "now," the Go
Connector incorrectly concludes the certificate is still valid. This is
on account of the erroneous monotonic clock value in time.Now that is
then used in calls to time.After.

This commit resolves this last bug by stripping the monotonic clock
value from both times before doing the comparison. In addition to
time.Round(0), time.UTC() will also strip the monotonic reading. This
commit uses time.UTC().

Past, related issues:

GoogleCloudPlatform/cloud-sql-proxy#1788
GoogleCloudPlatform/cloud-sql-proxy#1199
#402

Fixes #749
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants