-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARM client: survive empty response and error #94078
ARM client: survive empty response and error #94078
Conversation
Hi @bpineau. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix.
/lgtm
/approve
/retest
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bpineau, feiskyer The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -119,6 +119,10 @@ func (c *Client) Send(ctx context.Context, request *http.Request) (*http.Respons | |||
return response, rerr | |||
} | |||
|
|||
if response == nil && rerr == nil { | |||
return response, rerr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a bug that should be fixed in sendRequest… clients expect the invariant that either response or err will be non-nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea. let's add the workaround here and check whether could we fix the underlying SDK bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jhendrixMSFT do you know something about this? I think it's probably something wrong from go-autorest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant we should fix the bug in Client#sendRequest in this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a bug I recently fixed that was causing a nil response and error; this was a corner-case in authentication, see Azure/go-autorest#547 for the details. I can't tell from the stack if auth is relevant here so it might be unrelated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any additional data/frames/etc you have would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edited the PR description, now with the full stack trace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found where the nil response and nil err originates:
When cluster-autoscaler's Azure provider is provided a config file, documented defaults settings values (and env variables) are not applied. Setting only cloudProviderBackoff=true
will result in a config with cloudProviderBackoffRetries: 0
(vs. 6
by default when enabled with ENABLE_BACKOFF=true
env). Cluster-autoscaler will then instanciate an armclient
with backoff.Step=0
. But azure_retry.go's doBackoffRetry()
(in this repos) only runs queries when Step > 0
, otherwise it returns the nil, nil
we're seeing here.
I'll PR cluster-autoscaler to address the source, but I wonder if we should also make that more safe in this repos.
For instance: ensuring armclient.New() sets retry.Backoff to at least 1 (means one request exec, no retry), and doBackoffRetry
logs something when called with Step=0. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bpineau good catch. Agreed with above, we should make it safer in this repo as well (set backoff to 1 if it is 0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed a PR to fix this issue in armclient here: #94180.
We're seeing legacy-cloud-providersazure/clients (in our case, the synchronized copy used by cluster-autoscaler) segfaulting under heavy pressure and ARM throttling, like so: ``` panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1b595b5] cluster-autoscaler-all-79b9478bf5-cgkg8 cluster-autoscaler goroutine 82 [running]: k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/clients/armclient.(*Client).Send(0xc00052f520, 0x3b6dc40, 0xc000937440, 0xc000933f00, 0x0, 0x0) /home/jb/go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/clients/armclient/azure_armclient.go:122 +0xb5 ``` Reason is the ARM client expects `sendRequest()` to return either a non nil *retry.Error, or a non nil *http.Response, while both can be nil.
399273f
to
d16eee0
Compare
/retest |
When `cloudProviderBackoff` is configured, `cloudProviderBackoffRetries` must also be set to a value > 0, otherwise the cluster-autoscaler will instanciate a vmssclient with 0 Steps retries, which will cause `doBackoffRetry()` to return a nil response and nil error on requests. ARM client can't cope with those and will then segfault. See kubernetes/kubernetes#94078 The README.md needed a small update, because the documented defaults are a bit misleading: they don't apply when the cluster-autoscaler is provided a config file, due to: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/azure_manager.go#L299-L308 ... which is also causing all environment variables to be ignored when a configuration file is provided.
When `cloudProviderBackoff` is configured, `cloudProviderBackoffRetries` must also be set to a value > 0, otherwise the cluster-autoscaler will instanciate a vmssclient with 0 Steps retries, which will cause `doBackoffRetry()` to return a nil response and nil error on requests. ARM client can't cope with those and will then segfault. See kubernetes/kubernetes#94078 The README.md needed a small update, because the documented defaults are a bit misleading: they don't apply when the cluster-autoscaler is provided a config file, due to: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/azure_manager.go#L299-L308 ... which is also causing all environment variables to be ignored when a configuration file is provided.
When `cloudProviderBackoff` is configured, `cloudProviderBackoffRetries` must also be set to a value > 0, otherwise the cluster-autoscaler will instanciate a vmssclient with 0 Steps retries, which will cause `doBackoffRetry()` to return a nil response and nil error on requests. ARM client can't cope with those and will then segfault. See kubernetes/kubernetes#94078 The README.md needed a small update, because the documented defaults are a bit misleading: they don't apply when the cluster-autoscaler is provided a config file, due to: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/azure_manager.go#L299-L308 ... which is also causing all environment variables to be ignored when a configuration file is provided.
When `cloudProviderBackoff` is configured, `cloudProviderBackoffRetries` must also be set to a value > 0, otherwise the cluster-autoscaler will instanciate a vmssclient with 0 Steps retries, which will cause `doBackoffRetry()` to return a nil response and nil error on requests. ARM client can't cope with those and will then segfault. See kubernetes/kubernetes#94078 The README.md needed a small update, because the documented defaults are a bit misleading: they don't apply when the cluster-autoscaler is provided a config file, due to: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/azure_manager.go#L299-L308 ... which is also causing all environment variables to be ignored when a configuration file is provided.
/lgtm |
When `cloudProviderBackoff` is configured, `cloudProviderBackoffRetries` must also be set to a value > 0, otherwise the cluster-autoscaler will instanciate a vmssclient with 0 Steps retries, which will cause `doBackoffRetry()` to return a nil response and nil error on requests. ARM client can't cope with those and will then segfault. See kubernetes/kubernetes#94078 The README.md needed a small update, because the documented defaults are a bit misleading: they don't apply when the cluster-autoscaler is provided a config file, due to: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/azure_manager.go#L299-L308 ... which is also causing all environment variables to be ignored when a configuration file is provided.
When `cloudProviderBackoff` is configured, `cloudProviderBackoffRetries` must also be set to a value > 0, otherwise the cluster-autoscaler will instanciate a vmssclient with 0 Steps retries, which will cause `doBackoffRetry()` to return a nil response and nil error on requests. ARM client can't cope with those and will then segfault. See kubernetes/kubernetes#94078 The README.md needed a small update, because the documented defaults are a bit misleading: they don't apply when the cluster-autoscaler is provided a config file, due to: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/azure_manager.go#L299-L308 ... which is also causing all environment variables to be ignored when a configuration file is provided.
When `cloudProviderBackoff` is configured, `cloudProviderBackoffRetries` must also be set to a value > 0, otherwise the cluster-autoscaler will instanciate a vmssclient with 0 Steps retries, which will cause `doBackoffRetry()` to return a nil response and nil error on requests. ARM client can't cope with those and will then segfault. See kubernetes/kubernetes#94078 The README.md needed a small update, because the documented defaults are a bit misleading: they don't apply when the cluster-autoscaler is provided a config file, due to: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/azure_manager.go#L299-L308 ... which is also causing all environment variables to be ignored when a configuration file is provided.
When `cloudProviderBackoff` is configured, `cloudProviderBackoffRetries` must also be set to a value > 0, otherwise the cluster-autoscaler will instanciate a vmssclient with 0 Steps retries, which will cause `doBackoffRetry()` to return a nil response and nil error on requests. ARM client can't cope with those and will then segfault. See kubernetes/kubernetes#94078 The README.md needed a small update, because the documented defaults are a bit misleading: they don't apply when the cluster-autoscaler is provided a config file, due to: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/azure_manager.go#L299-L308 ... which is also causing all environment variables to be ignored when a configuration file is provided.
When `cloudProviderBackoff` is configured, `cloudProviderBackoffRetries` must also be set to a value > 0, otherwise the cluster-autoscaler will instanciate a vmssclient with 0 Steps retries, which will cause `doBackoffRetry()` to return a nil response and nil error on requests. ARM client can't cope with those and will then segfault. See kubernetes/kubernetes#94078 The README.md needed a small update, because the documented defaults are a bit misleading: they don't apply when the cluster-autoscaler is provided a config file, due to: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/azure_manager.go#L299-L308 ... which is also causing all environment variables to be ignored when a configuration file is provided.
When `cloudProviderBackoff` is configured, `cloudProviderBackoffRetries` must also be set to a value > 0, otherwise the cluster-autoscaler will instanciate a vmssclient with 0 Steps retries, which will cause `doBackoffRetry()` to return a nil response and nil error on requests. ARM client can't cope with those and will then segfault. See kubernetes/kubernetes#94078 The README.md needed a small update, because the documented defaults are a bit misleading: they don't apply when the cluster-autoscaler is provided a config file, due to: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/azure_manager.go#L299-L308 ... which is also causing all environment variables to be ignored when a configuration file is provided.
What type of PR is this?
/kind bug
What this PR does / why we need it:
We're seeing legacy-cloud-providers/azure/clients (in our case, the synchronised copy used by cluster-autoscaler 1.19) segfaulting under heavy pressure and ARM throttling:
In
k8s.io/legacy-cloud-providers/azure/clients/armclient/azure_armclient.go
,we assume ARM requests would either return either a non nil *retry.Error, or a non nil *http.Response:
But really, it does not offer such guarantee:
Which issue(s) this PR fixes:
Fixes #94077
Does this PR introduce a user-facing change?:
/assign @andyzhangx @feiskyer
/sig cloud-provider
/area provider/azure