ARM client: survive empty response and error #94078

bpineau · 2020-08-18T11:29:45Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

We're seeing legacy-cloud-providers/azure/clients (in our case, the synchronised copy used by cluster-autoscaler 1.19) segfaulting under heavy pressure and ARM throttling:

I0809 17:11:56.963285      49 azure_cache.go:83] Invalidating unowned instance cache
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1b595b5]
cluster-autoscaler-all-79b9478bf5-cgkg8 cluster-autoscaler
goroutine 82 [running]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/clients/armclient.(*Client).Send(0xc00052f520, 0x3b6dc40, 0xc000937440, 0xc000933f00, 0x0, 0x0)
        /home/jb/go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/clients/armclient/azure_armclient.go:122 +0xb5
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/clients/armclient.(*Client).GetResource(0xc00052f520, 0x3b6dc40, 0xc000937440, 0xc000998630, 0x82, 0x0, 0x0, 0xc000d0f8b0, 0xb)
        /home/jb/go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/clients/armclient/azure_armclient.go:312 +0x3a0
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/clients/vmssclient.(*Client).listVMSS(0xc000758700, 0x3b6dc40, 0xc000937440, 0xc000d0f8b0, 0xb, 0x0, 0x0, 0x0, 0x0)
        /home/jb/go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/clients/vmssclient/azure_vmssclient.go:181 +0x316
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/clients/vmssclient.(*Client).List(0xc000758700, 0x3b6dc40, 0xc000937440, 0xc000d0f8b0, 0xb, 0x57517d7, 0x57, 0x13d056b, 0x1c5b8c1)
        /home/jb/go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/clients/vmssclient/azure_vmssclient.go:158 +0x331
k8s.io/autoscaler/cluster-autoscaler/cloudprovider/azure.(*AzureManager).listScaleSets(0xc000a1f200, 0xc00091afe0, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/jb/go/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/azure/azure_manager.go:646 +0xe0
k8s.io/autoscaler/cluster-autoscaler/cloudprovider/azure.(*AzureManager).getFilteredAutoscalingGroups(0xc000a1f200, 0xc00091afe0, 0x1, 0x1, 0x8, 0x8199ee, 0x5834f80, 0x8, 0x0)
        /home/jb/go/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/azure/azure_manager.go:626 +0x194
k8s.io/autoscaler/cluster-autoscaler/cloudprovider/azure.(*AzureManager).fetchAutoAsgs(0xc000a1f200, 0x2, 0x2)
        /home/jb/go/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/azure/azure_manager.go:549 +0x67
k8s.io/autoscaler/cluster-autoscaler/cloudprovider/azure.(*AzureManager).forceRefresh(0xc000a1f200, 0x0, 0x0)
        /home/jb/go/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/azure/azure_manager.go:534 +0x40
k8s.io/autoscaler/cluster-autoscaler/cloudprovider/azure.CreateAzureManager(0x3b15060, 0xc00091af30, 0x0, 0x0, 0x0, 0xc00044bed0, 0x1, 0x1, 0x1, 0xc00018aa80, ...)
        /home/jb/go/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/azure/azure_manager.go:472 +0x5ff
k8s.io/autoscaler/cluster-autoscaler/cloudprovider/azure.BuildAzure(0xa, 0x3f847ae147ae147b, 0x3fe0000000000000, 0x8bb2c97000, 0x1176592e000, 0x0, 0x1e84800, 0x0, 0xf4240000000000, 0x0, ...)
        /home/jb/go/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/azure/azure_cloud_provider.go:165 +0x1d0
k8s.io/autoscaler/cluster-autoscaler/cloudprovider/builder.buildCloudProvider(0xa, 0x3f847ae147ae147b, 0x3fe0000000000000, 0x8bb2c97000, 0x1176592e000, 0x0, 0x1e84800, 0x0, 0xf4240000000000, 0x0, ...)
        /home/jb/go/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/builder/builder_all.go:57 +0x262
k8s.io/autoscaler/cluster-autoscaler/cloudprovider/builder.NewCloudProvider(0xa, 0x3f847ae147ae147b, 0x3fe0000000000000, 0x8bb2c97000, 0x1176592e000, 0x0, 0x1e84800, 0x0, 0xf4240000000000, 0x0, ...)
        /home/jb/go/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/builder/cloud_provider_builder.go:45 +0x1e7
k8s.io/autoscaler/cluster-autoscaler/core.initializeDefaultOptions(0xc00090b790, 0x203000, 0x0)
        /home/jb/go/src/k8s.io/autoscaler/cluster-autoscaler/core/autoscaler.go:101 +0x31a
k8s.io/autoscaler/cluster-autoscaler/core.NewAutoscaler(0xa, 0x3f847ae147ae147b, 0x3fe0000000000000, 0x8bb2c97000, 0x1176592e000, 0x0, 0x1e84800, 0x0, 0xf4240000000000, 0x0, ...)
        /home/jb/go/src/k8s.io/autoscaler/cluster-autoscaler/core/autoscaler.go:65 +0x43
main.buildAutoscaler(0x0, 0x0, 0x0, 0x0)
        /home/jb/go/src/k8s.io/autoscaler/cluster-autoscaler/main.go:320 +0x33a
main.run(0xc0000a0c30)
        /home/jb/go/src/k8s.io/autoscaler/cluster-autoscaler/main.go:326 +0x39
main.main.func2(0x3b6dc40, 0xc000936200)
        /home/jb/go/src/k8s.io/autoscaler/cluster-autoscaler/main.go:430 +0x2a
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
        /home/jb/go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:208 +0x113

In k8s.io/legacy-cloud-providers/azure/clients/armclient/azure_armclient.go,
we assume ARM requests would either return either a non nil *retry.Error, or a non nil *http.Response:

func (c *Client) Send(ctx context.Context, request *http.Request) (*http.Response, *retry.Error) {
        response, rerr := c.sendRequest(ctx, request)
        if rerr != nil {
                return response, rerr
        }

        if response.StatusCode != http.StatusNotFound || c.clientRegion == "" {
                return response, rerr
        }
        ...

But really, it does not offer such guarantee:

# vendor/vendor/k8s.io/legacy-cloud-providers/azure/clients/armclient/azure_armclient.go
func (c *Client) sendRequest(ctx context.Context, request *http.Request) (*http.Response, *retry.Error) {
        ...
        return response, retry.GetError(response, err)
}

# vendor/k8s.io/legacy-cloud-providers/azure/retry/azure_error.go
func GetError(resp *http.Response, err error) *Error {
        if err == nil && resp == nil {
                return nil
        }
        ...

Which issue(s) this PR fixes:

Fixes #94077

Does this PR introduce a user-facing change?:

Azure ARM client: don't segfault on empty response and http error

/assign @andyzhangx @feiskyer
/sig cloud-provider
/area provider/azure

k8s-ci-robot · 2020-08-18T11:29:53Z

Hi @bpineau. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

andyzhangx · 2020-08-18T11:38:34Z

/ok-to-test
/kind bug
/priority important-soon

feiskyer

Thanks for the fix.

/lgtm
/approve
/retest

k8s-ci-robot · 2020-08-18T12:06:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bpineau, feiskyer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/legacy-cloud-providers/azure/OWNERS~~ [feiskyer]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

liggitt · 2020-08-18T12:32:09Z

staging/src/k8s.io/legacy-cloud-providers/azure/clients/armclient/azure_armclient.go

@@ -119,6 +119,10 @@ func (c *Client) Send(ctx context.Context, request *http.Request) (*http.Respons
 		return response, rerr
 	}

+	if response == nil && rerr == nil {
+		return response, rerr


This seems like a bug that should be fixed in sendRequest… clients expect the invariant that either response or err will be non-nil.

yea. let's add the workaround here and check whether could we fix the underlying SDK bug.

@jhendrixMSFT do you know something about this? I think it's probably something wrong from go-autorest.

I meant we should fix the bug in Client#sendRequest in this file

There was a bug I recently fixed that was causing a nil response and error; this was a corner-case in authentication, see Azure/go-autorest#547 for the details. I can't tell from the stack if auth is relevant here so it might be unrelated.

Any additional data/frames/etc you have would be great.

Edited the PR description, now with the full stack trace.

Found where the nil response and nil err originates:

When cluster-autoscaler's Azure provider is provided a config file, documented defaults settings values (and env variables) are not applied. Setting only cloudProviderBackoff=true will result in a config with cloudProviderBackoffRetries: 0 (vs. 6 by default when enabled with ENABLE_BACKOFF=true env). Cluster-autoscaler will then instanciate an armclient with backoff.Step=0. But azure_retry.go's doBackoffRetry() (in this repos) only runs queries when Step > 0, otherwise it returns the nil, nil we're seeing here.

I'll PR cluster-autoscaler to address the source, but I wonder if we should also make that more safe in this repos.
For instance: ensuring armclient.New() sets retry.Backoff to at least 1 (means one request exec, no retry), and doBackoffRetry logs something when called with Step=0. What do you think?

@bpineau good catch. Agreed with above, we should make it safer in this repo as well (set backoff to 1 if it is 0).

Filed a PR to fix this issue in armclient here: #94180.

We're seeing legacy-cloud-providersazure/clients (in our case, the synchronized copy used by cluster-autoscaler) segfaulting under heavy pressure and ARM throttling, like so: ``` panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1b595b5] cluster-autoscaler-all-79b9478bf5-cgkg8 cluster-autoscaler goroutine 82 [running]: k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/clients/armclient.(*Client).Send(0xc00052f520, 0x3b6dc40, 0xc000937440, 0xc000933f00, 0x0, 0x0) /home/jb/go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/clients/armclient/azure_armclient.go:122 +0xb5 ``` Reason is the ARM client expects `sendRequest()` to return either a non nil *retry.Error, or a non nil *http.Response, while both can be nil.

andyzhangx · 2020-08-20T01:05:12Z

/retest

When `cloudProviderBackoff` is configured, `cloudProviderBackoffRetries` must also be set to a value > 0, otherwise the cluster-autoscaler will instanciate a vmssclient with 0 Steps retries, which will cause `doBackoffRetry()` to return a nil response and nil error on requests. ARM client can't cope with those and will then segfault. See kubernetes/kubernetes#94078 The README.md needed a small update, because the documented defaults are a bit misleading: they don't apply when the cluster-autoscaler is provided a config file, due to: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/azure_manager.go#L299-L308 ... which is also causing all environment variables to be ignored when a configuration file is provided.

feiskyer · 2020-09-07T02:44:41Z

/lgtm
/milestone v1.20

When `cloudProviderBackoff` is configured, `cloudProviderBackoffRetries` must also be set to a value > 0, otherwise the cluster-autoscaler will instanciate a vmssclient with 0 Steps retries, which will cause `doBackoffRetry()` to return a nil response and nil error on requests. ARM client can't cope with those and will then segfault. See kubernetes/kubernetes#94078 The README.md needed a small update, because the documented defaults are a bit misleading: they don't apply when the cluster-autoscaler is provided a config file, due to: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/azure_manager.go#L299-L308 ... which is also causing all environment variables to be ignored when a configuration file is provided.

k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Aug 18, 2020

k8s-ci-robot assigned andyzhangx and feiskyer Aug 18, 2020

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 18, 2020

k8s-ci-robot requested review from brendandburns and justaugustus August 18, 2020 11:30

k8s-ci-robot added the area/cloudprovider label Aug 18, 2020

feiskyer reviewed Aug 18, 2020

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 18, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 18, 2020

liggitt reviewed Aug 18, 2020

View reviewed changes

bpineau force-pushed the armclient-errors-handling branch from 399273f to d16eee0 Compare August 19, 2020 14:54

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 19, 2020

bpineau mentioned this pull request Aug 22, 2020

Azure cloud provider: backoff needs retries kubernetes/autoscaler#3449

Merged

feiskyer mentioned this pull request Aug 23, 2020

Ensure backoff step is set to 1 for Azure armclient #94180

Merged

k8s-ci-robot added this to the v1.20 milestone Sep 7, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 7, 2020

k8s-ci-robot merged commit e7420a4 into kubernetes:master Sep 7, 2020

github-actions bot mentioned this pull request Sep 15, 2020

Week Ending September 13, 2020 dev-obs/actus#223

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARM client: survive empty response and error #94078

ARM client: survive empty response and error #94078

bpineau commented Aug 18, 2020 •

edited

Loading

k8s-ci-robot commented Aug 18, 2020

andyzhangx commented Aug 18, 2020

feiskyer left a comment

k8s-ci-robot commented Aug 18, 2020

liggitt Aug 18, 2020

feiskyer Aug 19, 2020

feiskyer Aug 19, 2020

liggitt Aug 19, 2020

jhendrixMSFT Aug 19, 2020

jhendrixMSFT Aug 19, 2020

bpineau Aug 19, 2020

bpineau Aug 21, 2020

feiskyer Aug 23, 2020 •

edited

Loading

feiskyer Aug 23, 2020

andyzhangx commented Aug 20, 2020

feiskyer commented Sep 7, 2020

ARM client: survive empty response and error #94078

ARM client: survive empty response and error #94078

Conversation

bpineau commented Aug 18, 2020 • edited Loading

k8s-ci-robot commented Aug 18, 2020

andyzhangx commented Aug 18, 2020

feiskyer left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Aug 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feiskyer Aug 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andyzhangx commented Aug 20, 2020

feiskyer commented Sep 7, 2020

bpineau commented Aug 18, 2020 •

edited

Loading

feiskyer Aug 23, 2020 •

edited

Loading