Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linkerd-proxy 2.11 crashes when the application issues requests resulting in Connection refused #7103

Closed
kwencel opened this issue Oct 15, 2021 · 7 comments · Fixed by linkerd/linkerd2-proxy#1318
Milestone

Comments

@kwencel
Copy link

kwencel commented Oct 15, 2021

Bug Report

What is the issue?

linkerd-proxy 2.11 crashes when the application it encapsulates requests resulting in Connection refused.

It also crashes on the latest edge-21.10.1.
It does not crash on stable-2.10, so I strongly think it is a regression from that version.

How can it be reproduced?

We have prepared a YAML which spawns a little Locust-based stress test (not really a stress test, as it requires very little requests per second to crash the linkerd-proxy). It tries to access the existing Kubernetes service (your own linkerd-identity in this example, but it can be anything) using a port which the service does not expose. As a result, the request ends with Connection refused and linkerd-proxy dies.

It crashes on our production infrastructure (on-prem Rancher provisioned cluster) but for simplicity we have also prepared a minikube-based scenario.

1. Spawn a minikube cluster

minikube start --cpus 2 --kubernetes-version v1.20.11 -p linkerdcrash

2. Install Linkerd 2.11

linkerd install | kubectl apply -f -

3. Run our YAML

curl -s https://gist.githubusercontent.com/kwencel/f141159ae710f85eb9ab7c54fee76a5e/raw/d37dd20518f015185b05e60d87d17dd40fdfd7c1/locust.yaml | kubectl apply -f -

Logs, error output, etc

time="2021-10-15T18:58:09Z" level=info msg="Found pre-existing key: /var/run/linkerd/identity/end-entity/key.p8"
time="2021-10-15T18:58:09Z" level=info msg="Found pre-existing CSR: /var/run/linkerd/identity/end-entity/csr.der"
[     0.000523s] ERROR ThreadId(01) linkerd_app::env: No inbound ports specified via LINKERD2_PROXY_INBOUND_PORTS
[     0.000745s]  INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[     0.001120s]  INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[     0.001142s]  INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[     0.001144s]  INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[     0.001146s]  INFO ThreadId(01) linkerd2_proxy: Tap DISABLED
[     0.001147s]  INFO ThreadId(01) linkerd2_proxy: Local identity is default.default.serviceaccount.identity.linkerd.cluster.local
[     0.001151s]  INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.001153s]  INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.014938s]  INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity: default.default.serviceaccount.identity.linkerd.cluster.local
[     0.829546s]  INFO ThreadId(01) outbound:server{orig_dst=10.98.134.224:12345}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=Connection refused (os error 111)
[     0.831752s]  WARN ThreadId(01) outbound:server{orig_dst=10.98.134.224:12345}: linkerd_reconnect: Service failed error=channel closed

And shortly after linkerd-proxy dies.

linkerd check output

Linkerd core checks
===================

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready
√ cluster networks contains all node podCIDRs

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
√ policy-validator webhook has valid cert
√ policy-validator cert is valid for at least 60 days

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ can retrieve the control plane version
√ control plane is up-to-date
√ control plane and cli versions match

linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
√ control plane proxies are up-to-date
√ control plane proxies and cli versions match

Status check results are √

Environment

  • Kubernetes Version: v1.20.11
  • Cluster Environment: on-prem RKE / minikube v1.23.2
  • Host OS: Debian 10 / minikube v1.23.2
  • Linkerd version: stable-2.11.0

Additional context

At first sight, it might look like a not-so-important bug, but we have encountered it in our production usage. We had a haproxy-based loadbalancer outside of the Kubernetes cluster which managed connections to several backends (also outside of the cluster).

We have observed that when the request from the k8s cluster to the loadbalancer was made during the restart of one of those backends, haproxy would often return an error, since it didn't have enough time to notice the backend was down. To our surprise, that error was triggering the linkerd-proxy of the caller to crash. We have reverted to 2.10 and it is working fine now.

@olix0r olix0r added this to the stable-2.11.1 milestone Oct 19, 2021
@olix0r
Copy link
Member

olix0r commented Oct 19, 2021

I don't think that the linkerd proxy is crashing, exactly. Rather, it looks like Kubelet is failing to pass readiness probes and is forcibly killing the proxy.

kubectl describe shows

  Warning  Unhealthy  5s (x6 over 36s)  kubelet            Readiness probe failed: Get "http://10.42.0.17:4191/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

When it's in this state I'm also unable to contact the metrics endpoint on the admin server. We'll need to dig in a little more on why the admin server is being starved or if it's a lower level problem like the OS being unable to create new connections.

@olix0r
Copy link
Member

olix0r commented Oct 19, 2021

I'm able to reproduce this a pretty low load configuration:

  - command:
    - locust
    - -f
    - /config/locustfile.py
    - --headless
    - -u
    - "8"
    - -r
    - "4"
    - -H
    - http://linkerd-identity.linkerd:12345

At lesser rates, this problem is not observed.

With all logging enabled, we see requests flow for about 2.5s. At the 5s mark we notice the identity client issue a DNS query. And then the proxy becomes completely unresponsive. No further logs are emitted and the admin server stops accepting connections.

The admin server runs in a dedicated thread, separate from the proxy thread pool. We could somehow be deadlocking on the policy check--this should be easy to confirm.

This behavior can be reproduced with logging set to off--so that likely eliminates any issues with tracing.

We'll continue to debug this. Thanks for the helpful report @kwencel!

@kwencel
Copy link
Author

kwencel commented Oct 19, 2021

No problem, I hope you'll be able to quickly find and fix the problem :) I also thought it might be related to probes, but in my testing the proxy dies pretty quickly, in less than 10s IIRC. That is why I assumed the proxy had just terminated itself.

But it seems you have configured the probes to reboot a non-responding proxy pretty fast and that just confused me. However, when I think about it, I guess it's good to have aggressive probes for such crucial components as proxies, which should be available all the time.

Please keep us posted with the progress!

@olix0r
Copy link
Member

olix0r commented Oct 19, 2021

I'm pretty sure I've found and fixed the issue. I'm still sorting out how to write a test to catch this issue, but we should have a fix ready for this week's edge release (and, subsequently, the 2.10.1 patch release).

I was able to get an lldb session running while in this state and we see:

(lldb) bt
* thread #1, name = 'linkerd2-proxy', stop reason = signal SIGSTOP
  * frame #0: 0x0000559dc3679c87 linkerd2-proxy`linkerd_proxy_http::orig_proto::downgrade_h2_error::h1b7a4818516e427c + 103
...

It turns out that this function can (incorrectly) loop infinitely, consuming the process's CPU allocation and starving the admin thread's ability to serve connections, presumably.

@kwencel
Copy link
Author

kwencel commented Oct 19, 2021

Yes, that would make sense, as we observed 100% CPU usage for a few seconds just before the proxy was restarted.

I assume you meant the 2.11.1 patch release? The bug is not present in the 2.10 family. When do you think we can expect the 2.11.1 release with this bug fixed?

Thank you very much for the prompt reaction to the submitted issue!

@olix0r
Copy link
Member

olix0r commented Oct 19, 2021

@kwencel oops, correct: 2.11.1. I'm hoping for this near the end of the month. Though, you should be able to run edge-channel proxies with the 2.11.0 control plane without issue in the meantime.

olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue Oct 19, 2021
578d979 introduced a bug: when the proxy handles errors on
HTTP/2-upgraded connections, it can get stuck in an infinite loop when
searching the error chain for an HTTP/2 error.

This change fixes this inifite loop and adds a test that exercises this
behavior to ensure that `downgrade_h2_error` behaves as expected.

Fixes linkerd/linkerd2#7103
olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue Oct 19, 2021
578d979 introduced a bug: when the proxy handles errors on
HTTP/2-upgraded connections, it can get stuck in an infinite loop when
searching the error chain for an HTTP/2 error.

This change fixes this inifite loop and adds a test that exercises this
behavior to ensure that `downgrade_h2_error` behaves as expected.

Fixes linkerd/linkerd2#7103
@olix0r
Copy link
Member

olix0r commented Oct 19, 2021

@kwencel Thanks so much for providing a manifest to reproduce this issue :) That made it much easier to track down.

If you want to test the fix for yourself, you can set the pod annotations:

config.linkerd.io/proxy-image: ghcr.io/olix0r/l2-proxy
config.linkerd.io/proxy-version: main.ec441e581

Let us know if you see anything else unexpected!

olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue Oct 22, 2021
578d979 introduced a bug: when the proxy handles errors on
HTTP/2-upgraded connections, it can get stuck in an infinite loop when
searching the error chain for an HTTP/2 error.

This change fixes this inifite loop and adds a test that exercises this
behavior to ensure that `downgrade_h2_error` behaves as expected.

Fixes linkerd/linkerd2#7103
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 19, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants