Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection error from keda-operator-metrics-apiserver logs #4190

Closed
prashathsenthil opened this issue Feb 1, 2023 · 22 comments · Fixed by #4327
Closed

Connection error from keda-operator-metrics-apiserver logs #4190

prashathsenthil opened this issue Feb 1, 2023 · 22 comments · Fixed by #4327
Assignees
Labels
bug Something isn't working

Comments

@prashathsenthil
Copy link

prashathsenthil commented Feb 1, 2023

Report

After updating KEDA to version 2.9.2 (via chart version 2.9.3) from version 2.8.1 the metrics apiserver logs shows connection issue from the keda operator service at port 9666

Expected Behavior

The keda metrics apiserver to be able to use gRPC connection to successfully get metrics from operator

Actual Behavior

When I deployed the latest KEDA version 2.9.2 started seeing the below error in the metrics apiserver logs,

I0201 19:41:19.967541 1 main.go:162] keda_metrics_adapter "msg"="Connecting Metrics Service gRPC client to the server" "address"="keda-operator.mynamespace.svc.cluster.local:9666"

Err: connection error: desc = "transport: Error while dialing failed to do connect handshake, response: "HTTP/1.1 502 Bad Gateway\r\n

Steps to Reproduce the Problem

Update KEDA version from 2.8.1 to 2.9.2

Logs from KEDA operator

I0201 19:41:19.967541       1 main.go:162] keda_metrics_adapter "msg"="Connecting Metrics Service gRPC client to the server" "address"="keda-operator.<MYNAMESPACE>.svc.cluster.local:9666"

Err: connection error: desc = "transport: Error while dialing failed to do connect handshake, response: \"HTTP/1.1 502 Bad Gateway\\r\\n

KEDA Version

2.9.2

Kubernetes Version

1.23

Platform

Amazon Web Services

Scaler Details

No response

Anything else?

The keda service (13-keda-service.yaml) was introduced with the latest version referring to this issue #3920 as this service wasn't there in the older version

I can try KEDA_USE_METRICS_SERVICE_GRPC=false but as it will be deprecated, wanted to see how to resolve the issue

KEDA runs behind enterprise proxy and the values are set via HTTP_PROXY HTTPS_PROXY and NO_PROXY

@prashathsenthil prashathsenthil added the bug Something isn't working label Feb 1, 2023
@JorTurFer
Copy link
Member

Hello,
I can see a typo in the url, the message says keda-operator..svc.cluster.local:9666 but it should be keda-operator.KEDA_NAMESPACE.svc.cluster.local:9666.
The value is set here https://github.com/kedacore/charts/blob/v2.9.3/keda/templates/22-metrics-deployment.yaml#L109
Could you check if that value is present? I think that the message you see is because there are missing configs after the update. If that line isn't provided, KEDA tries to generate it in fallback but KEDA namespace is required, and that variables is also missing

@prashathsenthil
Copy link
Author

Hello, I can see a typo in the url, the message says keda-operator..svc.cluster.local:9666 but it should be keda-operator.KEDA_NAMESPACE.svc.cluster.local:9666. The value is set here https://github.com/kedacore/charts/blob/v2.9.3/keda/templates/22-metrics-deployment.yaml#L109 Could you check if that value is present? I think that the message you see is because there are missing configs after the update. If that line isn't provided, KEDA tries to generate it in fallback but KEDA namespace is required, and that variables is also missing

@JorTurFer its just a typo while I wrote it, it actually has the namespace name there ""address"="keda-operator.mynamespace.svc.cluster.local:9666""

@JorTurFer
Copy link
Member

JorTurFer commented Feb 1, 2023

@zroubalik , Maybe we should exclude this communication from the proxied traffic, as it's internal traffic. I guess that we could explicitly set the values without the proxy to skip env vars configs
WDYT?

@prashathsenthil
Copy link
Author

prashathsenthil commented Feb 3, 2023

After adding cluster.local to NO_PROXY list, the 502 bad gateway error is resolved, however still seeing the below error in the metric server logs,

grpc: addrConn.createTransport failed to connect to "Addr": "keda-operator.mynamespace.svc.cluster.local:9666"

@tomkerkhove tomkerkhove moved this from Proposed to To Triage in Roadmap - KEDA Core Feb 16, 2023
@JorTurFer
Copy link
Member

After adding cluster.local to NO_PROXY list, the 502 bad gateway error is resolved, however still seeing the below error in the metric server logs,

grpc: addrConn.createTransport failed to connect to "Addr": "keda-operator.mynamespace.svc.cluster.local:9666"

Sorry, I missed your reply 😢

Is the error persistent? I ask because we print the failure message but we don't print the established connection message, if you see the error only once or twice, it's not a problem and it's normal.

@zroubalik zroubalik self-assigned this Mar 6, 2023
@zroubalik zroubalik moved this from To Triage to In Progress in Roadmap - KEDA Core Mar 6, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Ready To Ship in Roadmap - KEDA Core Mar 6, 2023
@JorTurFer JorTurFer moved this from Ready To Ship to Done in Roadmap - KEDA Core Mar 13, 2023
@jstefankowski
Copy link

Has this been resolved ?

After upgrading keda from 2.6.1 to 2.10.1 autoscaling is not working and seeing this error in apiserver log:

Err: connection error: desc = "transport: Error while dialing: dial tcp: lookup keda-sa.keda.svc.cluster.local on 10.106.0.2:53: no such host"
1 client.go:93] keda_metrics_adapter/provider "msg"="Waiting for establishing a gRPC connection to KEDA Metrics Server"
1 provider.go:110] keda_metrics_adapter/provider "msg"="timeout" "error"="timeout while waiting to establish gRPC connection to KEDA Metrics Service server" "server"="keda-sa.keda.svc.cluster.local:9666"

@JorTurFer
Copy link
Member

Has this been resolved ?
It is, yep.

Seeing that error during the startup could be normal if the operator is down, but it should be transitory.
Is your operator working?

@jstefankowski
Copy link

I'm new to Keda. How can I confirm the operator is working correctly ?

@zroubalik
Copy link
Member

Also, did you change anything in the configuration? keda-sa in the part with the url doesn't seem right to me.

@jstefankowski
Copy link

No configuration changes whatsoever.

I think the url configured in metrics apiserver deployment should be the operator internal endpoint: keda-sa.keda:9666 instead of keda-sa.keda.svc.cluster.local:9666.

keda-sa is the operator service name.

@jstefankowski
Copy link

After changing the apiserver deployment metrics-service-address image arg sill getting the same error:
lookup keda-sa.keda.svc on 10.108.0.2:53: no such host

Wondering if the --metrics-service-address argument format is wrong. It is: --metrics-service-address=keda-sa.keda.svc:9666'
which translates to: ..svc:9666

@jstefankowski
Copy link

From operator logs and from testing keda by sending SQS messages keda is scaling out successfully.

2023-06-15T17:10:16Z INFO scaleexecutor Successfully updated ScaleTarget {"scaledobject.Name": "aws-sqs-queue-scaledobject-ds", "scaledObject.Namespace": "test-namespace", "scaleTarget.Name": "ds", "Original Replicas Count": 0, "New Replicas Count": 1}

Just trying to understand the root cause of the timeout errors and its potential impact on production deployments.

@zroubalik
Copy link
Member

zroubalik commented Jun 15, 2023

Imho there should be keda-operator instead of keda-sa - basically the name of K8s Service that fronts the operator Deployment.

@zroubalik
Copy link
Member

keda-sa is the operator service name

I missed this note.

So you did some changes because the original name is keda-operator

@zroubalik
Copy link
Member

From operator logs and from testing keda by sending SQS messages keda is scaling out successfully.

2023-06-15T17:10:16Z INFO scaleexecutor Successfully updated ScaleTarget {"scaledobject.Name": "aws-sqs-queue-scaledobject-ds", "scaledObject.Namespace": "test-namespace", "scaleTarget.Name": "ds", "Original Replicas Count": 0, "New Replicas Count": 1}

Just trying to understand the root cause of the timeout errors and its potential impact on production deployments.

It won't scale to more than 1 replica

@jstefankowski
Copy link

@zroubalik
Spot on. I just realized, that's exactly the behavior, keda won't scale to more than 1 replica.
Why is that ?

@jstefankowski
Copy link

keda-sa is the operator service name

I missed this note.

So you did some changes because the original name is keda-operator

I tried patching metrics-apiserver deployment by changing metric-service-address to the operator service endpoint which is
keda-sa.keda:9666 but to no avail.

@zroubalik
Copy link
Member

Well, in first place we should ask why is the operator service name changed?

@jstefankowski
Copy link

That I'm not sure (inherited cluster). The operator name is set to equal service account hence keda-sa.
--set operator.name=${service_account_name} \

Do you think non default operator service name is the culprit here ?

@jstefankowski
Copy link

@zroubalik Also, can you expand on why keda "won't scale to more than 1 replica" ?

@Gershon-A
Copy link

Does someone have a solution?
I facing the same issue ....

@JorTurFer
Copy link
Member

@Gershon-A , this issue was closed 5 months ago, I suggest opening another issue to track your case. There are multiple reasons that can produce this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants