Skip to content

Keda-Operator OOM problem after upgrade to Keda v2.11.* #4789

Closed
kedacore/keda-docs
#1250
@andreb89

Description

Report

Hi,
we have an OOM problem in Kubernetes (AKS 1.26.3) with the keda-operator introduced with version 2.11.*. We are using Postgres- and Prometheus trigger for scaled jobs. For now, we downgraded to 2.10.1 again, where we do not have this issue.

Grafana metrics for the keda-operator pod with 2.11.1:
image

After the downgrade to 2.10.1:
image

I added some keda-operator pod logs. but nothing useful is really found around the time the OOM happens.

We are using the default resource request/limits, e.g. keda-operator:

    Limits:
      cpu:     1
      memory:  1000Mi
    Requests:
      cpu:      100m
      memory:   100Mi

We have about 500 scaledjobs instances and 1 scaledobjects instance. Most of the jobs have a Prometheus trigger with the following template:

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  annotations:
    meta.helm.sh/release-name: worker
    meta.helm.sh/release-namespace: ks-ns
  creationTimestamp: "2022-01-26T17
  finalizers:
  - finalizer.keda.sh
  generation: 21
  labels:
    ...
  name: worker
  namespace: ks-ns
  resourceVersion: "647584627"
  uid: 057f6b4b-8cc0-43aa-a16c-fe9ab7611d79
spec:
  failedJobsHistoryLimit: 10
  jobTargetRef:
    activeDeadlineSeconds: 1800
    backoffLimit: 6
    template:
      metadata:
        creationTimestamp: null
        labels:
         ...
      spec:
        containers:
        ...
    ttlSecondsAfterFinished: 3600
  maxReplicaCount: 20
  pollingInterval: 5
  rolloutStrategy: default
  scalingStrategy: {}
  successfulJobsHistoryLimit: 1
  triggers:
  - metadata:
      metricName: serverless_pendingjobs
      query: max(serverless_pendingjobs{queue="queue", namespace="ks-ns"})
      serverAddress: http://[cluster]:9090
      threshold: "1"
    type: prometheus

Expected Behavior

Memory consumption should stay the same after the Keda version update.

Actual Behavior

Huge jump in memory consumption after the upgrade.

Steps to Reproduce the Problem

Have a bigger cluster with a lot of different scale jobs and try the Keda version upgrade from 2.10.* to 2.11.*.

Maybe this will happen for you, too. Honestly unclear.

Logs from KEDA operator

  | Jul 11, 2023 @ 01:10:54.809 | 2023-07-10T23:10:54Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "appg-lucmerge-appf", "scaledJob.Namespace": "staging-appg-application", "Number of pending Jobs ": 0}
  | Jul 11, 2023 @ 01:10:54.809 | 2023-07-10T23:10:54Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "appf-bundle-finalizer-m", "scaledJob.Namespace": "staging-appf-application", "Number of pending Jobs ": 0}
  | Jul 11, 2023 @ 01:10:54.827 | 2023-07-10T23:10:54Z	INFO	scaleexecutor	Creating jobs	{"scaledJob.Name": "appf-import-data-xl", "scaledJob.Namespace": "staging-appf-application", "Effective number of max jobs": 0}
  | Jul 11, 2023 @ 01:10:54.827 | 2023-07-10T23:10:54Z	INFO	scaleexecutor	Creating jobs	{"scaledJob.Name": "appf-import-data-xl", "scaledJob.Namespace": "staging-appf-application", "Number of jobs": 0}
  | Jul 11, 2023 @ 01:10:54.827 | 2023-07-10T23:10:54Z	INFO	scaleexecutor	Created jobs	{"scaledJob.Name": "appf-import-data-xl", "scaledJob.Namespace": "staging-appf-application", "Number of jobs": 0}
  | Jul 11, 2023 @ 01:10:55.749 | 2023-07-10T23:10:55Z	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": ":8080"}
  | Jul 11, 2023 @ 01:10:55.773 | 2023-07-10T23:10:55Z	INFO	setup	Starting manager
  | Jul 11, 2023 @ 01:10:55.773 | 2023-07-10T23:10:55Z	INFO	setup	Git Commit: b8dbd298cf9001b1597a2756fd0be4fa4df2059f
  | Jul 11, 2023 @ 01:10:55.773 | 2023-07-10T23:10:55Z	INFO	setup	KEDA Version: 2.11.1
  | Jul 11, 2023 @ 01:10:55.773 | 2023-07-10T23:10:55Z	INFO	setup	Running on Kubernetes 1.26	{"version": "v1.26.3"}
  | Jul 11, 2023 @ 01:10:55.773 | 2023-07-10T23:10:55Z	INFO	setup	Go Version: go1.20.5
  | Jul 11, 2023 @ 01:10:55.773 | 2023-07-10T23:10:55Z	INFO	setup	Go OS/Arch: linux/amd64
  | Jul 11, 2023 @ 01:10:55.874 | 2023-07-10T23:10:55Z	INFO	Starting server	{"kind": "health probe", "addr": "[::]:8081"}
  | Jul 11, 2023 @ 01:10:55.874 | 2023-07-10T23:10:55Z	INFO	starting server	{"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
  | Jul 11, 2023 @ 01:10:55.874 | I0710 23:10:55.874494       1 leaderelection.go:245] attempting to acquire leader lease staging-keda-serverless/operator.keda.sh...
  | Jul 11, 2023 @ 01:10:56.668 | I0710 23:10:56.668438       1 httplog.go:132] "HTTP" verb="GET" URI="/healthz" latency="158.111µs" userAgent="kube-probe/1.26" audit-ID="b69846c6-e714-4b2e-8109-460408fc4fa0" srcIP="10.4.8.122:49694" resp=200
  | Jul 11, 2023 @ 01:10:56.668 | I0710 23:10:56.668040       1 httplog.go:132] "HTTP" verb="GET" URI="/readyz" latency="213.915µs" userAgent="kube-probe/1.26" audit-ID="280df468-0e0a-4222-84ee-0aed41f7c566" srcIP="10.4.8.122:49710" resp=200
  | Jul 11, 2023 @ 01:11:01.274 | I0710 23:11:01.274250       1 httplog.go:132] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1" latency="11.654072ms" userAgent="Go-http-client/2.0" audit-ID="7c0694eb-ae9d-4f70-b035-452bbd726728" srcIP="10.4.1.72:50284" resp=200
  | Jul 11, 2023 @ 01:11:01.274 | I0710 23:11:01.274315       1 httplog.go:132] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1" latency="11.800979ms" userAgent="Go-http-client/2.0" audit-ID="5d45c41a-6417-4d00-9e08-41b10b1477c1" srcIP="10.4.1.72:50284" resp=200
  | Jul 11, 2023 @ 01:11:01.274 | I0710 23:11:01.274552       1 httplog.go:132] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1" latency="12.225696ms" userAgent="Go-http-client/2.0" audit-ID="f1594dda-145d-4acb-b768-b74e1608460a" srcIP="10.4.1.72:50284" resp=200
  | Jul 11, 2023 @ 01:11:01.284 | I0710 23:11:01.283973       1 httplog.go:132] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1" latency="21.652078ms" userAgent="Go-http-client/2.0" audit-ID="11e6d398-2328-4f5a-abdc-8301c366b3b4" srcIP="10.4.1.72:50284" resp=200
  | Jul 11, 2023 @ 01:11:01.284 | I0710 23:11:01.283992       1 httplog.go:132] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1" latency="21.631177ms" userAgent="Go-http-client/2.0" audit-ID="3c64c045-a66f-4f51-9c09-a43a411cf3fc" srcIP="10.4.1.72:50284" resp=200
  | Jul 11, 2023 @ 01:11:02.144 | I0710 23:11:02.144046       1 httplog.go:132] "HTTP" verb="GET" URI="/openapi/v2" latency="18.651856ms" userAgent="" audit-ID="3d4f8945-2587-42c2-9703-3a10a6862d03" srcIP="10.4.1.72:52592" resp=304
  | Jul 11, 2023 @ 01:11:02.144 | I0710 23:11:02.144171       1 httplog.go:132] "HTTP" verb="GET" URI="/openapi/v3" latency="17.086093ms" userAgent="" audit-ID="d2813a34-fd4a-4bf0-a79e-d434a98a8cba" srcIP="10.4.1.72:52592" resp=200
  | Jul 11, 2023 @ 01:11:05.252 | I0710 23:11:05.252401       1 httplog.go:132] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1" latency="13.751949ms" userAgent="kube-controller-manager/v1.26.3 (linux/amd64) kubernetes/f18584a/system:serviceaccount:kube-system:resourcequota-controller" audit-ID="e25b658c-cdc2-4eb9-9b42-8e3c5dde004f" srcIP="10.4.1.72:52592" resp=200
  | Jul 11, 2023 @ 01:11:06.670 | I0710 23:11:06.670464       1 httplog.go:132] "HTTP" verb="GET" URI="/readyz" latency="221.919µs" userAgent="kube-probe/1.26" audit-ID="710f9c70-3312-4855-a44d-d88e5d548618" srcIP="10.4.8.122:51914" resp=200
  | Jul 11, 2023 @ 01:11:06.673 | I0710 23:11:06.673761       1 httplog.go:132] "HTTP" verb="GET" URI="/healthz" latency="164.413µs" userAgent="kube-probe/1.26" audit-ID="77c18635-cf67-4ae2-a901-6e9ed6568e06" srcIP="10.4.8.122:51928" resp=200
  | Jul 11, 2023 @ 01:11:06.848 | I0710 23:11:06.848846       1 httplog.go:132] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1" latency="14.929647ms" userAgent="kube-controller-manager/v1.26.3 (linux/amd64) kubernetes/f18584a/system:serviceaccount:kube-system:generic-garbage-collector" audit-ID="af777025-5f23-413c-8672-1ed69c616df0" srcIP="10.4.1.72:52592" resp=200
  | Jul 11, 2023 @ 01:11:08.502 | I0710 23:11:08.502363       1 httplog.go:132] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1" latency="14.735956ms" userAgent="kube-controller-manager/v1.26.3 (linux/amd64) kubernetes/f18584a/controller-discovery" audit-ID="c8b38271-99a6-4c7b-9fb7-dab401da7004" srcIP="10.4.1.72:52592" resp=200
  | Jul 11, 2023 @ 01:11:12.084 | I0710 23:11:12.084112       1 leaderelection.go:255] successfully acquired lease staging-keda-serverless/operator.keda.sh
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting EventSource	{"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "source": "kind source: *v1alpha1.TriggerAuthentication"}
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting EventSource	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v1alpha1.ScaledObject"}
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting EventSource	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v2.HorizontalPodAutoscaler"}
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting EventSource	{"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "source": "kind source: *v1alpha1.ClusterTriggerAuthentication"}
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting Controller	{"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication"}
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting Controller	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob"}
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting EventSource	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "source": "kind source: *v1alpha1.ScaledJob"}
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting Controller	{"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication"}
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting Controller	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject"}
  | Jul 11, 2023 @ 01:11:12.095 | 2023-07-10T23:11:12Z	INFO	grpc_server	Starting Metrics Service gRPC Server	{"address": ":9666"}
  | Jul 11, 2023 @ 01:11:12.488 | 2023-07-10T23:11:12Z	INFO	Starting workers	{"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "worker count": 1}
  | Jul 11, 2023 @ 01:11:12.488 | 2023-07-10T23:11:12Z	INFO	Reconciling ScaledJob	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"appd-leak-detection-m","namespace":"staging-appd-application"}, "namespace": "staging-appd-application", "name": "appd-leak-detection-m", "reconcileID": "5bb6525a-17bb-4149-a142-1970bf22d248"}
  | Jul 11, 2023 @ 01:11:12.488 | 2023-07-10T23:11:12Z	INFO	RolloutStrategy is deprecated, please us Rollout.Strategy in order to define the desired strategy for job rollouts	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"appd-leak-detection-m","namespace":"staging-appd-application"}, "namespace": "staging-appd-application", "name": "appd-leak-detection-m", "reconcileID": "5bb6525a-17bb-4149-a142-1970bf22d248"}
  | Jul 11, 2023 @ 01:11:12.488 | 2023-07-10T23:11:12Z	INFO	Starting workers	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "worker count": 5}
  | Jul 11, 2023 @ 01:11:12.488 | 2023-07-10T23:11:12Z	INFO	Starting workers	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "worker count": 1}
  | Jul 11, 2023 @ 01:11:12.489 | 2023-07-10T23:11:12Z	INFO	Starting workers	{"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "worker count": 1}
  | Jul 11, 2023 @ 01:11:12.490 | 2023-07-10T23:11:12Z	INFO	"metricName" is deprecated and will be removed in v2.12, please do not set it anymore	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"appac-analysis-calculation-worker","namespace":"staging-appac-application"}, "namespace": "staging-appac-application", "name": "appac-analysis-calculation-worker", "reconcileID": "23148f01-9201-4373-a770-6d7d4b5bbcf7", "trigger.type": "prometheus"}
  | Jul 11, 2023 @ 01:11:12.490 | 2023-07-10T23:11:12Z	INFO	Reconciling ScaledObject	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"appac-analysis-calculation-worker","namespace":"staging-appac-application"}, "namespace": "staging-appac-application", "name": "appac-analysis-calculation-worker", "reconcileID": "23148f01-9201-4373-a770-6d7d4b5bbcf7"}
  | Jul 11, 2023 @ 01:11:13.296 | 2023-07-10T23:11:13Z	INFO	Initializing Scaling logic according to ScaledJob Specification	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"appd-leak-detection-m","namespace":"staging-appd-application"}, "namespace": "staging-appd-application", "name": "appd-leak-detection-m", "reconcileID": "5bb6525a-17bb-4149-a142-1970bf22d248"}
  | Jul 11, 2023 @ 01:11:13.309 | 2023-07-10T23:11:13Z	INFO	Initializing Scaling logic according to ScaledObject Specification	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"appac-analysis-calculation-worker","namespace":"staging-appac-application"}, "namespace": "staging-appac-application", "name": "appac-analysis-calculation-worker", "reconcileID": "23148f01-9201-4373-a770-6d7d4b5bbcf7"}
  | Jul 11, 2023 @ 01:11:13.310 | 2023-07-10T23:11:13Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "appd-leak-detection-m", "scaledJob.Namespace": "staging-appd-application", "Number of running Jobs": 0}
  | Jul 11, 2023 @ 01:11:13.310 | 2023-07-10T23:11:13Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "appd-leak-detection-m", "scaledJob.Namespace": "staging-appd-application", "Number of pending Jobs ": 0}
  | Jul 11, 2023 @ 01:11:13.315 | 2023-07-10T23:11:13Z	INFO	Reconciling ScaledJob	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"appd-toll-qa-routing-l","namespace":"staging-appd-application"}, "namespace": "staging-appd-application", "name": "appd-toll-qa-routing-l", "reconcileID": "1fac22ef-d1df-4c3b-b7bf-ea6196839a2e"}
  | Jul 11, 2023 @ 01:11:13.315 | 2023-07-10T23:11:13Z	INFO	RolloutStrategy is deprecated, please us Rollout.Strategy in order to define the desired strategy for job rollouts	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"appd-toll-qa-routing-l","namespace":"staging-appd-application"}, "namespace": "staging-appd-application", "name": "appd-toll-qa-routing-l", "reconcileID": "1fac22ef-d1df-4c3b-b7bf-ea6196839a2e"}
  | Jul 11, 2023 @ 01:11:13.315 | 2023-07-10T23:11:13Z	INFO	Initializing Scaling logic according to ScaledJob Specification	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"appd-toll-qa-routing-l","namespace":"staging-appd-application"}, "namespace": "staging-appd-application", "name": "appd-toll-qa-routing-l", "reconcileID": "1fac22ef-d1df-4c3b-b7bf-ea6196839a2e"}
  | Jul 11, 2023 @ 01:11:13.320 | 2023-07-10T23:11:13Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "appd-toll-qa-routing-l", "scaledJob.Namespace": "staging-appd-application", "Number of running Jobs": 0}
  | Jul 11, 2023 @ 01:11:13.320 | 2023-07-10T23:11:13Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "appd-toll-qa-routing-l", "scaledJob.Namespace": "staging-appd-application", "Number of pending Jobs ": 0}

KEDA Version

2.11.1

Kubernetes Version

1.26

Platform

Microsoft Azure

Scaler Details

Prometheus & Postgres

Anything else?

No response

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

  • Status

    Ready To Ship

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions