Metric TotalNumberOfQueuedAndInProgressWorkflowRuns doesn't seem to line up with Current workflow/job counts

**Describe the bug**
We are still battling the 'cancellation' bug in our organization, many times running into pod reconciles at times where it doesn't make sense.  Typically the scale up works (tho since 0.20.1 its been scaling up tremendously slowly, if it even scales up to the proper amount of runners needed - more in details), but scaling down happens prematurely and cancels active workflow jobs.  I have been able to test the 0.20.4 version but cancellations occur even more often than not with that release, and that includes being capped at 24 runners for some odd reason.  I started monitoring the suggested replicas and am seeing a lot of oddities, which I am not yet able to map to any specific part of the code...

**Checks**

Example Log from Controller:

```
2022-01-18T19:15:05.548Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 8 by TotalNumberOfQueuedAndInProgressWorkflowRuns	{"workflow_runs_completed": 0, "workflow_runs_in_progress": 8, "workflow_runs_queued": 0, "workflow_runs_unknown": 0, "namespace": "default", "kind": "runnerdeployment", "name": "runners", "horizontal_runner_autoscaler": "runners-autoscaler"}
```

Example from CLI written w/ same code to grab workflow runs/jobs:

```
./airbin/checkctl github list_jobs
2022/01/18 13:15:08 [DEBUG] Total Count: 1 for Perform release (queued)
-----workflow stats for 1713612691-----Jobs Completed: 65Jobs Queued: 18Jobs In Progress: 46Jobs Unknown: 0
```

**To Reproduce**
Steps to reproduce the behavior:
1. Use TotalNumberOfQueuedAndInProgressWorkflowRuns
2. Use deployment from *screenshots* below
3. Launch workflow with 100+ jobs (in matrix if possible)

**Expected behavior**
A clear and concise description of what you expected to happen.

**Screenshots**

```
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  namespace: default
  name: runners
spec:
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
    spec:
      dockerdWithinRunnerContainer: true
      organization: myOrganization
      nodeSelector:
        com.example/capacity-type: SPOT
        com.example/purpose: github-runners
        com.example/environment: ops
      labels:
        - java
        - linux
        - eks
        - self-hosted
      resources:
        requests:
          cpu: "1.0"
          memory: "10Gi"
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  namespace: default
  name: runners-autoscaler
spec:
  scaleTargetRef:
    name:runners
  scaleDownDelaySecondsAfterScaleOut: 7200
  minReplicas: 1
  maxReplicas: 150
  metrics:
  - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
    repositoryNames:
    - my_repository
```

**Environment (please complete the following information):**
 - Controller Version [0.20.1->0.20.4]
 - Deployment Method: Helm

**Additional context**

This has been an ongoing issue but since the upgrade to the `0.20.x` branch I believe its gotten worse.  I am not sure if its the GitHub API or something in the calculations to determine the suggested replicas.  Either way, is there anything I am missing here?  I am not sure why I can get such different results on my own GitHub API calls from command line vs that of the controller's autoscaler suggested replicas.

`scaleDownDelaySecondsAfterScaleOut: 7200` was put in to battle some of the unnecessary kills, it isn't working as well though because after a scale up and the runners stay at scale, it eventually causes runners to get interrupted (by controller kills) during the middle of tests as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metric TotalNumberOfQueuedAndInProgressWorkflowRuns doesn't seem to line up with Current workflow/job counts #1057

24 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric TotalNumberOfQueuedAndInProgressWorkflowRuns doesn't seem to line up with Current workflow/job counts #1057

Description

Activity

jbkc85 commented on Jan 18, 2022

mumoshu commented on Jan 18, 2022

jbkc85 commented on Jan 19, 2022

toast-gear commented on Jan 19, 2022

jbkc85 commented on Jan 19, 2022

jbkc85 commented on Jan 19, 2022

mumoshu commented on Jan 19, 2022

jbkc85 commented on Jan 20, 2022

mumoshu commented on Jan 20, 2022

mumoshu commented on Jan 20, 2022

jbkc85 commented on Jan 20, 2022

mumoshu commented on Jan 20, 2022

jbkc85 commented on Jan 21, 2022

stale commented on Feb 20, 2022

24 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions