Skip to content

Metric TotalNumberOfQueuedAndInProgressWorkflowRuns doesn't seem to line up with Current workflow/job counts #1057

Closed
@jbkc85

Description

Describe the bug
We are still battling the 'cancellation' bug in our organization, many times running into pod reconciles at times where it doesn't make sense. Typically the scale up works (tho since 0.20.1 its been scaling up tremendously slowly, if it even scales up to the proper amount of runners needed - more in details), but scaling down happens prematurely and cancels active workflow jobs. I have been able to test the 0.20.4 version but cancellations occur even more often than not with that release, and that includes being capped at 24 runners for some odd reason. I started monitoring the suggested replicas and am seeing a lot of oddities, which I am not yet able to map to any specific part of the code...

Checks

Example Log from Controller:

2022-01-18T19:15:05.548Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 8 by TotalNumberOfQueuedAndInProgressWorkflowRuns	{"workflow_runs_completed": 0, "workflow_runs_in_progress": 8, "workflow_runs_queued": 0, "workflow_runs_unknown": 0, "namespace": "default", "kind": "runnerdeployment", "name": "runners", "horizontal_runner_autoscaler": "runners-autoscaler"}

Example from CLI written w/ same code to grab workflow runs/jobs:

./airbin/checkctl github list_jobs
2022/01/18 13:15:08 [DEBUG] Total Count: 1 for Perform release (queued)
-----workflow stats for 1713612691-----Jobs Completed: 65Jobs Queued: 18Jobs In Progress: 46Jobs Unknown: 0

To Reproduce
Steps to reproduce the behavior:

  1. Use TotalNumberOfQueuedAndInProgressWorkflowRuns
  2. Use deployment from screenshots below
  3. Launch workflow with 100+ jobs (in matrix if possible)

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  namespace: default
  name: runners
spec:
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
    spec:
      dockerdWithinRunnerContainer: true
      organization: myOrganization
      nodeSelector:
        com.example/capacity-type: SPOT
        com.example/purpose: github-runners
        com.example/environment: ops
      labels:
        - java
        - linux
        - eks
        - self-hosted
      resources:
        requests:
          cpu: "1.0"
          memory: "10Gi"
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  namespace: default
  name: runners-autoscaler
spec:
  scaleTargetRef:
    name:runners
  scaleDownDelaySecondsAfterScaleOut: 7200
  minReplicas: 1
  maxReplicas: 150
  metrics:
  - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
    repositoryNames:
    - my_repository

Environment (please complete the following information):

  • Controller Version [0.20.1->0.20.4]
  • Deployment Method: Helm

Additional context

This has been an ongoing issue but since the upgrade to the 0.20.x branch I believe its gotten worse. I am not sure if its the GitHub API or something in the calculations to determine the suggested replicas. Either way, is there anything I am missing here? I am not sure why I can get such different results on my own GitHub API calls from command line vs that of the controller's autoscaler suggested replicas.

scaleDownDelaySecondsAfterScaleOut: 7200 was put in to battle some of the unnecessary kills, it isn't working as well though because after a scale up and the runners stay at scale, it eventually causes runners to get interrupted (by controller kills) during the middle of tests as well.

Activity

jbkc85

jbkc85 commented on Jan 18, 2022

@jbkc85
Author

I might be caught on the whole "this is fixed but not released" - as I see the pagination was added recently but isn't in release 0.20.4 yet. I built it locally and will test it out tonight!

mumoshu

mumoshu commented on Jan 18, 2022

@mumoshu
Collaborator

@jbkc85 Just curious, but how many repositories are intended to use the same set of organizational runners managed by your runners RunnerDeployment?

jbkc85

jbkc85 commented on Jan 19, 2022

@jbkc85
Author

@jbkc85 Just curious, but how many repositories are intended to use the same set of organizational runners managed by your runners RunnerDeployment?

Only one. However this one has a matrix that is generated that can have anywhere between 75 - 200+ tests associated with it. Furthermore, during the release, it actually has a matrix that is used on a second matrix....meaning that you have in parallel 125+ tests generally.

Very expensive testing suite and lots of dynamics :-)

toast-gear

toast-gear commented on Jan 19, 2022

@toast-gear
Collaborator

@jbkc85 please do let us know how your testing goes, we're currently testing HEAD for a release so it would be great to know if real world uses also show no issues.

jbkc85

jbkc85 commented on Jan 19, 2022

@jbkc85
Author

@jbkc85 please do let us know how your testing goes, we're currently testing HEAD for a release so it would be great to know if real world uses also show no issues.

so far so good! The only issues we have ran into with tests today have been memory related - which I wish was easier to detect (as the runner controller will delete the pod before I can investigate and find out about the memory eviction). Will update after a full day of runs.

jbkc85

jbkc85 commented on Jan 19, 2022

@jbkc85
Author

So during a scale down event, we had the following scenario:

1.6426248285364957e+09	DEBUG	actions-runner-controller.runner	Rechecking the runner registration in 1m10.927349149s	{"runner": "default/runners-lsf87-qbmtw"}
1.6426248285458472e+09	INFO	actions-runner-controller.runner	Skipped registration check because it's deferred until 2022-01-19 20:41:28 +0000 UTC. Retrying in 58.454161511s at latest	{"runner": "default/runners-lsf87-qbmtw", "lastRegistrationCheckTime": "2022-01-19 20:40:28 +0000 UTC", "registrationCheckInterval": "1m0s"}
1.6426248294468365e+09	INFO	actions-runner-controller.runner	Skipped registration check because it's deferred until 2022-01-19 20:41:28 +0000 UTC. Retrying in 57.55317141s at latest	{"runner": "default/runners-lsf87-qbmtw", "lastRegistrationCheckTime": "2022-01-19 20:40:28 +0000 UTC", "registrationCheckInterval": "1m0s"}
1.6426248452326217e+09	INFO	actions-runner-controller.runner	Skipped registration check because it's deferred until 2022-01-19 20:41:28 +0000 UTC. Retrying in 41.767390462s at latest	{"runner": "default/runners-lsf87-qbmtw", "lastRegistrationCheckTime": "2022-01-19 20:40:28 +0000 UTC", "registrationCheckInterval": "1m0s"}
1.6426249621067107e+09	INFO	actions-runner-controller.runner	Updated registration token	{"runner": "runners-lsf87-qbmtw", "repository": ""}
1.6426249621073768e+09	DEBUG	events	Normal	{"object": {"kind":"Runner","namespace":"default","name":"runners-lsf87-qbmtw","uid":"8b20e268-d9e2-44db-b9ae-9f6c79174a67","apiVersion":"actions.summerwind.dev/v1alpha1","resourceVersion":"64949884"}, "reason": "RegistrationTokenUpdated", "message": "Successfully update registration token"}
1.6426250256327832e+09	DEBUG	events	Normal	{"object": {"kind":"RunnerReplicaSet","namespace":"default","name":"runners-lsf87","uid":"16cba2f6-1005-43cf-9949-e7a81816e153","apiVersion":"actions.summerwind.dev/v1alpha1","resourceVersion":"64952993"}, "reason": "RunnerDeleted", "message": "Deleted runner 'runners-lsf87-qbmtw'"}
1.6426250268427994e+09	INFO	runner-resource	validate resource to be updated	{"name": "runners-lsf87-qbmtw"}
1.6426250268519974e+09	INFO	runner-resource	validate resource to be updated	{"name": "runners-lsf87-qbmtw"}
1.6426250268677664e+09	INFO	actions-runner-controller.runner	Removed runner from GitHub	{"runner": "default/runners-lsf87-qbmtw", "repository": "", "organization": "my_org"}
1.6426250274022748e+09	DEBUG	actions-runner-controller.runnerreplicaset	Failed to check if runner is busy. Either this runner has never been successfully registered to GitHub or it still needs more time.	{"runnerreplicaset": "default/runners-lsf87", "runnerName": "runners-lsf87-qbmtw"}
1.6426250298669991e+09	DEBUG	events	Normal	{"object": {"kind":"RunnerReplicaSet","namespace":"default","name":"runners-lsf87","uid":"16cba2f6-1005-43cf-9949-e7a81816e153","apiVersion":"actions.summerwind.dev/v1alpha1","resourceVersion":"64952993"}, "reason": "RunnerDeleted", "message": "Deleted runner 'runners-lsf87-qbmtw'"}

I am a little confused because this job was in the middle of running a test, and it also looked like there was a registration token just updated....yet it deleted the runner anyway. Any thoughts or explanation on the logic used here?

mumoshu

mumoshu commented on Jan 19, 2022

@mumoshu
Collaborator

Failed to check if runner is busy. Either this runner has never been successfully registered to GitHub or it still needs more time.

This means that it was unable to find the runner out of the results of "list self-hosted runners" API calls.
I thought you're now testing it with an unreleased version of ARC that does have pagination support for that API call, right?
Perhaps it's not working as intended, or you're using a wrong/outdated ARC image?

jbkc85

jbkc85 commented on Jan 20, 2022

@jbkc85
Author

Failed to check if runner is busy. Either this runner has never been successfully registered to GitHub or it still needs more time.

This means that it was unable to find the runner out of the results of "list self-hosted runners" API calls. I thought you're now testing it with an unreleased version of ARC that does have pagination support for that API call, right? Perhaps it's not working as intended, or you're using a wrong/outdated ARC image?

I created my own image based off of what is currently in HEAD of master. So it does have the pagination support for the runners now.

Honestly I am not sure where the foul play is coming from just yet - though the following lines (https://github.com/actions-runner-controller/actions-runner-controller/blob/master/github/github.go#L147-L148) have me somewhat concerned:

	// we like to give runners a chance that are just starting up and may miss the expiration date by a bit
	runnerStartupTimeout := 3 * time.Minute

Basically in our environment, a brand new node is spun up to take care of a pod...this is generally pretty fast to pick up, but can still take 2-3 minutes from the runner 'scheduling' to the actual runner creation. This means that the registration token could already timeout right? If thats so I think I would like to extend this timeout to at least 5 minutes hoping to give runners a bit more time to spin up.

That being said, I am also looking into creating a struct creating an EvictionTimer around those items that are being marked for deletion or are being marked as 'safe to delete'. The reason being is if we look at a project like the autoscaling cluster controller, nodes marked as 'able to delete' have to be marked as such for at least 10 minutes. I think this project would benefit from taking a similar approach, and giving pods a timer which ultimately gives them a chance to avoid eviction/deletion if a second/third call to the GitHub API reveals that the pods are in fact still active.

Let me know your thoughts and I apologize if its a lot of babbling on!

mumoshu

mumoshu commented on Jan 20, 2022

@mumoshu
Collaborator

Basically in our environment, a brand new node is spun up to take care of a pod...this is generally pretty fast to pick up, but can still take 2-3 minutes from the runner 'scheduling' to the actual runner creation. This means that the registration token could already timeout right?

That's correct but the 3-minute timeout there takes place only when the previously created runner token(not PAT or installation token) is about to expire, so that it refreshes the runner token 3 minutes before before it actually expires.

The thing that is very close to "runner startup timeout" for general cases is 10 minutes https://github.com/actions-runner-controller/actions-runner-controller/blob/9ae83dfff54c074af7cbc44df812078183d559b5/controllers/runner_controller.go#L296

mumoshu

mumoshu commented on Jan 20, 2022

@mumoshu
Collaborator

giving pods a timer which ultimately gives them a chance to avoid eviction/deletion if a second/third call to the GitHub API reveals that the pods are in fact still active.

I think this is already handled by ARC(as it never tries to delete busy runners) and actions/runner(as it delays the runner agent termination until the WIP workflow job completes, if any).

Back to your original issue:

Suggested desired replicas of 8 by TotalNumberOfQueuedAndInProgressWorkflowRuns

If you think this specific number 8 is too low and doesn't reflect the actual number of queued and wip workflow runs, it might be due to pagination not working correctly, or the recalculation isn't happening frequently enough.

Regarding the latter, ARC does the recalculation only on each sync interval, which is configurable via --sync-period flag of ARC's controller-manager. If it's set to 10 seconds or so, it would be able to keep up with the demand pretty quickly, but you'll get GitHub API rate limit sooner depending on your scale.

If you've set it to 10 minutes or so, scale in/out can be delayed by approx 10 minutes but instead GitHub API rate limit is less of an issue.

For #1057 (comment), there might be a chance that the 10-minutes registration timeout is not long enough for you?
Unfortunately that timeout is hard-coded so you'd need to update code, give it a longer timeout like registrationTimeout := 20 * time.Minute to give it a try.

jbkc85

jbkc85 commented on Jan 20, 2022

@jbkc85
Author

If you think this specific number 8 is too low and doesn't reflect the actual number of queued and wip workflow runs, it might be due to pagination not working correctly, or the recalculation isn't happening frequently enough.

This was resolved because the code was in master but not in the latest 'release', so when I built an image off of master it worked as expected.

For #1057 (comment), there might be a chance that the 10-minutes registration timeout is not long enough for you?
Unfortunately that timeout is hard-coded so you'd need to update code, give it a longer timeout like registrationTimeout := 20 * time.Minute to give it a try.

The weird thing about this, and I can certainly jump over to that issue and comment given the original title of this issue is no longer a true 'issue' - there was no further log about the registrationDidTimeout - so theoretically the pod that can't be checked on whether or not it is busy should of never been thrown into the deletion candidates pile right?

mumoshu

mumoshu commented on Jan 20, 2022

@mumoshu
Collaborator

there was no further log about the registrationDidTimeout - so theoretically the pod that can't be checked on whether or not it is busy should of never been thrown into the deletion candidates pile right?

I think it's the opposite? Currently, ARC deletes a runner once it failed to report its status for registrationTimeout, which is equivalent to registrationDidTimeout = true.

registrationTimeout is there for a good reason, so I don't think disabling or removing it at all isn't the way to go. It's there to prevent runners that failed in the middle of the initialization process to not hand forever in the K8s cluster. We did see such failure scenarios before (like a year ago or so) so we added it as a rescue.

That said, I guess it can be useful if registrationTimeout was configurable, via maybe per ARC installation or per runnerdeployment. WDYT?

jbkc85

jbkc85 commented on Jan 21, 2022

@jbkc85
Author

That said, I guess it can be useful if registrationTimeout was configurable, via maybe per ARC installation or per runnerdeployment. WDYT?

I would like that personally. I do think having it is a good idea, but I am also wondering now that I am spending more time dissecting this code.... is it smart to base the registrationTimeout against the creation of the pod? There are instances in the ARC controller/runner_controller.go that force a call to update registration token (https://github.com/actions-runner-controller/actions-runner-controller/blob/9ae83dfff54c074af7cbc44df812078183d559b5/controllers/runner_controller.go#L562) - which implies that the restart/recreation of a pod is not actually triggered....yet the reconciliation of pods uses the runner.CreationTimestamp to determine whether or not registration has timed out (https://github.com/actions-runner-controller/actions-runner-controller/blob/9ae83dfff54c074af7cbc44df812078183d559b5/controllers/runnerreplicaset_controller.go#L197). Doesn't this imply that a pod that maybe has been up for 15+ minutes through whatever means is subject to immediate reconciliation if it recently re-registered and has not updated its status on the GitHub API?

stale

stale commented on Feb 20, 2022

@stale

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

24 remaining items

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Metric TotalNumberOfQueuedAndInProgressWorkflowRuns doesn't seem to line up with Current workflow/job counts · Issue #1057 · actions/actions-runner-controller