Metric TotalNumberOfQueuedAndInProgressWorkflowRuns doesn't seem to line up with Current workflow/job counts #1057
Description
Describe the bug
We are still battling the 'cancellation' bug in our organization, many times running into pod reconciles at times where it doesn't make sense. Typically the scale up works (tho since 0.20.1 its been scaling up tremendously slowly, if it even scales up to the proper amount of runners needed - more in details), but scaling down happens prematurely and cancels active workflow jobs. I have been able to test the 0.20.4 version but cancellations occur even more often than not with that release, and that includes being capped at 24 runners for some odd reason. I started monitoring the suggested replicas and am seeing a lot of oddities, which I am not yet able to map to any specific part of the code...
Checks
Example Log from Controller:
2022-01-18T19:15:05.548Z DEBUG actions-runner-controller.horizontalrunnerautoscaler Suggested desired replicas of 8 by TotalNumberOfQueuedAndInProgressWorkflowRuns {"workflow_runs_completed": 0, "workflow_runs_in_progress": 8, "workflow_runs_queued": 0, "workflow_runs_unknown": 0, "namespace": "default", "kind": "runnerdeployment", "name": "runners", "horizontal_runner_autoscaler": "runners-autoscaler"}
Example from CLI written w/ same code to grab workflow runs/jobs:
./airbin/checkctl github list_jobs
2022/01/18 13:15:08 [DEBUG] Total Count: 1 for Perform release (queued)
-----workflow stats for 1713612691-----Jobs Completed: 65Jobs Queued: 18Jobs In Progress: 46Jobs Unknown: 0
To Reproduce
Steps to reproduce the behavior:
- Use TotalNumberOfQueuedAndInProgressWorkflowRuns
- Use deployment from screenshots below
- Launch workflow with 100+ jobs (in matrix if possible)
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
namespace: default
name: runners
spec:
template:
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
spec:
dockerdWithinRunnerContainer: true
organization: myOrganization
nodeSelector:
com.example/capacity-type: SPOT
com.example/purpose: github-runners
com.example/environment: ops
labels:
- java
- linux
- eks
- self-hosted
resources:
requests:
cpu: "1.0"
memory: "10Gi"
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
namespace: default
name: runners-autoscaler
spec:
scaleTargetRef:
name:runners
scaleDownDelaySecondsAfterScaleOut: 7200
minReplicas: 1
maxReplicas: 150
metrics:
- type: TotalNumberOfQueuedAndInProgressWorkflowRuns
repositoryNames:
- my_repository
Environment (please complete the following information):
- Controller Version [0.20.1->0.20.4]
- Deployment Method: Helm
Additional context
This has been an ongoing issue but since the upgrade to the 0.20.x
branch I believe its gotten worse. I am not sure if its the GitHub API or something in the calculations to determine the suggested replicas. Either way, is there anything I am missing here? I am not sure why I can get such different results on my own GitHub API calls from command line vs that of the controller's autoscaler suggested replicas.
scaleDownDelaySecondsAfterScaleOut: 7200
was put in to battle some of the unnecessary kills, it isn't working as well though because after a scale up and the runners stay at scale, it eventually causes runners to get interrupted (by controller kills) during the middle of tests as well.
Activity
jbkc85 commentedon Jan 18, 2022
I might be caught on the whole "this is fixed but not released" - as I see the pagination was added recently but isn't in release 0.20.4 yet. I built it locally and will test it out tonight!
mumoshu commentedon Jan 18, 2022
@jbkc85 Just curious, but how many repositories are intended to use the same set of organizational runners managed by your
runners
RunnerDeployment?jbkc85 commentedon Jan 19, 2022
Only one. However this one has a matrix that is generated that can have anywhere between 75 - 200+ tests associated with it. Furthermore, during the release, it actually has a matrix that is used on a second matrix....meaning that you have in parallel 125+ tests generally.
Very expensive testing suite and lots of dynamics :-)
toast-gear commentedon Jan 19, 2022
@jbkc85 please do let us know how your testing goes, we're currently testing HEAD for a release so it would be great to know if real world uses also show no issues.
jbkc85 commentedon Jan 19, 2022
so far so good! The only issues we have ran into with tests today have been memory related - which I wish was easier to detect (as the runner controller will delete the pod before I can investigate and find out about the memory eviction). Will update after a full day of runs.
jbkc85 commentedon Jan 19, 2022
So during a
scale down
event, we had the following scenario:I am a little confused because this job was in the middle of running a test, and it also looked like there was a registration token just updated....yet it deleted the runner anyway. Any thoughts or explanation on the logic used here?
mumoshu commentedon Jan 19, 2022
This means that it was unable to find the runner out of the results of "list self-hosted runners" API calls.
I thought you're now testing it with an unreleased version of ARC that does have pagination support for that API call, right?
Perhaps it's not working as intended, or you're using a wrong/outdated ARC image?
jbkc85 commentedon Jan 20, 2022
I created my own image based off of what is currently in
HEAD
of master. So it does have the pagination support for the runners now.Honestly I am not sure where the foul play is coming from just yet - though the following lines (https://github.com/actions-runner-controller/actions-runner-controller/blob/master/github/github.go#L147-L148) have me somewhat concerned:
Basically in our environment, a brand new node is spun up to take care of a pod...this is generally pretty fast to pick up, but can still take 2-3 minutes from the runner 'scheduling' to the actual runner creation. This means that the registration token could already timeout right? If thats so I think I would like to extend this timeout to at least 5 minutes hoping to give runners a bit more time to spin up.
That being said, I am also looking into creating a struct creating an
EvictionTimer
around those items that are being marked for deletion or are being marked as 'safe to delete'. The reason being is if we look at a project like the autoscaling cluster controller, nodes marked as 'able to delete' have to be marked as such for at least 10 minutes. I think this project would benefit from taking a similar approach, and giving pods a timer which ultimately gives them a chance to avoid eviction/deletion if a second/third call to the GitHub API reveals that the pods are in fact still active.Let me know your thoughts and I apologize if its a lot of babbling on!
mumoshu commentedon Jan 20, 2022
That's correct but the 3-minute timeout there takes place only when the previously created runner token(not PAT or installation token) is about to expire, so that it refreshes the runner token 3 minutes before before it actually expires.
The thing that is very close to "runner startup timeout" for general cases is 10 minutes https://github.com/actions-runner-controller/actions-runner-controller/blob/9ae83dfff54c074af7cbc44df812078183d559b5/controllers/runner_controller.go#L296
mumoshu commentedon Jan 20, 2022
I think this is already handled by ARC(as it never tries to delete busy runners) and actions/runner(as it delays the runner agent termination until the WIP workflow job completes, if any).
Back to your original issue:
If you think this specific number
8
is too low and doesn't reflect the actual number of queued and wip workflow runs, it might be due to pagination not working correctly, or the recalculation isn't happening frequently enough.Regarding the latter, ARC does the recalculation only on each sync interval, which is configurable via
--sync-period
flag of ARC's controller-manager. If it's set to 10 seconds or so, it would be able to keep up with the demand pretty quickly, but you'll get GitHub API rate limit sooner depending on your scale.If you've set it to 10 minutes or so, scale in/out can be delayed by approx 10 minutes but instead GitHub API rate limit is less of an issue.
For #1057 (comment), there might be a chance that the 10-minutes registration timeout is not long enough for you?
Unfortunately that timeout is hard-coded so you'd need to update code, give it a longer timeout like
registrationTimeout := 20 * time.Minute
to give it a try.jbkc85 commentedon Jan 20, 2022
This was resolved because the code was in master but not in the latest 'release', so when I built an image off of master it worked as expected.
The weird thing about this, and I can certainly jump over to that issue and comment given the original title of this issue is no longer a true 'issue' - there was no further log about the
registrationDidTimeout
- so theoretically the pod that can't be checked on whether or not it is busy should of never been thrown into the deletion candidates pile right?mumoshu commentedon Jan 20, 2022
I think it's the opposite? Currently, ARC deletes a runner once it failed to report its status for
registrationTimeout
, which is equivalent toregistrationDidTimeout = true
.registrationTimeout
is there for a good reason, so I don't think disabling or removing it at all isn't the way to go. It's there to prevent runners that failed in the middle of the initialization process to not hand forever in the K8s cluster. We did see such failure scenarios before (like a year ago or so) so we added it as a rescue.That said, I guess it can be useful if
registrationTimeout
was configurable, via maybe per ARC installation or per runnerdeployment. WDYT?jbkc85 commentedon Jan 21, 2022
I would like that personally. I do think having it is a good idea, but I am also wondering now that I am spending more time dissecting this code.... is it smart to base the registrationTimeout against the creation of the pod? There are instances in the ARC controller/runner_controller.go that force a call to update registration token (https://github.com/actions-runner-controller/actions-runner-controller/blob/9ae83dfff54c074af7cbc44df812078183d559b5/controllers/runner_controller.go#L562) - which implies that the restart/recreation of a pod is not actually triggered....yet the reconciliation of pods uses the
runner.CreationTimestamp
to determine whether or not registration has timed out (https://github.com/actions-runner-controller/actions-runner-controller/blob/9ae83dfff54c074af7cbc44df812078183d559b5/controllers/runnerreplicaset_controller.go#L197). Doesn't this imply that a pod that maybe has been up for 15+ minutes through whatever means is subject to immediate reconciliation if it recently re-registered and has not updated its status on the GitHub API?stale commentedon Feb 20, 2022
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
24 remaining items