Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: wrong terminal counts calculated during migration check (#5400)
Wrong count calculated as part of migration check that happens every 30 seconds. We fetch counts of terminal statuses in the status table, but since at startup we "cleanup" old jobs just by appending another aborted status irrespective of it's state, that above query can count more than one terminal status per job. Understandably while actually migrating the jobs, we see that more jobs than we expect have been moved because we were expecting a lesser number. This issue happened now because archival tables have a default retention of 24 hours. so on successive restarts, more and more statuses were being appended for the same job. And we expect the following expression number of jobs to be migrated: numExpectedNumberOfMigratedJobs(e) = number of jobs(a) - number of terminal statuses in status table(b) Due to the cleanup at startup even when a remains same, b increases based on the retention duration effectively decreasing e. And server panics when it actually migrated more jobs than e. Now with this fix: we change b to number of jobIDs with terminal status in the status table and it's bound to remain the same even if we append more statuses for the same job.