Skip to content

Commit

Permalink
fix: wrong terminal counts calculated during migration check (#5400)
Browse files Browse the repository at this point in the history
Wrong count calculated as part of migration check that happens every 30 seconds.
We fetch counts of terminal statuses in the status table, but since at startup we "cleanup" old jobs just by appending another aborted status irrespective of it's state, that above query can count more than one terminal status per job.

Understandably while actually migrating the jobs, we see that more jobs than we expect have been moved because we were expecting a lesser number.
This issue happened now because archival tables have a default retention of 24 hours. so on successive restarts, more and more statuses were being appended for the same job. And we expect the following expression number of jobs to be migrated:

numExpectedNumberOfMigratedJobs(e) = number of jobs(a) - number of terminal statuses in status table(b)
Due to the cleanup at startup even when a remains same, b increases based on the retention duration effectively decreasing e. And server panics when it actually migrated more jobs than e.
Now with this fix: we change b to number of jobIDs with terminal status in the status table and it's bound to remain the same even if we append more statuses for the same job.
Sidddddarth authored Jan 20, 2025
1 parent 3258e9d commit 2541b1c
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion jobsdb/migration.go
Original file line number Diff line number Diff line change
@@ -628,7 +628,7 @@ func (jd *Handle) checkIfMigrateDS(ds dataSetT) (
`with combinedResult as (
select
(select count(*) from %[1]q) as totalJobCount,
(select count(*) from %[2]q where job_state = ANY($1)) as terminalJobCount,
(select count(*) from "v_last_%[2]s" where job_state = ANY($1)) as terminalJobCount,
(select created_at from %[1]q order by job_id desc limit 1) as maxCreatedAt,
COALESCE((select exec_time < $2 from %[2]q where job_state = ANY($1) order by id asc limit 1), false) as retentionExpired
)

0 comments on commit 2541b1c

Please sign in to comment.