Race condition in backfill.sql when worker scheduler has not started #25

coiax · 2022-05-12T15:19:27Z

It was noticed during a test suite that running the backfill procedure, would occasionally fail with confusing results (including chunks losing all entries, or the uniqueness constraint of a chunk/hypertable being broken, or a compression job being rescheduled for -infinity).

What seems to happen, is that a brand new database, created by our test runner, has not yet registered the Compression Policy worker.

The worker registration looks like this in the logs. It seems that there's a generic worker for Timescale for a given database, that is then responsible for starting jobs.

2022-05-12 14:33:46.027 UTC [1] DEBUG:  registering background worker "TimescaleDB Background Worker Scheduler"
2022-05-12 14:33:46.027 UTC [1] DEBUG:  starting background worker process "TimescaleDB Background Worker Scheduler"
2022-05-12 14:33:46.048 UTC [627] DEBUG:  database scheduler starting for database 18419
2022-05-12 14:33:46.049 UTC [627] DEBUG:  launching job 1000 "Compression Policy [1000]"
2022-05-12 14:33:46.049 UTC [1] DEBUG:  registering background worker "Compression Policy [1000]"
2022-05-12 14:33:46.049 UTC [1] DEBUG:  starting background worker process "Compression Policy [1000]"

backfill.sql reschedules the compression job for given chunks before doing any operations.
But, when the compression policy job has not been created, the rescheduling does not take any action.

timescaledb-extras/backfill.sql

Lines 98 to 108 in 2358d75

    
           IF compression_job_id IS NULL THEN  
        
               old_time = NULL::timestamptz; 
        
           ELSE 
        
               SELECT next_start INTO old_time FROM _timescaledb_internal.bgw_job_stat WHERE job_id = compression_job_id FOR UPDATE; 
        
               IF version = 1 THEN 
        
                   PERFORM alter_job_schedule(compression_job_id, next_start=> new_time); 
        
               ELSE  
        
                   PERFORM alter_job(compression_job_id, next_start=> new_time); 
        
               END IF; 
        
           END IF;

The race condition is if the registration of the background worker happens after the attempt to reschedule. The worker then can run at the same time as the main part of decompress_backfill(), causing data corruption.

We've been unable to reproduce outside of our test suite annoyingly, and I'm not sure if this something that can arise outside of freshly created databases that don't have background workers yet.

The text was updated successfully, but these errors were encountered:

coiax linked a pull request May 19, 2022 that will close this issue

Fix backfill.sql compression job rescheduling #26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in backfill.sql when worker scheduler has not started #25

Race condition in backfill.sql when worker scheduler has not started #25

coiax commented May 12, 2022

Race condition in backfill.sql when worker scheduler has not started #25

Race condition in backfill.sql when worker scheduler has not started #25

Comments

coiax commented May 12, 2022