Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in backfill.sql when worker scheduler has not started #25

Open
coiax opened this issue May 12, 2022 · 0 comments · May be fixed by #26
Open

Race condition in backfill.sql when worker scheduler has not started #25

coiax opened this issue May 12, 2022 · 0 comments · May be fixed by #26

Comments

@coiax
Copy link

coiax commented May 12, 2022

It was noticed during a test suite that running the backfill procedure, would occasionally fail with confusing results (including chunks losing all entries, or the uniqueness constraint of a chunk/hypertable being broken, or a compression job being rescheduled for -infinity).

What seems to happen, is that a brand new database, created by our test runner, has not yet registered the Compression Policy worker.

The worker registration looks like this in the logs. It seems that there's a generic worker for Timescale for a given database, that is then responsible for starting jobs.

2022-05-12 14:33:46.027 UTC [1] DEBUG:  registering background worker "TimescaleDB Background Worker Scheduler"
2022-05-12 14:33:46.027 UTC [1] DEBUG:  starting background worker process "TimescaleDB Background Worker Scheduler"
2022-05-12 14:33:46.048 UTC [627] DEBUG:  database scheduler starting for database 18419
2022-05-12 14:33:46.049 UTC [627] DEBUG:  launching job 1000 "Compression Policy [1000]"
2022-05-12 14:33:46.049 UTC [1] DEBUG:  registering background worker "Compression Policy [1000]"
2022-05-12 14:33:46.049 UTC [1] DEBUG:  starting background worker process "Compression Policy [1000]"

backfill.sql reschedules the compression job for given chunks before doing any operations.
But, when the compression policy job has not been created, the rescheduling does not take any action.

IF compression_job_id IS NULL THEN
old_time = NULL::timestamptz;
ELSE
SELECT next_start INTO old_time FROM _timescaledb_internal.bgw_job_stat WHERE job_id = compression_job_id FOR UPDATE;
IF version = 1 THEN
PERFORM alter_job_schedule(compression_job_id, next_start=> new_time);
ELSE
PERFORM alter_job(compression_job_id, next_start=> new_time);
END IF;
END IF;

The race condition is if the registration of the background worker happens after the attempt to reschedule. The worker then can run at the same time as the main part of decompress_backfill(), causing data corruption.

We've been unable to reproduce outside of our test suite annoyingly, and I'm not sure if this something that can arise outside of freshly created databases that don't have background workers yet.

@coiax coiax linked a pull request May 19, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant