Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

download_file: use tempfile::Builder to avoid name collisions #5296

Merged
merged 3 commits into from
Oct 23, 2024
Merged

Conversation

xzfc
Copy link
Contributor

@xzfc xzfc commented Oct 23, 2024

Fixup for #5227.


The PR #5227 introduced a new flaky CI failure (example log).

Failed to remove downloaded shards snapshot after recovery: No such file or directory (os error 2) at path "./snapshots/tmp/downloaded.snapshot"
curl: (22) The requested URL returned error: 500
2024-10-22T12:20:24.818833Z ERROR qdrant::actix::helpers: Error processing request: Service internal error: Can't restore snapshot in local mode with missing data at shard: /home/runner/work/qdrant/qdrant/./snapshots/tmp/test-collection-shard-0-downloaded.snapshotdjoHHc
2024-10-22T12:20:24.819113Z  INFO actix_web::middleware::logger: 127.0.0.1 "PUT /collections/test-collection/shards/0/snapshots/recover HTTP/1.1" 500 228 "-" "curl/7.81.0" 0.445313
Exit code: 22
./tests/shard-snapshot-api.sh:595@curl-ok:  curl -sS --fail-with-body "$@"
./tests/shard-snapshot-api.sh:203@do-recover: curl-ok
./tests/shard-snapshot-api.sh:623@concurrent: do-recover
./tests/shard-snapshot-api.sh:275@recover-remote-concurrent: concurrent
./tests/shard-snapshot-api.sh:153@test-all: recover-remote-concurrent
./tests/shard-snapshot-api.sh:43@main: test-all
./tests/shard-snapshot-api.sh:659@main: main
Exit code: 22

The cause is this change after which the filename of a temporary download file became deterministic.
Before 5227: ./snapshots/tmp/download-{random}/{filename}.
After: 5227: ./snapshots/tmp/{filename}.
The CI test recover-remote-concurrent triggers a race condition because it tries to upload and restore multiple snapshots with the same name simultaneously.


There are two possible ways to fix that:

  1. Revert recover_shard_snapshot to use snapshots_download_tempdir (a random directory inside ./snapshots/tmp).
  2. Update the download_file() function to use random filename (via tempfile::Builder()) instead of deterministic user-provided filename (via tempfile::TempPath::from_path()).

In this PR I took the second approach as I find it more clean.
The new temp file would be named like this: ./snapshots/tmp/{filename}-{random}.download.

Additionally, this PR updates the test to spawn 10 concurrent operations instead of 2 to increase the chance of catching this condition again.

@xzfc xzfc merged commit 9c17ca2 into dev Oct 23, 2024
17 checks passed
@xzfc xzfc deleted the fixup-5227 branch October 23, 2024 15:08
timvisee pushed a commit that referenced this pull request Nov 8, 2024
* download_file: use `tempfile::Builder` to avoid name collisions

* _do_recover_from_snapshot: follow the same pattern as in recover_shard_snapshot

* tests/shard-snapshot-api.sh: PARALLEL=2 -> PARALLEL=10

Increase number of concurrent operations to catch race conditions with
greater probability.
@timvisee timvisee mentioned this pull request Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants