Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hotbackup Deadlock and Consistency fixes. #18928

Merged
merged 57 commits into from
Jul 14, 2023

Conversation

maierlars
Copy link
Contributor

@maierlars maierlars commented May 10, 2023

Scope & Purpose

Before this PR, when the user requested a HotBackup, the ArangoDB cluster waited for all in-progress transactions to be committed. This lead to deadlocks.
Also, the HotBackup snapshot of the RocksDB database did only include some portion of the WAL and was not guaranteed to include enough of the WAL to recover MerkleTrees. This would lead to MerkleTrees being inconsistent with the database and potential data loss from replication.

After this PR, requesting a HotBackup merely prevents transactions from being committed.

The HotBackup process has to take two snapshots: One of the RocksDB store, and one of the ArangoSearch store.

Consistency is achieved by taking a snapshot of RocksDB with a large enough portion of the WAL so that the recovery process after the DBServer restart can replay all necessary operations on ArangoSearch and MerkleTrees.

Unfortunately, to achieve taking a snapshot of RocksDB including the WAL files, a small patch is applied to our RocksDB fork. Future work can include upstreaming that patch.

Enterprise: https://github.com/arangodb/enterprise/pull/1267

@maierlars maierlars self-assigned this May 10, 2023
@cla-bot cla-bot bot added the cla-signed label May 10, 2023
@markuspf markuspf force-pushed the feature/block-commit-hot-backup branch 2 times, most recently from 7f51fbc to b16f412 Compare June 8, 2023 10:52
@markuspf markuspf force-pushed the feature/block-commit-hot-backup branch from b16f412 to 7ff75b8 Compare June 20, 2023 14:21
@jsteemann jsteemann added this to the devel milestone Jul 11, 2023
Copy link
Member

@neunhoef neunhoef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review not yet complete, but already some comments, publishing for earlier visibility.

arangod/Cluster/ClusterMethods.cpp Outdated Show resolved Hide resolved
arangod/Cluster/ClusterMethods.cpp Show resolved Hide resolved
if (!result.ok()) {
unlockDBServerTransactions(pool, backupId, lockedServers);
unlockServersTrxCommit(pool, backupId, lockedServers);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this is not a new change of this PR, but what if a lock request is sent and runs into a timeout? Then this particular coordinator might still be locked and we do not even try to roll back the lock here, since we are only contacting the lockedServers. Should we not - as in the cases further down - try to unlock all coordinators again?

Copy link
Contributor Author

@maierlars maierlars Jul 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm against changing this in this PR:

  1. This PR is already big enough and it is not a new issue.
  2. The code in question is a huge mess. It is more likely for me to break it than fix anything useful. It has to be reimplemented from ground up.
  3. The existing logic is good enough for dbservers, I would argue it is good enough for coordinators. There was probably a reason for why to do it like that.

However, I'm not saying that this should never be change. The hotbackup code (in particular the coordinator code) needs a rewrite, in particular since the assumptions have changed over time and chances are high we could come up with simpler code that works more reliably.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we adress this in a BTS follow-up?

arangod/Cluster/ClusterMethods.cpp Show resolved Hide resolved
arangod/RocksDBEngine/Methods/RocksDBTrxMethods.cpp Outdated Show resolved Hide resolved
arangod/Transaction/Manager.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@Dronplane Dronplane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I've missed that, but do we test here that ArangoSearch view documents are all present (and nothing duplicated) after restoring hotbackup?

Copy link
Member

@neunhoef neunhoef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the things I have so far reported: LGTM. I will now move on to the enterprise part.

Copy link
Contributor

@joerg84 joerg84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the open point(s) can imo be addressed later

@maierlars maierlars merged commit 172abe4 into devel Jul 14, 2023
@maierlars maierlars deleted the feature/block-commit-hot-backup branch July 14, 2023 08:58
@joerg84
Copy link
Contributor

joerg84 commented Jul 17, 2023

Maybe I've missed that, but do we test here that ArangoSearch view documents are all present (and nothing duplicated) after restoring hotbackup?

Afaik it was

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants