Skip to content

Commit

Permalink
fix issues with Pregel garbage-collection (arangodb#19725)
Browse files Browse the repository at this point in the history
* fix issues with Pregel garbage-collection

when dropping a database, the Pregel conductor will still hold a ref
count for the already-dropped database, until the conductor is
garbage-collected.
garbage-collection for conductors only happens every ~20s, and by
default only after a timeout that is too long for being useful in
integration tests.
this PR changes the conductor garbage collection so that conductors of
dropped databases can be cleaned up even before their TTL has expired.

the PR also adds an extra round of garbage collection in the stop phase
of the PregelFeature, because when performing the garbage collection in
the unprepare phase only, assertion failures can occur.
the problem is that garbage-collecting conductors can post to the
scheduler, which is already forbidden during the unprepare step. the
hope is that when garbage-collecting in the stop phase, we can eliminate
all conductors before we enter the unprepare phase, so we can work
around the assertion failure.

garbage collection also turned up an issue, because the garbage
collection first tries to cancel all GC-able conductors.
Conductor::cancel() can throw however, which previously led to no
conductors being garbage collected. the PR changes the logic so that
when cancel() throws for a single conductor, the garbage collection
still goes on and cancels all other GC-able conductors and also removes
them from the internal inventory.

* updated CHANGELOG
  • Loading branch information
jsteemann authored Sep 5, 2023
1 parent b12311f commit 9b92c8e
Show file tree
Hide file tree
Showing 9 changed files with 47 additions and 9 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ devel
Internally, `https://` will be converted to `ssl://`, and `http://` will
be converted to `tcp://`.

* Improve Pregel garbage collection for dropped databases.

* Refuse server startup with outdated little-endian on-disk format.
The little-endian on-disk format was used for deployments that were created
with either ArangoDB 3.2 or 3.3 when using the RocksDB storage engine.
Expand Down
5 changes: 5 additions & 0 deletions arangod/Pregel/Conductor/Conductor.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -558,6 +558,11 @@ bool Conductor::canBeGarbageCollected() const {
if (guard.owns_lock()) {
if (_state == ExecutionState::CANCELED || _state == ExecutionState::DONE ||
_state == ExecutionState::FATAL_ERROR) {
if (_vocbaseGuard.database().isDropped()) {
// if database is dropped already, we allow premature deletion before
// ttl.
return true;
}
return (_expires != std::chrono::system_clock::time_point{} &&
_expires <= std::chrono::system_clock::now());
}
Expand Down
33 changes: 31 additions & 2 deletions arangod/Pregel/PregelFeature.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -541,14 +541,35 @@ void PregelFeature::beginShutdown() {

// cancel all conductors and workers
for (auto& it : cs) {
it.second.conductor->cancel();
try {
it.second.conductor->cancel();
} catch (std::exception const& ex) {
// if an exception happens here, log it, but continue with
// the garbage collection. this is important, so that we
// can eventually get rid of some leftover conductors.
LOG_TOPIC("aaa06", INFO, Logger::PREGEL)
<< "unable to cancel conductor during shutdown: " << ex.what();
}
}
for (auto it : ws) {
it.second.second->cancelGlobalStep(VPackSlice());
}
}

void PregelFeature::stop() {
// garbage collect conductors here, because it may be too
// late for garbage collection during unprepare().
// during unprepare() we are not allowed to post further items
// onto the scheduler anymore, but the garbage collection
// can post onto the scheduler.
garbageCollectConductors();
}

void PregelFeature::unprepare() {
// TODO: this may trigger an assertion failure in maintainer
// mode, because it is not allowed to post to the scheduler
// during unprepare() anymore. we are working around this by
// trying to garbage-collection in the stop() phase already.
garbageCollectConductors();

std::unique_lock guard{_mutex};
Expand Down Expand Up @@ -659,7 +680,15 @@ void PregelFeature::garbageCollectConductors() try {
// cancel and kill conductors without holding the mutex
// permanently
for (auto& c : conductors) {
c->cancel();
try {
c->cancel();
} catch (std::exception const& ex) {
// if an exception happens here, log it, but continue with
// the garbage collection. this is important, so that we
// can eventually get rid of some leftover conductors.
LOG_TOPIC("517bb", INFO, Logger::PREGEL)
<< "Unable to cancel conductor for garbage-collection: " << ex.what();
}
}

std::lock_guard guard{_mutex};
Expand Down
1 change: 1 addition & 0 deletions arangod/Pregel/PregelFeature.h
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@ class PregelFeature final : public ArangodFeature {
options) override final;
void start() override final;
void beginShutdown() override final;
void stop() override final;
void unprepare() override final;

bool isStopping() const noexcept;
Expand Down
4 changes: 2 additions & 2 deletions js/common/modules/@arangodb/graph/pregel-test-helpers.js
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ const waitUntilRunFinishedSuccessfully = function (pid, maxWaitSeconds = 120, sl
return status;
};

const waitForResultsBeeingGarbageCollected = function (pid, ttl) {
const waitForResultsBeingGarbageCollected = function (pid, ttl) {
// garbage collection runs every 20s, therefore we should wait at least that long plus the ttl given
const maxWaitSeconds = ttl + 100;
const sleepIntervalSeconds = 0.2;
Expand Down Expand Up @@ -1830,5 +1830,5 @@ exports.runFinished = runFinished;
exports.runCanceled = runCanceled;
exports.runFinishedSuccessfully = runFinishedSuccessfully;
exports.waitUntilRunFinishedSuccessfully = waitUntilRunFinishedSuccessfully;
exports.waitForResultsBeeingGarbageCollected = waitForResultsBeeingGarbageCollected;
exports.waitForResultsBeingGarbageCollected = waitForResultsBeingGarbageCollected;
exports.uniquePregelResults = uniquePregelResults;
4 changes: 2 additions & 2 deletions tests/js/common/shell/shell-pregel-basics-example.js
Original file line number Diff line number Diff line change
Expand Up @@ -202,7 +202,7 @@ function basicTestSuite() {
// after cancelling the pregel run, PREGEL_RESULT is empty

pregel.cancel(pid);
pregelTestHelpers.waitForResultsBeeingGarbageCollected(pid, 0);
pregelTestHelpers.waitForResultsBeingGarbageCollected(pid, 0);
},

test_AQL_pregel_result_is_empty_after_ttl_expires: function () {
Expand All @@ -212,7 +212,7 @@ function basicTestSuite() {
assertEqual(stats.vertexCount, 11, stats);
assertEqual(stats.edgeCount, 17, stats);

pregelTestHelpers.waitForResultsBeeingGarbageCollected(pid, ttl);
pregelTestHelpers.waitForResultsBeingGarbageCollected(pid, ttl);
},

test_AQL_pregel_result_is_empty_when_pregel_results_are_stored_in_database: function () {
Expand Down
2 changes: 1 addition & 1 deletion tests/js/common/shell/shell-pregel-random.js
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@ function randomTestSuite() {
stats = pregelTestHelpers.waitUntilRunFinishedSuccessfully(pid);
assertTrue(stats.hasOwnProperty('expires'));
assertTrue(stats.expires > stats.created);
pregelTestHelpers.waitForResultsBeeingGarbageCollected(pid, 2 * ttl);
pregelTestHelpers.waitForResultsBeingGarbageCollected(pid, 2 * ttl);
}
};
}
Expand Down
2 changes: 1 addition & 1 deletion tests/js/common/shell/shell-pregel-start-cluster.js
Original file line number Diff line number Diff line change
Expand Up @@ -269,7 +269,7 @@ function basicTestSuite() {
});

pregel.cancel(pid); // delete contents
pregelTestHelpers.waitForResultsBeeingGarbageCollected(pid, 0);
pregelTestHelpers.waitForResultsBeingGarbageCollected(pid, 0);
}
};
};
Expand Down
3 changes: 2 additions & 1 deletion tests/js/common/shell/shell-pregel-status-writer.js
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@ function pregelStatusWriterSuiteModules() {
tearDown: function () {
pregelTestHelpers.dropExampleGraph();
},

testSystemCollectionExists: function () {
assertTrue(db[pregelSystemCollectionName]);
const properties = db[pregelSystemCollectionName].properties();
Expand Down Expand Up @@ -239,6 +239,7 @@ function pregelStatusWriterSuiteModules() {
// delete entry again, should reply that deletion is not possible as history entry not available anymore
try {
pregel.removeHistory(pid);
fail();
} catch (error) {
assertEqual(errors.ERROR_ARANGO_DOCUMENT_NOT_FOUND.code, error.errorNum);
assertEqual(errors.ERROR_ARANGO_DOCUMENT_NOT_FOUND.message, error.errorMessage);
Expand Down

0 comments on commit 9b92c8e

Please sign in to comment.