Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Retry failed PVF execution (AmbiguousWorkerDeath) #6235

Merged
merged 6 commits into from
Nov 8, 2022
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Retry failed PVF execution
PVF execution that fails due to AmbiguousWorkerDeath should be retried once.
This should reduce the occurrence of failures due to transient conditions.

Closes #6195
  • Loading branch information
mrcnski committed Nov 3, 2022
commit 2e91ff3b1b17f375a00dcddf83b2dfedf75b3c13
58 changes: 38 additions & 20 deletions node/core/candidate-validation/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ use polkadot_node_subsystem::{
overseer, FromOrchestra, OverseerSignal, SpawnedSubsystem, SubsystemError, SubsystemResult,
SubsystemSender,
};
use polkadot_node_subsystem_util::TimeoutExt;
use polkadot_parachain::primitives::{ValidationParams, ValidationResult as WasmValidationResult};
use polkadot_primitives::v2::{
CandidateCommitments, CandidateDescriptor, CandidateReceipt, Hash, OccupiedCoreAssumption,
Expand All @@ -60,6 +61,9 @@ mod tests;

const LOG_TARGET: &'static str = "parachain::candidate-validation";

// The amount of time to wait before retrying after an AmbiguousWorkerDeath validation error.
const PVF_EXECUTION_RETRY_DELAY: Duration = Duration::from_secs(1);
slumber marked this conversation as resolved.
Show resolved Hide resolved
eskimor marked this conversation as resolved.
Show resolved Hide resolved

/// Configuration for the candidate validation subsystem
#[derive(Clone)]
pub struct Config {
Expand Down Expand Up @@ -621,28 +625,21 @@ impl ValidationBackend for ValidationHost {
timeout: Duration,
params: ValidationParams,
) -> Result<WasmValidationResult, ValidationError> {
let (tx, rx) = oneshot::channel();
if let Err(err) = self
.execute_pvf(
Pvf::from_code(raw_validation_code),
timeout,
params.encode(),
polkadot_node_core_pvf::Priority::Normal,
tx,
)
.await
{
return Err(ValidationError::InternalError(format!(
"cannot send pvf to the validation host: {:?}",
err
)))
}
let pvf = Pvf::from_code(raw_validation_code);

let validation_result = rx
.await
.map_err(|_| ValidationError::InternalError("validation was cancelled".into()))?;
let validation_result = execute_pvf_once(self, pvf.clone(), timeout, params.encode()).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: params in their encoded form may be used twice so probably it would be better to encode them once into a variable and use it twice, rather than encoding them twice? I mean cloning the already encoded form is cheaper than encoding the source two times.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given there's a 1 second delay, IMO this won't make a difference

Copy link
Contributor Author

@mrcnski mrcnski Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. I thought about this, but I expect the retry case to be pretty rare. I would prefer to call encode() a second time in the retry case, rather than unconditionally clone() and make an extra heap allocation in every case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments.


validation_result
// If we get an AmbiguousWorkerDeath error, retry once after a brief delay, on the
// assumption that the conditions that caused this error may have been transient.
if let Err(ValidationError::InvalidCandidate(WasmInvalidCandidate::AmbiguousWorkerDeath)) =
validation_result
{
// Wait a brief delay before retrying.
let _: Option<()> = future::pending().timeout(PVF_EXECUTION_RETRY_DELAY).await;
slumber marked this conversation as resolved.
Show resolved Hide resolved
execute_pvf_once(self, pvf, timeout, params.encode()).await
} else {
validation_result
}
}

async fn precheck_pvf(&mut self, pvf: Pvf) -> Result<(), PrepareError> {
Expand All @@ -657,6 +654,27 @@ impl ValidationBackend for ValidationHost {
}
}

// Tries executing a PVF a single time (no retries).
async fn execute_pvf_once(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this fn look like a good candidate for inclusion into ValidationHost implementation (not into ValidationBackend trait though)?

Copy link
Contributor Author

@mrcnski mrcnski Nov 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ValidationHost? I think I agree with Chris that it should be in ValidationBackend as ValidationHost already has execute_pvf. However, we need to do some extra stuff around calling execute_pvf, so we still need validate_candidate in the ValidationBackend trait.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to keep this code in validate_candidate (don't touch it basically) and introduce a new method to ValidationBackend trait

async fn validate_candidate_with_retry(
    ..,
    can_retry: impl Fn(ValidationError) -> bool
)

And give it a default implementation (the one you already have) which calls validate_candidate

This would make a code easily testable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, except without the can_retry parameter as I didn't see a need for it. It would make the code more general but isn't it unnecessary abstraction?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever you prefer, it was an insignificant suggestion :)

host: &mut ValidationHost,
pvf: Pvf,
timeout: Duration,
params: Vec<u8>,
) -> Result<WasmValidationResult, ValidationError> {
let priority = polkadot_node_core_pvf::Priority::Normal;

let (tx, rx) = oneshot::channel();
if let Err(err) = host.execute_pvf(pvf, timeout, params, priority, tx).await {
return Err(ValidationError::InternalError(format!(
"cannot send pvf to the validation host: {:?}",
err
)))
}

rx.await
.map_err(|_| ValidationError::InternalError("validation was cancelled".into()))?
}

/// Does basic checks of a candidate. Provide the encoded PoV-block. Returns `Ok` if basic checks
/// are passed, `Err` otherwise.
fn perform_basic_checks(
Expand Down