Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Separate preparation timeouts for PVF prechecking and execution #6139

Merged
merged 7 commits into from
Oct 13, 2022

Conversation

mrcnski
Copy link
Contributor

@mrcnski mrcnski commented Oct 11, 2022

PULL REQUEST

Overview

Per the linked issue, we make the required changes so that preparation for
execution is more lenient (by a factor of 3) than preparation for prechecking.

We add a compilation_timeout parameter for PVF preparation job
and also split the COMPILATION_TIMEOUT constant into two new consts.

Todo

  • What kind of tests should be added?

Issues Closed

Closes #4132

@mrcnski mrcnski added the A0-please_review Pull request needs code review. label Oct 11, 2022
@mrcnski mrcnski requested a review from pepyakin October 11, 2022 17:39
@mrcnski mrcnski requested review from a team and chevdor as code owners October 11, 2022 17:52
@paritytech-ci paritytech-ci requested a review from a team October 11, 2022 17:53
@mrcnski
Copy link
Contributor Author

mrcnski commented Oct 11, 2022

Not sure how to label this for release, i.e. what is worthy of release notes?

@rphmeier
Copy link
Contributor

@m-cat - the default should be 'silent' but major features, API changes, DB locations, and other things that impact the process of running a validator should be present in release notes.

@eskimor eskimor added C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit. B0-silent Changes should not be mentioned in any release notes labels Oct 12, 2022
Copy link
Member

@eskimor eskimor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! Really nice work! One thing, with regards to leniency, for execution we have a factor of 6, having the same factor here seems to make sense and would make it a bit "safer".

@eskimor
Copy link
Member

eskimor commented Oct 12, 2022

With regards to tests, for this particular change there is hardly something we could test.

Wiith regards to leniency, for execution we have a factor of 6, having the same factor here seems to make sense and would make it a bit "safer".

Ok, actually the situation is a bit different. For backing/approval timeouts we only have like 2-3 validators with the short timeout, here we have a super majority - so a smaller lenience can be justified - still node load can vary over time, also the validator set may change over time ... better safe than sorry.

Copy link
Contributor

@pepyakin pepyakin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job. Don't get discouraged by the number of comments. They are all nits, which (at least for my reviews) means they are not blocking the merge and are not necessary to address.

Organizational note: I personally prefer landing code as soon as it's ready. If you see a part that is not ready and you wanna address before the merge, I am up to split it into another PR. That will give us a platform for discussion the changes with more focus. It's also possible to land and as-is and send the fixes with a follow-up PR (as long as everything is linked)

@@ -76,6 +79,8 @@ struct JobData {
/// The priority of this job. Can be bumped.
priority: Priority,
pvf: Pvf,
/// The timeout for the preparation job.
compilation_timeout: Duration,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think here (and elsewhere) it's more correct to say preparation. The worker will perform preparation, which is the combination of prevalidation and compilation (production of the compiled artifact).

@@ -38,6 +38,16 @@ use std::{
time::{Duration, SystemTime},
};

/// The time period after which the precheck preparation worker is considered unresponsive and will
/// be killed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: those docs distinguish between pre-check preparation worker and execute preparation worker. I am not sure if that's the right way of thinking about that. After all it's the very same worker that does the same the job. It does not even have any parametrization.

/// The time period after which the execute preparation worker is considered unresponsive and will
/// be killed.
// NOTE: If you change this make sure to fix the buckets of `pvf_preparation_time` metric.
pub const EXECUTE_COMPILATION_TIMEOUT: Duration = Duration::from_secs(180);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It would be great if there were a doc line explaining the relationship between the two. Perhaps, moving them into a module (inline or separate file) and in the module doc explaining the stuff we discussed in DMs?

@@ -166,6 +167,7 @@ impl metrics::Metrics for Metrics {
20.0,
30.0,
60.0,
180.0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Do you think that's a good resolution for this metric? IOW, ask the question, looking at a metrics dashboard do you think it's possible that you would think for yourself "I wish there were more buckets available that are higher/lesser than 180"? If you are inclined to say yes just plop more bands. It's very cold code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by cold code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally, cold code is the code that is not frequently called. Here, I meant that there are no performance reasons to save on the bands. Assuming more bands, less performance. I did not even think about this too much since it's how many preparations per second can we reasonably do in the worst case? 100? So with the performance of the node argument being irrelevant, and by extension memory as well. I don't see other arguments against it.

- **Prevalidation:** Right now this just tries to deserialize the binary with
parity-wasm. It is a part of *preparation*.
- **Compilation:** This is the process of compiling a PVF from wasm code to
machine code. It is a part of *preparation*.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This book already has a glossary. Do you think it would be better to move those there? Here, we can leave a note saying that this is a loaded document with terms refer to the glossary.

Alternatively (or better, additionally) we could embed those into the text as explainations. I think as a bonus this would allow us to structure the explaination hierarchically, IMO better. Something like the following abstract:

  1. In order to make the PVF usable for candidate validation it has to be registered on-chain
  2. As part of the registration process, it has to go through pre-checking.
  3. Pre-checking is a game of attempting preparation and reporting the results back on-chain.
  4. We define preparation as a process that: validates the consistency of the wasm binary (aka prevalidation) and the compilation of the wasm module into machine code (refered to as artifact).
  5. Besides pre-checking, preparation can also be triggered by execution, since compiled artifact is needed for the execution

/// The time period after which the execute preparation worker is considered unresponsive and will
/// be killed.
// NOTE: If you change this make sure to fix the buckets of `pvf_preparation_time` metric.
pub const EXECUTE_COMPILATION_TIMEOUT: Duration = Duration::from_secs(180);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I wonder if this would be better named as lazy or lenient. After all we use it for the heads up signal which also requires a more permissive timeout.

@@ -418,6 +429,9 @@ async fn handle_to_host(
Ok(())
}

/// Handles PVF prechecking.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: ... prechecking requests

@@ -485,9 +511,17 @@ async fn handle_execute_pvf(
}
} else {
// Artifact is unknown: register it and enqueue a job with the corresponding priority and
//
// PVF.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, finally this is fixed 🎉

@mrcnski
Copy link
Contributor Author

mrcnski commented Oct 12, 2022

Thanks for the reviews! 👍 Just stuck on a couple of CI checks:

  • Check reviews: Looks like I need a couple reviews from the ci or release-engineering teams.
  • zombienet: I'm trying to restart this one on GitLab, but getting This job could not start because it could not retrieve the needed artifacts: publish-polkadot-debug-image.

@mrcnski
Copy link
Contributor Author

mrcnski commented Oct 12, 2022

still node load can vary over time, also the validator set may change over time ... better safe than sorry.

@eskimor Sounds good, I'll address it in the followup PR.

Good job. Don't get discouraged by the number of comments. They are all nits, which (at least for my reviews) means they are not blocking the merge and are not necessary to address.

All good. I was expecting more comments than that for my first PR!

I agree about addressing the nits in a followup PR.

@ordian
Copy link
Member

ordian commented Oct 13, 2022

jFYI: continuous-integration/gitlab-zombienet-tests-parachains-disputes is expected to fail for now on master, it's marked as {"build_allow_failure":true}
should be fixed in #6142

@eskimor eskimor enabled auto-merge (squash) October 13, 2022 09:59
@eskimor eskimor requested review from pepyakin and slumber October 13, 2022 10:00
@paritytech-ci paritytech-ci requested a review from a team October 13, 2022 10:34
@eskimor eskimor merged commit 851a108 into master Oct 13, 2022
@eskimor eskimor deleted the m-cat/pvf-timeouts branch October 13, 2022 11:00
Copy link
Contributor

@slumber slumber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @pepyakin comments, other than that well done

Note: once pvf is queued for preparation with some timeout, any subsequent request would discard compilation_timeout parameter and simply enqueue response_receiver (the way it's implemented now)

This is OK because we can't receive execute request for unprepared code until it's enacted once prechecking process concludes, and the code that was already prechecked should pass this process with a greater timeout.

However, this is an external guarantee so wanted to make sure you keep it in mind.

UPD: nvm didn't make it in time 😪

@mrcnski mrcnski mentioned this pull request Oct 13, 2022
ordian added a commit that referenced this pull request Oct 21, 2022
* master: (21 commits)
  try and fix build (#6170)
  Companion for EPM duplicate submissions (#6115)
  Bump docker/setup-buildx-action from 2.0.0 to 2.1.0 (#6141)
  companion for #12212 (#6162)
  Bump substrate (#6164)
  BlockId removal: refactor: StorageProvider (#6160)
  availability-recovery: use `IfDisconnected::TryConnect` for chunks (#6081)
  Update clap to version 4 (#6128)
  Add `force_open_hrmp_channel` Call (#6155)
  Fix fuzzing builds xcm-fuzz and erasure-coding fuzzer (#6153)
  BlockId removal refactor: Backend::state_at (#6149)
  First round of implementers guide fixes (#6146)
  bump zombienet version (#6142)
  lingua.dic is not managed by CI team (#6148)
  pallet-mmr: RPC and Runtime APIs work with block numbers (#6072)
  Separate preparation timeouts for PVF prechecking and execution (#6139)
  Malus: add disputed block percentage (#6100)
  refactor grid topology to expose more info to subsystems (#6140)
  Manual Para Lock (#5451)
  Expose node subcommands in Malus CLI (#6135)
  ...
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Relax PVF compilation deadline during preparation
8 participants