[move] Implement and enable on-chain package hooks #19613

kklas · 2024-09-30T08:55:36Z

Description

This PR implements and enables on-chain package hooks which makes it possible to build against on-chain dependencies specified in the manifest.

The hooks are implemented as follows:

resolve_on_chain_dependency:
- fetches the dependency from the chain based on the provided package ID
- creates a bytecode package in ~/.move/<id> by generating a manifest and saving the modules
custom_resolve_pkg_id:
- checks for the lockfile and tries to resolve the original-published-id for the current env
- if the lockfile doesn't exist or doesn't have the relevant env entry, we try to resolve by looking up the published-at field in the manifest -- the package is fetched and the id is resolved as the self id of any module in that package (this is also how the on-chain packages generated with resolve_on_chain_dependency are resolved, with caching to avoid duplicate fetching)
- if there's no published-id field either, we assume the package hasn't been published and resolve the id as its manifest name
custom_resolve_version:
- same as above but we resolve the version from the version field in env (in case of lockfile) or the version value from the fetched package (in case of published-at)
- if we can't resolve with above we resolve version as None (see this for more details https://github.com/MystenLabs/sui/blob/main/external-crates/move/crates/move-package/src/resolution/dependency_graph.rs#L135-L152)

Since packages are now resolved by their (original-id, version) pair instead of manifest name, this effectively makes it possible to consolidate both source and on-chain dependencies under a single dependency graph. The reasoning and principles behind this design are outlined here https://gist.github.com/kklas/d9edad7875548fae57b0c86bc8871fea

cc @rvantonder @amnn #14178

Notes and known limitations

The async stuff in SuiPackageHooks is a bit hazardous. The gist of it is that the hook methods are required to be non-async but we need to use the SuiClient within them both for fetching the on-chain packages and resolving package version an ids based on published-at. The way I got around it is that I run the future in tokio::task::block_in_place which will inform the existing executor (in case we're already running within tokio) about the blocking task so it hands of other any tasks to another thread. Then within it we dispatch the future in a separate runtime that we control. This works and it will not starve other tasks but it can lead to weird behaviour, slowness, or even deadlocks in some cases when multiple pieces of code are running concurrently within the same task (which is facilitated by macros like join! and select!). I took a quick look at the codebase and it seems to me it won't be an issue because the sui CLI doesn't use that kind of concurrency for commands, the sui-move-lsp doesn't use tokio / async, and sui-source-validation-service runs all build tasks using tokio::spawn which will be OK. Nonetheless I think the hooks should be refactored to be async ASAP (ideally make them not be global static also)!

Since the packages are now identified by their (original-id, version), the dependency graph section in the lockfiles is now env specific. This means that if the lockfile was generated for the testnet env, having that package as a dependency and building on the mainnet will lead to inconsistencies as testnet ids will be loaded into the graph. Perhaps there should be an optional chain_id field in the lockfile so that it can be ignored when there's a mismatch?

When using on-chain dependencies it's not possible to specify which env they correspond to, so these manifests will only work for a single env.

move-analyzer (to my surprise) works with bytecode deps (based on a smaller project I tested it on, it could have issues elsewhere), but it's not possible to change the env after it starts. Ideally it would pick up changes from WalletContext or have a setting for selecting the env in the editor menu.

Test plan

Added an integration test, tested manually on large-ish projects.

Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

Protocol:
Nodes (Validators and Full nodes):
Indexer:
JSON-RPC:
GraphQL:
CLI: It's now possible to build against on-chain dependencies using an on-chain dependency entry in Move.toml (e.g. Foo = { id = "0x123" }).
Rust SDK:
REST API:

vercel · 2024-09-30T08:55:57Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
sui-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Nov 1, 2024 9:28pm

3 Skipped Deployments

Name	Status	Preview	Updated (UTC)
multisig-toolkit	⬜️ Ignored (Inspect)	Visit Preview	Nov 1, 2024 9:28pm
sui-kiosk	⬜️ Ignored (Inspect)	Visit Preview	Nov 1, 2024 9:28pm
sui-typescript-docs	⬜️ Ignored (Inspect)	Visit Preview	Nov 1, 2024 9:28pm

kklas · 2024-10-10T20:57:48Z

@amnn I have rebased and fixed clippy, test-extra, and cargo-deny (advisory, fixed by rebase). I can't reproduce TS e2e tests and split cluster check (all pass on my machine) and cargo test (can't reproduce -- consistently have other failures locally even on trunk).

amnn · 2024-10-11T11:53:30Z

Thanks @kklas, some of these tests may be flaky -- I've triggered another run, and we can try retrying them to see.

amnn

Got a chance to take a high level look at the PR @kklas, thanks for this.

The main thing I'm thinking/worrying about is this whole async story (I'm sure it's the same for you). I have a mental model for how I think things "should" work, and I'm curious to understand from you where it falls apart:

I was hoping that for the most part, the fact that a package was being resolved using the package hooks would be hidden from the rest of the system, because we would fetch the packages during dependency resolution, and then from that point on, we refer to it as a file location. Based on that, outside of dependency resolution, we don't have to worry about this async API complication.

Maybe this is true and the issue is that we need to dependency resolution in more places. If so, there are some tricks we can use to make the async API a bit more palatable. For example, we can implement the hook using channels:

When you register the hook, you spin up a separate task that handles requests from the hook.
It talks over channels, so it waits for a request on one channel, does the work (fetching and writing packages), and then returns on another channel.
The hook implementations themselves, can be non-async, communicating over the channels using their blocking APIs.

Code organisation-wise, we will want to move all this hook stuff out into its own crate, because the sui crate is quite large, and we don't want to introduce it as a dependency to a bunch of places if we don't have to. The package hook's dependencies seem quite modest, so it would be fine to include just that.

kklas · 2024-10-14T14:16:20Z

Based on that, outside of dependency resolution, we don't have to worry about this async API complication.

@amnn my understanding is this: I agree we shouldn't have to worry about it, and we don't if the other code that depends on the hooks is not aysnc / concurrent. But if it is then it's an issue because what you're trying to do is run blocking code (because all package hooks functions are non async) that then encapsulates some async code that's IO bound and doing http requests basically.
The canonical way to do that (run async code from non-async function) is to run self.handle.block_on (where the handle is another tokio runtime we created for package hooks specifically), but the problem is that tokio doesn't allow running runtimes within runtimes (in case the main is using tokio already) so you will get this error:

thread 'tokio-runtime-worker' panicked at 'Cannot start a runtime from within a runtime. This happens because a function (like `block_on`) attempted to block the current thread while the thread is being used to drive asynchronous tasks.

So this is the conundrum and I don't think using channels does anything fundamentally to alleviate that but would rather create more issues -- e.g. the "build dependency graph" call is sync, which will then call into a package hook which will send something to a channel and block waiting for the response. This can cause a deadlock because the hook task might be running on the same thread as the one which is handling the requests from the hooks (and if you want to use tokio channels instead of std then we're back at square one because you can't await in the hook). Seems like an anti-pettern to use channels for this and more complex in general.

After thinking more about this, I can come up with another solution that works:

    pub fn run_async<F>(&self, future: F) -> F::Output
    where
        F: std::future::Future + Send + 'static,
        F::Output: Send + 'static,
    {
        let h = self.handle.clone();
        std::thread::spawn(move || h.block_on(future))
            .join()
            .expect("Thread panicked")
    }

So instead of trying to use the existing runtime with block_in_place which will spawn the task in a spearate "blocking" thread (and which has caveats I outlined in the OP and here), we spawn an isolated thread and do the async stuff in there with a separate runtime. So the advantages of this are:

doesn't rely on Handle::try_current() to detect whether we're running within tokio which is an anti-pattern
doesn't use block_in_place which can cause unexpected behaviour, especially in join! or select! scenarios
spawning a separate thread completely isolates the other runtime

The downside though is the cost of spawning new threads but I guess we can consider this similar to the overhead of running an external resolver command which is a separate process.

Ultimately the issue IMO is the design of the dependency graph. If we want to be able to resolve packages asynchronously, which we do basically because all the sui client that we need are async, the dependency graph needs to support that natively. So with all of this separate thread creation etc. we're just dancing around this issue. Not sure behind the design decision to not have async in the dependency graph but maybe it's time to revisit that. Another issue is PackageHooks are global once registered which also creates issues in concurrent scenarios if you want to run it for different settings concurrently.

Code organisation-wise, we will want to move all this hook stuff out into its own crate

Makes sense! Will do that!

amnn · 2024-10-16T11:33:19Z

So this is the conundrum and I don't think using channels does anything fundamentally to alleviate that but would rather create more issues -- e.g. the "build dependency graph" call is sync, which will then call into a package hook which will send something to a channel and block waiting for the response.

When I originally mentioned using channels, what I had in mind was the following, which I think gets around your concern around the task being potentially scheduled on the same thread as the caller, and your other concern about creating lots of short-lived threads:

When you register the package hook, it also spawns its own dedicated tokio runtime, and a task in that runtime to handle package hook requests.
Inside that runtime, it uses non-blocking tokio channels to receive requests.
Outside the runtime (i.e. in the actual package hooks), blocking calls are made on the other sides of those channels, to make whatever sync code that relies on the package hook wait for the hook to do its work.

You can simulate a function call with channels using a pattern like this:

async fn foo(x: Param) -> Ret { 
    /* ... */
    bar
}

Becomes (roughly -- I haven't run this):

static ACTOR: OnceCell<Actor> = OnceCell::new();

struct Actor {
    call: mpsc::Sender<(Param, oneshot::Receiver<Ret>)>,
    rt: Runtime,
}

fn foo(x: Param) -> Result<Ret> {
    let call = CALL.get().clone();
    let (tx, rx) = oneshot::channel();
    blocking_send((x, tx))?;
    rx.blocking_recv()
}

fn register_hook() {
    let rt = Builder::new_multi_thread()
        .enable_all()
        .build()
        .expect("Couldn't set-up runtime for actor");

    // Support some moderate level of concurrency (buffer for four 
    // requests).
    let (tx, rx) = mpsc::channel(4);

    // NB. This will complain if you register the hook twice, so test 
    // scenarios may be complicated.
    ACTOR.set(Actor { call: tx, rt }).expect("Failed to set-up actor");

    rt.spawn(async move {
        loop {
            let (x, tx) = rx.recv().await.expect("tx can't close");
            /* ... */

            // Ignore the error from this -- it means the rx side was dropped,
            // but that is not this actor's concern.
            let _ = tx.send(bar).await;
        }
    });
}

Tokio has some resources on this problem: https://tokio.rs/tokio/topics/bridging -- this approach is the one suggested here:

Sending messages

The third technique is to spawn a runtime and use message passing to communicate with it. This involves a bit more boilerplate than the other two approaches, but it is the most flexible approach.
[...]
This example could be configured in many ways. For instance, you could use a Semaphore to limit the number of active tasks, or you could use a channel in the opposite direction to send a response to the spawner. When you spawn a runtime in this way, it is a type of actor.

But to your point here:

Ultimately the issue IMO is the design of the dependency graph. If we want to be able to resolve packages asynchronously, which we do basically because all the sui client that we need are async, the dependency graph needs to support that natively. So with all of this separate thread creation etc. we're just dancing around this issue. Not sure behind the design decision to not have async in the dependency graph but maybe it's time to revisit that. Another issue is PackageHooks are global once registered which also creates issues in concurrent scenarios if you want to run it for different settings concurrently.

Yes, we do need to redesign dependency resolution with all its new constraints in mind. Package Hooks predate Sui, which uses quite a different philosophy to packages compared to what came before it. Once MVR and this work have landed, I'm hoping we can pick up that redesign work internally.

kklas · 2024-11-01T23:15:40Z

@amnn and I continued the async discussion over another channel and agreed that the solution I implemented here is a reasonable one considering the trade-offs.

github-actions · 2025-01-01T02:06:47Z

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.

vercel bot deployed to Preview – sui-docs September 30, 2024 08:59 View deployment

stefan-mysten requested review from amnn and rvantonder September 30, 2024 15:56

kklas force-pushed the implement-on-chain-hooks branch from d5bca38 to 9564ae0 Compare October 1, 2024 11:34

vercel bot deployed to Preview – sui-docs October 1, 2024 11:38 View deployment

kklas force-pushed the implement-on-chain-hooks branch from 9564ae0 to 2bf9961 Compare October 1, 2024 12:08

vercel bot deployed to Preview – sui-docs October 1, 2024 12:10 View deployment

kklas force-pushed the implement-on-chain-hooks branch from 2bf9961 to f75a20f Compare October 10, 2024 20:54

vercel bot deployed to Preview – sui-docs October 10, 2024 20:59 View deployment

amnn reviewed Oct 11, 2024

View reviewed changes

kklas added 14 commits November 1, 2024 21:15

pass in optional lockfile string to pkg id and version resolution hooks

5f630c4

implement sui on-chain hooks

04ef8d0

make on-chain hooks work with move-analyzer

4ee061c

make on-chain hooks work with sui-source-validation-service

1a3d069

rename old SuiPackageHooks to SuiPackageHooksForTesting

62522ad

bump lockfile version to 4

dd27cb3

add an integration test for hooks

f0d81c8

add license headers

cb055b5

fix clippy

1d142be

fix simtests

26f9c1e

use lru cache in package_cache

3b0207e

rebase

e1c29ed

fix clippy

f01e9f9

move package hooks to a separate crate

a4036b4

kklas force-pushed the implement-on-chain-hooks branch from f75a20f to a4036b4 Compare November 1, 2024 21:27

vercel bot deployed to Preview – sui-docs November 1, 2024 21:28 View deployment

github-actions bot added the Stale label Jan 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[move] Implement and enable on-chain package hooks #19613

[move] Implement and enable on-chain package hooks #19613

kklas commented Sep 30, 2024

vercel bot commented Sep 30, 2024 •

edited

Loading

kklas commented Oct 10, 2024 •

edited

Loading

amnn commented Oct 11, 2024

amnn left a comment

kklas commented Oct 14, 2024 •

edited

Loading

amnn commented Oct 16, 2024

Sending messages

kklas commented Nov 1, 2024

github-actions bot commented Jan 1, 2025

[move] Implement and enable on-chain package hooks #19613

Are you sure you want to change the base?

[move] Implement and enable on-chain package hooks #19613

Conversation

kklas commented Sep 30, 2024

Description

Notes and known limitations

Test plan

Release notes

vercel bot commented Sep 30, 2024 • edited Loading

kklas commented Oct 10, 2024 • edited Loading

amnn commented Oct 11, 2024

amnn left a comment

Choose a reason for hiding this comment

kklas commented Oct 14, 2024 • edited Loading

amnn commented Oct 16, 2024

Sending messages

kklas commented Nov 1, 2024

github-actions bot commented Jan 1, 2025

vercel bot commented Sep 30, 2024 •

edited

Loading

kklas commented Oct 10, 2024 •

edited

Loading

kklas commented Oct 14, 2024 •

edited

Loading