Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[move] Implement and enable on-chain package hooks #19613

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

kklas
Copy link
Contributor

@kklas kklas commented Sep 30, 2024

Description

This PR implements and enables on-chain package hooks which makes it possible to build against on-chain dependencies specified in the manifest.

The hooks are implemented as follows:

  • resolve_on_chain_dependency:
    • fetches the dependency from the chain based on the provided package ID
    • creates a bytecode package in ~/.move/<id> by generating a manifest and saving the modules
  • custom_resolve_pkg_id:
    • checks for the lockfile and tries to resolve the original-published-id for the current env
    • if the lockfile doesn't exist or doesn't have the relevant env entry, we try to resolve by looking up the published-at field in the manifest -- the package is fetched and the id is resolved as the self id of any module in that package (this is also how the on-chain packages generated with resolve_on_chain_dependency are resolved, with caching to avoid duplicate fetching)
    • if there's no published-id field either, we assume the package hasn't been published and resolve the id as its manifest name
  • custom_resolve_version:

Since packages are now resolved by their (original-id, version) pair instead of manifest name, this effectively makes it possible to consolidate both source and on-chain dependencies under a single dependency graph. The reasoning and principles behind this design are outlined here https://gist.github.com/kklas/d9edad7875548fae57b0c86bc8871fea

cc @rvantonder @amnn #14178

Notes and known limitations

The async stuff in SuiPackageHooks is a bit hazardous. The gist of it is that the hook methods are required to be non-async but we need to use the SuiClient within them both for fetching the on-chain packages and resolving package version an ids based on published-at. The way I got around it is that I run the future in tokio::task::block_in_place which will inform the existing executor (in case we're already running within tokio) about the blocking task so it hands of other any tasks to another thread. Then within it we dispatch the future in a separate runtime that we control. This works and it will not starve other tasks but it can lead to weird behaviour, slowness, or even deadlocks in some cases when multiple pieces of code are running concurrently within the same task (which is facilitated by macros like join! and select!). I took a quick look at the codebase and it seems to me it won't be an issue because the sui CLI doesn't use that kind of concurrency for commands, the sui-move-lsp doesn't use tokio / async, and sui-source-validation-service runs all build tasks using tokio::spawn which will be OK. Nonetheless I think the hooks should be refactored to be async ASAP (ideally make them not be global static also)!

Since the packages are now identified by their (original-id, version), the dependency graph section in the lockfiles is now env specific. This means that if the lockfile was generated for the testnet env, having that package as a dependency and building on the mainnet will lead to inconsistencies as testnet ids will be loaded into the graph. Perhaps there should be an optional chain_id field in the lockfile so that it can be ignored when there's a mismatch?

When using on-chain dependencies it's not possible to specify which env they correspond to, so these manifests will only work for a single env.

move-analyzer (to my surprise) works with bytecode deps (based on a smaller project I tested it on, it could have issues elsewhere), but it's not possible to change the env after it starts. Ideally it would pick up changes from WalletContext or have a setting for selecting the env in the editor menu.

Test plan

Added an integration test, tested manually on large-ish projects.


Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

  • Protocol:
  • Nodes (Validators and Full nodes):
  • Indexer:
  • JSON-RPC:
  • GraphQL:
  • CLI: It's now possible to build against on-chain dependencies using an on-chain dependency entry in Move.toml (e.g. Foo = { id = "0x123" }).
  • Rust SDK:
  • REST API:

Copy link

vercel bot commented Sep 30, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
sui-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 1, 2024 9:28pm
3 Skipped Deployments
Name Status Preview Comments Updated (UTC)
multisig-toolkit ⬜️ Ignored (Inspect) Visit Preview Nov 1, 2024 9:28pm
sui-kiosk ⬜️ Ignored (Inspect) Visit Preview Nov 1, 2024 9:28pm
sui-typescript-docs ⬜️ Ignored (Inspect) Visit Preview Nov 1, 2024 9:28pm

@kklas
Copy link
Contributor Author

kklas commented Oct 10, 2024

@amnn I have rebased and fixed clippy, test-extra, and cargo-deny (advisory, fixed by rebase). I can't reproduce TS e2e tests and split cluster check (all pass on my machine) and cargo test (can't reproduce -- consistently have other failures locally even on trunk).

@amnn
Copy link
Member

amnn commented Oct 11, 2024

Thanks @kklas, some of these tests may be flaky -- I've triggered another run, and we can try retrying them to see.

Copy link
Member

@amnn amnn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got a chance to take a high level look at the PR @kklas, thanks for this.

The main thing I'm thinking/worrying about is this whole async story (I'm sure it's the same for you). I have a mental model for how I think things "should" work, and I'm curious to understand from you where it falls apart:

I was hoping that for the most part, the fact that a package was being resolved using the package hooks would be hidden from the rest of the system, because we would fetch the packages during dependency resolution, and then from that point on, we refer to it as a file location. Based on that, outside of dependency resolution, we don't have to worry about this async API complication.

Maybe this is true and the issue is that we need to dependency resolution in more places. If so, there are some tricks we can use to make the async API a bit more palatable. For example, we can implement the hook using channels:

  • When you register the hook, you spin up a separate task that handles requests from the hook.
  • It talks over channels, so it waits for a request on one channel, does the work (fetching and writing packages), and then returns on another channel.
  • The hook implementations themselves, can be non-async, communicating over the channels using their blocking APIs.

Code organisation-wise, we will want to move all this hook stuff out into its own crate, because the sui crate is quite large, and we don't want to introduce it as a dependency to a bunch of places if we don't have to. The package hook's dependencies seem quite modest, so it would be fine to include just that.

@kklas
Copy link
Contributor Author

kklas commented Oct 14, 2024

Based on that, outside of dependency resolution, we don't have to worry about this async API complication.

@amnn my understanding is this: I agree we shouldn't have to worry about it, and we don't if the other code that depends on the hooks is not aysnc / concurrent. But if it is then it's an issue because what you're trying to do is run blocking code (because all package hooks functions are non async) that then encapsulates some async code that's IO bound and doing http requests basically.
The canonical way to do that (run async code from non-async function) is to run self.handle.block_on (where the handle is another tokio runtime we created for package hooks specifically), but the problem is that tokio doesn't allow running runtimes within runtimes (in case the main is using tokio already) so you will get this error:

thread 'tokio-runtime-worker' panicked at 'Cannot start a runtime from within a runtime. This happens because a function (like `block_on`) attempted to block the current thread while the thread is being used to drive asynchronous tasks.

So this is the conundrum and I don't think using channels does anything fundamentally to alleviate that but would rather create more issues -- e.g. the "build dependency graph" call is sync, which will then call into a package hook which will send something to a channel and block waiting for the response. This can cause a deadlock because the hook task might be running on the same thread as the one which is handling the requests from the hooks (and if you want to use tokio channels instead of std then we're back at square one because you can't await in the hook). Seems like an anti-pettern to use channels for this and more complex in general.

After thinking more about this, I can come up with another solution that works:

    pub fn run_async<F>(&self, future: F) -> F::Output
    where
        F: std::future::Future + Send + 'static,
        F::Output: Send + 'static,
    {
        let h = self.handle.clone();
        std::thread::spawn(move || h.block_on(future))
            .join()
            .expect("Thread panicked")
    }

So instead of trying to use the existing runtime with block_in_place which will spawn the task in a spearate "blocking" thread (and which has caveats I outlined in the OP and here), we spawn an isolated thread and do the async stuff in there with a separate runtime. So the advantages of this are:

  • doesn't rely on Handle::try_current() to detect whether we're running within tokio which is an anti-pattern
  • doesn't use block_in_place which can cause unexpected behaviour, especially in join! or select! scenarios
  • spawning a separate thread completely isolates the other runtime

The downside though is the cost of spawning new threads but I guess we can consider this similar to the overhead of running an external resolver command which is a separate process.

Ultimately the issue IMO is the design of the dependency graph. If we want to be able to resolve packages asynchronously, which we do basically because all the sui client that we need are async, the dependency graph needs to support that natively. So with all of this separate thread creation etc. we're just dancing around this issue. Not sure behind the design decision to not have async in the dependency graph but maybe it's time to revisit that. Another issue is PackageHooks are global once registered which also creates issues in concurrent scenarios if you want to run it for different settings concurrently.

Code organisation-wise, we will want to move all this hook stuff out into its own crate

Makes sense! Will do that!

@amnn
Copy link
Member

amnn commented Oct 16, 2024

So this is the conundrum and I don't think using channels does anything fundamentally to alleviate that but would rather create more issues -- e.g. the "build dependency graph" call is sync, which will then call into a package hook which will send something to a channel and block waiting for the response.

When I originally mentioned using channels, what I had in mind was the following, which I think gets around your concern around the task being potentially scheduled on the same thread as the caller, and your other concern about creating lots of short-lived threads:

  • When you register the package hook, it also spawns its own dedicated tokio runtime, and a task in that runtime to handle package hook requests.
  • Inside that runtime, it uses non-blocking tokio channels to receive requests.
  • Outside the runtime (i.e. in the actual package hooks), blocking calls are made on the other sides of those channels, to make whatever sync code that relies on the package hook wait for the hook to do its work.

You can simulate a function call with channels using a pattern like this:

async fn foo(x: Param) -> Ret { 
    /* ... */
    bar
}

Becomes (roughly -- I haven't run this):

static ACTOR: OnceCell<Actor> = OnceCell::new();

struct Actor {
    call: mpsc::Sender<(Param, oneshot::Receiver<Ret>)>,
    rt: Runtime,
}

fn foo(x: Param) -> Result<Ret> {
    let call = CALL.get().clone();
    let (tx, rx) = oneshot::channel();
    blocking_send((x, tx))?;
    rx.blocking_recv()
}

fn register_hook() {
    let rt = Builder::new_multi_thread()
        .enable_all()
        .build()
        .expect("Couldn't set-up runtime for actor");

    // Support some moderate level of concurrency (buffer for four 
    // requests).
    let (tx, rx) = mpsc::channel(4);

    // NB. This will complain if you register the hook twice, so test 
    // scenarios may be complicated.
    ACTOR.set(Actor { call: tx, rt }).expect("Failed to set-up actor");

    rt.spawn(async move {
        loop {
            let (x, tx) = rx.recv().await.expect("tx can't close");
            /* ... */

            // Ignore the error from this -- it means the rx side was dropped,
            // but that is not this actor's concern.
            let _ = tx.send(bar).await;
        }
    });
}

Tokio has some resources on this problem: https://tokio.rs/tokio/topics/bridging -- this approach is the one suggested here:

Sending messages

The third technique is to spawn a runtime and use message passing to communicate with it. This involves a bit more boilerplate than the other two approaches, but it is the most flexible approach.
[...]
This example could be configured in many ways. For instance, you could use a Semaphore to limit the number of active tasks, or you could use a channel in the opposite direction to send a response to the spawner. When you spawn a runtime in this way, it is a type of actor.

But to your point here:

Ultimately the issue IMO is the design of the dependency graph. If we want to be able to resolve packages asynchronously, which we do basically because all the sui client that we need are async, the dependency graph needs to support that natively. So with all of this separate thread creation etc. we're just dancing around this issue. Not sure behind the design decision to not have async in the dependency graph but maybe it's time to revisit that. Another issue is PackageHooks are global once registered which also creates issues in concurrent scenarios if you want to run it for different settings concurrently.

Yes, we do need to redesign dependency resolution with all its new constraints in mind. Package Hooks predate Sui, which uses quite a different philosophy to packages compared to what came before it. Once MVR and this work have landed, I'm hoping we can pick up that redesign work internally.

@kklas kklas force-pushed the implement-on-chain-hooks branch from f75a20f to a4036b4 Compare November 1, 2024 21:27
@kklas
Copy link
Contributor Author

kklas commented Nov 1, 2024

@amnn and I continued the async discussion over another channel and agreed that the solution I implemented here is a reasonable one considering the trade-offs.

Copy link
Contributor

github-actions bot commented Jan 1, 2025

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jan 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants