`cranelift-frontend`: Support moving GCs with user stack maps #8978

fitzgen · 2024-07-19T18:21:50Z

This refactors the way that user stack maps work such that they are compatible with moving GCs. We now keep every GC value in a stack slot, spill to the stack slot on the value's definition, and reload from the stack slot on every use. This means that we can generate a bunch of redundant loads, but we let the mid-end's alias analysis clean that up. This has the added benefit of reducing the length of live ranges for GC values and also means we don't need to proactively spill every single live GC values at each safepoint, because we know they are already spilled. This latter bit actually helps us avoid an accidentally quadratic issue with many, long, overlapping live ranges where we would do O(n^2) spills for a series of safepoints that keep creating more GC refs.

We also implement two further optimizations:

We lazily allocate slots for live GC values, which means that if a GC value is not ever live across a safepoint, then we do never allocate a stack slot for it and never spill it to the stack slot. If we didn't do this, then frames would be larger and we'd have a dead store to the stack slot that would otherwise require the mid-end to grow a dead-store-elimination pass.
We keep a free list of available slots that will never be used again, and we reuse slots from here when possible. This means that a chain of non-overlapping live ranges, each of which still needs to appear in some safepoint's stack map, will all reuse the same stack slot, keeping frames from being bloated.

Finally, this commit also introduces some logs for the liveness analysis to ease future debugging.

This refactors the way that user stack maps work such that they are compatible with moving GCs. We now keep every GC value in a stack slot, spill to the stack slot on the value's definition, and reload from the stack slot on every use. This means that we can generate a bunch of redundant loads, but we let the mid-end's alias analysis clean that up. This has the added benefit of reducing the length of live ranges for GC values and also means we don't need to proactively spill every single live GC values at each safepoint, because we know they are already spilled. This latter bit actually helps us avoid an accidentally quadratic issue with many, long, overlapping live ranges where we would do `O(n^2)` spills for a series of safepoints that keep creating more GC refs. We also implement two further optimizations: 1. We lazily allocate slots for live GC values, which means that if a GC value is not ever live across a safepoint, then we do never allocate a stack slot for it and never spill it to the stack slot. If we didn't do this, then frames would be larger and we'd have a dead store to the stack slot that would otherwise require the mid-end to grow a dead-store-elimination pass. 2. We keep a free list of available slots that will never be used again, and we reuse slots from here when possible. This means that a chain of non-overlapping live ranges, each of which still needs to appear in some safepoint's stack map, will all reuse the same stack slot, keeping frames from being bloated. Finally, this commit also introduces some logs for the liveness analysis to ease future debugging. Co-Authored-By: Trevor Elliott <telliott@fastly.com>

cfallin

I think this is correct, just some nits below -- I'm glad our idea to fall back to stackslots is working out!

I think we can land as-is if you prefer, but I do have some thoughts spawned from the handling of blockparams. My original thought when I suggested this was that (perhaps naively, but...) it would operate closer to the level of the frontend's "variables" that we use to represent Wasm locals, so that somehow we wouldn't need to handle blockparams -- since by construction, we insert blockparams only to carry the same variable from source block to dest block. I worry that inserting reloads and respills (if I've understood the above correctly) could lead to unexpected worst-case behavior still. (In the vars_block_params_and_needs_stack_map test example, x would have one stackslot, we would overwrite it in both block1 and block2, and we wouldn't need loads at those blocks' tails and a re-store at block3.)

I think that if we have a closer correspondence between frontend variables (i.e. Wasm locals) and stackslots, we have a more understandable cost-model from the Wasm producer's point of view as well: at least in baseline compilers, locals generally become actual stack slots, so producers try to minimize local count and reuse them where possible. If we can carry over that consolidation to our IR, we have an understandable model of "one store per write to a Wasm local, one load per use of a Wasm local" which is great.

As things are now I think we can still see higher-than-linear resulting code size from linear number of reads/writes to refs: imagine a br_table that fans out to N blocks, the first (innermost) of which overwrites a reftyped Wasm local with an existing def. Then when we process end opcodes and generate N merge-points in sequence, each merge point will have to reload and re-store the slot. Ideally we'd instead directly translate the update to the variable to a stack_store and the uses to stack_loads.

I think that might point the way to an implementation solution that also avoids the need for the extra pass: branch in the variable-to-SSA-value translation to an alternate "value just lives on stack" path.

Thoughts?

cranelift/frontend/src/frontend.rs

cfallin · 2024-07-19T18:30:49Z

cranelift/frontend/src/frontend.rs

+        // A map from size/align to free stack slots that are not being used
+        // anymore. This allows us to reuse stack slots across multiple values
+        // helps cut down on the ultimate size of our stack frames.
+        let mut free_stack_slots: crate::HashMap<u32, SmallVec<[ir::StackSlot; 4]>> =


If it's not too bad to do, could we have an enum for stack-slot size rather than a bare u32? Something like enum StackSlotSize { Slot4, Slot8 } and perhaps .bytes() and .align() methods as needed; and a TryFrom<ir::Type> impl maybe too...

(To be clear, I'm suggesting this as a refactor wherever slot-sizes appear, the hashmap key here just prompted the thought)

We would actually need quite a few slot sizes, since we have types like f64x8 in CLIF. I'm not fully convinced that we would actually benefit from such an enum, and even if we do want to go down that route, I'd want to do it in a separate PR.

I did waffle back and forth between a full hash map here vs an array indexed by log2(size) like the old approach did all over the place. I opted to just go for the simplest, most-obvious implementation approach and figured we can come back and optimize if it is ever a perf bottleneck, but I suspect it really isn't.

So given all that: are you okay with keeping this as-is for now at least?

Sure, it doesn't need to happen as part of this PR. (The reason for TryFrom<ir::Type> rather than From was exactly to address this -- I don't think we need to define a slot size for every type, only for types that the cranelift-frontend user puts into stackmaps -- and in general I find bare integers pretty error-prone; but happy to discuss further!)

Did this in #8991

cranelift/frontend/src/frontend.rs

fitzgen · 2024-07-19T20:27:36Z

operate closer to the level of the frontend's "variables"

I also thought about creating a new entity for these things but I thought through how it would be used in cranelift-wasm and wasmtime-cranelift and my conclusion was that it would be a bit of a mess. First, we aren't just dealing with wasm locals, which are Variables; we are also dealing with the results of calls and things like that, which produce Values. So we have to have a solution for Values either way. Second, cranelift-wasm doesn't make the decision whether certain types are managed by the GC or not, it is agnostic of runtime structures and implementation, wasmtime-cranelift makes those decisions based on its Wasmtime-specific knowledge. That means that cranelift-wasm would have to have some kind of enum of either a GC-managed Variable (or other new entity) or a plain Value and use that for its abstract operand stack, at which point the whole crate, essentially, is infected with this change and things are getting really hairy.

For this reason, we decided it was better, at least for now, to play nice with the existing system and its existing entities, rather than overhaul everything.

I think that might point the way to an implementation solution that also avoids the need for the extra pass: branch in the variable-to-SSA-value translation to an alternate "value just lives on stack" path.

Unfortunately this won't work because the stack slots and liveness and all that are not created until we are finishing the function builder, long after calls to builder.use_var.

Trevor and I talked about another solution for this specific problem though: if we maintain a backwards map from Value to the Variable that it is associated with (if any) then we can force all Values for the same Variable to use the same stack slot. For Values that are not associated with a Variable, they would use the same approach we have in this PR. But this is work for the future.

fitzgen · 2024-07-19T20:49:45Z

I think we can land as-is if you prefer

I'm going to go ahead and enqueue this for merging, but I am still interested in your responses above and in the inline thread!

Rather than an open-coded `u32`. Addresses bytecodealliance#8978 (comment)

* Move safepoints and stack maps code to its own module This is purely mechanical code motion. * Shorten log messages' prefix These are all inside a "safepoint" module, and `env_logger` prints the module before the log message, so the old "stack map" part of the log messages' prefix is a bit redundant now. * Use a typed enum for stack slot sizes Rather than an open-coded `u32`. Addresses #8978 (comment) * Update some comments These comments were referencing things that we did in the old implementation of stack maps and safepoints, and have been updated for the new approach.

fitzgen requested review from a team as code owners July 19, 2024 18:21

fitzgen requested review from cfallin and removed request for a team July 19, 2024 18:21

cfallin approved these changes Jul 19, 2024

View reviewed changes

review feedback

c3f75fc

fitzgen added this pull request to the merge queue Jul 19, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 19, 2024

github-actions bot added the cranelift Issues related to the Cranelift code generator label Jul 19, 2024

fitzgen added this pull request to the merge queue Jul 22, 2024

Merged via the queue into bytecodealliance:main with commit 2028224 Jul 22, 2024
37 checks passed

fitzgen deleted the user-stack-maps-and-moving-gc branch July 22, 2024 17:46

fitzgen added a commit to fitzgen/wasmtime that referenced this pull request Jul 22, 2024

Use a typed enum for stack slot sizes

6cb20ae

Rather than an open-coded `u32`. Addresses bytecodealliance#8978 (comment)

fitzgen mentioned this pull request Jul 22, 2024

fuzzgen: Add stack map variables #8941

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`cranelift-frontend`: Support moving GCs with user stack maps #8978

`cranelift-frontend`: Support moving GCs with user stack maps #8978

fitzgen commented Jul 19, 2024

cfallin left a comment

cfallin Jul 19, 2024

fitzgen Jul 19, 2024

cfallin Jul 19, 2024

fitzgen Jul 22, 2024

fitzgen commented Jul 19, 2024

fitzgen commented Jul 19, 2024

cranelift-frontend: Support moving GCs with user stack maps #8978

cranelift-frontend: Support moving GCs with user stack maps #8978

Conversation

fitzgen commented Jul 19, 2024

cfallin left a comment

Choose a reason for hiding this comment

cfallin Jul 19, 2024

Choose a reason for hiding this comment

fitzgen Jul 19, 2024

Choose a reason for hiding this comment

cfallin Jul 19, 2024

Choose a reason for hiding this comment

fitzgen Jul 22, 2024

Choose a reason for hiding this comment

fitzgen commented Jul 19, 2024

fitzgen commented Jul 19, 2024

`cranelift-frontend`: Support moving GCs with user stack maps #8978

`cranelift-frontend`: Support moving GCs with user stack maps #8978