-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cranelift-frontend
: Support moving GCs with user stack maps
#8978
cranelift-frontend
: Support moving GCs with user stack maps
#8978
Conversation
This refactors the way that user stack maps work such that they are compatible with moving GCs. We now keep every GC value in a stack slot, spill to the stack slot on the value's definition, and reload from the stack slot on every use. This means that we can generate a bunch of redundant loads, but we let the mid-end's alias analysis clean that up. This has the added benefit of reducing the length of live ranges for GC values and also means we don't need to proactively spill every single live GC values at each safepoint, because we know they are already spilled. This latter bit actually helps us avoid an accidentally quadratic issue with many, long, overlapping live ranges where we would do `O(n^2)` spills for a series of safepoints that keep creating more GC refs. We also implement two further optimizations: 1. We lazily allocate slots for live GC values, which means that if a GC value is not ever live across a safepoint, then we do never allocate a stack slot for it and never spill it to the stack slot. If we didn't do this, then frames would be larger and we'd have a dead store to the stack slot that would otherwise require the mid-end to grow a dead-store-elimination pass. 2. We keep a free list of available slots that will never be used again, and we reuse slots from here when possible. This means that a chain of non-overlapping live ranges, each of which still needs to appear in some safepoint's stack map, will all reuse the same stack slot, keeping frames from being bloated. Finally, this commit also introduces some logs for the liveness analysis to ease future debugging. Co-Authored-By: Trevor Elliott <telliott@fastly.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is correct, just some nits below -- I'm glad our idea to fall back to stackslots is working out!
I think we can land as-is if you prefer, but I do have some thoughts spawned from the handling of blockparams. My original thought when I suggested this was that (perhaps naively, but...) it would operate closer to the level of the frontend's "variables" that we use to represent Wasm locals, so that somehow we wouldn't need to handle blockparams -- since by construction, we insert blockparams only to carry the same variable from source block to dest block. I worry that inserting reloads and respills (if I've understood the above correctly) could lead to unexpected worst-case behavior still. (In the vars_block_params_and_needs_stack_map
test example, x
would have one stackslot, we would overwrite it in both block1
and block2
, and we wouldn't need loads at those blocks' tails and a re-store at block3
.)
I think that if we have a closer correspondence between frontend variables (i.e. Wasm locals) and stackslots, we have a more understandable cost-model from the Wasm producer's point of view as well: at least in baseline compilers, locals generally become actual stack slots, so producers try to minimize local count and reuse them where possible. If we can carry over that consolidation to our IR, we have an understandable model of "one store per write to a Wasm local, one load per use of a Wasm local" which is great.
As things are now I think we can still see higher-than-linear resulting code size from linear number of reads/writes to refs: imagine a br_table
that fans out to N blocks, the first (innermost) of which overwrites a reftyped Wasm local with an existing def. Then when we process end
opcodes and generate N merge-points in sequence, each merge point will have to reload and re-store the slot. Ideally we'd instead directly translate the update to the variable to a stack_store
and the uses to stack_load
s.
I think that might point the way to an implementation solution that also avoids the need for the extra pass: branch in the variable-to-SSA-value translation to an alternate "value just lives on stack" path.
Thoughts?
// A map from size/align to free stack slots that are not being used | ||
// anymore. This allows us to reuse stack slots across multiple values | ||
// helps cut down on the ultimate size of our stack frames. | ||
let mut free_stack_slots: crate::HashMap<u32, SmallVec<[ir::StackSlot; 4]>> = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not too bad to do, could we have an enum for stack-slot size rather than a bare u32
? Something like enum StackSlotSize { Slot4, Slot8 }
and perhaps .bytes()
and .align()
methods as needed; and a TryFrom<ir::Type>
impl maybe too...
(To be clear, I'm suggesting this as a refactor wherever slot-sizes appear, the hashmap key here just prompted the thought)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would actually need quite a few slot sizes, since we have types like f64x8
in CLIF. I'm not fully convinced that we would actually benefit from such an enum
, and even if we do want to go down that route, I'd want to do it in a separate PR.
I did waffle back and forth between a full hash map here vs an array indexed by log2(size)
like the old approach did all over the place. I opted to just go for the simplest, most-obvious implementation approach and figured we can come back and optimize if it is ever a perf bottleneck, but I suspect it really isn't.
So given all that: are you okay with keeping this as-is for now at least?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it doesn't need to happen as part of this PR. (The reason for TryFrom<ir::Type>
rather than From
was exactly to address this -- I don't think we need to define a slot size for every type, only for types that the cranelift-frontend
user puts into stackmaps -- and in general I find bare integers pretty error-prone; but happy to discuss further!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this in #8991
I also thought about creating a new entity for these things but I thought through how it would be used in For this reason, we decided it was better, at least for now, to play nice with the existing system and its existing entities, rather than overhaul everything.
Unfortunately this won't work because the stack slots and liveness and all that are not created until we are finishing the function builder, long after calls to Trevor and I talked about another solution for this specific problem though: if we maintain a backwards map from |
I'm going to go ahead and enqueue this for merging, but I am still interested in your responses above and in the inline thread! |
Rather than an open-coded `u32`. Addresses bytecodealliance#8978 (comment)
* Move safepoints and stack maps code to its own module This is purely mechanical code motion. * Shorten log messages' prefix These are all inside a "safepoint" module, and `env_logger` prints the module before the log message, so the old "stack map" part of the log messages' prefix is a bit redundant now. * Use a typed enum for stack slot sizes Rather than an open-coded `u32`. Addresses #8978 (comment) * Update some comments These comments were referencing things that we did in the old implementation of stack maps and safepoints, and have been updated for the new approach.
This refactors the way that user stack maps work such that they are compatible with moving GCs. We now keep every GC value in a stack slot, spill to the stack slot on the value's definition, and reload from the stack slot on every use. This means that we can generate a bunch of redundant loads, but we let the mid-end's alias analysis clean that up. This has the added benefit of reducing the length of live ranges for GC values and also means we don't need to proactively spill every single live GC values at each safepoint, because we know they are already spilled. This latter bit actually helps us avoid an accidentally quadratic issue with many, long, overlapping live ranges where we would do
O(n^2)
spills for a series of safepoints that keep creating more GC refs.We also implement two further optimizations:
We lazily allocate slots for live GC values, which means that if a GC value is not ever live across a safepoint, then we do never allocate a stack slot for it and never spill it to the stack slot. If we didn't do this, then frames would be larger and we'd have a dead store to the stack slot that would otherwise require the mid-end to grow a dead-store-elimination pass.
We keep a free list of available slots that will never be used again, and we reuse slots from here when possible. This means that a chain of non-overlapping live ranges, each of which still needs to appear in some safepoint's stack map, will all reuse the same stack slot, keeping frames from being bloated.
Finally, this commit also introduces some logs for the liveness analysis to ease future debugging.