Suboptimal codegen for ARM32 targets when performing offset loadΒ #125386
Description
rustc
generates suboptimal code on 32-bit ARM targets when performing a load from a base + offset pointer. This seems to be a general issue, rearing its head in a number of programs I've written, including trivial examples.
Since this pattern - loading from a non-constant address that's been offset by an index - is very common in real code and in particular inner loops, I'd be surprised if this doesn't have a non-trivial impact on the performance of real code.
Note that LLVM doesn't seem to exhibit this poor behaviour on aarch64 (ARM 64) targets.
unsafe fn read(src: *const u16, n: usize) -> u16 {
src.byte_add(n).read()
}
produces
read:
add r0, r0, r1
ldrh r0, [r0]
bx lr
I'd expect it to produce
read:
ldrh r0, [r0, r1]
bx lr
as GCC does. I believe that on many targets (at the very least, armv4) the latter is always faster than the former.
Note that this is an issue with LLVM: Clang also exhibits this poor code generation.
Rust (rustc, bad): https://godbolt.org/z/7oPe8crM7
C (Clang, bad): https://godbolt.org/z/4M9E7Kh91
C (GCC, good): https://godbolt.org/z/639cxxKc8
I've not been able to test Rust's new GCC backend since I've not been able to work out how to tell it to generate code for ARM32 targets.
rustc --version --verbose
:
rustc 1.79.0-nightly (8b2459c1f 2024-04-09)
Activity