Suboptimal codegen for ARM32 targets when performing offset load

`rustc` generates suboptimal code on 32-bit ARM targets when performing a load from a base + offset pointer. This seems to be a general issue, rearing its head in a number of programs I've written, including trivial examples.

Since this pattern - loading from a non-constant address that's been offset by an index - is very common in real code and in particular inner loops, I'd be surprised if this doesn't have a non-trivial impact on the performance of real code.

Note that LLVM doesn't seem to exhibit this poor behaviour on aarch64 (ARM 64) targets.

```rust
unsafe fn read(src: *const u16, n: usize) -> u16 {
    src.byte_add(n).read()
}
```

produces

```arm
read:
        add     r0, r0, r1
        ldrh    r0, [r0]
        bx      lr
```

I'd expect it to produce

```
read:
        ldrh   r0, [r0, r1]
        bx      lr
```

as GCC does. I believe that on many targets (at the very least, armv4) the latter is *always* faster than the former.

Note that this is an issue with LLVM: Clang also exhibits this poor code generation.

Rust (rustc, bad): https://godbolt.org/z/7oPe8crM7
C (Clang, bad): https://godbolt.org/z/4M9E7Kh91
C (GCC, good): https://godbolt.org/z/639cxxKc8

I've not been able to test Rust's new GCC backend since I've not been able to work out how to tell it to generate code for ARM32 targets.

`rustc --version --verbose`:
```
rustc 1.79.0-nightly (8b2459c1f 2024-04-09)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suboptimal codegen for ARM32 targets when performing offset load #125386

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development