Description
Reciprocal approximations
The following instructions are of special interest, and could make it into the B format encoding space:
- FRECPE - Floating-point reciprocal estimate (estimate 1.0/x)
- FRSQRTE - Floating-point reciprocal square root estimate (estimate 1.0/sqrt(x))
The main question is how accurate the approximation should be. A common choice in other ISAs appears to be ~8 bits (the CRAY-1 produced 30 bits, though). We could use different levels of accuracy for different floating-point formats.
Here are a few different possibilities:
Approximation | NR steps for float32 | NR steps for float16 | NR steps for float8 |
---|---|---|---|
8 | 2 | 1 | 0 |
11 | 2 | 0 | 0 |
12-13 | 1 | 0 | 0 |
For the instructions to be valuable, they should offer a reasonable improvement over the classic hack:
float approximate_rsqrt(float x) {
auto i = std::bit_cast<uint32_t>(x);
i = 0x5f3759df - (i >> 1);
return std::bit_cast<float>(i);
}
...which is very cheap on MRISC32 (only 4 integer instructions without any latencies):
ldi r2, #0x5f3759df ; ldi + or
lsr r1, r1, #1
sub r1, r2, r1
Reciprocal improvements
The approximation can be improved using Newton-Raphson iterations. These iterations can be implemented using regular floating-point arithmetic operations, but it's also possible to provide dedicated instructions that does the majority of the work of an iteration in a single instruction.
The main advantages of such instructions are:
- Reduce the latency of consecutive, dependent NR operations.
- Support 1/0 = Inf (during NR
Inf*0
must result inInf
rather thanNaN
, which would have been the case without a dedicated instruction). - Possibly improved accuracy as the instruction performs a fused multiply-add.
These instructions would have to go into the format A encoding space.
Reciprocal
One NR step for improving y = 1 / x
is: y = y * (2 - x * y)
.
A suitable instruction (based on FMA) should implement 2 - a * b
.
Reciprocal square root
One NR step for improving y = 1 / sqrt(x)
is: y = y * (1.5 - (0.5 * x * y * y))
.
A suitable instruction (based on FMA) should implement (3 - a * b) / 2
.