This is a portable, performant implementation of Poly1305, a "secret-key message-authentication code suitable for a wide variety of applications".
All assembler is PIC safe.
The library can be initialized, i.e. the most optimized implementation that passes internal tests will be automatically selected, in two ways, neither of which are thread safe:
-
int poly1305_startup(void);
explicitly initializes the library, and returns a non-zero value if no suitable implementation is found that passes internal tests -
Do nothing and use the library like normal. It will auto-initialize itself when needed, and hard exit if no suitable implementation is found.
Common assumptions:
- When using the incremental functions, the
poly1305_state
struct is assumed to be word aligned, if necessary, for the system in use.
in
is assumed to be word aligned. Incremental support has no alignment requirements, but will obviously slow down if non word-aligned pointers are passed.
void poly1305_auth(unsigned char *mac, const unsigned char *in, size_t inlen, const poly1305_key *key);
Creates an authentictor in mac
under the key key
with inlen
bytes from in
.
Incremental in
buffers are not required to be word aligned. Unaligned buffers will require copying to aligned buffers however, which will obviously incur a speed penalty.
void poly1305_init(poly1305_state *S, const poly1305_key *key)
Initializes S
with the key key
.
void poly1305_init_ext(poly1305_state *S, const poly1305_key *key, size_t bytes_hint)
Initializes S
with the key key
, and the hint that no more than bytes_hint
will be authenticated. If more than bytes_hint
bytes are passed, in total, the result may be undefined.
void poly1305_update(poly1305_state *S, const unsigned char *in, size_t inlen)
Updates the state S
with inlen
bytes from in
in.
void poly1305_finish(poly1305_state *S, unsigned char *mac)
Performs any finalizations on S
and store the resulting authentictor in to mac
.
size_t bytes = ...;
unsigned char data[...] = {...};
poly1305_key key = {{...}};
unsigned char mac[16];
poly1305_auth(mac, data, bytes, &key);
Hashing incrementally, i.e. with multiple calls to update the state.
size_t bytes = ...;
unsigned char data[...] = {...};
poly1305_key key = {{...}};
unsigned char mac[16];
poly1305_state state;
size_t i;
poly1305_init(&state, &key);
/* add one byte at a time, extremely inefficient */
for (i = 0; i < bytes; i++) {
poly1305_update(&state, data + i, 1);
}
poly1305_finish(&state, mac);
There are 3 reference versions, specialized for increasingly capable systems from 8 bit-ish only operations (with the world's most inefficient portable carries, you really don't want to use this unless nothing else runs) to 64 bit.
- Generic 8-bit-ish: poly1305_ref
- Generic 32-bit with 64-bit compiler support: poly1305_ref
- Generic 64-bit: poly1305_ref
- 386 compatible: poly1305_x86
- SSE2: poly1305_sse2
- AVX: poly1305_avx
- AVX2: poly1305_avx2
The 386 compatible version is a modified version of djb's floating point public domain implementation.
SSE2, AVX, and AVX2 versions of the one-shot version poly1305_auth
will revert to the 386 compatible version if the number of bytes is below a certain threshhold.
- x86-64 compatible: poly1305_x86
- SSE2: poly1305_sse2
- AVX: poly1305_avx
- AVX2: poly1305_avx2
SSE2, AVX, and AVX2 versions of the one-shot version poly1305_auth
will revert to the x86-64 compatible version if the number of bytes is below a certain threshhold.
The x86-64 compatible version is only included for short messages. It is thoroughly beaten by SIMD versions above 64-128 bytes.
- ARMv6: poly1305_armv6
- NEON: poly1305_neon
NEON versions of the one-shot version poly1305_auth
will revert to the ARMv6 version if the number of bytes is below a certain threshhold.
See asm-opt#configuring for full configure options.
If you would like to use Yasm with a gcc-compatible compiler, pass --yasm
to configure.
The Visual Studio projects are generated assuming Yasm is available. You will need to have Yasm.exe somewhere in your path to build them.
./configure
make lib
and make install-lib
OR copy bin/poly1305.lib
and app/include/poly1305.h
to your desired location.
./configure --pic
make shared
make install-shared
./configure
make util
bin/poly1305-util [bench|fuzz]
Benchmarking will implicitly test every available version. If any fail, it will exit with an error indicating which versions did not pass. Features tested include:
- One-shot and Incremental authentication
- Results above 2^130 - 5 are properly normalized
- All potential block sizes in the underlying implementation are triggered
Fuzzing tests every available implementation for the current CPU against the reference implementation. Features tested are:
- One-shot and Incremental authentication
Only the top 3 benchmarks per mode will be shown. Anything past 3 or so is pretty irrelevant to the current architecture.
Implemenation | 1 byte | 64 bytes | 576 bytes | 8192 bytes |
---|---|---|---|---|
SSE2-64 | 158 | 4.70 | 2.22 | 1.53 |
SSE2-32 | 275 | 7.42 | 2.54 | 1.80 |
x86-64 | 158 | 4.74 | 3.44 | 3.30 |
x86-32 | 275 | 7.08 | 3.74 | 3.33 |
Timings are with Turbo Boost and Hyperthreading, so their accuracy is not concrete. For reference, OpenSSL and Crypto++ give ~0.8cpb for AES-128-CTR and ~1.1cpb for AES-256-CTR, ~7.4cpb for SHA-512, and ~4.5cpb for MD5.
Implemenation | 1 byte | 64 bytes | 576 bytes | 8192 bytes |
---|---|---|---|---|
AVX2-64 | 110 | 3.22 | 0.96 | 0.60 |
AVX2-32 | 223 | 4.37 | 1.15 | 0.67 |
AVX-64 | 110 | 3.22 | 1.39 | 1.06 |
AVX-32 | 223 | 4.37 | 1.51 | 1.04 |
SSE2-64 | 110 | 3.22 | 1.43 | 1.12 |
SSE2-32 | 223 | 4.33 | 1.55 | 1.10 |
Timings are with Turbo on, so accuracy is not concrete. I'm not sure how to adjust for it either, and depending on clock speed (3.1ghz vs 4.0ghz), OpenSSL gives between 0.73cpb - 0.94cpb for AES-128-CTR, 1.03cpb - 1.33cpb for AES-256-CTR, 10.96cpb - 14.1cpb for SHA-512, and 4.7cpb - 5.16cpb for MD5.
Implemenation | 1 byte | 64 bytes | 576 bytes | 8192 bytes |
---|---|---|---|---|
AVX-64 | 175 | 5.27 | 1.35 | 0.80 |
SSE2-64 | 175 | 5.36 | 1.47 | 0.88 |
AVX-32 | 319 | 5.72 | 1.85 | 1.19 |
SSE2-32 | 320 | 5.78 | 1.94 | 1.31 |
x86-32 | 313 | 8.00 | 3.62 | 2.99 |
x86-64 | 175 | 5.30 | 4.03 | 3.83 |
I don't have access to the cycle counter yet, so cycles are computed by taking the microseconds times the clock speed (666mhz) divided by 1 million. For comparison, on long messages, OpenSSL 1.0.0e gives 52.3 cpb for aes-128-cbc (woof), ~123cpb for SHA-512 (really woof), and ~9.6cpb for MD5.
Implemenation | 1 byte | 64 bytes | 576 bytes | 8192 bytes |
---|---|---|---|---|
Neon-32 | 290 | 9.53 | 3.33 | 2.26 |
ARMv6-32 | 290 | 9.53 | 6.99 | 6.73 |
Public Domain, or MIT