These are specially optimized primitives for Poly1305, a "secret-key message-authentication code suitable for a wide variety of applications".
A sample Poly1305 implementation utilizing the primitives which provides one-pass and incremental support with CPU dispatching is included.
All assembler is PIC safe.
All pointers should probably be word aligned. I haven't implemented alignment checking on messages yet, I don't know if using misaligned pointers is common or not.
See: http://nacl.cace-project.eu/onetimeauth.html, in specific, slightly plagiarized:
The poly1305_auth function, viewed as a function of the message for a uniform random key, is designed to meet the standard notion of unforgeability after a single message. After the sender authenticates one message, an attacker cannot find authenticators for any other messages.
The sender MUST NOT use poly1305_auth to authenticate more than one message under the same key. Authenticators for two messages under the same key should be expected to reveal enough information to allow forgeries of authenticators on other messages.
sh configure.sh (--compiler [*gcc,clang,icc,..])
[gcc,clang,icc,..] poly1305.c poly1305_extensions.S -O3 -o poly1305.o (-fPIC)
configure.sh creates poly1305_config.inc, which indicates what the compiler supports. Available options are
#define POLY1305_EXT_REF_8
, support for 8x8=16 bit multiplications and 32 bit additions#define POLY1305_EXT_REF_32
, support for 32x32=64 bit multiplications and 64 bit additions#define POLY1305_EXT_X86
, support for 32 bit x86 instructions#define POLY1305_EXT_X86_64
, support for 64 bit x86 instructions#define POLY1305_EXT_SSE2
, support for SSE2#define POLY1305_EXT_AVX
, support for AVX#define POLY1305_EXT_AVX2
, support for AVX2
There are two ways to use the code, through the sample implmentation in poly1305.c or directly with the platform specific versions.
int poly1305_detect(void);
Before using the sample implementation, call poly1305_detect
to determine the best implementation for the current
CPU. poly1305_detect
additionally calls poly1305_power_on_self_test
for each implementation to verify they are
working properly.
poly1305_detect
returns 1
if everything is working, or 0
if there is a failure.
The sample implementation provides two ways to calculate authenticators.
void poly1305_auth(unsigned char mac[16], const unsigned char *m, size_t bytes, const poly1305_key *key);
where mac
is the buffer which receives the 16 byte authenticator,
m
is a pointer to the message to be processed,
bytes
is the number of bytes in the message, and
key
is the poly1305 key that is only used for this message and is discarded immediately after.
poly1305_context
is declared in poly1305.h and is an opaque structure large enough to support
every underlying platform specific implementation. It has no alignment requirements, but must not be copied
as it is aligned internally and a different base address will result in a different alignment.
void poly1305_init(poly1305_context *ctx, const poly1305_key *key);
void poly1305_init_ext(poly1305_context *ctx, const poly1305_key *key, unsigned long long bytes_hint);
where
key
is the poly1305 key that is only used for this message and is discarded immediately after,
and, when using poly1305_init_ext
, bytes_hint
is the total length of the message that will be processed. This allows
the underlying implementation to skip some pre-calculations if the message will not be long enough to warrant them.
void poly1305_update(poly1305_context *ctx, const unsigned char *m, size_t bytes);
where m
is a pointer to the message fragment to be processed, and
bytes
is the length of the message fragment
void poly1305_finish(poly1305_context *ctx, unsigned char mac[16]);
where mac
is the buffer which receives the 16 byte authenticator. After calling finish, the underlying
implementation will zero out ctx
.
The platform specific implementations provide a single call function, and a set of functions which are used to build an incremental Poly1305. Each implementation suffixes all functions, e.g. poly1305_auth_ref, poly1305_auth_x86, poly1305_avx, etc.
void poly1305_auth_xxx(unsigned char mac[16], const unsigned char *m, size_t bytes, const poly1305_key *key);
where mac
is the buffer which receives the 16 byte authenticator,
m
is a pointer to the message to be processed,
bytes
is the number of bytes in the message, and
key
is the poly1305 key that is only used for this message and is discarded immediately after.
The platform specific incremental functions take a void pointer to a context, which varies in size and alignment requirements:
- 8 bit reference: 52 bytes, no alignment requirement
- 32 bit reference: (14 * sizeof(unsigned long)) + 1 byte, sizeof(unsigned long) byte alignment requirement
- SSE2: 240 bytes + sizeof(size_t), 16 byte alignment
- AVX: 240 bytes + sizeof(size_t), 16 byte alignment
- AVX2: 320 bytes + sizeof(size_t), 32 byte alignment
The cover-all choice for x86 is a 328 byte buffer with at least 32 byte alignment.
size_t poly1305_block_size_xxx(void);
returns the block size of the underlying implementation
void poly1305_init_ext_xxx(void *ctx, const poly1305_key *key, unsigned long long bytes_hint);
where key
is the poly1305 key that is only used for this message and is discarded immediately after,
and, bytes_hint
is the total length of the message that will be processed. Implementations that don't use
precomputations are free to ignore this.
void poly1305_blocks_xxx(void *ctx, const unsigned char *m, size_t bytes)
where m
is a pointer to the message fragment to be processed, and
bytes
is the number of bytes in the message fragment. bytes
must be a multiple of the block size, it
is not possible to feed arbitrary length fragments in
void poly1305_finish_ext_xxx(void *ctx, const unsigned char *m, size_t remaining, unsigned char mac[16]);
where m
is a pointer to the remaining bytes of the message,
remaining
is the number of bytes in m
. Note that remaining
can be 0, and must be less than the block size of the
underlying implementation, and
mac
is the buffer which receives the 16 byte authenticator.
Before returning, poly1305_finish_ext will zero out the context.
poly1305_key key = {{...}};
const uint8_t msg[100] = {...};
uint8_t mac[16];
poly1305_auth(mac, msg, 100, &key);
poly1305_key key = {{...}};
const uint8_t msg[100] = {...};
poly1305_context ctx;
uint8_t mac[16];
size_t i;
poly1305_init(&ctx, &key);
/* update one byte at a time, extremely inefficient */
for (i = 0; i < 100; i++)
poly1305_update(&ctx, msg, 1);
poly1305_finish(&ctx, mac);
Timings are in cycles (rdtsc). Raw cycles are reported for 1 byte to give an idea for very short message overhead, and cycles/byte for 64 and above.
Results sorted by long message performance.
Ref32, and especially Ref8, have fairly poor performance, but as both are provided for portability on un-optimized platforms this is not an issue.
bench-x86.sh is provided to easily test implementations. It uses gcc, but any gcc compatible compiler can be used.
Impl. | 1 byte | 64 bytes | 576 bytes | 4096 bytes |
---|---|---|---|---|
x86-64 | 262 | 5.66 | 2.06 | 1.54 |
SSE2-32 | 300 | 7.81 | 2.43 | 1.81 |
x86-32 | 287 | 7.23 | 3.75 | 3.36 |
Ref32-64 | 275 | 7.42 | 4.95 | 4.62 |
Ref32-32 | 412 | 14.45 | 11.22 | 10.88 |
Ref8-32 | 2250 | 124.02 | 116.73 | 116.89 |
Ref8-64 | 2662 | 149.81 | 145.07 | 144.48 |
Timings are with Turbo Boost and Hyperthreading, so accuracy is not concrete.
Impl. | 1 byte | 64 bytes | 576 bytes | 4096 bytes |
---|---|---|---|---|
AVX2-64 | 194 | 3.73 | 0.98 | 0.63 |
AVX2-32 | 218 | 5.88 | 1.25 | 0.73 |
AVX-64 | 173 | 3.27 | 1.34 | 1.08 |
AVX-32 | 194 | 5.03 | 1.48 | 1.08 |
x86-64 | 176 | 3.69 | 1.41 | 1.20 |
SSE2-32 | 194 | 5.03 | 1.54 | 1.14 |
x86-32 | 206 | 4.80 | 2.56 | 2.34 |
Ref32-64 | 143 | 3.69 | 2.62 | 2.55 |
Ref32-32 | 286 | 9.73 | 8.01 | 7.84 |
Ref8-32 | 1025 | 54.69 | 51.85 | 51.56 |
Ref8-64 | 1165 | 64.19 | 61.54 | 61.22 |
Timings are with Turbo on, so accuracy is not concrete.
Impl. | 1 byte | 64 bytes | 576 bytes | 4096 bytes |
---|---|---|---|---|
AVX-64 | 268 | 5.08 | 1.33 | 0.85 |
x86-64 | 266 | 5.05 | 1.43 | 0.96 |
AVX-32 | 306 | 7.31 | 1.86 | 1.25 |
SSE2-32 | 301 | 7.31 | 1.90 | 1.28 |
x86-32 | 351 | 8.39 | 3.61 | 3.03 |
Ref32-64 | 291 | 9.09 | 6.77 | 6.60 |
Ref32-32 | 412 | 13.75 | 11.34 | 10.68 |
Ref8-32 | 2060 | 117.20 | 113.40 | 113.01 |
Ref8-64 | 2310 | 131.92 | 128.22 | 127.75 |
Fuzzing against the reference implementations is available. See fuzz/README.
poly1305_power_on_self_test
is run for every implementation available when poly1305_detect
is called. There
are unfortunately no official test vectors available, so it tests against the test vector in NaCl, a message/key
that cause the internal hash to come out larger than the prime modulus, and messages of length 0 to 256.
sources/ contains the base source files that were used to to put everything together.
- Ref8 is based on poly1305/ref by djb
- Ref32 is mine
- x86 is hand altered from poly1305/x86 by djb
- SSE2 and AVX are based on SSE2 (32bit / 64bit) by me
- AVX2 is based on AVX2 (32bit / 64bit) by me
Public Domain, or MIT