Skip to content

Optimized implementations of Poly1305, a fast message-authentication-code

Notifications You must be signed in to change notification settings

floodyberry/poly1305-opt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ABOUT

These are specially optimized primitives for Poly1305, a "secret-key message-authentication code suitable for a wide variety of applications".

A sample Poly1305 implementation utilizing the primitives which provides one-pass and incremental support with CPU dispatching is included.

All assembler is PIC safe.

All pointers should probably be word aligned. I haven't implemented alignment checking on messages yet, I don't know if using misaligned pointers is common or not.

Usage

See: http://nacl.cace-project.eu/onetimeauth.html, in specific, slightly plagiarized:

The poly1305_auth function, viewed as a function of the message for a uniform random key, is designed to meet the standard notion of unforgeability after a single message. After the sender authenticates one message, an attacker cannot find authenticators for any other messages.

The sender MUST NOT use poly1305_auth to authenticate more than one message under the same key. Authenticators for two messages under the same key should be expected to reveal enough information to allow forgeries of authenticators on other messages.

Compiling

sh configure.sh (--compiler [*gcc,clang,icc,..])
[gcc,clang,icc,..] poly1305.c poly1305_extensions.S -O3 -o poly1305.o (-fPIC)

Configuring by Hand

configure.sh creates poly1305_config.inc, which indicates what the compiler supports. Available options are

  • #define POLY1305_EXT_REF_8, support for 8x8=16 bit multiplications and 32 bit additions
  • #define POLY1305_EXT_REF_32, support for 32x32=64 bit multiplications and 64 bit additions
  • #define POLY1305_EXT_X86, support for 32 bit x86 instructions
  • #define POLY1305_EXT_X86_64, support for 64 bit x86 instructions
  • #define POLY1305_EXT_SSE2, support for SSE2
  • #define POLY1305_EXT_AVX, support for AVX
  • #define POLY1305_EXT_AVX2, support for AVX2

Calling

There are two ways to use the code, through the sample implmentation in poly1305.c or directly with the platform specific versions.

Sample Implementation

int poly1305_detect(void);

Before using the sample implementation, call poly1305_detect to determine the best implementation for the current CPU. poly1305_detect additionally calls poly1305_power_on_self_test for each implementation to verify they are working properly.

poly1305_detect returns 1 if everything is working, or 0 if there is a failure.

The sample implementation provides two ways to calculate authenticators.

1. Single Call version

void poly1305_auth(unsigned char mac[16], const unsigned char *m, size_t bytes, const poly1305_key *key);

where mac is the buffer which receives the 16 byte authenticator,

m is a pointer to the message to be processed,

bytes is the number of bytes in the message, and

key is the poly1305 key that is only used for this message and is discarded immediately after.

2. Incremental version

poly1305_context is declared in poly1305.h and is an opaque structure large enough to support every underlying platform specific implementation. It has no alignment requirements, but must not be copied as it is aligned internally and a different base address will result in a different alignment.

void poly1305_init(poly1305_context *ctx, const poly1305_key *key); void poly1305_init_ext(poly1305_context *ctx, const poly1305_key *key, unsigned long long bytes_hint);

where

key is the poly1305 key that is only used for this message and is discarded immediately after,

and, when using poly1305_init_ext, bytes_hint is the total length of the message that will be processed. This allows the underlying implementation to skip some pre-calculations if the message will not be long enough to warrant them.

void poly1305_update(poly1305_context *ctx, const unsigned char *m, size_t bytes);

where m is a pointer to the message fragment to be processed, and

bytes is the length of the message fragment

void poly1305_finish(poly1305_context *ctx, unsigned char mac[16]);

where mac is the buffer which receives the 16 byte authenticator. After calling finish, the underlying implementation will zero out ctx.

Platform Specific Implementations

The platform specific implementations provide a single call function, and a set of functions which are used to build an incremental Poly1305. Each implementation suffixes all functions, e.g. poly1305_auth_ref, poly1305_auth_x86, poly1305_avx, etc.

1. Single Call version

void poly1305_auth_xxx(unsigned char mac[16], const unsigned char *m, size_t bytes, const poly1305_key *key);

where mac is the buffer which receives the 16 byte authenticator,

m is a pointer to the message to be processed,

bytes is the number of bytes in the message, and

key is the poly1305 key that is only used for this message and is discarded immediately after.

2. Incremental functions

The platform specific incremental functions take a void pointer to a context, which varies in size and alignment requirements:

  • 8 bit reference: 52 bytes, no alignment requirement
  • 32 bit reference: (14 * sizeof(unsigned long)) + 1 byte, sizeof(unsigned long) byte alignment requirement
  • SSE2: 240 bytes + sizeof(size_t), 16 byte alignment
  • AVX: 240 bytes + sizeof(size_t), 16 byte alignment
  • AVX2: 320 bytes + sizeof(size_t), 32 byte alignment

The cover-all choice for x86 is a 328 byte buffer with at least 32 byte alignment.

size_t poly1305_block_size_xxx(void);

returns the block size of the underlying implementation

void poly1305_init_ext_xxx(void *ctx, const poly1305_key *key, unsigned long long bytes_hint);

where key is the poly1305 key that is only used for this message and is discarded immediately after,

and, bytes_hint is the total length of the message that will be processed. Implementations that don't use precomputations are free to ignore this.

void poly1305_blocks_xxx(void *ctx, const unsigned char *m, size_t bytes)

where m is a pointer to the message fragment to be processed, and

bytes is the number of bytes in the message fragment. bytes must be a multiple of the block size, it is not possible to feed arbitrary length fragments in

void poly1305_finish_ext_xxx(void *ctx, const unsigned char *m, size_t remaining, unsigned char mac[16]);

where m is a pointer to the remaining bytes of the message,

remaining is the number of bytes in m. Note that remaining can be 0, and must be less than the block size of the underlying implementation, and

mac is the buffer which receives the 16 byte authenticator.

Before returning, poly1305_finish_ext will zero out the context.

Examples

Creating an authenticator, single call:

poly1305_key key = {{...}};
const uint8_t msg[100] = {...};
uint8_t mac[16];

poly1305_auth(mac, msg, 100, &key);

Creating an authenticator, incrementally:

poly1305_key key = {{...}};
const uint8_t msg[100] = {...};
poly1305_context ctx;
uint8_t mac[16];
size_t i;

poly1305_init(&ctx, &key);

/* update one byte at a time, extremely inefficient */
for (i = 0; i < 100; i++)
    poly1305_update(&ctx, msg, 1);

poly1305_finish(&ctx, mac);

Performance

Timings are in cycles (rdtsc). Raw cycles are reported for 1 byte to give an idea for very short message overhead, and cycles/byte for 64 and above.

Results sorted by long message performance.

Ref32, and especially Ref8, have fairly poor performance, but as both are provided for portability on un-optimized platforms this is not an issue.

bench-x86.sh is provided to easily test implementations. It uses gcc, but any gcc compatible compiler can be used.

Impl.1 byte64 bytes576 bytes4096 bytes
x86-64 262 5.66 2.06 1.54
SSE2-32 300 7.81 2.43 1.81
x86-32 287 7.23 3.75 3.36
Ref32-64 275 7.42 4.95 4.62
Ref32-32 412 14.45 11.22 10.88
Ref8-32 2250 124.02 116.73 116.89
Ref8-64 2662 149.81 145.07 144.48

Timings are with Turbo Boost and Hyperthreading, so accuracy is not concrete.

Impl.1 byte64 bytes576 bytes4096 bytes
AVX2-64 194 3.73 0.98 0.63
AVX2-32 218 5.88 1.25 0.73
AVX-64 173 3.27 1.34 1.08
AVX-32 194 5.03 1.48 1.08
x86-64 176 3.69 1.41 1.20
SSE2-32 194 5.03 1.54 1.14
x86-32 206 4.80 2.56 2.34
Ref32-64 143 3.69 2.62 2.55
Ref32-32 286 9.73 8.01 7.84
Ref8-32 1025 54.69 51.85 51.56
Ref8-64 1165 64.19 61.54 61.22

AMD FX-8120

Timings are with Turbo on, so accuracy is not concrete.

Impl.1 byte64 bytes576 bytes4096 bytes
AVX-64 268 5.08 1.33 0.85
x86-64 266 5.05 1.43 0.96
AVX-32 306 7.31 1.86 1.25
SSE2-32 301 7.31 1.90 1.28
x86-32 351 8.39 3.61 3.03
Ref32-64 291 9.09 6.77 6.60
Ref32-32 412 13.75 11.34 10.68
Ref8-32 2060 117.20 113.40 113.01
Ref8-64 2310 131.92 128.22 127.75

Testing

Fuzzing against the reference implementations is available. See fuzz/README.

poly1305_power_on_self_test is run for every implementation available when poly1305_detect is called. There are unfortunately no official test vectors available, so it tests against the test vector in NaCl, a message/key that cause the internal hash to come out larger than the prime modulus, and messages of length 0 to 256.

Sources

sources/ contains the base source files that were used to to put everything together.

LICENSE

Public Domain, or MIT

About

Optimized implementations of Poly1305, a fast message-authentication-code

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published