LZ4_memcpy_using_offset small optimization #1347

Nicoshev · 2024-01-14T20:49:15Z

Sending this PR as a subset of #1222, trying to isolate an optimization.

Tested it on an 12th gen Intel CPU.
Showed to be faster on gcc-9, gcc-10, gcc-11 and gcc-12.
Tied on clang-12 and clang-13, with a tiny favorable trend on clang-12.
Seems to incur a small performance regression on clang-11 and clang-14.

Overall it seems to be faster, and in theory it should be faster as well.

Might worth giving it a try.

Cyan4973 · 2024-01-15T01:35:23Z

It's likely that the impact of this patch depends on the data to decompress.
Are there specific data sources you have used to observe the speed impact ?
While I'll make some complementary tests on my side, I'm concerned I may employ data sources that don't use the modified code path often enough to matter.

Cyan4973 · 2024-01-15T02:16:49Z

So, I made some tests, across a variety of compilers and compressed files,
the log is pretty long, but I guess it's necessary to observe the problem:

PR1347                                                                          | dev
compile clean lz4 with gcc-7                                                    │compile clean lz4 with gcc-7
48180333f03e8039bc6e3ff6f12b0044  lz4                                           │7166618239f6ad30fd7f9c235e94590c  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 2805.0 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 2821.4 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2554.1 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2558.9 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2370.1 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2381.8 MB/s
compile clean lz4 with gcc-8                                                    │compile clean lz4 with gcc-8
47b72f3375c6ad7034fcd9af1bf5d16d  lz4                                           │6194b574ac6a7440e97889b1c3f9e830  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 2974.2 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3667.2 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2732.6 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3578.8 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2523.1 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3276.3 MB/s
compile clean lz4 with gcc-9                                                    │compile clean lz4 with gcc-9
d2630412a7a2a4230b0dfabcfcd59c4e  lz4                                           │315df72b525703f37d4bbd6af3b27f3e  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3690.0 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3241.4 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3588.9 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3108.3 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3282.0 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2851.8 MB/s
compile clean lz4 with gcc-10                                                   │compile clean lz4 with gcc-10
1fba9c8bc1ab783a294183aa54b5f722  lz4                                           │a77a7732928e276031a09349ddd776a9  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3272.6 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 2823.7 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3127.1 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2577.9 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2879.5 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2381.8 MB/s
compile clean lz4 with gcc-11                                                   │compile clean lz4 with gcc-11
bd59aa081d1f38562a036761ce8e14b5  lz4                                           │88529879f018d0362464f68b50261f24  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3598.1 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3109.8 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3529.9 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2922.5 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3246.3 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2694.0 MB/s
compile clean lz4 with clang-6.0                                                │compile clean lz4 with clang-6.0
88dd2dc19aa4a1da762db3100c7e1a98  lz4                                           │51deea9a1a98484f8f749afad1db1791  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 2863.1 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3159.0 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2691.1 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3057.9 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2479.8 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2820.5 MB/s
compile clean lz4 with clang-7                                                  │compile clean lz4 with clang-7
3beabd9eab6d4c1e81a08a2839cde928  lz4                                           │2779891a8438c6ddaf81242a7e517ed9  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3421.2 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3575.7 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3401.6 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3476.2 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3088.2 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3200.8 MB/s
compile clean lz4 with clang-8                                                  │compile clean lz4 with clang-8
7290b35d81112e4b22cc3d5a551b7944  lz4                                           │9b168401b32f61dc7a7a9a379622389e  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3428.7 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3598.0 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3372.5 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3494.8 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3083.2 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3219.7 MB/s
compile clean lz4 with clang-9                                                  │compile clean lz4 with clang-9
ad68c393416f20b5014fb6ed5fca9183  lz4                                           │f73fbbff66bbd65b07c5474b19679ce3  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3530.9 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3743.4 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3423.0 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3585.1 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3151.7 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3297.4 MB/s
compile clean lz4 with clang-10                                                 │compile clean lz4 with clang-10
541da45e800263c843f540f49d6c2878  lz4                                           │a7b71fcce7c9b725f901db97abf32033  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3502.0 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3714.1 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3408.3 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3568.5 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3118.2 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3281.4 MB/s
compile clean lz4 with clang-11                                                 │compile clean lz4 with clang-11
3c11be5c1be103610a7486d9bf7d1e07  lz4                                           │87d5efd84c2e4be7d010951f1980e4f0  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3144.2 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3333.5 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2931.1 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3156.1 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2720.9 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2916.3 MB/s
compile clean lz4 with clang-12                                                 │compile clean lz4 with clang-12
66ea97c739c5294322295a8e0d4c2e55  lz4                                           │bdf35b1ceaa0e6b1537ecc2962da16db  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3263.7 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3298.1 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3125.9 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3128.6 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2871.8 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2897.2 MB/s

Summary :
While I can observe speed gains for gcc 9, 10 and 11, as announced,
this is not universal.
For example, the decompression speed impact is very bad for gcc-8. And it's negative for multiple versions of clang.

Ultimately, the wild performance differences measured probably have nothing to do with the patch itself, at least not directly.
A more likely cause is random instruction alignment side effects as a consequence of tiny changes in the binary size of some functions.

Which means, I don't think this way of measuring performance is good enough to analyze the benefits of the patch.
So we need another approach.

A few potential ideas, for discussion:

find and compare the generated assembly, prove that the new assembly is "better"
- might not be easy in this case, because there is a change from a static write pattern to a multiplication, so comparison is not straightforward
find a use case where this code path is used more often, often enough to measure a regular difference
- eventually, create the pattern manually
- there is always a risk that the created synthetic pattern is "too artificial", and does not represent any real use case, but I think the point is to analyse the impact of the change in isolation, so it's fine to focus on that.

Nicoshev added 2 commits January 14, 2024 11:04

Optimize LZ4_memcpy_using_offset

7fa674d

Fix pointer arithmetic on 32-bit systems

5b92b38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LZ4_memcpy_using_offset small optimization #1347

LZ4_memcpy_using_offset small optimization #1347

Nicoshev commented Jan 14, 2024

Cyan4973 commented Jan 15, 2024

Cyan4973 commented Jan 15, 2024 •

edited

Loading

LZ4_memcpy_using_offset small optimization #1347

Are you sure you want to change the base?

LZ4_memcpy_using_offset small optimization #1347

Conversation

Nicoshev commented Jan 14, 2024

Cyan4973 commented Jan 15, 2024

Cyan4973 commented Jan 15, 2024 • edited Loading

Cyan4973 commented Jan 15, 2024 •

edited

Loading