Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LZ4_memcpy_using_offset small optimization #1347

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

Nicoshev
Copy link
Contributor

Sending this PR as a subset of #1222, trying to isolate an optimization.

Tested it on an 12th gen Intel CPU.
Showed to be faster on gcc-9, gcc-10, gcc-11 and gcc-12.
Tied on clang-12 and clang-13, with a tiny favorable trend on clang-12.
Seems to incur a small performance regression on clang-11 and clang-14.

Overall it seems to be faster, and in theory it should be faster as well.

Might worth giving it a try.

@Cyan4973
Copy link
Member

It's likely that the impact of this patch depends on the data to decompress.
Are there specific data sources you have used to observe the speed impact ?
While I'll make some complementary tests on my side, I'm concerned I may employ data sources that don't use the modified code path often enough to matter.

@Cyan4973
Copy link
Member

Cyan4973 commented Jan 15, 2024

So, I made some tests, across a variety of compilers and compressed files,
the log is pretty long, but I guess it's necessary to observe the problem:

PR1347                                                                          | dev
compile clean lz4 with gcc-7                                                    │compile clean lz4 with gcc-7
48180333f03e8039bc6e3ff6f12b0044  lz4                                           │7166618239f6ad30fd7f9c235e94590c  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 2805.0 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 2821.4 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2554.1 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2558.9 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2370.1 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2381.8 MB/s
compile clean lz4 with gcc-8                                                    │compile clean lz4 with gcc-8
47b72f3375c6ad7034fcd9af1bf5d16d  lz4                                           │6194b574ac6a7440e97889b1c3f9e830  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 2974.2 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3667.2 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2732.6 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3578.8 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2523.1 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3276.3 MB/s
compile clean lz4 with gcc-9                                                    │compile clean lz4 with gcc-9
d2630412a7a2a4230b0dfabcfcd59c4e  lz4                                           │315df72b525703f37d4bbd6af3b27f3e  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3690.0 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3241.4 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3588.9 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3108.3 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3282.0 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2851.8 MB/s
compile clean lz4 with gcc-10                                                   │compile clean lz4 with gcc-10
1fba9c8bc1ab783a294183aa54b5f722  lz4                                           │a77a7732928e276031a09349ddd776a9  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3272.6 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 2823.7 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3127.1 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2577.9 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2879.5 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2381.8 MB/s
compile clean lz4 with gcc-11                                                   │compile clean lz4 with gcc-11
bd59aa081d1f38562a036761ce8e14b5  lz4                                           │88529879f018d0362464f68b50261f24  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3598.1 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3109.8 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3529.9 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2922.5 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3246.3 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2694.0 MB/s
compile clean lz4 with clang-6.0                                                │compile clean lz4 with clang-6.0
88dd2dc19aa4a1da762db3100c7e1a98  lz4                                           │51deea9a1a98484f8f749afad1db1791  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 2863.1 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3159.0 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2691.1 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3057.9 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2479.8 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2820.5 MB/s
compile clean lz4 with clang-7                                                  │compile clean lz4 with clang-7
3beabd9eab6d4c1e81a08a2839cde928  lz4                                           │2779891a8438c6ddaf81242a7e517ed9  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3421.2 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3575.7 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3401.6 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3476.2 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3088.2 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3200.8 MB/s
compile clean lz4 with clang-8                                                  │compile clean lz4 with clang-8
7290b35d81112e4b22cc3d5a551b7944  lz4                                           │9b168401b32f61dc7a7a9a379622389e  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3428.7 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3598.0 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3372.5 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3494.8 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3083.2 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3219.7 MB/s
compile clean lz4 with clang-9                                                  │compile clean lz4 with clang-9
ad68c393416f20b5014fb6ed5fca9183  lz4                                           │f73fbbff66bbd65b07c5474b19679ce3  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3530.9 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3743.4 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3423.0 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3585.1 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3151.7 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3297.4 MB/s
compile clean lz4 with clang-10                                                 │compile clean lz4 with clang-10
541da45e800263c843f540f49d6c2878  lz4                                           │a7b71fcce7c9b725f901db97abf32033  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3502.0 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3714.1 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3408.3 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3568.5 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3118.2 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3281.4 MB/s
compile clean lz4 with clang-11                                                 │compile clean lz4 with clang-11
3c11be5c1be103610a7486d9bf7d1e07  lz4                                           │87d5efd84c2e4be7d010951f1980e4f0  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3144.2 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3333.5 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 2931.1 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3156.1 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2720.9 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2916.3 MB/s
compile clean lz4 with clang-12                                                 │compile clean lz4 with clang-12
66ea97c739c5294322295a8e0d4c2e55  lz4                                           │bdf35b1ceaa0e6b1537ecc2962da16db  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present       │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3263.7 MB/s  │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3298.1 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3125.9 MB/s  │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3128.6 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2871.8 MB/s  │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 2897.2 MB/s

Summary :
While I can observe speed gains for gcc 9, 10 and 11, as announced,
this is not universal.
For example, the decompression speed impact is very bad for gcc-8. And it's negative for multiple versions of clang.

Ultimately, the wild performance differences measured probably have nothing to do with the patch itself, at least not directly.
A more likely cause is random instruction alignment side effects as a consequence of tiny changes in the binary size of some functions.

Which means, I don't think this way of measuring performance is good enough to analyze the benefits of the patch.
So we need another approach.

A few potential ideas, for discussion:

  1. find and compare the generated assembly, prove that the new assembly is "better"
    • might not be easy in this case, because there is a change from a static write pattern to a multiplication, so comparison is not straightforward
  2. find a use case where this code path is used more often, often enough to measure a regular difference
    • eventually, create the pattern manually
    • there is always a risk that the created synthetic pattern is "too artificial", and does not represent any real use case, but I think the point is to analyse the impact of the change in isolation, so it's fine to focus on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants