Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call benchmark method directly #2334

Closed
wants to merge 2 commits into from

Conversation

timcassell
Copy link
Collaborator

Fixes #1133

This wraps the workload call with a NoInlining | NoOptimization method instead of a delegate.

Mac Intel x64 results

BenchmarkDotNet=v0.13.5.20230619-develop, OS=macOS Monterey 12.3 (21E230) [Darwin 21.4.0]
Intel Core i9-9880H CPU 2.30GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK=8.0.100-preview.5.23303.2
  [Host]     : .NET 8.0.0 (8.0.23.28008), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.0 (8.0.23.28008), X64 RyuJIT AVX2

Master

Method Mean Error StdDev
OneIncrement 0.0013 ns 0.0037 ns 0.0048 ns
TwoIncrement 0.0034 ns 0.0043 ns 0.0064 ns
ThreeIncrement 0.0000 ns 0.0000 ns 0.0000 ns
FourIncrement 0.1652 ns 0.0053 ns 0.0047 ns
FiveIncrement 0.3950 ns 0.0047 ns 0.0039 ns
SixIncrement 0.6129 ns 0.0038 ns 0.0035 ns

This PR

Method Mean Error StdDev
OneIncrement 0.2852 ns 0.0063 ns 0.0056 ns
TwoIncrement 0.3822 ns 0.0477 ns 0.0742 ns
ThreeIncrement 0.4122 ns 0.0071 ns 0.0063 ns
FourIncrement 0.4845 ns 0.0080 ns 0.0071 ns
FiveIncrement 0.5980 ns 0.0082 ns 0.0068 ns
SixIncrement 1.1828 ns 0.0078 ns 0.0069 ns

Windows AMD x64 results

BenchmarkDotNet=v0.13.5.20230619-develop, OS=Windows 10 (10.0.19045.3086/22H2/2022Update)
AMD Phenom(tm) II X6 1055T Processor, 1 CPU, 6 logical and 6 physical cores
.NET SDK=8.0.100-preview.5.23303.2
  [Host]     : .NET 8.0.0 (8.0.23.28008), X64 RyuJIT SSE3
  DefaultJob : .NET 8.0.0 (8.0.23.28008), X64 RyuJIT SSE3

Master

Method Mean Error StdDev
OneIncrement 0.1833 ns 0.0359 ns 0.0300 ns
TwoIncrement 0.6950 ns 0.0523 ns 0.0538 ns
ThreeIncrement 0.7524 ns 0.0062 ns 0.0051 ns
FourIncrement 1.0793 ns 0.0154 ns 0.0129 ns
FiveIncrement 1.1073 ns 0.0589 ns 0.0630 ns
SixIncrement 1.6137 ns 0.0683 ns 0.0787 ns

This PR

Method Mean Error StdDev
OneIncrement 0.6927 ns 0.0182 ns 0.0179 ns
TwoIncrement 1.1581 ns 0.0104 ns 0.0087 ns
ThreeIncrement 1.2368 ns 0.0057 ns 0.0050 ns
FourIncrement 1.4841 ns 0.0054 ns 0.0045 ns
FiveIncrement 1.7893 ns 0.0918 ns 0.1128 ns
SixIncrement 2.3818 ns 0.1026 ns 0.0909 ns

@timcassell
Copy link
Collaborator Author

timcassell commented Jul 24, 2023

With this PR, I am seeing what looks like more accurate results in the default toolchain, but the InProcessEmitToolchain is now showing faster times than out-of-process toolchains. The IL is identical (as confirmed by the IL comparison tests), so I'm not really sure why this is. I tried to disassemble to see what's going on, but DisassemblyDiagnoser apparently doesn't work with InProcessEmitToolchain (I got errors).

Looking at the logs, it appears that overhead is measured at more time.

@ig-sinicyn Any ideas?

Master:

    Runtime=.NET 7.0  

|         Method |      Mean |     Error |    StdDev |
|--------------- |----------:|----------:|----------:|
|   OneIncrement | 0.1382 ns | 0.0093 ns | 0.0073 ns |
|   TwoIncrement | 0.7758 ns | 0.0281 ns | 0.0235 ns |
| ThreeIncrement | 0.7911 ns | 0.0304 ns | 0.0445 ns |
|  FourIncrement | 1.3554 ns | 0.0681 ns | 0.1040 ns |

    Toolchain=InProcessEmitToolchain  

|         Method |      Mean |     Error |    StdDev |
|--------------- |----------:|----------:|----------:|
|   OneIncrement | 0.2270 ns | 0.0279 ns | 0.0261 ns |
|   TwoIncrement | 0.8583 ns | 0.0539 ns | 0.0477 ns |
| ThreeIncrement | 0.4007 ns | 0.0578 ns | 0.1071 ns |
|  FourIncrement | 1.4215 ns | 0.0096 ns | 0.0080 ns |

PR:

    Runtime=.NET 7.0  

|         Method |      Mean |     Error |    StdDev |
|--------------- |----------:|----------:|----------:|
|   OneIncrement | 0.5118 ns | 0.0697 ns | 0.0977 ns |
|   TwoIncrement | 1.1762 ns | 0.0813 ns | 0.0903 ns |
| ThreeIncrement | 1.4889 ns | 0.0869 ns | 0.1300 ns |
|  FourIncrement | 1.6251 ns | 0.0893 ns | 0.1028 ns |

    Toolchain=InProcessEmitToolchain  

|         Method |      Mean |     Error |    StdDev |
|--------------- |----------:|----------:|----------:|
|   OneIncrement | 0.0000 ns | 0.0000 ns | 0.0000 ns |
|   TwoIncrement | 0.1269 ns | 0.0088 ns | 0.0078 ns |
| ThreeIncrement | 0.5865 ns | 0.0064 ns | 0.0060 ns |
|  FourIncrement | 0.5677 ns | 0.0074 ns | 0.0062 ns |

@timcassell timcassell marked this pull request as draft July 24, 2023 14:03
@timcassell

This comment was marked as outdated.

@timcassell
Copy link
Collaborator Author

timcassell commented Jul 28, 2023

I reverted the ClrMd disassembler back to v1 on my local so I could actually inspect the asm. The only difference I see that might affect the result is this.

Default toolchain

call      qword ptr [BenchmarkDotNet.Autogenerated.Runnable_0.__Overhead()]

InProcessEmit

call      BenchmarkDotNet.Autogenerated.Runnable_0.__Overhead()

The IL is exactly the same for those calls, so it seems the JIT treats IL emit slightly different. I don't have any asm knowledge to know what effect that difference makes, but it seems that qword ptr is faster. (Only the overhead and wrapper calls are different, the workload call uses qword ptr for both toolchains.

call-direct-default-asm.md
call-direct-inprocess-asm.md

@timcassell
Copy link
Collaborator Author

timcassell commented Aug 1, 2023

The assembly issue with InProcessEmit is only in net7+. The overhead measurement is off by about 2-3 clock cycles, which isn't far off from the current measurement in all toolchains. I don't think it should block this from being merged.

@timcassell timcassell marked this pull request as ready for review August 1, 2023 06:51
@timcassell timcassell added this to the v0.14.0 milestone Jan 14, 2024
@timcassell
Copy link
Collaborator Author

@AndreyAkinshin I would also like to get this in v0.14.0 if you don't mind (followed by #2336). These 2 PRs will likely change the results of long-term measurements for higher accuracy (like dotnet/performance).

@AndreyAkinshin AndreyAkinshin modified the milestones: v0.14.x, v0.14.0 Jan 22, 2024
@timcassell timcassell linked an issue Mar 6, 2024 that may be closed by this pull request
@timcassell timcassell force-pushed the call-direct branch 2 times, most recently from 1982b8a to 6ba4993 Compare March 10, 2024 04:25
@AndreyAkinshin
Copy link
Member

@timcassell could you please rebase on master one more time? I introduced a bug in Perfolizer 0.3.16 that was fixed in 0.3.17. I just pushed Perfolizer 0.3.17 to BenchmarkDotNet master.

@AndreyAkinshin
Copy link
Member

It seems I found a problem. Let's consider the following environment:

BenchmarkDotNet v0.13.13-develop (2024-03-11), Ubuntu 22.04.4 LTS (Jammy Jellyfish)
AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 8.0.100
  [Host]     : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

The original benchmark was extended to the following form:

Source code
public class Program
{
    public static void Main() => BenchmarkRunner.Run<OverheadTests>();
}

[DisassemblyDiagnoser]
public class OverheadTests
{
    private int _field;

    [Benchmark]
    public void Increment01()
    {
        _field++;
    }

    [Benchmark]
    public void Increment02()
    {
        _field++;
        _field++;
    }

    [Benchmark]
    public void Increment03()
    {
        _field++;
        _field++;
        _field++;
    }

    [Benchmark]
    public void Increment04()
    {
        _field++;
        _field++;
        _field++;
        _field++;
    }

    [Benchmark]
    public void Increment05()
    {
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
    }

    [Benchmark]
    public void Increment06()
    {
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
    }

    [Benchmark]
    public void Increment07()
    {
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
    }

    [Benchmark]
    public void Increment08()
    {
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
    }

    [Benchmark]
    public void Increment09()
    {
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
    }

    [Benchmark]
    public void Increment10()
    {
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
    }

    [Benchmark]
    public void Increment20()
    {
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
    }
}

Here is the generated assembly:

Assembly

.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkDotNet.Samples.OverheadTests.Increment01()
       inc       dword ptr [rdi+8]
       ret
; Total bytes of code 4

.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkDotNet.Samples.OverheadTests.Increment02()
       mov       eax,[rdi+8]
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       ret
; Total bytes of code 14

.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkDotNet.Samples.OverheadTests.Increment03()
       mov       eax,[rdi+8]
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       ret
; Total bytes of code 19

.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkDotNet.Samples.OverheadTests.Increment04()
       mov       eax,[rdi+8]
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       ret
; Total bytes of code 24

.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkDotNet.Samples.OverheadTests.Increment05()
       mov       eax,[rdi+8]
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       ret
; Total bytes of code 29

.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkDotNet.Samples.OverheadTests.Increment06()
       mov       eax,[rdi+8]
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       ret
; Total bytes of code 34

.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkDotNet.Samples.OverheadTests.Increment07()
       mov       eax,[rdi+8]
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       ret
; Total bytes of code 39

.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkDotNet.Samples.OverheadTests.Increment08()
       push      rbp
       mov       rbp,rsp
       mov       eax,[rdi+8]
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       pop       rbp
       ret
; Total bytes of code 49

.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkDotNet.Samples.OverheadTests.Increment09()
       push      rbp
       mov       rbp,rsp
       mov       eax,[rdi+8]
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       pop       rbp
       ret
; Total bytes of code 54

.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkDotNet.Samples.OverheadTests.Increment10()
       push      rbp
       mov       rbp,rsp
       mov       eax,[rdi+8]
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       pop       rbp
       ret
; Total bytes of code 59

.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkDotNet.Samples.OverheadTests.Increment20()
       push      rbp
       mov       rbp,rsp
       mov       eax,[rdi+8]
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       pop       rbp
       ret
; Total bytes of code 109

An interesting observation: since Increment08, .NET 8.0.23.53103 starts wrapping the method body with

push      rbp
mov       rbp,rsp
...
pop       rbp

Here are my results with the latest master:

| Method      | Mean      | Error     | StdDev    | Code Size |
|------------ |----------:|----------:|----------:|----------:|
| Increment01 | 0.0282 ns | 0.0000 ns | 0.0000 ns |       4 B |
| Increment02 | 0.0329 ns | 0.0000 ns | 0.0000 ns |      14 B |
| Increment03 | 0.0339 ns | 0.0005 ns | 0.0004 ns |      19 B |
| Increment04 | 0.0000 ns | 0.0000 ns | 0.0000 ns |      24 B |
| Increment05 | 0.1551 ns | 0.0001 ns | 0.0001 ns |      29 B |
| Increment06 | 0.1588 ns | 0.0007 ns | 0.0006 ns |      34 B |
| Increment07 | 0.3427 ns | 0.0022 ns | 0.0021 ns |      39 B |
| Increment08 | 0.5363 ns | 0.0002 ns | 0.0002 ns |      49 B |
| Increment09 | 0.7391 ns | 0.0005 ns | 0.0005 ns |      54 B |
| Increment10 | 0.9274 ns | 0.0015 ns | 0.0013 ns |      59 B |
| Increment20 | 2.7696 ns | 0.0004 ns | 0.0004 ns |     109 B |

The results are quite consistent, stable, and reproducible. For Increment01..04, we have "instant" results, but at least the "Mean" time is not decreasing with an increased number of increments.

Now let's run the same set of benchmarks using BenchmarkDotNet from this PR:

| Method      | Mean      | Error     | StdDev    | Code Size |
|------------ |----------:|----------:|----------:|----------:|
| Increment01 | 0.0113 ns | 0.0061 ns | 0.0057 ns |       4 B |
| Increment02 | 0.0073 ns | 0.0001 ns | 0.0000 ns |      14 B |
| Increment03 | 0.0030 ns | 0.0000 ns | 0.0000 ns |      19 B |
| Increment04 | 0.0145 ns | 0.0001 ns | 0.0001 ns |      24 B |
| Increment05 | 0.0175 ns | 0.0001 ns | 0.0001 ns |      29 B |
| Increment06 | 0.0251 ns | 0.0017 ns | 0.0016 ns |      34 B |
| Increment07 | 0.5500 ns | 0.0014 ns | 0.0013 ns |      39 B |
| Increment08 | 0.6812 ns | 0.0442 ns | 0.1007 ns |      49 B |
| Increment09 | 0.3456 ns | 0.0001 ns | 0.0001 ns |      54 B |
| Increment10 | 0.3746 ns | 0.0035 ns | 0.0033 ns |      59 B |
| Increment20 | 2.2034 ns | 0.0020 ns | 0.0018 ns |     109 B |

Observations:

  • "Instant"-result problem for Increment01..04 is not resolved (plus Increment05..06 are always "instant" now)
  • We have a mean time estimation degradation after Increment08 (0.68->0.34)

While the "correct" results are a controversial thing in this case, the non-monotonic Mean column definitely feels wrong, and it's a clear regression compared to the master. These results are also reproducible on my machine: Increment09 and Increment10 are always reported to be faster than Increment07 and Increment08.

I'm ready to collect any additional diagnostic info if needed.

@timcassell
Copy link
Collaborator Author

@AndreyAkinshin Unfortunately I don't have a Ryzen cpu to run benchmarks on, but I ran those benchmarks again on both of my machines and got results that look mostly good (the only outlier is the drop from inc3 to inc4 on Intel on both branches). It must be a cpu architectural reason for those results.

Master

BenchmarkDotNet v0.13.13-develop (2024-03-11), Windows 10 (10.0.19045.4046/22H2/2022Update)
AMD Phenom(tm) II X6 1055T Processor, 1 CPU, 6 logical and 6 physical cores
.NET SDK 8.0.200
  [Host]     : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3
  DefaultJob : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3


| Method      | Mean      | Error     | StdDev    | Code Size |
|------------ |----------:|----------:|----------:|----------:|
| Increment01 | 0.4739 ns | 0.0395 ns | 0.0369 ns |       4 B |
| Increment02 | 0.7638 ns | 0.0169 ns | 0.0158 ns |      14 B |
| Increment03 | 0.7994 ns | 0.0283 ns | 0.0236 ns |      19 B |
| Increment04 | 1.0731 ns | 0.0076 ns | 0.0059 ns |      24 B |
| Increment05 | 1.1564 ns | 0.0615 ns | 0.0683 ns |      29 B |
| Increment06 | 1.4360 ns | 0.0692 ns | 0.1097 ns |      34 B |
| Increment07 | 1.4596 ns | 0.0457 ns | 0.0382 ns |      39 B |
| Increment08 | 1.9861 ns | 0.0282 ns | 0.0220 ns |      44 B |
| Increment09 | 2.0474 ns | 0.0542 ns | 0.0507 ns |      49 B |
| Increment10 | 2.2247 ns | 0.0239 ns | 0.0200 ns |      54 B |
| Increment20 | 5.6803 ns | 0.0725 ns | 0.0643 ns |     104 B |
BenchmarkDotNet v0.13.13-develop (2024-03-11), macOS Monterey 12.6 (21G115) [Darwin 21.6.0]
Intel Core i9-9880H CPU 2.30GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.100
  [Host]     : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2


| Method      | Mean      | Error     | StdDev    |
|------------ |----------:|----------:|----------:|
| Increment01 | 0.0168 ns | 0.0127 ns | 0.0119 ns |
| Increment02 | 0.0366 ns | 0.0104 ns | 0.0081 ns |
| Increment03 | 0.2366 ns | 0.0150 ns | 0.0133 ns |
| Increment04 | 0.1721 ns | 0.0175 ns | 0.0164 ns |
| Increment05 | 0.4294 ns | 0.0123 ns | 0.0103 ns |
| Increment06 | 0.7268 ns | 0.0299 ns | 0.0279 ns |
| Increment07 | 1.0112 ns | 0.0359 ns | 0.0336 ns |
| Increment08 | 1.2435 ns | 0.0271 ns | 0.0253 ns |
| Increment09 | 1.4853 ns | 0.0309 ns | 0.0274 ns |
| Increment10 | 1.7526 ns | 0.0231 ns | 0.0216 ns |
| Increment20 | 4.3590 ns | 0.0467 ns | 0.0437 ns |

PR

BenchmarkDotNet v0.13.13-develop (2024-03-11), Windows 10 (10.0.19045.4046/22H2/2022Update)
AMD Phenom(tm) II X6 1055T Processor, 1 CPU, 6 logical and 6 physical cores
.NET SDK 8.0.201
  [Host]     : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3
  DefaultJob : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3


| Method      | Mean      | Error     | StdDev    | Code Size |
|------------ |----------:|----------:|----------:|----------:|
| Increment01 | 0.6082 ns | 0.0716 ns | 0.1757 ns |       4 B |
| Increment02 | 1.1070 ns | 0.0182 ns | 0.0142 ns |      14 B |
| Increment03 | 1.2486 ns | 0.0261 ns | 0.0231 ns |      19 B |
| Increment04 | 1.4787 ns | 0.0125 ns | 0.0104 ns |      24 B |
| Increment05 | 1.7511 ns | 0.0104 ns | 0.0092 ns |      29 B |
| Increment06 | 2.4044 ns | 0.0249 ns | 0.0221 ns |      34 B |
| Increment07 | 2.5753 ns | 0.1115 ns | 0.1145 ns |      39 B |
| Increment08 | 3.8930 ns | 0.0206 ns | 0.0172 ns |      44 B |
| Increment09 | 4.0305 ns | 0.0762 ns | 0.0713 ns |      49 B |
| Increment10 | 4.2039 ns | 0.0184 ns | 0.0172 ns |      54 B |
| Increment20 | 6.9911 ns | 0.0497 ns | 0.0465 ns |     104 B |
BenchmarkDotNet v0.13.13-develop (2024-03-11), macOS Monterey 12.6 (21G115) [Darwin 21.6.0]
Intel Core i9-9880H CPU 2.30GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.100
  [Host]     : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2


| Method      | Mean      | Error     | StdDev    |
|------------ |----------:|----------:|----------:|
| Increment01 | 0.3142 ns | 0.0238 ns | 0.0223 ns |
| Increment02 | 0.4408 ns | 0.0141 ns | 0.0132 ns |
| Increment03 | 0.6237 ns | 0.0428 ns | 0.0380 ns |
| Increment04 | 0.5075 ns | 0.0271 ns | 0.0240 ns |
| Increment05 | 0.6775 ns | 0.0271 ns | 0.0240 ns |
| Increment06 | 1.2777 ns | 0.0309 ns | 0.0289 ns |
| Increment07 | 1.4158 ns | 0.0417 ns | 0.0390 ns |
| Increment08 | 1.6902 ns | 0.0253 ns | 0.0197 ns |
| Increment09 | 2.0321 ns | 0.0342 ns | 0.0267 ns |
| Increment10 | 2.2494 ns | 0.0456 ns | 0.0404 ns |
| Increment20 | 4.9814 ns | 0.0795 ns | 0.0704 ns |

No assembly for the Intel chip (no support for MacOS), but assembly for the old AMD chip:

Assembly

.NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3

; ConsoleApp1.OverheadTests.Increment01()
       inc       dword ptr [rcx+8]
       ret
; Total bytes of code 4

.NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3

; ConsoleApp1.OverheadTests.Increment02()
       mov       eax,[rcx+8]
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       ret
; Total bytes of code 14

.NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3

; ConsoleApp1.OverheadTests.Increment03()
       mov       eax,[rcx+8]
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       ret
; Total bytes of code 19

.NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3

; ConsoleApp1.OverheadTests.Increment04()
       mov       eax,[rcx+8]
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       ret
; Total bytes of code 24

.NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3

; ConsoleApp1.OverheadTests.Increment05()
       mov       eax,[rcx+8]
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       ret
; Total bytes of code 29

.NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3

; ConsoleApp1.OverheadTests.Increment06()
       mov       eax,[rcx+8]
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       ret
; Total bytes of code 34

.NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3

; ConsoleApp1.OverheadTests.Increment07()
       mov       eax,[rcx+8]
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       ret
; Total bytes of code 39

.NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3

; ConsoleApp1.OverheadTests.Increment08()
       mov       eax,[rcx+8]
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       ret
; Total bytes of code 44

.NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3

; ConsoleApp1.OverheadTests.Increment09()
       mov       eax,[rcx+8]
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       ret
; Total bytes of code 49

.NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3

; ConsoleApp1.OverheadTests.Increment10()
       mov       eax,[rcx+8]
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       ret
; Total bytes of code 54

.NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3

; ConsoleApp1.OverheadTests.Increment20()
       mov       eax,[rcx+8]
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       ret
; Total bytes of code 104

I see no logical reason for there to be a flaw with wrapping the call in a NoInlining method rather than a delegate, but I could be missing something.

@timcassell
Copy link
Collaborator Author

@AndreyAkinshin I know it'll spoil the "instant" results, but I wonder what results you will get if you make the field volatile to prevent any cpu optimization shenanigans.

@timcassell
Copy link
Collaborator Author

@AndreyAkinshin Also, can you disassemble *Workload* and *Overhead* methods? I'm curious if the assembly calls match or if there's some differences like I saw with the IL emit.

@AndreyAkinshin
Copy link
Member

@timcassell

I wonder what results you will get if you make the field volatile to prevent any cpu optimization shenanigans.

master:

| Method      | Mean      | Error     | StdDev    | Code Size |
|------------ |----------:|----------:|----------:|----------:|
| Increment01 | 0.0272 ns | 0.0005 ns | 0.0005 ns |       4 B |
| Increment02 | 0.0138 ns | 0.0151 ns | 0.0141 ns |       7 B |
| Increment03 | 0.0326 ns | 0.0000 ns | 0.0000 ns |      10 B |
| Increment04 | 0.0632 ns | 0.0018 ns | 0.0015 ns |      13 B |
| Increment05 | 0.1947 ns | 0.0001 ns | 0.0001 ns |      16 B |
| Increment06 | 0.4581 ns | 0.0017 ns | 0.0014 ns |      24 B |
| Increment07 | 0.5812 ns | 0.0037 ns | 0.0033 ns |      27 B |
| Increment08 | 0.8946 ns | 0.0183 ns | 0.0171 ns |      30 B |
| Increment09 | 0.9137 ns | 0.0008 ns | 0.0007 ns |      33 B |
| Increment10 | 1.0144 ns | 0.0023 ns | 0.0022 ns |      36 B |
| Increment20 | 3.1494 ns | 0.0025 ns | 0.0023 ns |      66 B |

PR:

| Method      | Mean      | Error     | StdDev    | Code Size |
|------------ |----------:|----------:|----------:|----------:|
| Increment01 | 0.0103 ns | 0.0018 ns | 0.0016 ns |       4 B |
| Increment02 | 0.0068 ns | 0.0022 ns | 0.0021 ns |       7 B |
| Increment03 | 0.0148 ns | 0.0001 ns | 0.0001 ns |      10 B |
| Increment04 | 0.0147 ns | 0.0000 ns | 0.0000 ns |      13 B |
| Increment05 | 0.0186 ns | 0.0015 ns | 0.0014 ns |      16 B |
| Increment06 | 0.0418 ns | 0.0003 ns | 0.0003 ns |      24 B |
| Increment07 | 0.2448 ns | 0.0010 ns | 0.0009 ns |      27 B |
| Increment08 | 0.5236 ns | 0.0049 ns | 0.0046 ns |      30 B |
| Increment09 | 0.5196 ns | 0.0183 ns | 0.0171 ns |      33 B |
| Increment10 | 0.6923 ns | 0.0062 ns | 0.0058 ns |      36 B |
| Increment20 | 2.7928 ns | 0.0044 ns | 0.0041 ns |      66 B |

@AndreyAkinshin
Copy link
Member

@timcassell

Also, can you disassemble *Workload* and *Overhead* methods? I'm curious if the assembly calls match or if there's some differences like I saw with the IL emit.

Could you please remind me what is the easiest way to do this on Linux nowadays?

@timcassell
Copy link
Collaborator Author

Could you please remind me what is the easiest way to do this on Linux nowadays?

You can use --disasmFilter *Workload* *Overhead* command line arg, or

config.AddDiagnoser(new DisassemblyDiagnoser(new DisassemblyDiagnoserConfig(filters: ["*Workload*", "*Overhead*"])))

@timcassell
Copy link
Collaborator Author

timcassell commented Mar 11, 2024

PR:

| Method      | Mean      | Error     | StdDev    | Code Size |
|------------ |----------:|----------:|----------:|----------:|
| Increment01 | 0.0103 ns | 0.0018 ns | 0.0016 ns |       4 B |
| Increment02 | 0.0068 ns | 0.0022 ns | 0.0021 ns |       7 B |
| Increment03 | 0.0148 ns | 0.0001 ns | 0.0001 ns |      10 B |
| Increment04 | 0.0147 ns | 0.0000 ns | 0.0000 ns |      13 B |
| Increment05 | 0.0186 ns | 0.0015 ns | 0.0014 ns |      16 B |
| Increment06 | 0.0418 ns | 0.0003 ns | 0.0003 ns |      24 B |
| Increment07 | 0.2448 ns | 0.0010 ns | 0.0009 ns |      27 B |
| Increment08 | 0.5236 ns | 0.0049 ns | 0.0046 ns |      30 B |
| Increment09 | 0.5196 ns | 0.0183 ns | 0.0171 ns |      33 B |
| Increment10 | 0.6923 ns | 0.0062 ns | 0.0058 ns |      36 B |
| Increment20 | 2.7928 ns | 0.0044 ns | 0.0041 ns |      66 B |

Well those results look more stable (almost a consistent increase after inc6). It looks like almost a constant time of 0.4ns was subtracted from your master results. If your cpu is at 5ghz, that's 2 clock cycles. That's almost exactly the same as what I see with the InProcessEmitToolchain on my older machine. Will be curious to see if the assembly shows it.

It's also interesting that adding volatile shrank the code size. 🤔

@timcassell
Copy link
Collaborator Author

timcassell commented Mar 12, 2024

Other things to check out:

Results with net6.0 runtime
Results with full Framework runtime (if you can, I know you said you're on Linux)
Results with another cpu (if you have another cpu you can test with)

@AndreyAkinshin
Copy link
Member

Status update: measurements are in progress (I want to collect a comprehensive set of summary tables and share them at once)

@timcassell
Copy link
Collaborator Author

I borrowed another computer to run these tests with a Ryzen cpu. These are the results I got with net481 and net8.0:

// * Summary *

BenchmarkDotNet v0.13.13-develop (2024-07-06), Windows 11 (10.0.22631.3737/23H2/2023Update/SunValley3)
AMD Ryzen 7 5700U with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
  [Host]     : .NET Framework 4.8.1 (4.8.9241.0), X64 RyuJIT VectorSize=256
  DefaultJob : .NET Framework 4.8.1 (4.8.9241.0), X64 RyuJIT VectorSize=256


| Method      | Mean      | Error     | StdDev    | Median    | Code Size |
|------------ |----------:|----------:|----------:|----------:|----------:|
| Increment01 | 0.1276 ns | 0.0459 ns | 0.0805 ns | 0.1055 ns |       4 B |
| Increment02 | 0.1577 ns | 0.0472 ns | 0.0975 ns | 0.1116 ns |      14 B |
| Increment03 | 0.4134 ns | 0.0531 ns | 0.1132 ns | 0.3539 ns |      19 B |
| Increment04 | 0.5874 ns | 0.0575 ns | 0.1299 ns | 0.5092 ns |      24 B |
| Increment05 | 0.9165 ns | 0.0591 ns | 0.0633 ns | 0.8915 ns |      29 B |
| Increment06 | 1.1740 ns | 0.0476 ns | 0.0422 ns | 1.1952 ns |      34 B |
| Increment07 | 1.1266 ns | 0.0347 ns | 0.0271 ns | 1.1292 ns |      39 B |
| Increment08 | 1.7088 ns | 0.0800 ns | 0.1521 ns | 1.6548 ns |      44 B |
| Increment09 | 2.0595 ns | 0.0795 ns | 0.1694 ns | 1.9913 ns |      49 B |
| Increment10 | 2.2026 ns | 0.0754 ns | 0.0589 ns | 2.1720 ns |      54 B |
| Increment20 | 4.9901 ns | 0.0930 ns | 0.0777 ns | 4.9950 ns |     104 B |
// * Summary *

BenchmarkDotNet v0.13.13-develop (2024-07-06), Windows 11 (10.0.22631.3737/23H2/2023Update/SunValley3)
AMD Ryzen 7 5700U with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.302
  [Host]     : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2


| Method      | Mean      | Error     | StdDev    | Code Size |
|------------ |----------:|----------:|----------:|----------:|
| Increment01 | 0.0000 ns | 0.0000 ns | 0.0000 ns |       4 B |
| Increment02 | 0.2017 ns | 0.0021 ns | 0.0019 ns |      14 B |
| Increment03 | 0.3340 ns | 0.0063 ns | 0.0056 ns |      19 B |
| Increment04 | 0.6301 ns | 0.0087 ns | 0.0081 ns |      24 B |
| Increment05 | 0.8555 ns | 0.0050 ns | 0.0039 ns |      29 B |
| Increment06 | 1.1705 ns | 0.0040 ns | 0.0033 ns |      34 B |
| Increment07 | 1.6550 ns | 0.0710 ns | 0.1334 ns |      39 B |
| Increment08 | 1.5983 ns | 0.0315 ns | 0.0280 ns |      44 B |
| Increment09 | 1.7839 ns | 0.0256 ns | 0.0214 ns |      49 B |
| Increment10 | 2.1949 ns | 0.0621 ns | 0.0581 ns |      54 B |
| Increment20 | 4.7125 ns | 0.1352 ns | 0.2297 ns |     104 B |

Besides the zero measurement for Increment01 on net8.0, the results are pretty much what I would expect. My guess is it's a jit codegen issue either with Zen 4 (I ran the tests with Zen 3), or with ubuntu (I ran on Windows 11). Without the same setup you have, I can't dig deeper.

I ran this to get the assembly of both runtimes:

public class Program
{
    public static void Main() => BenchmarkRunner.Run<OverheadTests>(
        DefaultConfig.Instance
            .AddDiagnoser(
                new DisassemblyDiagnoser(new DisassemblyDiagnoserConfig(filters:
                    [
                        "*Runnable_0.__Overhead*",
                        "*Runnable_0.__OverheadWrapper*",
                        "*Runnable_0.__WorkloadWrapper*",
                        "*Runnable_0.OverheadActionUnroll*",
                        "*Runnable_0.WorkloadActionUnroll*",
                        "*OverheadTests.Increment*",
                    ]))
            )
            .AddJob(Job.Default.WithRuntime(ClrRuntime.Net481))
            .AddJob(Job.Default.WithRuntime(CoreRuntime.Core80))
        );
}

public class OverheadTests
{
    private int _field;

    [Benchmark]
    public void Increment01()
    {
        _field++;
    }
}
asm

.NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2

; BenchmarkDotNet.Autogenerated.Runnable_0.__Overhead()
       ret
; Total bytes of code 1
; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       push      rbp
       sub       rsp,20
       lea       rbp,[rsp+20]
       mov       [rbp+10],rcx
       mov       rcx,[rbp+10]
       call      qword ptr [7FFA930A5440]; BenchmarkDotNet.Autogenerated.Runnable_0.__Overhead()
       nop
       add       rsp,20
       pop       rbp
       ret
; Total bytes of code 31
; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       push      rbp
       sub       rsp,20
       lea       rbp,[rsp+20]
       mov       [rbp+10],rcx
       mov       rcx,[rbp+10]
       call      qword ptr [7FFA930A5380]; OverheadTests.Increment01()
       nop
       add       rsp,20
       pop       rbp
       ret
; Total bytes of code 31
; BenchmarkDotNet.Autogenerated.Runnable_0.OverheadActionUnroll(Int64)
       push      rdi
       push      rsi
       push      rbx
       sub       rsp,20
       mov       rbx,rcx
       mov       rsi,rdx
       xor       edi,edi
       test      rsi,rsi
       jle       near ptr M03_L01
M03_L00:
       mov       rcx,rbx
       call      qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       inc       rdi
       cmp       rdi,rsi
       jl        near ptr M03_L00
M03_L01:
       add       rsp,20
       pop       rbx
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 188
; BenchmarkDotNet.Autogenerated.Runnable_0.WorkloadActionUnroll(Int64)
       push      rdi
       push      rsi
       push      rbx
       sub       rsp,20
       mov       rbx,rcx
       mov       rsi,rdx
       xor       edi,edi
       test      rsi,rsi
       jle       near ptr M04_L01
M04_L00:
       mov       rcx,rbx
       call      qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rbx
       call      qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       inc       rdi
       cmp       rdi,rsi
       jl        near ptr M04_L00
M04_L01:
       add       rsp,20
       pop       rbx
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 188
; OverheadTests.Increment01()
       inc       dword ptr [rcx+8]
       ret
; Total bytes of code 4

.NET Framework 4.8.1 (4.8.9241.0), X64 RyuJIT VectorSize=256

; BenchmarkDotNet.Autogenerated.Runnable_0.__Overhead()
       ret
; Total bytes of code 1
; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       push      rbp
       sub       rsp,20
       lea       rbp,[rsp+20]
       mov       [rbp+10],rcx
       mov       rcx,[rbp+10]
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__Overhead()
       nop
       lea       rsp,[rbp]
       pop       rbp
       ret
; Total bytes of code 30
; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       push      rbp
       sub       rsp,20
       lea       rbp,[rsp+20]
       mov       [rbp+10],rcx
       mov       rcx,[rbp+10]
       call      OverheadTests.Increment01()
       nop
       lea       rsp,[rbp]
       pop       rbp
       ret
; Total bytes of code 30
; BenchmarkDotNet.Autogenerated.Runnable_0.OverheadActionUnroll(Int64)
       push      rdi
       push      rsi
       push      rbx
       sub       rsp,20
       mov       rsi,rcx
       mov       rdi,rdx
       xor       ebx,ebx
       test      rdi,rdi
       jle       near ptr M03_L01
M03_L00:
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
       inc       rbx
       cmp       rbx,rdi
       jl        near ptr M03_L00
M03_L01:
       add       rsp,20
       pop       rbx
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 172
; BenchmarkDotNet.Autogenerated.Runnable_0.WorkloadActionUnroll(Int64)
       push      rdi
       push      rsi
       push      rbx
       sub       rsp,20
       mov       rsi,rcx
       mov       rdi,rdx
       xor       ebx,ebx
       test      rdi,rdi
       jle       near ptr M04_L01
M04_L00:
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       mov       rcx,rsi
       call      BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
       inc       rbx
       cmp       rbx,rdi
       jl        near ptr M04_L00
M04_L01:
       add       rsp,20
       pop       rbx
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 172
; OverheadTests.Increment01()
       inc       dword ptr [rcx+8]
       ret
; Total bytes of code 4

If you could run the same (omitting the net481 runtime of course), maybe we could get a clearer picture of what's going on with your machine. @AndreyAkinshin

@timcassell timcassell modified the milestones: v0.14.0, v0.15.x Aug 6, 2024
AndreyAkinshin added a commit that referenced this pull request Aug 28, 2024
@timcassell
Copy link
Collaborator Author

So, I recently got a new 9800x3d cpu, and I was able to repro your results. It seems quite dependent on the cpu architecture. I tried changing the wrapper method to use AggressiveInlining instead of NoInlining, and the results were great, but the disassembly broke, because the JIT ended up inlining the entire benchmark method, ignoring the NoOptimization flag. I'm going to open an issue in dotnet/runtime and see what they have to say about it.

@timcassell
Copy link
Collaborator Author

@AndyAyersMS I experimented with changing the AggressiveOptimization to NoOptimization on the WorkloadAction*Unroll methods, and calling the benchmark method directly (without using a wrapper method). This change fixed the measurement problems on Zen 5, but I'm not sure what other effects it will have. Reading through #1934, it seems you originally applied AggressiveOptimization to fix a heap issue with invoking delegates, but if we remove the delegates and call the method directly instead, will there still be an issue?

@AndyAyersMS
Copy link
Member

If I remember, the main issue was that if those methods are not optimized two things may cause disruptive changes in some benchmark measurements (typically benchmarks that do a fair amount of allocation):

  1. The extra allocations done or by the unoptimized code
  2. Unoptimized code does less aggressive GC tracking. So in addition to allocating more, more allocations end up live while the benchmark is running (and this cannot be fixed by explicit GC collect calls or setting things to null)

So this shifts the values BDN reports for some benchmarks.

For most point-in-time measurements this shift may not be a big deal, but we like to see continuity in benchmark results as we update BDN versions, since we are tracking performance of the same benchmark over long stretches of time. That being said, we can cope with shifts if necessary (eg we saw one when we enabled CET by default during the .NET 9 cycle).

…thods.

Count down loops instead of count up.
@timcassell
Copy link
Collaborator Author

timcassell commented Jan 12, 2025

  1. The extra allocations done or by the unoptimized code

I believe we do not allocate ourselves, so this should be a non-issue. Also, NoOptimization I'm pretty sure means the JIT will not instrument it or do anything special, so unlike with normal methods in tier 0, we shouldn't get any extra allocations inserted, if I understand it correctly.

  1. Unoptimized code does less aggressive GC tracking. So in addition to allocating more, more allocations end up live while the benchmark is running (and this cannot be fixed by explicit GC collect calls or setting things to null)

In #2336 we no longer store any returned values at all. Would that alleviate GC concerns?

@timcassell
Copy link
Collaborator Author

@AndyAyersMS I ran the Append_Strings benchmark you mentioned in dotnet/performance#2214 (comment).

Master:

Method repeat Mean Error StdDev Allocated
Append_Strings 1 57.58 ns 0.317 ns 0.281 ns 1.22 KB
Append_Strings 1000 12,344.97 ns 168.074 ns 157.217 ns 546.16 KB

This PR:

Method repeat Mean Error StdDev Allocated
Append_Strings 1 60.22 ns 0.223 ns 0.186 ns 1.22 KB
Append_Strings 1000 16,915.48 ns 109.032 ns 101.988 ns 546.16 KB

This PR + #2336 changes:

Method repeat Mean Error StdDev Allocated
Append_Strings 1 57.87 ns 0.311 ns 0.275 ns 1.22 KB
Append_Strings 1000 12,625.00 ns 48.200 ns 42.728 ns 546.16 KB

So it looks like not consuming the return value does fix the GC issue. With these results, I will close this PR and include all changes in #2336, as they clearly need to go together.

@timcassell timcassell closed this Jan 12, 2025
@timcassell timcassell removed this from the v0.15.x milestone Jan 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inaccurate results reported for small methods BenchmarkDotNet (arguably) slightly overcorrects for overhead
3 participants