-
-
Notifications
You must be signed in to change notification settings - Fork 983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Call benchmark method directly #2334
Conversation
43367bf
to
b262565
Compare
With this PR, I am seeing what looks like more accurate results in the default toolchain, but the Looking at the logs, it appears that overhead is measured at more time. @ig-sinicyn Any ideas? Master:
PR:
|
This comment was marked as outdated.
This comment was marked as outdated.
I reverted the ClrMd disassembler back to v1 on my local so I could actually inspect the asm. The only difference I see that might affect the result is this. Default toolchain call qword ptr [BenchmarkDotNet.Autogenerated.Runnable_0.__Overhead()] InProcessEmit call BenchmarkDotNet.Autogenerated.Runnable_0.__Overhead() The IL is exactly the same for those calls, so it seems the JIT treats IL emit slightly different. I don't have any asm knowledge to know what effect that difference makes, but it seems that |
The assembly issue with InProcessEmit is only in net7+. The overhead measurement is off by about 2-3 clock cycles, which isn't far off from the current measurement in all toolchains. I don't think it should block this from being merged. |
@AndreyAkinshin I would also like to get this in v0.14.0 if you don't mind (followed by #2336). These 2 PRs will likely change the results of long-term measurements for higher accuracy (like dotnet/performance). |
tests/BenchmarkDotNet.IntegrationTests.ManualRunning/ExpectedBenchmarkResultsTests.cs
Outdated
Show resolved
Hide resolved
1982b8a
to
6ba4993
Compare
@timcassell could you please rebase on master one more time? I introduced a bug in Perfolizer 0.3.16 that was fixed in 0.3.17. I just pushed Perfolizer 0.3.17 to BenchmarkDotNet master. |
It seems I found a problem. Let's consider the following environment: BenchmarkDotNet v0.13.13-develop (2024-03-11), Ubuntu 22.04.4 LTS (Jammy Jellyfish)
AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 8.0.100
[Host] : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI The original benchmark was extended to the following form: Source codepublic class Program
{
public static void Main() => BenchmarkRunner.Run<OverheadTests>();
}
[DisassemblyDiagnoser]
public class OverheadTests
{
private int _field;
[Benchmark]
public void Increment01()
{
_field++;
}
[Benchmark]
public void Increment02()
{
_field++;
_field++;
}
[Benchmark]
public void Increment03()
{
_field++;
_field++;
_field++;
}
[Benchmark]
public void Increment04()
{
_field++;
_field++;
_field++;
_field++;
}
[Benchmark]
public void Increment05()
{
_field++;
_field++;
_field++;
_field++;
_field++;
}
[Benchmark]
public void Increment06()
{
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
}
[Benchmark]
public void Increment07()
{
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
}
[Benchmark]
public void Increment08()
{
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
}
[Benchmark]
public void Increment09()
{
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
}
[Benchmark]
public void Increment10()
{
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
}
[Benchmark]
public void Increment20()
{
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
}
} Here is the generated assembly: Assembly.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkDotNet.Samples.OverheadTests.Increment01()
inc dword ptr [rdi+8]
ret
; Total bytes of code 4 .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkDotNet.Samples.OverheadTests.Increment02()
mov eax,[rdi+8]
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
ret
; Total bytes of code 14 .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkDotNet.Samples.OverheadTests.Increment03()
mov eax,[rdi+8]
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
ret
; Total bytes of code 19 .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkDotNet.Samples.OverheadTests.Increment04()
mov eax,[rdi+8]
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
ret
; Total bytes of code 24 .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkDotNet.Samples.OverheadTests.Increment05()
mov eax,[rdi+8]
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
ret
; Total bytes of code 29 .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkDotNet.Samples.OverheadTests.Increment06()
mov eax,[rdi+8]
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
ret
; Total bytes of code 34 .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkDotNet.Samples.OverheadTests.Increment07()
mov eax,[rdi+8]
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
ret
; Total bytes of code 39 .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkDotNet.Samples.OverheadTests.Increment08()
push rbp
mov rbp,rsp
mov eax,[rdi+8]
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
pop rbp
ret
; Total bytes of code 49 .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkDotNet.Samples.OverheadTests.Increment09()
push rbp
mov rbp,rsp
mov eax,[rdi+8]
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
pop rbp
ret
; Total bytes of code 54 .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkDotNet.Samples.OverheadTests.Increment10()
push rbp
mov rbp,rsp
mov eax,[rdi+8]
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
pop rbp
ret
; Total bytes of code 59 .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkDotNet.Samples.OverheadTests.Increment20()
push rbp
mov rbp,rsp
mov eax,[rdi+8]
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
pop rbp
ret
; Total bytes of code 109 An interesting observation: since push rbp
mov rbp,rsp
...
pop rbp Here are my results with the latest | Method | Mean | Error | StdDev | Code Size |
|------------ |----------:|----------:|----------:|----------:|
| Increment01 | 0.0282 ns | 0.0000 ns | 0.0000 ns | 4 B |
| Increment02 | 0.0329 ns | 0.0000 ns | 0.0000 ns | 14 B |
| Increment03 | 0.0339 ns | 0.0005 ns | 0.0004 ns | 19 B |
| Increment04 | 0.0000 ns | 0.0000 ns | 0.0000 ns | 24 B |
| Increment05 | 0.1551 ns | 0.0001 ns | 0.0001 ns | 29 B |
| Increment06 | 0.1588 ns | 0.0007 ns | 0.0006 ns | 34 B |
| Increment07 | 0.3427 ns | 0.0022 ns | 0.0021 ns | 39 B |
| Increment08 | 0.5363 ns | 0.0002 ns | 0.0002 ns | 49 B |
| Increment09 | 0.7391 ns | 0.0005 ns | 0.0005 ns | 54 B |
| Increment10 | 0.9274 ns | 0.0015 ns | 0.0013 ns | 59 B |
| Increment20 | 2.7696 ns | 0.0004 ns | 0.0004 ns | 109 B | The results are quite consistent, stable, and reproducible. For Now let's run the same set of benchmarks using BenchmarkDotNet from this PR: | Method | Mean | Error | StdDev | Code Size |
|------------ |----------:|----------:|----------:|----------:|
| Increment01 | 0.0113 ns | 0.0061 ns | 0.0057 ns | 4 B |
| Increment02 | 0.0073 ns | 0.0001 ns | 0.0000 ns | 14 B |
| Increment03 | 0.0030 ns | 0.0000 ns | 0.0000 ns | 19 B |
| Increment04 | 0.0145 ns | 0.0001 ns | 0.0001 ns | 24 B |
| Increment05 | 0.0175 ns | 0.0001 ns | 0.0001 ns | 29 B |
| Increment06 | 0.0251 ns | 0.0017 ns | 0.0016 ns | 34 B |
| Increment07 | 0.5500 ns | 0.0014 ns | 0.0013 ns | 39 B |
| Increment08 | 0.6812 ns | 0.0442 ns | 0.1007 ns | 49 B |
| Increment09 | 0.3456 ns | 0.0001 ns | 0.0001 ns | 54 B |
| Increment10 | 0.3746 ns | 0.0035 ns | 0.0033 ns | 59 B |
| Increment20 | 2.2034 ns | 0.0020 ns | 0.0018 ns | 109 B | Observations:
While the "correct" results are a controversial thing in this case, the non-monotonic Mean column definitely feels wrong, and it's a clear regression compared to the master. These results are also reproducible on my machine: I'm ready to collect any additional diagnostic info if needed. |
@AndreyAkinshin Unfortunately I don't have a Ryzen cpu to run benchmarks on, but I ran those benchmarks again on both of my machines and got results that look mostly good (the only outlier is the drop from inc3 to inc4 on Intel on both branches). It must be a cpu architectural reason for those results. Master
PR
No assembly for the Intel chip (no support for MacOS), but assembly for the old AMD chip: Assembly
.NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3; ConsoleApp1.OverheadTests.Increment01()
inc dword ptr [rcx+8]
ret
; Total bytes of code 4 .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3; ConsoleApp1.OverheadTests.Increment02()
mov eax,[rcx+8]
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
ret
; Total bytes of code 14 .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3; ConsoleApp1.OverheadTests.Increment03()
mov eax,[rcx+8]
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
ret
; Total bytes of code 19 .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3; ConsoleApp1.OverheadTests.Increment04()
mov eax,[rcx+8]
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
ret
; Total bytes of code 24 .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3; ConsoleApp1.OverheadTests.Increment05()
mov eax,[rcx+8]
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
ret
; Total bytes of code 29 .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3; ConsoleApp1.OverheadTests.Increment06()
mov eax,[rcx+8]
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
ret
; Total bytes of code 34 .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3; ConsoleApp1.OverheadTests.Increment07()
mov eax,[rcx+8]
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
ret
; Total bytes of code 39 .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3; ConsoleApp1.OverheadTests.Increment08()
mov eax,[rcx+8]
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
ret
; Total bytes of code 44 .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3; ConsoleApp1.OverheadTests.Increment09()
mov eax,[rcx+8]
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
ret
; Total bytes of code 49 .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3; ConsoleApp1.OverheadTests.Increment10()
mov eax,[rcx+8]
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
ret
; Total bytes of code 54 .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3; ConsoleApp1.OverheadTests.Increment20()
mov eax,[rcx+8]
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
ret
; Total bytes of code 104 I see no logical reason for there to be a flaw with wrapping the call in a |
@AndreyAkinshin I know it'll spoil the "instant" results, but I wonder what results you will get if you make the field |
@AndreyAkinshin Also, can you disassemble |
| Method | Mean | Error | StdDev | Code Size |
|------------ |----------:|----------:|----------:|----------:|
| Increment01 | 0.0272 ns | 0.0005 ns | 0.0005 ns | 4 B |
| Increment02 | 0.0138 ns | 0.0151 ns | 0.0141 ns | 7 B |
| Increment03 | 0.0326 ns | 0.0000 ns | 0.0000 ns | 10 B |
| Increment04 | 0.0632 ns | 0.0018 ns | 0.0015 ns | 13 B |
| Increment05 | 0.1947 ns | 0.0001 ns | 0.0001 ns | 16 B |
| Increment06 | 0.4581 ns | 0.0017 ns | 0.0014 ns | 24 B |
| Increment07 | 0.5812 ns | 0.0037 ns | 0.0033 ns | 27 B |
| Increment08 | 0.8946 ns | 0.0183 ns | 0.0171 ns | 30 B |
| Increment09 | 0.9137 ns | 0.0008 ns | 0.0007 ns | 33 B |
| Increment10 | 1.0144 ns | 0.0023 ns | 0.0022 ns | 36 B |
| Increment20 | 3.1494 ns | 0.0025 ns | 0.0023 ns | 66 B | PR: | Method | Mean | Error | StdDev | Code Size |
|------------ |----------:|----------:|----------:|----------:|
| Increment01 | 0.0103 ns | 0.0018 ns | 0.0016 ns | 4 B |
| Increment02 | 0.0068 ns | 0.0022 ns | 0.0021 ns | 7 B |
| Increment03 | 0.0148 ns | 0.0001 ns | 0.0001 ns | 10 B |
| Increment04 | 0.0147 ns | 0.0000 ns | 0.0000 ns | 13 B |
| Increment05 | 0.0186 ns | 0.0015 ns | 0.0014 ns | 16 B |
| Increment06 | 0.0418 ns | 0.0003 ns | 0.0003 ns | 24 B |
| Increment07 | 0.2448 ns | 0.0010 ns | 0.0009 ns | 27 B |
| Increment08 | 0.5236 ns | 0.0049 ns | 0.0046 ns | 30 B |
| Increment09 | 0.5196 ns | 0.0183 ns | 0.0171 ns | 33 B |
| Increment10 | 0.6923 ns | 0.0062 ns | 0.0058 ns | 36 B |
| Increment20 | 2.7928 ns | 0.0044 ns | 0.0041 ns | 66 B | |
Could you please remind me what is the easiest way to do this on Linux nowadays? |
You can use config.AddDiagnoser(new DisassemblyDiagnoser(new DisassemblyDiagnoserConfig(filters: ["*Workload*", "*Overhead*"]))) |
Well those results look more stable (almost a consistent increase after inc6). It looks like almost a constant time of 0.4ns was subtracted from your master results. If your cpu is at 5ghz, that's 2 clock cycles. That's almost exactly the same as what I see with the InProcessEmitToolchain on my older machine. Will be curious to see if the assembly shows it. It's also interesting that adding |
Other things to check out: Results with |
Status update: measurements are in progress (I want to collect a comprehensive set of summary tables and share them at once) |
I borrowed another computer to run these tests with a Ryzen cpu. These are the results I got with net481 and net8.0: // * Summary *
BenchmarkDotNet v0.13.13-develop (2024-07-06), Windows 11 (10.0.22631.3737/23H2/2023Update/SunValley3)
AMD Ryzen 7 5700U with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
[Host] : .NET Framework 4.8.1 (4.8.9241.0), X64 RyuJIT VectorSize=256
DefaultJob : .NET Framework 4.8.1 (4.8.9241.0), X64 RyuJIT VectorSize=256
| Method | Mean | Error | StdDev | Median | Code Size |
|------------ |----------:|----------:|----------:|----------:|----------:|
| Increment01 | 0.1276 ns | 0.0459 ns | 0.0805 ns | 0.1055 ns | 4 B |
| Increment02 | 0.1577 ns | 0.0472 ns | 0.0975 ns | 0.1116 ns | 14 B |
| Increment03 | 0.4134 ns | 0.0531 ns | 0.1132 ns | 0.3539 ns | 19 B |
| Increment04 | 0.5874 ns | 0.0575 ns | 0.1299 ns | 0.5092 ns | 24 B |
| Increment05 | 0.9165 ns | 0.0591 ns | 0.0633 ns | 0.8915 ns | 29 B |
| Increment06 | 1.1740 ns | 0.0476 ns | 0.0422 ns | 1.1952 ns | 34 B |
| Increment07 | 1.1266 ns | 0.0347 ns | 0.0271 ns | 1.1292 ns | 39 B |
| Increment08 | 1.7088 ns | 0.0800 ns | 0.1521 ns | 1.6548 ns | 44 B |
| Increment09 | 2.0595 ns | 0.0795 ns | 0.1694 ns | 1.9913 ns | 49 B |
| Increment10 | 2.2026 ns | 0.0754 ns | 0.0589 ns | 2.1720 ns | 54 B |
| Increment20 | 4.9901 ns | 0.0930 ns | 0.0777 ns | 4.9950 ns | 104 B | // * Summary *
BenchmarkDotNet v0.13.13-develop (2024-07-06), Windows 11 (10.0.22631.3737/23H2/2023Update/SunValley3)
AMD Ryzen 7 5700U with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.302
[Host] : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2
DefaultJob : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2
| Method | Mean | Error | StdDev | Code Size |
|------------ |----------:|----------:|----------:|----------:|
| Increment01 | 0.0000 ns | 0.0000 ns | 0.0000 ns | 4 B |
| Increment02 | 0.2017 ns | 0.0021 ns | 0.0019 ns | 14 B |
| Increment03 | 0.3340 ns | 0.0063 ns | 0.0056 ns | 19 B |
| Increment04 | 0.6301 ns | 0.0087 ns | 0.0081 ns | 24 B |
| Increment05 | 0.8555 ns | 0.0050 ns | 0.0039 ns | 29 B |
| Increment06 | 1.1705 ns | 0.0040 ns | 0.0033 ns | 34 B |
| Increment07 | 1.6550 ns | 0.0710 ns | 0.1334 ns | 39 B |
| Increment08 | 1.5983 ns | 0.0315 ns | 0.0280 ns | 44 B |
| Increment09 | 1.7839 ns | 0.0256 ns | 0.0214 ns | 49 B |
| Increment10 | 2.1949 ns | 0.0621 ns | 0.0581 ns | 54 B |
| Increment20 | 4.7125 ns | 0.1352 ns | 0.2297 ns | 104 B | Besides the zero measurement for Increment01 on net8.0, the results are pretty much what I would expect. My guess is it's a jit codegen issue either with Zen 4 (I ran the tests with Zen 3), or with ubuntu (I ran on Windows 11). Without the same setup you have, I can't dig deeper. I ran this to get the assembly of both runtimes: public class Program
{
public static void Main() => BenchmarkRunner.Run<OverheadTests>(
DefaultConfig.Instance
.AddDiagnoser(
new DisassemblyDiagnoser(new DisassemblyDiagnoserConfig(filters:
[
"*Runnable_0.__Overhead*",
"*Runnable_0.__OverheadWrapper*",
"*Runnable_0.__WorkloadWrapper*",
"*Runnable_0.OverheadActionUnroll*",
"*Runnable_0.WorkloadActionUnroll*",
"*OverheadTests.Increment*",
]))
)
.AddJob(Job.Default.WithRuntime(ClrRuntime.Net481))
.AddJob(Job.Default.WithRuntime(CoreRuntime.Core80))
);
}
public class OverheadTests
{
private int _field;
[Benchmark]
public void Increment01()
{
_field++;
}
} asm
.NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2; BenchmarkDotNet.Autogenerated.Runnable_0.__Overhead()
ret
; Total bytes of code 1 ; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
push rbp
sub rsp,20
lea rbp,[rsp+20]
mov [rbp+10],rcx
mov rcx,[rbp+10]
call qword ptr [7FFA930A5440]; BenchmarkDotNet.Autogenerated.Runnable_0.__Overhead()
nop
add rsp,20
pop rbp
ret
; Total bytes of code 31 ; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
push rbp
sub rsp,20
lea rbp,[rsp+20]
mov [rbp+10],rcx
mov rcx,[rbp+10]
call qword ptr [7FFA930A5380]; OverheadTests.Increment01()
nop
add rsp,20
pop rbp
ret
; Total bytes of code 31 ; BenchmarkDotNet.Autogenerated.Runnable_0.OverheadActionUnroll(Int64)
push rdi
push rsi
push rbx
sub rsp,20
mov rbx,rcx
mov rsi,rdx
xor edi,edi
test rsi,rsi
jle near ptr M03_L01
M03_L00:
mov rcx,rbx
call qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5458]; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
inc rdi
cmp rdi,rsi
jl near ptr M03_L00
M03_L01:
add rsp,20
pop rbx
pop rsi
pop rdi
ret
; Total bytes of code 188 ; BenchmarkDotNet.Autogenerated.Runnable_0.WorkloadActionUnroll(Int64)
push rdi
push rsi
push rbx
sub rsp,20
mov rbx,rcx
mov rsi,rdx
xor edi,edi
test rsi,rsi
jle near ptr M04_L01
M04_L00:
mov rcx,rbx
call qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rbx
call qword ptr [7FFA930A5470]; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
inc rdi
cmp rdi,rsi
jl near ptr M04_L00
M04_L01:
add rsp,20
pop rbx
pop rsi
pop rdi
ret
; Total bytes of code 188 ; OverheadTests.Increment01()
inc dword ptr [rcx+8]
ret
; Total bytes of code 4 .NET Framework 4.8.1 (4.8.9241.0), X64 RyuJIT VectorSize=256; BenchmarkDotNet.Autogenerated.Runnable_0.__Overhead()
ret
; Total bytes of code 1 ; BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
push rbp
sub rsp,20
lea rbp,[rsp+20]
mov [rbp+10],rcx
mov rcx,[rbp+10]
call BenchmarkDotNet.Autogenerated.Runnable_0.__Overhead()
nop
lea rsp,[rbp]
pop rbp
ret
; Total bytes of code 30 ; BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
push rbp
sub rsp,20
lea rbp,[rsp+20]
mov [rbp+10],rcx
mov rcx,[rbp+10]
call OverheadTests.Increment01()
nop
lea rsp,[rbp]
pop rbp
ret
; Total bytes of code 30 ; BenchmarkDotNet.Autogenerated.Runnable_0.OverheadActionUnroll(Int64)
push rdi
push rsi
push rbx
sub rsp,20
mov rsi,rcx
mov rdi,rdx
xor ebx,ebx
test rdi,rdi
jle near ptr M03_L01
M03_L00:
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__OverheadWrapper()
inc rbx
cmp rbx,rdi
jl near ptr M03_L00
M03_L01:
add rsp,20
pop rbx
pop rsi
pop rdi
ret
; Total bytes of code 172 ; BenchmarkDotNet.Autogenerated.Runnable_0.WorkloadActionUnroll(Int64)
push rdi
push rsi
push rbx
sub rsp,20
mov rsi,rcx
mov rdi,rdx
xor ebx,ebx
test rdi,rdi
jle near ptr M04_L01
M04_L00:
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
mov rcx,rsi
call BenchmarkDotNet.Autogenerated.Runnable_0.__WorkloadWrapper()
inc rbx
cmp rbx,rdi
jl near ptr M04_L00
M04_L01:
add rsp,20
pop rbx
pop rsi
pop rdi
ret
; Total bytes of code 172 ; OverheadTests.Increment01()
inc dword ptr [rcx+8]
ret
; Total bytes of code 4 If you could run the same (omitting the net481 runtime of course), maybe we could get a clearer picture of what's going on with your machine. @AndreyAkinshin |
IntroSmokeIncrements was taken from #2334 (comment)
f8f1050
to
8b7d8d9
Compare
8b7d8d9
to
f020b10
Compare
f020b10
to
e09268c
Compare
So, I recently got a new 9800x3d cpu, and I was able to repro your results. It seems quite dependent on the cpu architecture. I tried changing the wrapper method to use |
a1ce4c5
to
e09268c
Compare
@AndyAyersMS I experimented with changing the |
If I remember, the main issue was that if those methods are not optimized two things may cause disruptive changes in some benchmark measurements (typically benchmarks that do a fair amount of allocation):
So this shifts the values BDN reports for some benchmarks. For most point-in-time measurements this shift may not be a big deal, but we like to see continuity in benchmark results as we update BDN versions, since we are tracking performance of the same benchmark over long stretches of time. That being said, we can cope with shifts if necessary (eg we saw one when we enabled CET by default during the .NET 9 cycle). |
…thods. Count down loops instead of count up.
I believe we do not allocate ourselves, so this should be a non-issue. Also,
In #2336 we no longer store any returned values at all. Would that alleviate GC concerns? |
@AndyAyersMS I ran the Append_Strings benchmark you mentioned in dotnet/performance#2214 (comment). Master:
This PR:
This PR + #2336 changes:
So it looks like not consuming the return value does fix the GC issue. With these results, I will close this PR and include all changes in #2336, as they clearly need to go together. |
Fixes #1133
This wraps the workload call with a
NoInlining | NoOptimization
method instead of a delegate.Mac Intel x64 results
Master
This PR
Windows AMD x64 results
Master
This PR