Description
Context
Bevy folks measured that wgpu's trace-level logging when enabled at build time but disabled at run time can have a significant overhead (15%). This overhead comes from checking a global atomic variable that defines the log level.
In #6017, @Elabajaba proposes gating api logs and resource logs (trace level by default) behind a build flag.
I'm not fond of adding another feature flag for this, so I would like to explore other options first.
Disable trace level logs at build time
The most straightforward option is to note that trace level logging is the most verbose out of 5 logging levels. If the most verbose logging level is too verbose for a bevy application, then bevy could disable trace-level logging in optimized builds. This removes the performance concerns for bevy while letting its dependencies and bevy itself have a log level that has potentially verbose logs that are useful for development or for debugging particularly tricky issues. Firefox, for example, disable both trace and debug log levels in optimized builds by default for the same reason.
Debug builds being orders of magnitude slower than release builds, I believe it is safe to assume trace-level logging overhead to be negligible (unless someone has benchmarks to show otherwise).
I'm tempted to leave it at that. That said, 15% overhead is a lot and there should be ways to mitigate some of it without requiring more build configurations on the wgpu side.
Reduce the overhead of some of the logs
For context, wgpu's repository has a few benchmarks. Removing all trace-level logs yields improvements ranging from 0 to 5% on these benchmarks. This doesn't mean the that reducing the logging overhead will improve bevy by at most 5%, it only means that the highest improvement we can observe through these benchmarks is in that range. We'll have to scale these numbers a bit to transpose to other projects and getting some performance measurements from bevy itself would be valuable.
Resource logs
Resource logs only happen when creating and destroying API objects. For a typical application or game that's fairly rare (in the order of a few of times per frame). Not worth optimizing. Removing them entirely does not affect wgpu's benchmarks.
API logs
API logs can be very verbose. Most of the API log traffic in a typical frame would be in render and compute passes with one log per command (and games tend to submit a lot of commands per render pass). In #6010 I proposed to check the log level once at the beginning of the render pass instead of once per command. The idea was that checking a local variable instead of a global atomic at a high-ish frequency should be faster. I ran the wgpu's render pass benchmarks on two computes and the numbers (1) are puzzling. It is a mixed bag of improvements and regressions in the [-1.4% .. +1.7%] range.
Removing the api logs entirely was just as puzzling. The results (3) show either no changes or regressions.
Why? I don't know, I'm curious. It's easy to speculate but digging into it would take time. I ran the measurements multiple times on two computers and the results are consistent.
Other trace-level logs
There's a fair amount of trace-level logging outside of api and resource logs. Some of them might not be useful anymore and could be removed.
I'll open a PR to remove some of it and maintainers can chime in to "save" any log they care about. We already discussed keeping api, resource and initialization related logs.
Conclusions
Hard to draw conclusions at this point other that optimized builds are faster without trace-level logs, but that focusing on resource and api logs yields inconclusive results.
Measurements
Focusing on the render pass benchmarks here since other benchmarks were mostly not affected by the changes.
(1) Disable all trace-level logs
diff --git a/Cargo.lock b/Cargo.lock
index 8b4e604f4..19dfdc784 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -4234,6 +4234,7 @@ dependencies = [
"bincode",
"bytemuck",
"criterion",
+ "log",
"naga",
"nanorand",
"once_cell",
diff --git a/benches/Cargo.toml b/benches/Cargo.toml
index 82207d510..a0391d838 100644
--- a/benches/Cargo.toml
+++ b/benches/Cargo.toml
@@ -44,3 +44,4 @@ profiling.workspace = true
rayon.workspace = true
tracy-client = { workspace = true, optional = true }
wgpu = { workspace = true, features = ["wgsl", "metal", "dx12"] }
+log = { version = "0.4", features = ["release_max_level_debug"] }
[nical](~/dev/rust/wgpu)$ cargo bench -p wgpu-benchmark -- renderpass
Compiling log v0.4.22
Compiling wgpu-core v22.0.0 (/home/nical/dev/rust/wgpu/wgpu-core)
Compiling naga v22.0.0 (/home/nical/dev/rust/wgpu/naga)
Compiling wgpu-hal v22.0.0 (/home/nical/dev/rust/wgpu/wgpu-hal)
Compiling wgpu v22.0.0 (/home/nical/dev/rust/wgpu/wgpu)
Compiling wgpu-benchmark v22.0.0 (/home/nical/dev/rust/wgpu/benches)
Finished bench [optimized + debuginfo] target(s) in 29.22s
Running benches/root.rs (target/release/deps/root-093d36038254a381)
Gnuplot not found, using plotters backend
Benchmarking Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time): Warming up for 3.0000 sAdapterInfo { name: "AMD Radeon Pro WX 3200 Series (RADV POLARIS12)", vendor: 4098, device: 27009, device_type: DiscreteGpu, driver: "radv", driver_info: "Mesa 24.1.2", backend: Vulkan }
Benchmarking Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time): Collecting 100 samples in estRenderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time)
time: [24.859 ms 24.907 ms 24.958 ms]
thrpt: [400.68 Kelem/s 401.49 Kelem/s 402.26 Kelem/s]
change:
time: [-1.8921% -1.5875% -1.3066%] (p = 0.00 < 0.05)
thrpt: [+1.3239% +1.6132% +1.9286%]
Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
11 (11.00%) high mild
4 (4.00%) high severe
Benchmarking Renderpass: Single Threaded/2 renderpasses x 5000 draws (Renderpass Time): Collecting 100 samples in estiRenderpass: Single Threaded/2 renderpasses x 5000 draws (Renderpass Time)
time: [27.182 ms 27.212 ms 27.244 ms]
thrpt: [367.05 Kelem/s 367.48 Kelem/s 367.89 Kelem/s]
change:
time: [-5.2213% -4.7559% -4.2893%] (p = 0.00 < 0.05)
thrpt: [+4.4815% +4.9934% +5.5089%]
Performance has improved.
Benchmarking Renderpass: Single Threaded/4 renderpasses x 2500 draws (Renderpass Time): Collecting 100 samples in estiRenderpass: Single Threaded/4 renderpasses x 2500 draws (Renderpass Time)
time: [27.899 ms 27.992 ms 28.091 ms]
thrpt: [355.99 Kelem/s 357.25 Kelem/s 358.44 Kelem/s]
change:
time: [-3.4614% -3.0480% -2.6443%] (p = 0.00 < 0.05)
thrpt: [+2.7161% +3.1439% +3.5855%]
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
Benchmarking Renderpass: Single Threaded/8 renderpasses x 1250 draws (Renderpass Time): Collecting 100 samples in estiRenderpass: Single Threaded/8 renderpasses x 1250 draws (Renderpass Time)
time: [28.401 ms 28.489 ms 28.581 ms]
thrpt: [349.88 Kelem/s 351.01 Kelem/s 352.10 Kelem/s]
change:
time: [-0.9584% -0.5177% -0.0746%] (p = 0.02 < 0.05)
thrpt: [+0.0746% +0.5204% +0.9677%]
Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
Benchmarking Renderpass: Single Threaded/1 renderpasses x 10000 draws (Submit Time): Collecting 100 samples in estimatRenderpass: Single Threaded/1 renderpasses x 10000 draws (Submit Time)
time: [2.1742 ms 2.1788 ms 2.1838 ms]
thrpt: [4.5792 Melem/s 4.5896 Melem/s 4.5995 Melem/s]
change:
time: [-2.8352% -2.5681% -2.3122%] (p = 0.00 < 0.05)
thrpt: [+2.3669% +2.6358% +2.9179%]
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
7 (7.00%) high mild
1 (1.00%) high severe
Benchmarking Renderpass: Single Threaded/2 renderpasses x 5000 draws (Submit Time): Collecting 100 samples in estimateRenderpass: Single Threaded/2 renderpasses x 5000 draws (Submit Time)
time: [3.6249 ms 3.6396 ms 3.6553 ms]
thrpt: [2.7358 Melem/s 2.7475 Melem/s 2.7587 Melem/s]
change:
time: [-1.4089% -0.9266% -0.4124%] (p = 0.00 < 0.05)
thrpt: [+0.4141% +0.9353% +1.4290%]
Change within noise threshold.
Benchmarking Renderpass: Single Threaded/4 renderpasses x 2500 draws (Submit Time): Collecting 100 samples in estimateRenderpass: Single Threaded/4 renderpasses x 2500 draws (Submit Time)
time: [4.6839 ms 4.6959 ms 4.7082 ms]
thrpt: [2.1239 Melem/s 2.1295 Melem/s 2.1350 Melem/s]
change:
time: [-1.7518% -1.3117% -0.8691%] (p = 0.00 < 0.05)
thrpt: [+0.8767% +1.3291% +1.7831%]
Change within noise threshold.
Benchmarking Renderpass: Single Threaded/8 renderpasses x 1250 draws (Submit Time): Collecting 100 samples in estimateRenderpass: Single Threaded/8 renderpasses x 1250 draws (Submit Time)
time: [6.2658 ms 6.2730 ms 6.2807 ms]
thrpt: [1.5922 Melem/s 1.5941 Melem/s 1.5960 Melem/s]
change:
time: [-1.7147% -1.4655% -1.2216%] (p = 0.00 < 0.05)
thrpt: [+1.2367% +1.4873% +1.7446%]
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
7 (7.00%) high mild
(2) Reduce the overhead of api logs in render passes
By checking the global log level once per render pass instead of once per command (#6010).
[nical](~/dev/rust/wgpu)$ cargo bench -p wgpu-benchmark -- renderpass
Compiling naga v22.0.0 (/home/nical/dev/rust/wgpu/naga)
Compiling wgpu-core v22.0.0 (/home/nical/dev/rust/wgpu/wgpu-core)
Compiling wgpu-hal v22.0.0 (/home/nical/dev/rust/wgpu/wgpu-hal)
Compiling wgpu v22.0.0 (/home/nical/dev/rust/wgpu/wgpu)
Compiling wgpu-benchmark v22.0.0 (/home/nical/dev/rust/wgpu/benches)
Finished bench [optimized + debuginfo] target(s) in 29.31s
Running benches/root.rs (target/release/deps/root-c662ca2c1cfd646e)
Gnuplot not found, using plotters backend
Benchmarking Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time): Warming up for 3.0000 sAdapterInfo { name: "AMD Radeon Pro WX 3200 Series (RADV POLARIS12)", vendor: 4098, device: 27009, device_type: DiscreteGpu, driver: "radv", driver_info: "Mesa 24.1.2", backend: Vulkan }
Benchmarking Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time): Collecting 100 samples in estRenderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time)
time: [25.429 ms 25.497 ms 25.564 ms]
thrpt: [391.17 Kelem/s 392.21 Kelem/s 393.25 Kelem/s]
change:
time: [+1.4530% +1.7563% +2.0686%] (p = 0.00 < 0.05)
thrpt: [-2.0267% -1.7260% -1.4321%]
Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
11 (11.00%) high mild
2 (2.00%) high severe
Benchmarking Renderpass: Single Threaded/2 renderpasses x 5000 draws (Renderpass Time): Collecting 100 samples in estiRenderpass: Single Threaded/2 renderpasses x 5000 draws (Renderpass Time)
time: [27.644 ms 27.655 ms 27.669 ms]
thrpt: [361.42 Kelem/s 361.60 Kelem/s 361.75 Kelem/s]
change:
time: [-1.7414% -1.4481% -1.1609%] (p = 0.00 < 0.05)
thrpt: [+1.1745% +1.4694% +1.7723%]
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
Benchmarking Renderpass: Single Threaded/4 renderpasses x 2500 draws (Renderpass Time): Collecting 100 samples in estiRenderpass: Single Threaded/4 renderpasses x 2500 draws (Renderpass Time)
time: [28.426 ms 28.475 ms 28.528 ms]
thrpt: [350.54 Kelem/s 351.18 Kelem/s 351.79 Kelem/s]
change:
time: [-0.3652% -0.0600% +0.2200%] (p = 0.68 > 0.05)
thrpt: [-0.2195% +0.0601% +0.3666%]
No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
Benchmarking Renderpass: Single Threaded/8 renderpasses x 1250 draws (Renderpass Time): Collecting 100 samples in estiRenderpass: Single Threaded/8 renderpasses x 1250 draws (Renderpass Time)
time: [28.222 ms 28.261 ms 28.318 ms]
thrpt: [353.13 Kelem/s 353.84 Kelem/s 354.34 Kelem/s]
change:
time: [+0.8179% +0.9914% +1.1953%] (p = 0.00 < 0.05)
thrpt: [-1.1811% -0.9817% -0.8112%]
Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) high mild
2 (2.00%) high severe
Benchmarking Renderpass: Single Threaded/1 renderpasses x 10000 draws (Submit Time): Collecting 100 samples in estimatRenderpass: Single Threaded/1 renderpasses x 10000 draws (Submit Time)
time: [2.1659 ms 2.1688 ms 2.1720 ms]
thrpt: [4.6041 Melem/s 4.6108 Melem/s 4.6171 Melem/s]
change:
time: [+1.5074% +1.6509% +1.8064%] (p = 0.00 < 0.05)
thrpt: [-1.7743% -1.6241% -1.4850%]
Performance has regressed.
Benchmarking Renderpass: Single Threaded/2 renderpasses x 5000 draws (Submit Time): Collecting 100 samples in estimateRenderpass: Single Threaded/2 renderpasses x 5000 draws (Submit Time)
time: [3.5947 ms 3.6027 ms 3.6115 ms]
thrpt: [2.7689 Melem/s 2.7757 Melem/s 2.7819 Melem/s]
change:
time: [+0.1382% +0.5272% +0.8838%] (p = 0.01 < 0.05)
thrpt: [-0.8760% -0.5245% -0.1381%]
Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
7 (7.00%) high mild
3 (3.00%) high severe
Benchmarking Renderpass: Single Threaded/4 renderpasses x 2500 draws (Submit Time): Collecting 100 samples in estimateRenderpass: Single Threaded/4 renderpasses x 2500 draws (Submit Time)
time: [4.7490 ms 4.7583 ms 4.7677 ms]
thrpt: [2.0975 Melem/s 2.1016 Melem/s 2.1057 Melem/s]
change:
time: [+1.3025% +1.6018% +1.8908%] (p = 0.00 < 0.05)
thrpt: [-1.8557% -1.5765% -1.2857%]
Performance has regressed.
Benchmarking Renderpass: Single Threaded/8 renderpasses x 1250 draws (Submit Time): Collecting 100 samples in estimateRenderpass: Single Threaded/8 renderpasses x 1250 draws (Submit Time)
time: [6.3299 ms 6.3500 ms 6.3718 ms]
thrpt: [1.5694 Melem/s 1.5748 Melem/s 1.5798 Melem/s]
change:
time: [+0.0756% +0.4846% +0.8959%] (p = 0.03 < 0.05)
thrpt: [-0.8879% -0.4822% -0.0755%]
Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
9 (9.00%) high mild
1 (1.00%) high severe
(3) Remove api logs at build time
diff --git a/wgpu-core/src/lib.rs b/wgpu-core/src/lib.rs
index c46a8f103..e563a4407 100644
--- a/wgpu-core/src/lib.rs
+++ b/wgpu-core/src/lib.rs
@@ -320,7 +320,7 @@ macro_rules! api_log {
}
#[cfg(not(feature = "api_log_info"))]
macro_rules! api_log {
- ($($arg:tt)+) => (log::trace!($($arg)+))
+ ($($arg:tt)+) => ()
}
[nical](~/dev/rust/wgpu)$ cargo bench -p wgpu-benchmark -- renderpass
Compiling wgpu-core v22.0.0 (/home/nical/dev/rust/wgpu/wgpu-core)
Compiling wgpu v22.0.0 (/home/nical/dev/rust/wgpu/wgpu)
Compiling wgpu-benchmark v22.0.0 (/home/nical/dev/rust/wgpu/benches)
Finished bench [optimized + debuginfo] target(s) in 20.61s
Running benches/root.rs (target/release/deps/root-c662ca2c1cfd646e)
Gnuplot not found, using plotters backend
Benchmarking Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time): Warming up for 3.0000 sAdapterInfo { name: "AMD Radeon Pro WX 3200 Series (RADV POLARIS12)", vendor: 4098, device: 27009, device_type: DiscreteGpu, driver: "radv", driver_info: "Mesa 24.1.2", backend: Vulkan }
Benchmarking Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time): Collecting 100 samples in estRenderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time)
time: [25.254 ms 25.309 ms 25.367 ms]
thrpt: [394.21 Kelem/s 395.12 Kelem/s 395.98 Kelem/s]
change:
time: [-1.0935% -0.7374% -0.4044%] (p = 0.00 < 0.05)
thrpt: [+0.4061% +0.7428% +1.1056%]
Change within noise threshold.
Benchmarking Renderpass: Single Threaded/2 renderpasses x 5000 draws (Renderpass Time): Collecting 100 samples in estiRenderpass: Single Threaded/2 renderpasses x 5000 draws (Renderpass Time)
time: [28.435 ms 28.571 ms 28.711 ms]
thrpt: [348.30 Kelem/s 350.00 Kelem/s 351.68 Kelem/s]
change:
time: [+2.8122% +3.3149% +3.8114%] (p = 0.00 < 0.05)
thrpt: [-3.6715% -3.2086% -2.7353%]
Performance has regressed.
Benchmarking Renderpass: Single Threaded/4 renderpasses x 2500 draws (Renderpass Time): Collecting 100 samples in estiRenderpass: Single Threaded/4 renderpasses x 2500 draws (Renderpass Time)
time: [28.811 ms 28.872 ms 28.933 ms]
thrpt: [345.63 Kelem/s 346.36 Kelem/s 347.09 Kelem/s]
change:
time: [+1.1175% +1.3940% +1.6723%] (p = 0.00 < 0.05)
thrpt: [-1.6448% -1.3748% -1.1051%]
Performance has regressed.
Benchmarking Renderpass: Single Threaded/8 renderpasses x 1250 draws (Renderpass Time): Collecting 100 samples in estiRenderpass: Single Threaded/8 renderpasses x 1250 draws (Renderpass Time)
time: [28.552 ms 28.637 ms 28.726 ms]
thrpt: [348.12 Kelem/s 349.19 Kelem/s 350.24 Kelem/s]
change:
time: [+0.9958% +1.3312% +1.6786%] (p = 0.00 < 0.05)
thrpt: [-1.6509% -1.3138% -0.9859%]
Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
Benchmarking Renderpass: Single Threaded/1 renderpasses x 10000 draws (Submit Time): Collecting 100 samples in estimatRenderpass: Single Threaded/1 renderpasses x 10000 draws (Submit Time)
time: [2.2331 ms 2.2363 ms 2.2396 ms]
thrpt: [4.4651 Melem/s 4.4718 Melem/s 4.4780 Melem/s]
change:
time: [+2.8865% +3.1087% +3.3260%] (p = 0.00 < 0.05)
thrpt: [-3.2189% -3.0150% -2.8055%]
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
Benchmarking Renderpass: Single Threaded/2 renderpasses x 5000 draws (Submit Time): Collecting 100 samples in estimateRenderpass: Single Threaded/2 renderpasses x 5000 draws (Submit Time)
time: [3.6632 ms 3.6737 ms 3.6845 ms]
thrpt: [2.7140 Melem/s 2.7221 Melem/s 2.7299 Melem/s]
change:
time: [+1.5783% +1.9687% +2.3443%] (p = 0.00 < 0.05)
thrpt: [-2.2906% -1.9307% -1.5537%]
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
Benchmarking Renderpass: Single Threaded/4 renderpasses x 2500 draws (Submit Time): Collecting 100 samples in estimateRenderpass: Single Threaded/4 renderpasses x 2500 draws (Submit Time)
time: [4.7413 ms 4.7583 ms 4.7764 ms]
thrpt: [2.0936 Melem/s 2.1016 Melem/s 2.1091 Melem/s]
change:
time: [-0.4009% +0.0016% +0.4750%] (p = 0.97 > 0.05)
thrpt: [-0.4728% -0.0016% +0.4025%]
No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
9 (9.00%) high mild
Benchmarking Renderpass: Single Threaded/8 renderpasses x 1250 draws (Submit Time): Collecting 100 samples in estimateRenderpass: Single Threaded/8 renderpasses x 1250 draws (Submit Time)
time: [6.3527 ms 6.3663 ms 6.3800 ms]
thrpt: [1.5674 Melem/s 1.5708 Melem/s 1.5741 Melem/s]
change:
time: [-0.1435% +0.2569% +0.6557%] (p = 0.21 > 0.05)
thrpt: [-0.6514% -0.2562% +0.1437%]
No change in performance detected.
Metadata
Assignees
Type
Projects
Status
Todo