Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement experimental GPU two-phase occlusion culling for the standard 3D mesh pipeline. #17413

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

pcwalton
Copy link
Contributor

Occlusion culling allows the GPU to skip the vertex and fragment shading overhead for objects that can be quickly proved to be invisible because they're behind other geometry. A depth prepass already eliminates most fragment shading overhead for occluded objects, but the vertex shading overhead, as well as the cost of testing and rejecting fragments against the Z-buffer, is presently unavoidable for standard meshes. We currently perform occlusion culling only for meshlets. But other meshes, such as skinned meshes, can benefit from occlusion culling too in order to avoid the transform and skinning overhead for unseen meshes.

This commit adapts the same two-phase occlusion culling technique that meshlets use to Bevy's standard 3D mesh pipeline when the new OcclusionCulling component, as well as the DepthPrepass component, are present on the camera. It has these steps:

  1. Early depth prepass: We use the hierarchical Z-buffer from the previous frame to cull meshes for the initial depth prepass, effectively rendering only the meshes that were visible in the last frame.

  2. Early depth downsample: We downsample the depth buffer to create another hierarchical Z-buffer, this time with the current view transform.

  3. Late depth prepass: We use the new hierarchical Z-buffer to test all meshes that weren't rendered in the early depth prepass. Any meshes that pass this check are rendered.

  4. Late depth downsample: Again, we downsample the depth buffer to create a hierarchical Z-buffer in preparation for the early depth prepass of the next frame. This step is done after all the rendering, in order to account for custom phase items that might write to the depth buffer.

Note that this patch has no effect on the per-mesh CPU overhead for occluded objects, which remains high for a GPU-driven renderer due to the lack of cold-specialization and retained bins. If cold-specialization and retained bins weren't on the horizon, then a more traditional approach like potentially visible sets (PVS) or low-res CPU rendering would probably be more efficient than the GPU-driven approach that this patch implements for most scenes. However, at this point the amount of effort required to implement a PVS baking tool or a low-res CPU renderer would probably be greater than landing cold-specialization and retained bins, and the GPU driven approach is the more modern one anyway. It does mean that the performance improvements from occlusion culling as implemented in this patch today are likely to be limited, because of the high CPU overhead for occluded meshes.

Note also that this patch currently doesn't implement occlusion culling for 2D objects or shadow maps. Those can be addressed in a follow-up. Additionally, note that the techniques in this patch require compute shaders, which excludes support for WebGL 2.

This PR is marked experimental because of known precision issues with the downsampling approach when applied to non-power-of-two framebuffer sizes (i.e. most of them). These precision issues can, in rare cases, cause objects to be judged occluded that in fact are not. (I've never seen this in practice, but I know it's possible; it tends to be likelier to happen with small meshes.) As a follow-up to this patch, we desire to switch to the SPD-based hi-Z buffer shader from the Granite engine, which doesn't suffer from these problems, at which point we should be able to graduate this feature from experimental status. I opted not to include that rewrite in this patch for two reasons: (1) @JMS55 is planning on doing the rewrite to coincide with the new availability of image atomic operations in Naga; (2) to reduce the scope of this patch.

A new example, occlusion_culling, has been added. It demonstrates objects becoming quickly occluded and disoccluded by dynamic geometry and shows the number of objects that are actually being rendered. Also, a new --occlusion-culling switch has been added to scene_viewer, in order to make it easy to test this patch with large scenes like Bistro.

Migration guide

  • When enqueuing a custom mesh pipeline, work item buffers are now created with bevy::render::batching::gpu_preprocessing::get_or_create_work_item_buffer, not PreprocessWorkItemBuffers::new. See the specialized_mesh_pipeline example.

Showcase

Occlusion culling example:
Screenshot 2025-01-15 175051

Bistro zoomed out, before occlusion culling:
Screenshot 2025-01-16 185425

Bistro zoomed out, after occlusion culling:
Screenshot 2025-01-16 184949

In this scene, occlusion culling reduces the number of meshes Bevy has to render from 1591 to 585.

@pcwalton pcwalton force-pushed the occlusion-culling-4 branch from 7cd3abd to fd03dd0 Compare January 17, 2025 03:04
@pcwalton pcwalton added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Jan 17, 2025
@pcwalton pcwalton added this to the 0.16 milestone Jan 17, 2025
@pcwalton pcwalton force-pushed the occlusion-culling-4 branch 5 times, most recently from f2c4a5e to 357d4ad Compare January 17, 2025 04:24
3D mesh pipeline.

*Occlusion culling* allows the GPU to skip the vertex and fragment
shading overhead for objects that can be quickly proved to be invisible
because they're behind other geometry. A depth prepass already
eliminates most fragment shading overhead for occluded objects, but the
vertex shading overhead, as well as the cost of testing and rejecting
fragments against the Z-buffer, is presently unavoidable for standard
meshes. We currently perform occlusion culling only for meshlets. But
other meshes, such as skinned meshes, can benefit from occlusion culling
too in order to avoid the transform and skinning overhead for unseen
meshes.

This commit adapts the same [*two-phase occlusion culling*] technique
that meshlets use to Bevy's standard 3D mesh pipeline when the new
`OcclusionCulling` component, as well as the `DepthPrepass` component,
are present on the camera. It has these steps:

1. *Early depth prepass*: We use the hierarchical Z-buffer from the
   previous frame to cull meshes for the initial depth prepass,
   effectively rendering only the meshes that were visible in the last
   frame.

2. *Early depth downsample*: We downsample the depth buffer to create
   another hierarchical Z-buffer, this time with the current view
   transform.

3. *Late depth prepass*: We use the new hierarchical Z-buffer to test
   all meshes that weren't rendered in the early depth prepass. Any
   meshes that pass this check are rendered.

4. *Late depth downsample*: Again, we downsample the depth buffer to
   create a hierarchical Z-buffer in preparation for the early depth
   prepass of the next frame. This step is done after all the rendering,
   in order to account for custom phase items that might write to the
   depth buffer.

Note that this patch has no effect on the per-mesh CPU overhead for
occluded objects, which remains high for a GPU-driven renderer due to
the lack of `cold-specialization` and retained bins. If
`cold-specialization` and retained bins weren't on the horizon, then a
more traditional approach like potentially visible sets (PVS) or low-res
CPU rendering would probably be more efficient than the GPU-driven
approach that this patch implements for most scenes. However, at this
point the amount of effort required to implement a PVS baking tool or a
low-res CPU renderer would probably be greater than landing
`cold-specialization` and retained bins, and the GPU driven approach is
the more modern one anyway. It does mean that the performance
improvements from occlusion culling as implemented in this patch *today*
are likely to be limited, because of the high CPU overhead for occluded
meshes.

Note also that this patch currently doesn't implement occlusion culling
for 2D objects or shadow maps. Those can be addressed in a follow-up.
Additionally, note that the techniques in this patch require compute
shaders, which excludes support for WebGL 2.

This PR is marked experimental because of known precision issues with
the downsampling approach when applied to non-power-of-two framebuffer
sizes (i.e. most of them). These precision issues can, in rare cases,
cause objects to be judged occluded that in fact are not. (I've never
seen this in practice, but I know it's possible; it tends to be likelier
to happen with small meshes.) As a follow-up to this patch, we desire to
switch to the [SPD-based hi-Z buffer shader from the Granite engine],
which doesn't suffer from these problems, at which point we should be
able to graduate this feature from experimental status. I opted not to
include that rewrite in this patch for two reasons: (1) @JMS55 is
planning on doing the rewrite to coincide with the new availability of
image atomic operations in Naga; (2) to reduce the scope of this patch.

[*two-phase occlusion culling*]:
https://medium.com/@mil_kru/two-pass-occlusion-culling-4100edcad501

[Aaltonen SIGGRAPH 2015]:
https://www.advances.realtimerendering.com/s2015/aaltonenhaar_siggraph2015_combined_final_footer_220dpi.pdf

[Some literature]:
https://gist.github.com/reduz/c5769d0e705d8ab7ac187d63be0099b5?permalink_comment_id=5040452#gistcomment-5040452

[SPD-based hi-Z buffer shader from the Granite engine]:
https://github.com/Themaister/Granite/blob/master/assets/shaders/post/hiz.comp
@pcwalton pcwalton force-pushed the occlusion-culling-4 branch from 357d4ad to 6aec99d Compare January 17, 2025 05:21
Copy link
Contributor

@bushrat011899 bushrat011899 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, just a minor comment around the experimental module and marking it as doc(hidden) for Sem-Ver reasons. I unfortunately couldn't get the new occlusion_culling example to run on my laptop (Intel i5-1240p iGPU Windows 10) with either DX12 or the Vulkan backends.

Comment on lines +11 to +13
#endif // MULTISAMPLE
#endif // MESHLET
#endif // MESHLET_VISIBILITY_BUFFER_RASTER_PASS_OUTPUT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am reminded of how spoiled I am getting to just write Rust.

@@ -14,6 +14,7 @@ pub mod core_2d;
pub mod core_3d;
pub mod deferred;
pub mod dof;
pub mod experimental;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to annotate this #[doc(hidden)]. This makes it sem-ver compatible to include breaking changes in this module.

Copy link
Contributor Author

@pcwalton pcwalton Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about semver compatibility here though if we aren't shipping this in a point release? My concern about #[doc(hidden)] is that it makes the feature less discoverable, and we want testing on it as it's the kind of thing that could have a lot of bugs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point! While Bevy is pre-1.0 it's probably not important anyway, since every release is a breaking release.

Copy link
Contributor

@bushrat011899 bushrat011899 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can confirm the example now runs on my i5-1240p. In the DX12 backend it says my platform doesn't support occlusion culling, but runs the example fine otherwise. On Vulkan it works as expected, culling approximately 30 meshes. Nice work!

@pcwalton pcwalton self-assigned this Jan 18, 2025
@BenjaminBrienen BenjaminBrienen added D-Complex Quite challenging from either a design or technical perspective. Ask for help! D-Shaders This code uses GPU shader languages labels Jan 19, 2025
Copy link
Contributor

@tychedelia tychedelia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing a panic on my M2 MBP:

2025-01-20T01:09:54.639116Z ERROR wgpu::backend::wgpu_core: Handling wgpu errors as fatal by default
thread 'Compute Task Pool (4)' panicked at /Users/char/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/wgpu-23.0.1/src/backend/wgpu_core.rs:996:18:
wgpu error: Validation Error

Caused by:
  In Device::create_bind_group, label = 'preprocess_late_indexed_gpu_occlusion_culling_bind_group'
    Buffer offset 320 does not respect device's requested `min_storage_buffer_offset_alignment` limit 256


note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Encountered a panic in system `bevy_pbr::render::gpu_preprocess::prepare_preprocess_bind_groups`!

texture_storage_2d(TextureFormat::R32Float, StorageTextureAccess::WriteOnly),
texture_storage_2d(TextureFormat::R32Float, StorageTextureAccess::WriteOnly),
texture_storage_2d(TextureFormat::R32Float, StorageTextureAccess::WriteOnly),
texture_storage_2d(TextureFormat::R32Float, StorageTextureAccess::ReadWrite),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this one is marked ReadWrite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call textureStore on it. See mip_6 in downsample_depth.wgsl.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yup, I see it's the handoff point between first and second.

@@ -1,8 +1,16 @@
#ifdef MESHLET_VISIBILITY_BUFFER_RASTER_PASS_OUTPUT
@group(0) @binding(0) var<storage, read> mip_0: array<u64>; // Per pixel
Copy link
Contributor

@JMS55 JMS55 Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a note: I believe all the meshlet-specific stuff is going to disappear here once wgpu 24 is merged and I can switch back to an image-based visbuffer.

@JMS55
Copy link
Contributor

JMS55 commented Jan 20, 2025

Did part of my review, will do the rest another time.

Focused mainly on meshlets, the depth downsample, and culling test parts. Haven't yet looked at the code for applying the occlusion culling to our main pipeline.

@pcwalton pcwalton requested review from tychedelia and JMS55 January 22, 2025 03:22
@pcwalton
Copy link
Contributor Author

I believe I've fixed all the issues that I know about and have addressed the review comments. See the commit descriptions for more information.

Copy link
Contributor

@tychedelia tychedelia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crash on mac fixed. Lgtm! ✨

@tychedelia tychedelia added S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it and removed S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Jan 22, 2025
Copy link
Contributor

@atlv24 atlv24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another herculean effort from you on all fronts, the docs are great thanks.

a few questions

  • whats the path forward for occlusion culling shadow views? not planned, not worth it, or worth it but annoying to do/not now?
  • how does hzb test interact with TAA jitter?
  • what frustum are you culling by for early depth pass? culling by only prev view frustum will render more than necessary, and culling only by current view frustum will test against unwritten parts of the prev hzb, likely resulting in overdraw. this overdraw cost will have to be eaten either in the early pass or the depth pass though because we do not have information for the newly disoccluded region of the screen, so i don't think it matters, meaning culling by current view is ideal.

}
}
}

fn preprocess_direct_bind_group_layout_entries() -> DynamicBindGroupLayoutEntries {
DynamicBindGroupLayoutEntries::sequential(
DynamicBindGroupLayoutEntries::new_with_indices(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware we had a new_with_indices, this is much nicer to use

let uv_pos = ndc_to_uv(ndc_pos.xy);

// Update the AABB and maximum view-space depth.
if (i == 0u) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this i == 0u case can be removed by initializing max_depth_view to -inf, and aabb to vec4(inf, inf, -inf, -inf) i think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I did that. Note that I had to use a bitcast because Naga complained if I tried 1.0 / 0.0 or -1.0 / 0.0.

let depth_quad_a = textureLoad(depth_pyramid, aabb_top_left, depth_level).x;
let depth_quad_b = textureLoad(depth_pyramid, aabb_top_left + vec2(1u, 0u), depth_level).x;
let depth_quad_c = textureLoad(depth_pyramid, aabb_top_left + vec2(0u, 1u), depth_level).x;
let depth_quad_d = textureLoad(depth_pyramid, aabb_top_left + vec2(1u, 1u), depth_level).x;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe it is recommended to sample 16 pixels from a 1-step-finer mip level. this is definitely something to punt to another pr though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't heard of that. I have seen using a 3x3 sample based off a condition (I forget what) though. Something we can improve on in a future PR.

/// a significant slowdown.
///
/// Occlusion culling currently requires a `DepthPrepass`. If no depth prepass
/// is present on the view, the [`OcclusionCulling`] component will be ignored.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe have OcclusionCulling required component DepthPrepass?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i realize that the current way is just aligning with how all the other prepasses do it. lets punt on this to a required-components prepass migration pr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to do that but the problem is that DepthPrepass lives in bevy_core_pipeline which is downstream of bevy_render, so OcclusionCulling can't refer to it.

@pcwalton
Copy link
Contributor Author

whats the path forward for occlusion culling shadow views? not planned, not worth it, or worth it but annoying to do/not now?

Planned, but didn't want to do it in this patch since it'd make it bigger and more complex.

how does hzb test interact with TAA jitter?

I think TAA jitter is basically just a different view matrix, so it's essentially just the same as a regular camera movement (i.e. it should just work).

what frustum are you culling by for early depth pass? culling by only prev view frustum will render more than necessary, and culling only by current view frustum will test against unwritten parts of the prev hzb, likely resulting in overdraw. this overdraw cost will have to be eaten either in the early pass or the depth pass though because we do not have information for the newly disoccluded region of the screen, so i don't think it matters, meaning culling by current view is ideal.

It tests against the current frame frustum.

@pcwalton
Copy link
Contributor Author

pcwalton commented Jan 22, 2025

Bevy example runner output looks good; all the changed references seem to be false positives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times D-Complex Quite challenging from either a design or technical perspective. Ask for help! D-Shaders This code uses GPU shader languages S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

6 participants