Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent Metal from crashing when a still-open encoder is deallocated, resolve issue #2077. #4023

Merged
merged 22 commits into from
Aug 29, 2023

Conversation

bradwerth
Copy link
Contributor

@bradwerth bradwerth commented Aug 9, 2023

Checklist

  • Run cargo clippy.
  • Run cargo clippy --target wasm32-unknown-unknown if applicable.
  • Add change to CHANGELOG.md. See simple instructions inside file.

Connections
None

Description
Metal requires that a MTLCommandEncoder must be closed before it is deallocated, otherwise Metal will assert that endEncoding was not called. This manifests as a crash in CTS tests that might otherwise be failures or timeouts.

Testing
CTS tests that are expected fail on Metal will no longer crash.

@ErichDonGubler ErichDonGubler self-assigned this Aug 9, 2023
Copy link
Member

@ErichDonGubler ErichDonGubler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes sense to me; I think the shape of it might change a bit, but using Drop to implment this seems like an elegant way to solve the problem.

If it weren't for the CHANGELOG entry, I'd "only" be leaving a Comment review, not a Request changes review.

CHANGELOG.md Show resolved Hide resolved
wgpu-hal/src/metal/command.rs Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
wgpu-hal/src/metal/command.rs Show resolved Hide resolved
Copy link
Contributor

@kpreid kpreid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR fix #2077 and particularly the repro case I described in #2077 (comment) ?

If so, it should be mentioned in the PR description.

CHANGELOG.md Outdated Show resolved Hide resolved
@bradwerth bradwerth changed the title Prevent Metal from crashing when a still-open encoder is deallocated. Prevent Metal from crashing when a still-open encoder is deallocated, resolve issue #2077. Aug 11, 2023
Copy link
Member

@jimblandy jimblandy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a test case, probably in tests/tests/encoder.rs.

@jimblandy
Copy link
Member

jimblandy commented Aug 11, 2023

@ErichDonGubler asked:

Is trying to solve this problem for other platforms something you intend to resolve or track in another place?

As long as the fix is in wgpu-core, the encoders will get closed on all backends, as wgpu-core is simply driving whatever wgpu_hal::Api implementation you gave it.

Perhaps the right place to document this requirement is in wgpu-hal/src/lib.rs. I would like that file to start recording the invariants that wgpu-core works so hard to ensure. @cwfitzgerald, does that sound right to you?

@ErichDonGubler
Copy link
Member

@jimblandy: I'm confused. Why are we talking about wgpu-core? This fix isn't in wgpu-core; this is a wgpu-hal change specific to the metal module family.

@jimblandy
Copy link
Member

@jimblandy: I'm confused. Why are we talking about wgpu-core? This fix isn't in wgpu-core; this is a wgpu-hal change specific to the metal module family.

Sorry - Brad had shown me an earlier version of this patch that addressed the problem in generic code. This PR is completely different from that. This is indeed a Metal-only fix.

wgpu-hal/src/metal/command.rs Outdated Show resolved Hide resolved
@jimblandy
Copy link
Member

jimblandy commented Aug 13, 2023

It seems like wgpu-hal's Vulkan backend doesn't have any similar requirement: wgpu_hal::vulkan::Device::destroy_command_encoder calls vkDestroyCommandPool, which only seems to require that none of the vkCommandBuffers in the pool be actively in use by the GPU (the "pending" state).

But it's a little hard to follow, which is one reason I'd like to have a test case in this PR that runs on all platforms but reproduces the problem on Metal.

CommandEncoder such that we can call the existing discard_encoding
function. Also add a test of dropping a CommandEncoder after it has
errored on a command.
@bradwerth bradwerth requested a review from jimblandy August 15, 2023 15:30
@bradwerth
Copy link
Contributor Author

From the failures in the attempted merge, it appears that DX12 is failing the new test and therefore needs a similar intervention. The DX12 error is ID3D12CommandAllocator::Reset: The command allocator cannot be reset because a command list is currently being recorded with the allocator. [ EXECUTION ERROR #543: COMMAND_ALLOCATOR_CANNOT_RESET]. That sounds very much like the Metal issue with failing to call endEncoding. I'll move the Drop implementation further up the class hierarchy so it covers all adapters.

Copy link
Member

@cwfitzgerald cwfitzgerald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after nit

wgpu-hal/src/metal/command.rs Outdated Show resolved Hide resolved
@cwfitzgerald
Copy link
Member

cwfitzgerald commented Aug 16, 2023

D3D12 isn't happy with this change, unfortunately:

[2023-08-16T13:41:19Z ERROR wgpu_hal::auxil::dxgi::exception] ID3D12CommandAllocator::Reset: The command allocator cannot be reset because a command list is currently being recorded with the allocator. [ EXECUTION ERROR #543: COMMAND_ALLOCATOR_CANNOT_RESET]

@bradwerth
Copy link
Contributor Author

I'll move the Drop implementation further up the class hierarchy so it covers all adapters.

Since all the superclass implementations were empty, I just re-implemented it for DX12.

@ErichDonGubler ErichDonGubler self-requested a review August 16, 2023 19:47
Copy link
Member

@ErichDonGubler ErichDonGubler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minus Connor's concern.

wgpu-hal/src/dx12/command.rs Outdated Show resolved Hide resolved
@bradwerth
Copy link
Contributor Author

I'm not confident I'll be able to resolve the DX12 test failure with my setup in a reasonable time. I think I will disable the new test for DX12 in this PR, and treat it as a follow-up issue for somebody who is in a better position to fix it.

Copy link
Member

@jimblandy jimblandy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fantastic.

Please file a follow-up issue for fixing the test on DX12. You can just leave it for others to triage.

@jimblandy
Copy link
Member

There's an unrelated timeout on mach aarch64. I'm going to try re-running, and if it goes away we'll rebase and merge this.

@jimblandy
Copy link
Member

I filed #4096 for the aarch64 timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants