Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix shared provider unload crash #5523

Merged
merged 6 commits into from
Oct 20, 2020
Merged

Conversation

RyanUnderhill
Copy link
Member

@RyanUnderhill RyanUnderhill commented Oct 17, 2020

Short Description: On Linux, when calling dlclose() a library won't always be unloaded immediately (and have its global variables be destroyed). This causes a problem with shared providers, since they hold a KernelRegistry object in a static and they need to call into the core OnnxRuntime code to destroy it.

The fix adds a Shutdown() method to the shared providers and calls this on OrtEnv destruction in the core onnxruntime code. This way these globals are destroyed in a deterministic way before library unloading gets involved.

Long Description:Here's what would happen before this change on Linux:

  1. Customer code dynamically loads onnxruntime.so
  2. Customer code then adds TensorRT provider to session
  3. Onnxruntime dynamically loads onnxruntime_provider_tensorrt.so due to (2)
  4. Customer code destroys all onnxruntime resources and unloads the onnxruntime.so library
  5. onnxruntime global variables are destroyed, one of which unloads onnxruntime_provider_tensorrt.so
  6. onnxruntime fully unloads
  7. onnxruntime_provider_tensorrt.so global variables are destroyed, and try to call into onnxruntime.so code (to release a KernelRegistry object)
  8. Crash!

On Windows, unloading onnxruntime_provider_tensorrt.so causes it to unload immediately, but on Linux it doesn't unload until after onnxruntime.so unloads.

Even if dlclose() unloaded and shutdown the onnxruntime_provider_tensorrt.so library immediately the shutdown order is still undefined and could result in bad behavior. My change makes the shutdown order deterministic so that the OS ordering of global destruction and when the library unloads will work in both cases.

Secondly.. unloading libraries during library unload is bad app behavior on windows (should never FreeLibrary during DllMain). So this is just the right thing to do in general.

My first fix was just changing the code so that we unload libraries on OrtEnv destruction. This isn’t enough to fix the issue as on linux dlclose() doesn’t guarantee when it unloads a library, it just marks it for unload at some later point when it feels like it. So the explicit Shutdown() method is needed in the provider for that.

#5384 - Original customer bug

@RyanUnderhill RyanUnderhill marked this pull request as ready for review October 19, 2020 20:44
@RyanUnderhill RyanUnderhill requested a review from a team as a code owner October 19, 2020 20:44
@RyanUnderhill RyanUnderhill requested review from yuslepukhin and stevenlix and removed request for snnn October 19, 2020 20:45
void Shutdown_DeleteRegistry() {
s_kernel_registry.reset();
}

std::shared_ptr<Provider_KernelRegistry> DNNLExecutionProvider::Provider_GetKernelRegistry() const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::shared_ptr<Provider_KernelRegistry> [](start = 0, length = 40)

The issue with giving away shared_ptr is that even if you destroy your copy of the ptr, someone will cache it someplace and then will still try to call after unloading.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but onnxruntime is the only one with it at that point, so it's good. Since it is the only user of this and it is the one that destroys them.

Copy link
Contributor

@pranavsharma pranavsharma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The providers are registered with the session, not the env. Can the providers be unloaded by the session instead of the env?

@@ -543,35 +544,35 @@ struct CPUIDInfo {
bool HasAVX2() const { return g_host->CPUIDInfo__HasAVX2(this); }
bool HasAVX512f() const { return g_host->CPUIDInfo__HasAVX512f(this); }

PROVIDER_DISALLOW_ALL(CPUIDInfo)
PROVIDER_DISALLOW_ALL(CPUIDInfo)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: formatting

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like clang format fixed it and this is correct now? (I might be misunderstanding it)

return ret;
static std::shared_ptr<onnxruntime::Provider_KernelRegistry> s_kernel_registry;

void Shutdown_DeleteRegistry() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is calling it ShutdownRegistry enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for this issue. If we have other Onnxruntime objects that need to be destroyed we can do it here too. I've been thinking of ways to prove it's clean on shutdown after this fix to prevent future issues.

@RyanUnderhill
Copy link
Member Author

The providers are registered with the session, not the env. Can the providers be unloaded by the session instead of the env?

Potentially but that would be much more complex since we'd need to reference count the providers since they could be used by multiple sessions at once. Before this change they were just unloaded when onnxruntime is unloaded. So this doesn't really change when they're unloaded. It just makes it work properly.

@pranavsharma
Copy link
Contributor

The providers are registered with the session, not the env. Can the providers be unloaded by the session instead of the env?

Potentially but that would be much more complex since we'd need to reference count the providers since they could be used by multiple sessions at once. Before this change they were just unloaded when onnxruntime is unloaded. So this doesn't really change when they're unloaded. It just makes it work properly.

Didn't get the refcounting comment. The provider instances are created as part of session creation. They're not shared between sessions. We register factories with session options and then these factories are used to create the provider instances as part of session creation. If the lifetime of a provider is constrained by the session, why is the env unloading and destroying them?

@RyanUnderhill
Copy link
Member Author

The providers are registered with the session, not the env. Can the providers be unloaded by the session instead of the env?

Potentially but that would be much more complex since we'd need to reference count the providers since they could be used by multiple sessions at once. Before this change they were just unloaded when onnxruntime is unloaded. So this doesn't really change when they're unloaded. It just makes it work properly.

Didn't get the refcounting comment. The provider instances are created as part of session creation. They're not shared between sessions. We register factories with session options and then these factories are used to create the provider instances as part of session creation. If the lifetime of a provider is constrained by the session, why is the env unloading and destroying them?

The provider factory is created and added to the session, but the provider shared library itself is just held in a static variable. It's just owned by the onnxruntime library itself. To safely unload the provider shared library we'd need to refcount it in the session so that it's not unloaded until the last session using it goes away.

Just not sure if it's worth it as it might also lead to perf issues if a user creates a session using TensorRT, destroys it, then creates another one. It'd be loading/unloading the TensorRT provider & libraries every time.

@pranavsharma
Copy link
Contributor

The providers are registered with the session, not the env. Can the providers be unloaded by the session instead of the env?

Potentially but that would be much more complex since we'd need to reference count the providers since they could be used by multiple sessions at once. Before this change they were just unloaded when onnxruntime is unloaded. So this doesn't really change when they're unloaded. It just makes it work properly.

Didn't get the refcounting comment. The provider instances are created as part of session creation. They're not shared between sessions. We register factories with session options and then these factories are used to create the provider instances as part of session creation. If the lifetime of a provider is constrained by the session, why is the env unloading and destroying them?

The provider factory is created and added to the session, but the provider shared library itself is just held in a static variable. It's just owned by the onnxruntime library itself. To safely unload the provider shared library we'd need to refcount it in the session so that it's not unloaded until the last session using it goes away.

Just not sure if it's worth it as it might also lead to perf issues if a user creates a session using TensorRT, destroys it, then creates another one. It'd be loading/unloading the TensorRT provider & libraries every time.

Spoke offline and understood it better. LGTM

@RyanUnderhill RyanUnderhill merged commit 6106762 into master Oct 20, 2020
@RyanUnderhill RyanUnderhill deleted the ryanunderhill/fix_crash branch October 20, 2020 01:08
snnn added a commit that referenced this pull request Oct 20, 2020
snnn added a commit that referenced this pull request Oct 20, 2020
This reverts commit 6106762. Because Linux DNNL pipeline is failing.
RyanUnderhill added a commit that referenced this pull request Oct 20, 2020
* Change shared providers so that they are shutdown before shared library unload
* Move UnloadSharedProviders declaration into a shared header to avoid bugs.
RyanUnderhill added a commit that referenced this pull request Oct 27, 2020
* Change shared providers so that they are shutdown before shared library unload
* Move UnloadSharedProviders declaration into a shared header to avoid bugs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants