Fix shared provider unload crash #5523

RyanUnderhill · 2020-10-17T01:03:32Z

Short Description: On Linux, when calling dlclose() a library won't always be unloaded immediately (and have its global variables be destroyed). This causes a problem with shared providers, since they hold a KernelRegistry object in a static and they need to call into the core OnnxRuntime code to destroy it.

The fix adds a Shutdown() method to the shared providers and calls this on OrtEnv destruction in the core onnxruntime code. This way these globals are destroyed in a deterministic way before library unloading gets involved.

Long Description:Here's what would happen before this change on Linux:

Customer code dynamically loads onnxruntime.so
Customer code then adds TensorRT provider to session
Onnxruntime dynamically loads onnxruntime_provider_tensorrt.so due to (2)
Customer code destroys all onnxruntime resources and unloads the onnxruntime.so library
onnxruntime global variables are destroyed, one of which unloads onnxruntime_provider_tensorrt.so
onnxruntime fully unloads
onnxruntime_provider_tensorrt.so global variables are destroyed, and try to call into onnxruntime.so code (to release a KernelRegistry object)
Crash!

On Windows, unloading onnxruntime_provider_tensorrt.so causes it to unload immediately, but on Linux it doesn't unload until after onnxruntime.so unloads.

Even if dlclose() unloaded and shutdown the onnxruntime_provider_tensorrt.so library immediately the shutdown order is still undefined and could result in bad behavior. My change makes the shutdown order deterministic so that the OS ordering of global destruction and when the library unloads will work in both cases.

Secondly.. unloading libraries during library unload is bad app behavior on windows (should never FreeLibrary during DllMain). So this is just the right thing to do in general.

My first fix was just changing the code so that we unload libraries on OrtEnv destruction. This isn’t enough to fix the issue as on linux dlclose() doesn’t guarantee when it unloads a library, it just marks it for unload at some later point when it feels like it. So the explicit Shutdown() method is needed in the provider for that.

#5384 - Original customer bug

…ry unload

…o ryanunderhill/fix_crash

onnxruntime/core/session/ort_env.cc

onnxruntime/core/framework/provider_bridge_ort.cc

yuslepukhin · 2020-10-19T21:07:33Z

onnxruntime/core/providers/dnnl/dnnl_execution_provider.cc

+void Shutdown_DeleteRegistry() {
+  s_kernel_registry.reset();
+}
+
 std::shared_ptr<Provider_KernelRegistry> DNNLExecutionProvider::Provider_GetKernelRegistry() const {


std::shared_ptr<Provider_KernelRegistry> [](start = 0, length = 40)

The issue with giving away shared_ptr is that even if you destroy your copy of the ptr, someone will cache it someplace and then will still try to call after unloading.

Yes, but onnxruntime is the only one with it at that point, so it's good. Since it is the only user of this and it is the one that destroys them.

pranavsharma

The providers are registered with the session, not the env. Can the providers be unloaded by the session instead of the env?

pranavsharma · 2020-10-19T21:04:10Z

onnxruntime/core/providers/shared_library/provider_interfaces.h

@@ -543,35 +544,35 @@ struct CPUIDInfo {
  bool HasAVX2() const { return g_host->CPUIDInfo__HasAVX2(this); }
  bool HasAVX512f() const { return g_host->CPUIDInfo__HasAVX512f(this); }

-PROVIDER_DISALLOW_ALL(CPUIDInfo)
+  PROVIDER_DISALLOW_ALL(CPUIDInfo)


nit: formatting

Looks like clang format fixed it and this is correct now? (I might be misunderstanding it)

pranavsharma · 2020-10-19T21:05:14Z

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc

-  return ret;
+static std::shared_ptr<onnxruntime::Provider_KernelRegistry> s_kernel_registry;
+
+void Shutdown_DeleteRegistry() {


nit: Is calling it ShutdownRegistry enough?

It is for this issue. If we have other Onnxruntime objects that need to be destroyed we can do it here too. I've been thinking of ways to prove it's clean on shutdown after this fix to prevent future issues.

RyanUnderhill · 2020-10-19T22:42:40Z

The providers are registered with the session, not the env. Can the providers be unloaded by the session instead of the env?

Potentially but that would be much more complex since we'd need to reference count the providers since they could be used by multiple sessions at once. Before this change they were just unloaded when onnxruntime is unloaded. So this doesn't really change when they're unloaded. It just makes it work properly.

…o ryanunderhill/fix_crash

…bugs.

pranavsharma · 2020-10-19T23:57:02Z

The providers are registered with the session, not the env. Can the providers be unloaded by the session instead of the env?

Potentially but that would be much more complex since we'd need to reference count the providers since they could be used by multiple sessions at once. Before this change they were just unloaded when onnxruntime is unloaded. So this doesn't really change when they're unloaded. It just makes it work properly.

Didn't get the refcounting comment. The provider instances are created as part of session creation. They're not shared between sessions. We register factories with session options and then these factories are used to create the provider instances as part of session creation. If the lifetime of a provider is constrained by the session, why is the env unloading and destroying them?

RyanUnderhill · 2020-10-20T00:19:30Z

The providers are registered with the session, not the env. Can the providers be unloaded by the session instead of the env?

Potentially but that would be much more complex since we'd need to reference count the providers since they could be used by multiple sessions at once. Before this change they were just unloaded when onnxruntime is unloaded. So this doesn't really change when they're unloaded. It just makes it work properly.

Didn't get the refcounting comment. The provider instances are created as part of session creation. They're not shared between sessions. We register factories with session options and then these factories are used to create the provider instances as part of session creation. If the lifetime of a provider is constrained by the session, why is the env unloading and destroying them?

The provider factory is created and added to the session, but the provider shared library itself is just held in a static variable. It's just owned by the onnxruntime library itself. To safely unload the provider shared library we'd need to refcount it in the session so that it's not unloaded until the last session using it goes away.

Just not sure if it's worth it as it might also lead to perf issues if a user creates a session using TensorRT, destroys it, then creates another one. It'd be loading/unloading the TensorRT provider & libraries every time.

pranavsharma · 2020-10-20T00:32:26Z

The providers are registered with the session, not the env. Can the providers be unloaded by the session instead of the env?

Potentially but that would be much more complex since we'd need to reference count the providers since they could be used by multiple sessions at once. Before this change they were just unloaded when onnxruntime is unloaded. So this doesn't really change when they're unloaded. It just makes it work properly.

Didn't get the refcounting comment. The provider instances are created as part of session creation. They're not shared between sessions. We register factories with session options and then these factories are used to create the provider instances as part of session creation. If the lifetime of a provider is constrained by the session, why is the env unloading and destroying them?

The provider factory is created and added to the session, but the provider shared library itself is just held in a static variable. It's just owned by the onnxruntime library itself. To safely unload the provider shared library we'd need to refcount it in the session so that it's not unloaded until the last session using it goes away.

Just not sure if it's worth it as it might also lead to perf issues if a user creates a session using TensorRT, destroys it, then creates another one. It'd be loading/unloading the TensorRT provider & libraries every time.

Spoke offline and understood it better. LGTM

This reverts commit 6106762.

This reverts commit 6106762. Because Linux DNNL pipeline is failing.

* Change shared providers so that they are shutdown before shared library unload * Move UnloadSharedProviders declaration into a shared header to avoid bugs.

RyanUnderhill added 4 commits October 16, 2020 17:07

Change shared providers so that they are shutdown before shared libra…

02d6565

…ry unload

Merge branch 'master' of https://github.com/microsoft/onnxruntime

478cae1

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

dc92f47

…o ryanunderhill/fix_crash

Remove useless line

e0dceb5

RyanUnderhill mentioned this pull request Oct 17, 2020

ORT 1.5.1 TensorRT EP segfault on unloading shared library #5384

Closed

RyanUnderhill requested review from snnn, jywu-msft and pranavsharma October 17, 2020 01:03

RyanUnderhill marked this pull request as ready for review October 19, 2020 20:44

RyanUnderhill requested a review from a team as a code owner October 19, 2020 20:44

RyanUnderhill requested review from yuslepukhin and stevenlix and removed request for snnn October 19, 2020 20:45

yuslepukhin reviewed Oct 19, 2020

View reviewed changes

onnxruntime/core/session/ort_env.cc Outdated Show resolved Hide resolved

yuslepukhin reviewed Oct 19, 2020

View reviewed changes

onnxruntime/core/framework/provider_bridge_ort.cc Show resolved Hide resolved

yuslepukhin reviewed Oct 19, 2020

View reviewed changes

pranavsharma reviewed Oct 19, 2020

View reviewed changes

RyanUnderhill added 2 commits October 19, 2020 16:20

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

48add2d

…o ryanunderhill/fix_crash

Move UnloadSharedProviders declaration into a shared header to avoid …

b8b31d0

…bugs.

pranavsharma approved these changes Oct 20, 2020

View reviewed changes

RyanUnderhill merged commit 6106762 into master Oct 20, 2020

RyanUnderhill deleted the ryanunderhill/fix_crash branch October 20, 2020 01:08

snnn added a commit that referenced this pull request Oct 20, 2020

Revert "Fix shared provider unload crash (#5523)"

81a86f1

This reverts commit 6106762.

snnn added a commit that referenced this pull request Oct 20, 2020

Revert "Fix shared provider unload crash (#5523)" (#5547)

280cdf3

This reverts commit 6106762. Because Linux DNNL pipeline is failing.

RyanUnderhill added a commit that referenced this pull request Oct 20, 2020

Fix shared provider unload crash (#5523)

acb7922

* Change shared providers so that they are shutdown before shared library unload * Move UnloadSharedProviders declaration into a shared header to avoid bugs.

RyanUnderhill mentioned this pull request Oct 20, 2020

Fix shared provider unload crash #5553

Merged

RyanUnderhill added a commit that referenced this pull request Oct 27, 2020

Fix shared provider unload crash (#5523)

90274be

* Change shared providers so that they are shutdown before shared library unload * Move UnloadSharedProviders declaration into a shared header to avoid bugs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix shared provider unload crash #5523

Fix shared provider unload crash #5523

RyanUnderhill commented Oct 17, 2020 •

edited

Loading

yuslepukhin Oct 19, 2020

RyanUnderhill Oct 19, 2020

pranavsharma left a comment

pranavsharma Oct 19, 2020

RyanUnderhill Oct 19, 2020

pranavsharma Oct 19, 2020

RyanUnderhill Oct 19, 2020

RyanUnderhill commented Oct 19, 2020

pranavsharma commented Oct 19, 2020

RyanUnderhill commented Oct 20, 2020

pranavsharma commented Oct 20, 2020

Fix shared provider unload crash #5523

Fix shared provider unload crash #5523

Conversation

RyanUnderhill commented Oct 17, 2020 • edited Loading

yuslepukhin Oct 19, 2020

Choose a reason for hiding this comment

RyanUnderhill Oct 19, 2020

Choose a reason for hiding this comment

pranavsharma left a comment

Choose a reason for hiding this comment

pranavsharma Oct 19, 2020

Choose a reason for hiding this comment

RyanUnderhill Oct 19, 2020

Choose a reason for hiding this comment

pranavsharma Oct 19, 2020

Choose a reason for hiding this comment

RyanUnderhill Oct 19, 2020

Choose a reason for hiding this comment

RyanUnderhill commented Oct 19, 2020

pranavsharma commented Oct 19, 2020

RyanUnderhill commented Oct 20, 2020

pranavsharma commented Oct 20, 2020

RyanUnderhill commented Oct 17, 2020 •

edited

Loading