-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix shared provider unload crash #5523
Conversation
void Shutdown_DeleteRegistry() { | ||
s_kernel_registry.reset(); | ||
} | ||
|
||
std::shared_ptr<Provider_KernelRegistry> DNNLExecutionProvider::Provider_GetKernelRegistry() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::shared_ptr<Provider_KernelRegistry> [](start = 0, length = 40)
The issue with giving away shared_ptr is that even if you destroy your copy of the ptr, someone will cache it someplace and then will still try to call after unloading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but onnxruntime is the only one with it at that point, so it's good. Since it is the only user of this and it is the one that destroys them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The providers are registered with the session, not the env. Can the providers be unloaded by the session instead of the env?
@@ -543,35 +544,35 @@ struct CPUIDInfo { | |||
bool HasAVX2() const { return g_host->CPUIDInfo__HasAVX2(this); } | |||
bool HasAVX512f() const { return g_host->CPUIDInfo__HasAVX512f(this); } | |||
|
|||
PROVIDER_DISALLOW_ALL(CPUIDInfo) | |||
PROVIDER_DISALLOW_ALL(CPUIDInfo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: formatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like clang format fixed it and this is correct now? (I might be misunderstanding it)
return ret; | ||
static std::shared_ptr<onnxruntime::Provider_KernelRegistry> s_kernel_registry; | ||
|
||
void Shutdown_DeleteRegistry() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Is calling it ShutdownRegistry enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is for this issue. If we have other Onnxruntime objects that need to be destroyed we can do it here too. I've been thinking of ways to prove it's clean on shutdown after this fix to prevent future issues.
Potentially but that would be much more complex since we'd need to reference count the providers since they could be used by multiple sessions at once. Before this change they were just unloaded when onnxruntime is unloaded. So this doesn't really change when they're unloaded. It just makes it work properly. |
…o ryanunderhill/fix_crash
Didn't get the refcounting comment. The provider instances are created as part of session creation. They're not shared between sessions. We register factories with session options and then these factories are used to create the provider instances as part of session creation. If the lifetime of a provider is constrained by the session, why is the env unloading and destroying them? |
The provider factory is created and added to the session, but the provider shared library itself is just held in a static variable. It's just owned by the onnxruntime library itself. To safely unload the provider shared library we'd need to refcount it in the session so that it's not unloaded until the last session using it goes away. Just not sure if it's worth it as it might also lead to perf issues if a user creates a session using TensorRT, destroys it, then creates another one. It'd be loading/unloading the TensorRT provider & libraries every time. |
Spoke offline and understood it better. LGTM |
* Change shared providers so that they are shutdown before shared library unload * Move UnloadSharedProviders declaration into a shared header to avoid bugs.
* Change shared providers so that they are shutdown before shared library unload * Move UnloadSharedProviders declaration into a shared header to avoid bugs.
Short Description: On Linux, when calling dlclose() a library won't always be unloaded immediately (and have its global variables be destroyed). This causes a problem with shared providers, since they hold a KernelRegistry object in a static and they need to call into the core OnnxRuntime code to destroy it.
The fix adds a Shutdown() method to the shared providers and calls this on OrtEnv destruction in the core onnxruntime code. This way these globals are destroyed in a deterministic way before library unloading gets involved.
Long Description:Here's what would happen before this change on Linux:
On Windows, unloading onnxruntime_provider_tensorrt.so causes it to unload immediately, but on Linux it doesn't unload until after onnxruntime.so unloads.
Even if dlclose() unloaded and shutdown the onnxruntime_provider_tensorrt.so library immediately the shutdown order is still undefined and could result in bad behavior. My change makes the shutdown order deterministic so that the OS ordering of global destruction and when the library unloads will work in both cases.
Secondly.. unloading libraries during library unload is bad app behavior on windows (should never FreeLibrary during DllMain). So this is just the right thing to do in general.
My first fix was just changing the code so that we unload libraries on OrtEnv destruction. This isn’t enough to fix the issue as on linux dlclose() doesn’t guarantee when it unloads a library, it just marks it for unload at some later point when it feels like it. So the explicit Shutdown() method is needed in the provider for that.
#5384 - Original customer bug