-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ExtendedTools: Fix a thread handle leak #913
Conversation
Both this PR and #914 are the result of me combing through all the PH code as best I could, looking for any plausible thread handle leaks that could possibly be happening. I found just two potential ones, and they both seem to be relatively obscure I think. I should explain what prompted me to look into this. Something that's been bothering me lately, is that I've noticed that after 5 days of uptime, my machine has 282,000 handles open, of which 116,000 are in the I'm pretty sure that's not normal... right? This is on a system with about 360 processes and 4800 threads at the moment. (A reasonable number of Chrome tabs open, sure; but I've had far more things running at once in the past without this happening.) The thing that actually caught me off guard to begin with was that I was seeing 6-digit PIDs, which I had never seen before on this system (or on any Windows system really). It made my PID column overflow. I did upgrade this system from W10 1909 to 21H1 a week and a half ago. So at first I thought, "huh, maybe in 21H1 they increased the upper bound on PIDs; kind of like what happened with the change to I started wondering if maybe KProcessHacker was doing something naughty. Since it's a kernel driver and runs in the So, at this point I'm thinking it must be some new and exciting bug added to Windows 10 recently; or else some other kernel driver is screwing around leaking handles willy-nilly. (I suppose it could be NVIDIA... because I did happen to update to 466.47 at effectively the same time I did the OS update. 🤔) Am I at least on the right track in thinking that for all these handles to be piling up in the |
@dmex Any thoughts on this? One additional thing I realized, after my previous comment here, is that (based on the threads' names which shows the process/thread attribution etc for the handle) these handles to long-dead threads are not just being created and leaked by the So it appears that it's not just that some kernel component or driver is leaking thread handles to random unrelated threads in other processes; but instead, that some kernel component or driver is creating a ton of probably-short-lived threads within the If true, then it might actually be somewhat easier to track down who's responsible for this activity, since it seems probable that the same entity that's doing the thread handle leaking is also the entity that's creating the threads in question. And so therefore I can presumably just monitor ongoing thread creation and termination within the |
I did notice that the Handle properties window does show some rather useful information (in particular: the creation and exit time of the thread being referred to by a thread handle). So I could potentially figure out some time correlations. And I did notice that there seemed to be some time clustering. But it's impractical for me to open the Handle properties window for about 100,000 different thread handles. So I coded up a bit of a hack that basically duplicated the code for displaying the handle's thread creation time and exit time in the Handle properties window, in the form of additional columns in the Process properties window Handles table instead. That way, I could see all that information at once, and potentially copy it out into a text file etc. Then, after coding this up, I realized that I was an idiot, since it wouldn't actually work. Because, of course, KProcessHacker doesn't like unsigned custom-built ProcessHacker.exe files. And so the very first step, I don't even know for sure if this is something I could effectively do with the plugin SDK. 🤔❔ I suppose in theory I could reboot with test signing enabled, and build a hacked KProcessHacker driver that doesn't check the ProcessHacker.exe signature... but rebooting would mean eliminating all the leaked handles that I know for sure are there right now... ugh. |
You would have to open the thread context-menu more than 80,000 times to generate that same amount of handle leaking and even then it would only be handles leaking in the ProcessHacker.exe process and not other processes like System.
The last KPH update was 2016 and if it was doing something like that we'd know by now 😂
The kernel doesn't leak handles - It's 100% a third party driver.
You can enable tracing and capture a live kernel dump. You'll need a debugger somewhere for this but not on that machine.
No. This type of object tracing requires a debugger.
Windows doesn't record object traces by default. You have to enable the feature and capture a live kernel dump: This registry file will enable object tracing only for objects tagged for Threads: It will only work for newly created threads after it's enabled... the documentation says this should start immediately after adding the registry keys but this didn't work for me and you'll more than likely have to reboot. After it's enabled just wait until you can see a few leaked threads - use the Process Hacker > Tools menu > Create Live Dump to capture a kernel memory dump - the only requirement is that you're an administrator: Now make sure you disable the When the snapshot feature is disabled. Open the System process and copy the Note: If the object address is blank/empty then it's because the handle snapshot setting wasn't disabled. Now open the dump file and execute This should indicate the driver/process that created the thread and/or referenced the thread and with the reference counters you can determine what which calls forgot to close/dereference the handle. Do this for a few different threads that have leaked and you should be able to identify the culprit (and don't forget to undo the registry changes later). If you need some help understanding the stacks then feel free to post them here. |
Thanks for the advice! 👍👍👍 Right now, it seems that whatever the trigger for this is, isn't happening (naturally this happens when I want it to occur so I can debut it); and I had an unintended system reboot. So I'm monitoring and I'll see if this manifests again. Incidentally, the crash I mentioned was related to an NVIDIA driver problem... basically it's randomly having fits occasionally, mostly when the system is idle with displays auto-powered-down, where it undergoes like 100 TDRs, every 15 seconds, one after another, and then the display doesn't work anymore and the system is almost entirely locked up (all input devices are broken, numlock LED is stuck, won't respond to shutdown requests; but does continue to run processes apparently and still happily serves stuff up over SMB, bizarrely). So I'm increasingly suspicious that perhaps recent versions of the NVIDIA driver have had one or more bugs that could be causing both the leaked thread handles and the TDRs. (They've released a pretty ridiculous number of hotfix drivers lately, including one for a DPC Watchdog Violation BSoD some people were reporting. Dunno what's going on with them lately.) |
Is this a recent issue? I did recently start getting some very weird csrss/dwm/logout crash after updating the nvidia driver on my other machine. |
It does actually seem that NVIDIA's had some weird stuff going on with their drivers over the past... few weeks to a month, roughly? They've had an unusual number of driver updates (including hotfix driver releases) in that timeframe. Including release notes that acknowledge some DPC Watchdog Violation BSoDs first as a known issue, and then as something that's allegedly been fixed. Also a close friend told me that people using SteamVR over the past couple weeks have basically been told by Valve "we really really really recommend you roll back your driver if you updated recently" because of some sort of system crashing/freezing issue(s) that have been going on.
Prior to May 28, I was on 1909 with NVIDIA 466.27, and all was well. No Hardware Accelerated GPU Scheduling, because the OS version didn't support it. On May 28, I updated to 21H1. On May 30, I updated NVIDIA to 466.47. Within a few days, I enabled HAGS and rebooted. Nothing seemed particularly wrong. Around June 6~8, I caught onto the crazy number of thread handles in the System process and started looking into that particular weirdness. The system had been up since June 3, and it's likely that the thread handle leakage happened prior to when I really caught on to it. Then, just after midnight on June 10, there was a TDR flood, and the system was frozen up with no displays. I forced shutdown at around 1:00am, and updated NVIDIA to 466.74, I believe. There was another TDR flood around 8:00am, and I noticed the system was frozen when I attempted to use it at 1:00pm. Forced a reboot once again, and then updated to NVIDIA 466.79. (Because within like a single-digit number of hours, NVIDIA had pushed two new drivers, lol.) Since then, running on 466.79 with HAGS enabled, I've had uninterrupted uptime of about two days with no system freezes or thread handle leakage in the System process. I do see four randomly scattered TDR reports in the event log over the past 18 hours or so, but with no overall negative impact on the system that I can discern. But I still get the sense that NVIDIA hasn't entirely cleared up all of the fuck-ups related to whatever things they messed around with recently. What a mess. (I also kinda suspect that at least some portion of the problems here might be related to HAGS, but I have no real insight into that so I'm just sorta guessing.) |
Well, it was actually looking pretty good for a while. But then, the past two consecutive nights, the system did the chain of TDRs followed by the odd state of almost-totally-but-not-100%-frozen. Both times. Unfortunately, I haven't had a chance to get any useful information about whether System process thread handle leaks are connected to this stuff. At one point, I had Sysinternals Process Monitor doing logging of all process and thread creation and termination events to file. But the file was "corrupt" after the force reboot, and so PM wouldn't let me open the file. And the time when the SHTF doesn't seem to correlate with any plausible thing I can think of. Which of course makes it painful to track down. I guess I could probably do some kernel debugger stuff on this; particularly since it seems that the network stack stays functional in spite of the system's otherwise vegetative state. But if I do do that, it'll have to be on a future boot. I did make a new boot menu entry with kernel debugging enabled a few days back. So perhaps I'll give that some usage. At this point, I'm probably gonna downgrade from 466.79 back to 466.27 and see how things fare. Alternate plan would be to turn HAGS off and see if stuff magically stops having problems. grumbling noises |
Okay well so NVIDIA 471.11 is out today, and apparently this is their first driver release to officially support 21H1. Which is somewhat baffling to me, given that 21H1 has been rolling out to customers for well over a month and that there are constant betas and that 21H1 isn't even a particularly big/change-heavy update... but what do I know. Maybe I'll stop seeing random TDRs. Or maybe not. By the way: the thread timeline column in 7c7b7a6 is neat 😛 |
I found what looks to be a thread handle leak here. I believe this should be the right fix for it.