Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExtendedTools: Fix a thread handle leak #913

Merged
merged 1 commit into from
Jun 10, 2021

Conversation

jgottula
Copy link
Contributor

@jgottula jgottula commented Jun 9, 2021

I found what looks to be a thread handle leak here. I believe this should be the right fix for it.

@jgottula
Copy link
Contributor Author

jgottula commented Jun 9, 2021

Both this PR and #914 are the result of me combing through all the PH code as best I could, looking for any plausible thread handle leaks that could possibly be happening. I found just two potential ones, and they both seem to be relatively obscure I think.


I should explain what prompted me to look into this. Something that's been bothering me lately, is that I've noticed that after 5 days of uptime, my machine has 282,000 handles open, of which 116,000 are in the System process. And over 109k out of the 116k open handles in System are Thread handles referring to various terminated unnamed threads.

image

I'm pretty sure that's not normal... right? This is on a system with about 360 processes and 4800 threads at the moment. (A reasonable number of Chrome tabs open, sure; but I've had far more things running at once in the past without this happening.)

The thing that actually caught me off guard to begin with was that I was seeing 6-digit PIDs, which I had never seen before on this system (or on any Windows system really). It made my PID column overflow.

I did upgrade this system from W10 1909 to 21H1 a week and a half ago. So at first I thought, "huh, maybe in 21H1 they increased the upper bound on PIDs; kind of like what happened with the change to /proc/sys/kernel/pid_max in Arch Linux somewhat recently". But then I go look it up, and find that nope, that's not how that works on Windows; it's always pretty much been unbounded within the range of a DWORD. (But presumably you don't get PIDs climbing this high unless you either have a ton of processes/threads actually running at once, or else there's a bunch of "zombie" threads taking up PID slots, I imagine.)

I started wondering if maybe KProcessHacker was doing something naughty. Since it's a kernel driver and runs in the System "process". But that doesn't really appear to be the case, after looking through all the code. I don't see anything that looks like it'd be leaking handles anywhere near this badly.

So, at this point I'm thinking it must be some new and exciting bug added to Windows 10 recently; or else some other kernel driver is screwing around leaking handles willy-nilly. (I suppose it could be NVIDIA... because I did happen to update to 466.47 at effectively the same time I did the OS update. 🤔)

Am I at least on the right track in thinking that for all these handles to be piling up in the System process, it must be the handiwork of one of the kernel drivers running on the system (if not the main kernel itself)? Also do you know offhand if there's any relatively easy way to trace back to which particular kernel thread or driver or whatever was responsible for opening the handle? Bonus points if it's possible without having to fire up the kernel debugger 😜

@jgottula
Copy link
Contributor Author

Am I at least on the right track in thinking that for all these handles to be piling up in the System process, it must be the handiwork of one of the kernel drivers running on the system (if not the main kernel itself)? Also do you know offhand if there's any relatively easy way to trace back to which particular kernel thread or driver or whatever was responsible for opening the handle? Bonus points if it's possible without having to fire up the kernel debugger 😜

@dmex Any thoughts on this?

One additional thing I realized, after my previous comment here, is that (based on the threads' names which shows the process/thread attribution etc for the handle) these handles to long-dead threads are not just being created and leaked by the System process; they're also all handles to threads which were themselves created in the System process itself back when they were alive.

So it appears that it's not just that some kernel component or driver is leaking thread handles to random unrelated threads in other processes; but instead, that some kernel component or driver is creating a ton of probably-short-lived threads within the System process itself, and on top of that, is also leaking handles to those threads. Does that sound right?

If true, then it might actually be somewhat easier to track down who's responsible for this activity, since it seems probable that the same entity that's doing the thread handle leaking is also the entity that's creating the threads in question. And so therefore I can presumably just monitor ongoing thread creation and termination within the System process, and e.g. log all the thread creations and terminations and get stack traces, which ought to have a good chance of shedding light on who's at fault. Maybe...

@jgottula
Copy link
Contributor Author

I did notice that the Handle properties window does show some rather useful information (in particular: the creation and exit time of the thread being referred to by a thread handle). So I could potentially figure out some time correlations. And I did notice that there seemed to be some time clustering.

But it's impractical for me to open the Handle properties window for about 100,000 different thread handles. So I coded up a bit of a hack that basically duplicated the code for displaying the handle's thread creation time and exit time in the Handle properties window, in the form of additional columns in the Process properties window Handles table instead. That way, I could see all that information at once, and potentially copy it out into a text file etc.

Then, after coding this up, I realized that I was an idiot, since it wouldn't actually work. Because, of course, KProcessHacker doesn't like unsigned custom-built ProcessHacker.exe files. And so the very first step, PhOpenProcess(... PROCESS_DUP_HANDLE ...) on the System process, just immediately fails with STATUS_ACCESS_DENIED and so nothing works. Yeah that felt stupid.

I don't even know for sure if this is something I could effectively do with the plugin SDK. 🤔❔

I suppose in theory I could reboot with test signing enabled, and build a hacked KProcessHacker driver that doesn't check the ProcessHacker.exe signature... but rebooting would mean eliminating all the leaked handles that I know for sure are there right now... ugh.

@dmex dmex merged commit 59c3fff into winsiderss:master Jun 10, 2021
@dmex
Copy link
Member

dmex commented Jun 10, 2021

I found just two potential ones, and they both seem to be relatively obscure

You would have to open the thread context-menu more than 80,000 times to generate that same amount of handle leaking and even then it would only be handles leaking in the ProcessHacker.exe process and not other processes like System.

I started wondering if maybe KProcessHacker was doing something naughty. Since it's a kernel driver and runs in the System "process". But that doesn't really appear to be the case, after looking through all the code. I don't see anything that looks like it'd be leaking handles anywhere near this badly.

The last KPH update was 2016 and if it was doing something like that we'd know by now 😂

all these handles to be piling up in the System process, it must be the handiwork of one of the kernel drivers running on the system (if not the main kernel itself)?

The kernel doesn't leak handles - It's 100% a third party driver.

Also do you know offhand if there's any relatively easy way to trace back to which particular kernel thread or driver or whatever was responsible for opening the handle? Bonus points if it's possible without having to fire up the kernel debugger

You can enable tracing and capture a live kernel dump. You'll need a debugger somewhere for this but not on that machine.

I don't even know for sure if this is something I could effectively do with the plugin SDK.

No. This type of object tracing requires a debugger.

I suppose in theory I could reboot with test signing enabled, and build a hacked KProcessHacker driver that doesn't check the ProcessHacker.exe signature... but rebooting would mean eliminating all the leaked handles that I know for sure are there right now... ugh.

Windows doesn't record object traces by default. You have to enable the feature and capture a live kernel dump:
https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/object-reference-tracing

This registry file will enable object tracing only for objects tagged for Threads:
kernel.zip

It will only work for newly created threads after it's enabled... the documentation says this should start immediately after adding the registry keys but this didn't work for me and you'll more than likely have to reboot.

After it's enabled just wait until you can see a few leaked threads - use the Process Hacker > Tools menu > Create Live Dump to capture a kernel memory dump - the only requirement is that you're an administrator:

image

Now make sure you disable the EnableHandleSnapshot option in the Process Hacker settings:

image

When the snapshot feature is disabled. Open the System process and copy the Object address:

image

Note: If the object address is blank/empty then it's because the handle snapshot setting wasn't disabled.

Now open the dump file and execute !objtrace <object address> and you'll be able to see stack traces for everything that ever accessed/queried/touched that thread:

image

This should indicate the driver/process that created the thread and/or referenced the thread and with the reference counters you can determine what which calls forgot to close/dereference the handle.

Do this for a few different threads that have leaked and you should be able to identify the culprit (and don't forget to undo the registry changes later). If you need some help understanding the stacks then feel free to post them here.

@jgottula
Copy link
Contributor Author

Thanks for the advice! 👍👍👍

Right now, it seems that whatever the trigger for this is, isn't happening (naturally this happens when I want it to occur so I can debut it); and I had an unintended system reboot. So I'm monitoring and I'll see if this manifests again.


Incidentally, the crash I mentioned was related to an NVIDIA driver problem... basically it's randomly having fits occasionally, mostly when the system is idle with displays auto-powered-down, where it undergoes like 100 TDRs, every 15 seconds, one after another, and then the display doesn't work anymore and the system is almost entirely locked up (all input devices are broken, numlock LED is stuck, won't respond to shutdown requests; but does continue to run processes apparently and still happily serves stuff up over SMB, bizarrely).

So I'm increasingly suspicious that perhaps recent versions of the NVIDIA driver have had one or more bugs that could be causing both the leaked thread handles and the TDRs. (They've released a pretty ridiculous number of hotfix drivers lately, including one for a DPC Watchdog Violation BSoD some people were reporting. Dunno what's going on with them lately.)

@dmex
Copy link
Member

dmex commented Jun 11, 2021

it undergoes like 100 TDRs, every 15 seconds

Is this a recent issue? I did recently start getting some very weird csrss/dwm/logout crash after updating the nvidia driver on my other machine.

@jgottula
Copy link
Contributor Author

Is this a recent issue? I did recently start getting some very weird csrss/dwm/logout crash after updating the nvidia driver on my other machine.

It does actually seem that NVIDIA's had some weird stuff going on with their drivers over the past... few weeks to a month, roughly? They've had an unusual number of driver updates (including hotfix driver releases) in that timeframe. Including release notes that acknowledge some DPC Watchdog Violation BSoDs first as a known issue, and then as something that's allegedly been fixed. Also a close friend told me that people using SteamVR over the past couple weeks have basically been told by Valve "we really really really recommend you roll back your driver if you updated recently" because of some sort of system crashing/freezing issue(s) that have been going on.

Date Version Type Notes
2021-04-29 466.27 Normal Everything was fine with this driver version, AFAIK.
2021-05-18 466.47 Normal People seem to have started reporting more serious problems with this release. Scattered reports of game crashes, system crashes, displays losing signal, etc.
2021-05-24 466.55 Hotfix Minor fixes that seem unrelated to the main problems.
2021-06-03 466.63 Normal This is where even bigger problems started occurring. Specifically: DPC Watchdog Violation BSoDs on Kepler (600/700 series) and Turing (16 and 20 series). This driver release has been removed from the NVIDIA downloads page as if it never existed.
2021-06-08 466.74 Hotfix Allegedly fixes the DPC Watchdog Violation BSoDs.
2021-06-10 466.77 Normal Seemingly a do-over of 466.63, this time with the BSoD fix from 466.74. And support for 3080 Ti and 3070 Ti.
2021-06-10 466.79 Hotfix Apparently addresses some issue with displays flickering or just losing signal entirely in multi-monitor configurations.

Prior to May 28, I was on 1909 with NVIDIA 466.27, and all was well. No Hardware Accelerated GPU Scheduling, because the OS version didn't support it.

On May 28, I updated to 21H1. On May 30, I updated NVIDIA to 466.47. Within a few days, I enabled HAGS and rebooted. Nothing seemed particularly wrong.

Around June 6~8, I caught onto the crazy number of thread handles in the System process and started looking into that particular weirdness. The system had been up since June 3, and it's likely that the thread handle leakage happened prior to when I really caught on to it.

Then, just after midnight on June 10, there was a TDR flood, and the system was frozen up with no displays. I forced shutdown at around 1:00am, and updated NVIDIA to 466.74, I believe.

There was another TDR flood around 8:00am, and I noticed the system was frozen when I attempted to use it at 1:00pm. Forced a reboot once again, and then updated to NVIDIA 466.79. (Because within like a single-digit number of hours, NVIDIA had pushed two new drivers, lol.)

Since then, running on 466.79 with HAGS enabled, I've had uninterrupted uptime of about two days with no system freezes or thread handle leakage in the System process. I do see four randomly scattered TDR reports in the event log over the past 18 hours or so, but with no overall negative impact on the system that I can discern.

But I still get the sense that NVIDIA hasn't entirely cleared up all of the fuck-ups related to whatever things they messed around with recently. What a mess.

(I also kinda suspect that at least some portion of the problems here might be related to HAGS, but I have no real insight into that so I'm just sorta guessing.)

@jgottula
Copy link
Contributor Author

Well, it was actually looking pretty good for a while.

But then, the past two consecutive nights, the system did the chain of TDRs followed by the odd state of almost-totally-but-not-100%-frozen. Both times.

Unfortunately, I haven't had a chance to get any useful information about whether System process thread handle leaks are connected to this stuff. At one point, I had Sysinternals Process Monitor doing logging of all process and thread creation and termination events to file. But the file was "corrupt" after the force reboot, and so PM wouldn't let me open the file.

And the time when the SHTF doesn't seem to correlate with any plausible thing I can think of. Which of course makes it painful to track down.

I guess I could probably do some kernel debugger stuff on this; particularly since it seems that the network stack stays functional in spite of the system's otherwise vegetative state. But if I do do that, it'll have to be on a future boot. I did make a new boot menu entry with kernel debugging enabled a few days back. So perhaps I'll give that some usage.

At this point, I'm probably gonna downgrade from 466.79 back to 466.27 and see how things fare. Alternate plan would be to turn HAGS off and see if stuff magically stops having problems.

grumbling noises

@jgottula
Copy link
Contributor Author

Okay well so NVIDIA 471.11 is out today, and apparently this is their first driver release to officially support 21H1. Which is somewhat baffling to me, given that 21H1 has been rolling out to customers for well over a month and that there are constant betas and that 21H1 isn't even a particularly big/change-heavy update... but what do I know.

Maybe I'll stop seeing random TDRs. Or maybe not.

By the way: the thread timeline column in 7c7b7a6 is neat 😛

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants