-
Notifications
You must be signed in to change notification settings - Fork 22.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MPS] Memory leak in nn.Linear
#132332
Comments
I believe there is already an issue about it |
I have a regression test prepared. I'll get a PR out and link to this bug. |
PR for regression test is in #132355. Need to investigate if the memory leak is in PyTorch or MPS. If the former the fix should be implemented in PyTorch not MPS, and the regression test needs to be updated. Awaiting confirmation that this is a live bug even with the changes in macOS from #125217 (ref #125217 (comment)) We had anecdotal evidence that the memory leak in |
We have repro'ed the MaxPool2d issue . This was a bug in the refcount one of the MaxPool2dGrad intermediate tensors which is fixed in MPSGraph framework. The fix will be in upcoming releases... |
Hi @hvaara! Thanks for the repro case. I can verify that this is a live bug as you say even with the changes that target the maxpool2d memory leak issue so this should be something different. Based on the initial look I can verify that this does show some similarities to the earlier problem in the sense that its not recognized as a leak as such but instead something in the MPSGraph seems to be holding onto the memory and thinking its doing the right thing. |
@jhavukainen Thanks a lot for testing the repro case for
FYI, switching from
makes the leaking issue in |
@hvaara Oh that's great find, thanks! I'll convene with the MPSGraph experts on this to see if it rings some bells on what might be the root cause for it but this definitely narrows it down. |
@jhavukainen This is between |
@kulinseth I added that comment before we discussed the implication of COMMIT_ADAPTIVE and COMMIT_AND_WAIT having different behavior here. So yes you are correct it does not seem to be related to MPSGraph as we concluded in our chat but instead on how we keep encoding to the command buffer until we hit the low watermark value and flush on the pytorch side. |
Ok @hvaara thanks for your patience! Here's the results from my deep dive to the traces of metal resources getting generated during the execution of the code snippet and my current understanding on what's going on:
So in summary to me it seems like the memory is managed as intended in this case. In its current form the COMMIT_ADAPTIVE is a bit opaque to the user since it might make the memory seem like its behaving erratically as the commit is automated to only happen once the local device hits the low memory watermark, which depends on the device you are running on. Additionally there's the interplay of when will the underlying python objects get garbage collected that should finally release the memory buffers assigned to them. Let me know if this sounds reasonable or if you think there's still something we missed here. Or if this is causing a concrete issue on your side that would warrant us to do some changes on how the adaptive commit works at the moment. Here's also the adjusted script that I used to check that eventually the underlying memory is freed as the objects are garbage collected:
Footnotes |
I'll close this for now since based on what I'm seeing this is not a memory leak in nn.Linear. @hvaara please don't hesitate to reopen if it looks like this is not the case from your point of view, or a feature request if there is a need for some additional controls in limiting how much memory the COMMIT_ADAPTIVE approach can use in your application. |
@hvaara , can you please comment and see if it addresses this issue. |
Thanks a lot for investigating everybody, and for the detailed notes from your deep dive @jhavukainen! Highly appreciate it! There are a couple things that are still not clear to me. I'll prepare a notebook with some examples to better illustrate what I mean. |
No problem! Sure I'm happy to take a look once you have the notebook with examples ready |
🐛 Describe the bug
Under certain circumstances
nn.Linear
will have memory leaks on MPS. The exact failure mode and condition that leads to leakage is unclear at this moment. I'll give an update when I have more information.Possibly related to #125217.
Steps to reproduce
Example output
Versions
PyTorch version: 2.5.0a0+git5406e46
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 14.5 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: version 3.30.1
Libc version: N/A
Python version: 3.8.19 | packaged by conda-forge | (default, Mar 20 2024, 12:49:57) [Clang 16.0.6 ] (64-bit runtime)
Python platform: macOS-14.5-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Apple M3 Max
Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] optree==0.11.0
[pip3] torch==2.5.0a0+git5406e46
[pip3] torchvision==0.20.0a0+61bd547
[conda] numpy 1.24.4 pypi_0 pypi
[conda] optree 0.11.0 pypi_0 pypi
[conda] torch 2.5.0a0+git5406e46 dev_0
[conda] torchvision 0.20.0a0+61bd547 dev_0
cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @kulinseth @albanD @malfet @DenisVieriu97 @jhavukainen
The text was updated successfully, but these errors were encountered: