Description
NVIDIA Open GPU Kernel Modules Version
550.90.07
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
- I confirm that this does not happen with the proprietary driver package.
Operating System and Version
Linux Mint 21.3
Kernel Release
6.8.0-50-generic NVIDIA#51~22.04.1-Ubuntu
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
- I am running on a stable kernel release.
Hardware: GPU
RTX4090 and RTX4060Ti
Describe the bug
Is it possible to enable P2P between 4090 and 4060Ti cards?
My motherboard is Asus Pro WS X299 SAGE II. I turned on large BAR and disabled IOMMU in bios.
Next I installed open-gpu-kernel-modules-550.90.07-p2p using install.sh script and driver: NVIDIA-Linux-x86_64-550.90.07.run --no-kernel-modules
nvidia-smi works fine, but p2pBandwidthLatencyTest gives following output:
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 1a, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4060 Ti, pciBusID: 68, pciDeviceID: 0, pciDomainID:0
Device=0 CANNOT Access Peer Device=1
Device=1 CANNOT Access Peer Device=0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1
0 1 0
1 0 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 911.08 6.27
1 6.26 244.87
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 913.21 6.27
1 6.25 245.25
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 915.33 8.49
1 8.63 244.46
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 918.31 8.43
1 8.63 244.56
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 1.44 20.43
1 20.54 1.20
CPU 0 1
0 2.25 6.10
1 5.98 2.24
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 1.45 20.61
1 11.36 1.20
CPU 0 1
0 2.22 5.93
1 6.19 2.23
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
This result indicate that P2P is not working. Manual says that both 4090 and 4060Ti should be supported. Is there anythig that can be done to enable P2P?
To Reproduce
I followed the installation instructions for the kernel version 550
Bug Incidence
Always
nvidia-bug-report.log.gz
More Info
No response
Activity
mylesgoose commentedon Jan 3, 2025
Seen a few people having issues with mixed cards. One guy has a6000 and 4 4090 and p2p didn't work unrill he disabled the A6000
Ivan04012025 commentedon Jan 4, 2025
Thank you for a reply. It's a pity that p2p doesn't work on mixed cards.
ilovesouthpark commentedon Jan 5, 2025
"Manual says that both 4090 and 4060Ti should be supported" which manual says that? You may mess up with the p2p mod and the original open-gpu-kernel-modules.
_"Normally, P2P on NVIDIA cards uses MAILBOXP2P. This is some hardware interface designed to allow GPUs to transfer memory back in the days of small BAR. It is not present or disabled in hardware on the 4090s, and that's why P2P doesn't work. There was a bug in early versions of the driver that reported that it did work, and it was actually sending stuff on the PCIe bus. However, because the mailbox hardware wasn't present, these copies wouldn't go to the right place. You could even crash the system by doing something like torch.zeros(10000,10000).cuda().to("cuda:1")
In some 3090s and all 4090s, NVIDIA added large BAR support."_
Ivan04012025 commentedon Jan 5, 2025
For example, here. In the end there is a table of compatible GPUs
https://github.com/tinygrad/open-gpu-kernel-modules/tree/535.54.03
However, I installed 550.90.07 version of a driver and open gpu kernel modules, so I am wondering should i set the "NVreg_OpenRmEnableUnsupportedGpus" nvidia.ko kernel module parameter to 1 in this version of kernel module or it is already set by default? I didn't do that.
mylesgoose commentedon Jan 5, 2025
@Ivan04012025 I think he I meaning that list of gpu that are compatible are the ones compatible with the original nvidia open driver. For p2p he mentioned you need a gpu with large BAR support. Perhaps that 4060 gpu does not support that.
Ivan04012025 commentedon Jan 5, 2025
Then is there a way to find out whether 4060 has large BAR or not?
mylesgoose commentedon Jan 5, 2025
@Ivan04012025 says on spec page for 4060 Resizable BAR
Resizable BAR is an advanced PCI Express feature that enables the CPU to access the entire GPU frame buffer at once, improving performance in many games. . You can see the bar information from dmesg or system info. Or with cpu x Linux version.
Ivan04012025 commentedon Jan 5, 2025
In system info or cpu x I do not see any info about BAR support. dmesg gives a lot of output and i don't know which string is about BAR support on 4060. I copied this output, maybe you can help with that?
dmesg.log
NVIDIA settings software shows: "Resizable BAR: Yes" on both 4090 and 4060 GPUs
![BAR](https://private-user-images.githubusercontent.com/193645685/400203301-342e5049-505e-451e-af04-676c9e9b8c5b.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk2NjMwODksIm5iZiI6MTczOTY2Mjc4OSwicGF0aCI6Ii8xOTM2NDU2ODUvNDAwMjAzMzAxLTM0MmU1MDQ5LTUwNWUtNDUxZS1hZjA0LTY3NmM5ZTliOGM1Yi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjE1JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxNVQyMzM5NDlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT00MDcxNDhmOTIyMmMyYmVlMWY1Y2U5ZTEwZDRhOTlkZDY3ZTQzMzU5M2YzY2Q3OTcyYWQxOGEzZmM3MzFkYjVhJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9._0X5oqux8odapULIIBj8qA1dt-QcUTDt5SB0kIfbsw4)
mylesgoose commentedon Jan 5, 2025
@Ivan04012025 3090s and all 4090s, NVIDIA added large BAR support.
tiny@tiny14:~$ lspci -s 01:00.0 -v
01:00.0 VGA compatible controller: NVIDIA Corporation AD102 [GeForce RTX 4090] (rev a1) (prog-if 00 [VGA controller])
Subsystem: Micro-Star International Co., Ltd. [MSI] Device 510b
Physical Slot: 49
Flags: bus master, fast devsel, latency 0, IRQ 377
Memory at b2000000 (32-bit, non-prefetchable) [size=16M]
Memory at 28800000000 (64-bit, prefetchable) [size=32G]
Memory at 28400000000 (64-bit, prefetchable) [size=32M]
I/O ports at 3000 [size=128]
Expansion ROM at b3000000 [virtual] [disabled] [size=512K]
Capabilities:
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
Notice how BAR1 is size 32G. In H100, they also added support for a PCIe mode that uses the BAR directly instead of the mailboxes, called BAR1P2P. So, what happens if we try to enable that on a 4090?
We do this by bypassing the HAL and calling a bunch of the GH100 methods directly. Methods like kbusEnableStaticBar1Mapping_GH100, which maps the entire VRAM into BAR1. This mostly just works, but we had to disable the use of that region in the MapAperture function for some reason. Shouldn't matter.
[ 3491.654009] NVRM: kbusEnableStaticBar1Mapping_GH100: Static bar1 mapped offset 0x0 size 0x5e9200000
[ 3491.793389] NVRM: kbusEnableStaticBar1Mapping_GH100: Static bar1 mapped offset 0x0 size 0x5e9200000
Perfect, we now have the VRAM mapped. However, it's not that easy to get P2P. When you run ./simpleP2P from cuda-samples, you get this error.
[ 3742.840689] NVRM: kbusCreateP2PMappingForBar1P2P_GH100: added PCIe BAR1 P2P mapping between GPU2 and GPU3
[ 3742.840762] NVRM: kbusCreateP2PMappingForBar1P2P_GH100: added PCIe BAR1 P2P mapping between GPU3 and GPU2
[ 3742.841089] NVRM: nvAssertFailed: Assertion failed: (shifted >> pField->shift) == value @ field_desc.h:272
[ 3742.841106] NVRM: nvAssertFailed: Assertion failed: (shifted & pField->maskPos) == shifted @ field_desc.h:273
[ 3742.841281] NVRM: nvAssertFailed: Assertion failed: (shifted >> pField->shift) == value @ field_desc.h:272
[ 3742.841292] NVRM: nvAssertFailed: Assertion failed: (shifted & pField->maskPos) == shifted @ field_desc.h:273
[ 3742.865948] NVRM: GPU at PCI:0000:01:00: GPU-49c7a6c9-e3a8-3b48-f0ba-171520d77dd1
[ 3742.865956] NVRM: Xid (PCI:0000:01:00): 31, pid=21804, name=simpleP2P, Ch 00000013, intr 00000000. MMU Fault: ENGINE CE3 HUBCLIENT_CE1 faulted @ 0x7f97_94000000. Fault is of type FAULT_INFO_TYPE_UNSUPPORTED_KIND ACCESS_TYPE_VIRT_WRITE
Failing with an MMU fault. So you dive into this and find that it's using GMMU_APERTURE_PEER as the mapping type. That doesn't seem supported in the 4090. So let's see what types are supported, GMMU_APERTURE_VIDEO,GMMU_APERTURE_SYS_NONCOH, and GMMU_APERTURE_SYS_COH. We don't care about being coherent with the CPU's L2 cache, but it does have to go out the PCIe bus, so we rewrite GMMU_APERTURE_PEER to GMMU_APERTURE_SYS_NONCOH. We also no longer set the peer id that was corrupting the page table.
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 24.21GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Verification error @ element 1: val = 0.000000, ref = 4.000000
Verification error @ element 2: val = 0.000000, ref = 8.00000
mylesgoose commentedon Jan 5, 2025
Also why is one of you gpu only 8x pcie.