[BUG] DataEngineV2 Unable to attach a PV to a pod in the newer kernel #7190

xinydev · 2023-11-23T03:06:54Z

Describe the bug (🐛 if you encounter this issue)

When using dataengine v2, the pod is always in the ContainerCreating state, and there will be a log of nvme discover execution failure in the instance-manager. This problem only occurs in 6.5.0-060500rc6-generic, and everything is normal when rolled back to 5.15.0-88-generic.

logs in below additional context.

It seems like there might be a compatibility issue between NVMe userspace tool and NVMe driver, but i am not sure from which version this issue started to appear.

To Reproduce

I found this problem when I tried following the quickstart with ubuntu 20.04(6.5.0-060500rc6-generic) and after rolling back to the kernel 5.15.0-88-generic, the problem was gone.

Expected behavior

pod running

Support bundle for troubleshooting

Environment

Longhorn version: v1.5.3
Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
helm install longhorn . -n longhorn --create-namespace --set="defaultSettings.defaultReplicaCount=1,defaultSettings.v2DataEngine=true,longhornUI.replicas=1,persistence.defaultClassReplicaCount=1,csi.attacherReplicaCount=1,csi.provisionerReplicaCount=1,csi.resizerReplicaCount=1,csi.snapshotterReplicaCount=1"
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
kubeadm, flannel, single node
Node config
- OS type and version: ubuntu 20.04.6 LTS
- Kernel version: 6.5.0-060500rc6-generic not works; 5.15.0-88-generic works well
- CPU per node:
- Memory per node:
- Disk type(e.g. SSD/NVMe/HDD): NVMe
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): KVM
Number of Longhorn volumes in the cluster: 1
Impacted Longhorn resources:
- Volume names:

Additional context

instance-manager log

[longhorn-instance-manager] time="2023-11-23T02:51:41Z" level=info msg="Creating instance" func="instance.(*Server).InstanceCreate" file="instance.go:87" backendStoreDriver=v2 name=pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16-r-7d684fee type=replica
[longhorn-instance-manager] time="2023-11-23T02:51:42Z" level=info msg="Creating a lvol bdev for the new replica" func="spdk.(*Replica).Create" file="replica.go:484" lvsName=nvme1n1 lvsUUID=c3f7b73f-c6db-4b07-8cbe-6474abcae7be replicaName=pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16-r-7d684fee
[longhorn-instance-manager] time="2023-11-23T02:51:42Z" level=info msg="Creating instance" func="instance.(*Server).InstanceCreate" file="instance.go:87" backendStoreDriver=v2 name=pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16-e-0 type=engine
[2023-11-23 02:51:42.497368] tcp.c: 631:nvmf_tcp_create: *NOTICE*: *** TCP Transport Init ***
[2023-11-23 02:51:42.531600] tcp.c: 856:nvmf_tcp_listen: *NOTICE*: *** NVMe/TCP Target Listening on 172.168.0.211 port 20006 ***
[longhorn-instance-manager] time="2023-11-23T02:51:42Z" level=warning msg="Failed to get devices for address : and nqn nqn.2023-01.io.longhorn.spdk:pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16-e-0" func=nvme.GetDevices.func1 file="nvme.go:45" error="cannot find a valid nvme device with subsystem NQN nqn.2023-01.io.longhorn.spdk:pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16-e-0 and address :"
[longhorn-instance-manager] time="2023-11-23T02:51:42Z" level=info msg="Stopping NVMe initiator blindly before starting" func="nvme.(*Initiator).Start" file="initiator.go:98" name=pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16 subsystemNQN="nqn.2023-01.io.longhorn.spdk:pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16-e-0"
[longhorn-instance-manager] time="2023-11-23T02:51:42Z" level=info msg="Launching NVMe initiator" func="nvme.(*Initiator).Start" file="initiator.go:103" name=pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16 subsystemNQN="nqn.2023-01.io.longhorn.spdk:pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16-e-0"
[longhorn-instance-manager] time="2023-11-23T02:51:42Z" level=warning msg="Failed to discover" func="nvme.(*Initiator).Start" file="initiator.go:109" error="failed to execute: nvme [discover -q nqn.2014-08.org.nvmexpress:uuid:95c4f779-634f-4d33-826b-967ffc57096d -t tcp -a 172.168.0.211 -s 20006 -o json], output , stderr Failed to write to /dev/nvme-fabrics: Invalid argument\nfailed to add controller, error invalid arguments/configuration\n: exit status 1" name=pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16 subsystemNQN="nqn.2023-01.io.longhorn.spdk:pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16-e-0"
[longhorn-instance-manager] time="2023-11-23T02:51:45Z" level=warning msg="Failed to discover" func="nvme.(*Initiator).Start" file="initiator.go:109" error="failed to execute: nvme [discover -q nqn.2014-08.org.nvmexpress:uuid:95c4f779-634f-4d33-826b-967ffc57096d -t tcp -a 172.168.0.211 -s 20006 -o json], output , stderr Failed to write to /dev/nvme-fabrics: Invalid argument\nfailed to add controller, error invalid arguments/configuration\n: exit status 1" name=pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16 subsystemNQN="nqn.2023-01.io.longhorn.spdk:pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16-e-0"
[longhorn-instance-manager] time="2023-11-23T02:51:48Z" level=warning msg="Failed to discover" func="nvme.(*Initiator).Start" file="initiator.go:109" error="failed to execute: nvme [discover -q nqn.2014-08.org.nvmexpress:uuid:95c4f779-634f-4d33-826b-967ffc57096d -t tcp -a 172.168.0.211 -s 20006 -o json], output , stderr Failed to write to /dev/nvme-fabrics: Invalid argument\nfailed to add controller, error invalid arguments/configuration\n: exit status 1" name=pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16 subsystemNQN="nqn.2023-01.io.longhorn.spdk:pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16-e-0"

k describe pod volume-test

  Normal   Scheduled           3m9s               default-scheduler        Successfully assigned default/volume-test to xin-csi-1
  Warning  FailedAttachVolume  28s (x8 over 96s)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16" : rpc error: code = Aborted desc = volume pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16 is not ready for workloads

in instance-manager pod

➜  ~ k exec -n longhorn instance-manager-4e940cc46a64d7999373770995c8ad95 -it -- bash
instance-manager-4e940cc46a64d7999373770995c8ad95:/ # nvme discover -q nqn.2014-08.org.nvmexpress:uuid:95c4f779-634f-4d33-826b-967ffc57096d -t tcp -a 172.16
8.0.211 -s 20006 -o json
Failed to write to /dev/nvme-fabrics: Invalid argument
failed to add controller, error invalid arguments/configuration
instance-manager-4e940cc46a64d7999373770995c8ad95:/ # nvme version
nvme version 2.5 (git 2.5)
libnvme version 1.5 (git 1.5)

in host

➜  ~ sudo nvme discover -q nqn.2014-08.org.nvmexpress:uuid:95c4f779-634f-4d33-826b-967ffc57096d -t tcp -a 172.168.0.211 -s 20006

Discovery Log Number of Records 1, Generation counter 1
=====Discovery Log Entry 0======
trtype:  tcp
adrfam:  ipv4
subtype: nvme subsystem
treq:    not required
portid:  0
trsvcid: 20006
subnqn:  nqn.2023-01.io.longhorn.spdk:pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16-e-0
traddr:  172.168.0.211
sectype: none
➜  ~ sudo nvme version
nvme version 1.9

The text was updated successfully, but these errors were encountered:

derekbit · 2023-11-23T03:15:52Z

@xinydev
Has nvme-tcp kernel module be inserted?

xinydev · 2023-11-23T03:25:44Z

@xinydev Has nvme-tcp kernel module be inserted?

Yes, I have already inserted it.

if kernel module not be inserted, there will be a error log like:

Failed to open /dev/nvme-fabrics: No such file or directory

derekbit · 2023-11-23T03:44:26Z

Thanks @xinydev.
Sounds like an issue related to the interaction between the kernel and the userspace tools. We will try to reproduce it first.

derekbit · 2023-11-23T03:51:27Z

cc @shuo-wu @DamiaSan

shuo-wu · 2023-11-30T09:42:24Z

Would the module version be an issue? The instance manager pod is using nvme-cli v2.5 while the host is using v1.9.
Could you mind checking the module version?
In my test env, the versions are:

$ modinfo nvme-tcp
filename:       /lib/modules/5.15.0-88-generic/kernel/drivers/nvme/host/nvme-tcp.ko
license:        GPL v2
srcversion:     7FFE7D1724F52D771EB5D26
depends:        nvme-core,nvme-fabrics
retpoline:      Y
intree:         Y
name:           nvme_tcp
vermagic:       5.15.0-88-generic SMP mod_unload modversions
......

$ modinfo nvme-fabrics
filename:       /lib/modules/5.15.0-88-generic/kernel/drivers/nvme/host/nvme-fabrics.ko
license:        GPL v2
srcversion:     9D0BFE90C6484EDBAE5E0FF
depends:        nvme-core
retpoline:      Y
intree:         Y
name:           nvme_fabrics
vermagic:       5.15.0-88-generic SMP mod_unload modversions
......

$ modinfo nvme-core
filename:       /lib/modules/5.15.0-88-generic/kernel/drivers/nvme/host/nvme-core.ko
version:        1.0
license:        GPL
srcversion:     DA0DD0E371F36C4A3E52F58
depends:
retpoline:      Y
intree:         Y
name:           nvme_core
vermagic:       5.15.0-88-generic SMP mod_unload modversions
......

$ nvme version
nvme version 2.5 (git 2.5)
libnvme version 1.5 (git 1.5)

xinydev · 2023-12-04T06:47:25Z

➜  ~ modinfo nvme-tcp
filename:       /lib/modules/6.5.11-060511-generic/kernel/drivers/nvme/host/nvme-tcp.ko.zst
license:        GPL v2
srcversion:     A5B1C9B60F86F438D4705E4
depends:        nvme-core,nvme-fabrics
retpoline:      Y
intree:         Y
name:           nvme_tcp
vermagic:       6.5.11-060511-generic SMP preempt mod_unload modversions

➜  ~ modinfo nvme-fabrics
filename:       /lib/modules/6.5.11-060511-generic/kernel/drivers/nvme/host/nvme-fabrics.ko.zst
license:        GPL v2
srcversion:     EB6EE0CE0E17086935C440A
depends:        nvme-core
retpoline:      Y
intree:         Y
name:           nvme_fabrics
vermagic:       6.5.11-060511-generic SMP preempt mod_unload modversions

➜  ~ modinfo nvme-core
filename:       /lib/modules/6.5.11-060511-generic/kernel/drivers/nvme/host/nvme-core.ko.zst
version:        1.0
license:        GPL
srcversion:     3E9317C3EF9ABEB3D74F2C0
depends:        nvme-common
retpoline:      Y
intree:         Y
name:           nvme_core
vermagic:       6.5.11-060511-generic SMP preempt mod_unload modversions

➜  ~ modinfo nvme-core
filename:       /lib/modules/6.5.11-060511-generic/kernel/drivers/nvme/host/nvme-core.ko.zst
version:        1.0
license:        GPL
srcversion:     3E9317C3EF9ABEB3D74F2C0
depends:        nvme-common
retpoline:      Y
intree:         Y
name:           nvme_core
vermagic:       6.5.11-060511-generic SMP preempt mod_unload modversions

derekbit · 2023-12-22T02:36:30Z

Tried to reproduce the issue and found the error when executing nvme discover

[longhorn-instance-manager] time="2023-12-22T02:35:20Z" level=warning msg="Failed to discover" func="nvme.(*Initiator).Start" file="initiator.go:151" error="failed to execute: /usr/bin/nsenter [nsenter --mount=/proc/1/ns/mnt --ipc=/proc/1/ns/ipc --net=/proc/1/ns/net nvme discover -q nqn.2014-08.org.nvmexpress:uuid:5cd441a4-2841-499f-a06c-7b976fa3f88d -t tcp -a 10.42.2.225 -s 20006 -o json], output , stderr Failed to write to /dev/nvme-fabrics: Invalid argument\nfailed to add controller, error invalid arguments/configuration\n: exit status 1" name=pvc-1aeab07f-aefc-43c4-bf72-13239d811f78 subsystemNQN="nqn.2023-01.io.longhorn.spdk:pvc-1aeab07f-aefc-43c4-bf72-13239d811f78-e-0"

dmesg shows

[ 3657.165204] nvme_fabrics: found same hostid 8dc3ffce-d16d-457b-8011-dfd504e34cea but different hostnqn nqn.2014-08.org.nvmexpress:uuid:5cd441a4-2841-499f-a06c-7b976fa3f88d
[ 3660.172225] nvme_fabrics: found same hostid 8dc3ffce-d16d-457b-8011-dfd504e34cea but different hostnqn nqn.2014-08.org.nvmexpress:uuid:5cd441a4-2841-499f-a06c-7b976fa3f88d
[ 3662.533314] nvme_fabrics: found same hostid 8dc3ffce-d16d-457b-8011-dfd504e34cea but different hostnqn nqn.2014-08.org.nvmexpress:uuid:5cd441a4-2841-499f-a06c-7b976fa3f88d
[ 3663.178411] nvme_fabrics: found same hostid 8dc3ffce-d16d-457b-8011-dfd504e34cea but different hostnqn nqn.2014-08.org.nvmexpress:uuid:5cd441a4-2841-499f-a06c-7b976fa3f88d

derekbit · 2023-12-22T04:01:17Z

A new check of hostid and hostnqn is introduced since linux kernel v6.5.
https://elixir.bootlin.com/linux/v6.5/source/drivers/nvme/host/fabrics.c#L46

DamiaSan · 2023-12-22T06:38:02Z

Probably in the nvme connect command we have to specify hostid too: https://lore.kernel.org/linux-nvme/czfmbbq3cqzimjtckrtc3fctg2zar2rsly3prnzt45d6drlyjp@v7cc7srlhi7c/T/

derekbit · 2023-12-22T06:48:03Z

Probably in the nvme connect command we have to specify hostid too: https://lore.kernel.org/linux-nvme/czfmbbq3cqzimjtckrtc3fctg2zar2rsly3prnzt45d6drlyjp@v7cc7srlhi7c/T/

Yes, this is a solution, but a random hostid generated when starting the IM pod is ok?

DamiaSan · 2023-12-22T07:38:12Z

Probably in the nvme connect command we have to specify hostid too: https://lore.kernel.org/linux-nvme/czfmbbq3cqzimjtckrtc3fctg2zar2rsly3prnzt45d6drlyjp@v7cc7srlhi7c/T/

Yes, this is a solution, but a random hostid generated when starting the IM pod is ok?

Don´t know, but from the comment in the code it should be

	/*
	 * We have defined a host as how it is perceived by the target.
	 * Therefore, we don't allow different Host NQNs with the same Host ID.
	 * Similarly, we do not allow the usage of the same Host NQN with
	 * different Host IDs. This'll maintain unambiguous host identification.
	 */

longhorn-io-github-bot · 2023-12-25T12:22:10Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at:
Does the PR include the explanation for the fix or the feature?
Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
The PR is at

longhorn/go-spdk-helper#60
longhorn/longhorn-instance-manager#344

Which areas/issues this PR might have potential impacts on?
Area: v2 volume attachment/detachment
Issues

yangchiu · 2023-12-26T03:29:19Z

Verified passed on master-head (longhorn-instance-manager c61da18). Created an ubuntu-22.04 cluster and upgraded kernel version to 6.6.8-060608-generic:

wget https://raw.githubusercontent.com/pimlie/ubuntu-mainline-kernel.sh/master/ubuntu-mainline-kernel.sh
$ sudo install ubuntu-mainline-kernel.sh /usr/local/bin/
$ sudo ubuntu-mainline-kernel.sh -c    
$ sudo ubuntu-mainline-kernel.sh -i
$ sudo reboot 
$ uname -r
# 6.6.8-060608-generic

Then followed https://longhorn.io/docs/1.5.3/spdk/quick-start/ to setup v2 engine environment and create a v2 volume with a pod. Everything works without problem.

shuo-wu · 2024-08-01T20:48:12Z

This issue can be triggered if the 'extras' kernel module package is not up-to-date.

In my recent case, if somehow the nodes provided by the cloud vendor are rebooted, their OS kernel version may be updated automatically. Then the error mentioned by this ticket will be triggered.
The solution here is, upgrading the package to the latest one:

apt install linux-modules-extra-$(uname -r)

xinydev added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels Nov 23, 2023

longhorn-io-github-bot added this to Community Review Sprint Nov 23, 2023

github-project-automation bot moved this to New in Community Review Sprint Nov 23, 2023

derekbit added investigation-needed Identified the issue but require further investigation for resolution (won't be stale) area/v2-data-engine v2 data engine (SPDK) labels Nov 23, 2023

derekbit added the area/environment-issue User-specific related issues, ex: network, DNS, host packages, etc. label Nov 23, 2023

derekbit added this to the v1.6.0 milestone Nov 23, 2023

ejweber moved this from New to Backlog Candidates in Community Review Sprint Nov 27, 2023

innobead changed the title ~~[BUG]DataEngineV2 Unable to attach a PV to a pod in the newer kernel~~ [BUG] DataEngineV2 Unable to attach a PV to a pod in the newer kernel Nov 29, 2023

innobead assigned DamiaSan Nov 29, 2023

innobead assigned derekbit and unassigned DamiaSan Dec 22, 2023

This was referenced Dec 22, 2023

nvme: pass both --hostid and --hostnqn to discover and connect commands longhorn/go-spdk-helper#60

Merged

nvme: create a pair of hostid and hostnqn longhorn/longhorn-instance-manager#344

Merged

derekbit removed the investigation-needed Identified the issue but require further investigation for resolution (won't be stale) label Dec 22, 2023

innobead assigned yangchiu Dec 25, 2023

yangchiu closed this as completed Dec 26, 2023

c3y1huang moved this from Backlog Candidates to Resolved/Scheduled in Community Review Sprint Jan 30, 2024

derekbit added this to Longhorn Sprint Aug 3, 2024

derekbit moved this to Closed in Longhorn Sprint Aug 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] DataEngineV2 Unable to attach a PV to a pod in the newer kernel #7190

[BUG] DataEngineV2 Unable to attach a PV to a pod in the newer kernel #7190

xinydev commented Nov 23, 2023

derekbit commented Nov 23, 2023

xinydev commented Nov 23, 2023

derekbit commented Nov 23, 2023

derekbit commented Nov 23, 2023

shuo-wu commented Nov 30, 2023

xinydev commented Dec 4, 2023

derekbit commented Dec 22, 2023 •

edited

Loading

derekbit commented Dec 22, 2023

DamiaSan commented Dec 22, 2023

derekbit commented Dec 22, 2023

DamiaSan commented Dec 22, 2023

longhorn-io-github-bot commented Dec 25, 2023 •

edited by derekbit

Loading

yangchiu commented Dec 26, 2023

shuo-wu commented Aug 1, 2024

[BUG] DataEngineV2 Unable to attach a PV to a pod in the newer kernel #7190

[BUG] DataEngineV2 Unable to attach a PV to a pod in the newer kernel #7190

Comments

xinydev commented Nov 23, 2023

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

derekbit commented Nov 23, 2023

xinydev commented Nov 23, 2023

derekbit commented Nov 23, 2023

derekbit commented Nov 23, 2023

shuo-wu commented Nov 30, 2023

xinydev commented Dec 4, 2023

derekbit commented Dec 22, 2023 • edited Loading

derekbit commented Dec 22, 2023

DamiaSan commented Dec 22, 2023

derekbit commented Dec 22, 2023

DamiaSan commented Dec 22, 2023

longhorn-io-github-bot commented Dec 25, 2023 • edited by derekbit Loading

Pre Ready-For-Testing Checklist

yangchiu commented Dec 26, 2023

shuo-wu commented Aug 1, 2024

derekbit commented Dec 22, 2023 •

edited

Loading

longhorn-io-github-bot commented Dec 25, 2023 •

edited by derekbit

Loading