Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] DataEngineV2 Unable to attach a PV to a pod in the newer kernel #7190

Closed
xinydev opened this issue Nov 23, 2023 · 14 comments
Closed

[BUG] DataEngineV2 Unable to attach a PV to a pod in the newer kernel #7190

xinydev opened this issue Nov 23, 2023 · 14 comments
Assignees
Labels
area/environment-issue User-specific related issues, ex: network, DNS, host packages, etc. area/v2-data-engine v2 data engine (SPDK) kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Milestone

Comments

@xinydev
Copy link

xinydev commented Nov 23, 2023

Describe the bug (🐛 if you encounter this issue)

When using dataengine v2, the pod is always in the ContainerCreating state, and there will be a log of nvme discover execution failure in the instance-manager. This problem only occurs in 6.5.0-060500rc6-generic, and everything is normal when rolled back to 5.15.0-88-generic.

logs in below additional context.

It seems like there might be a compatibility issue between NVMe userspace tool and NVMe driver, but i am not sure from which version this issue started to appear.

To Reproduce

I found this problem when I tried following the quickstart with ubuntu 20.04(6.5.0-060500rc6-generic) and after rolling back to the kernel 5.15.0-88-generic, the problem was gone.

Expected behavior

pod running

Support bundle for troubleshooting

Environment

  • Longhorn version: v1.5.3

  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
    helm install longhorn . -n longhorn --create-namespace --set="defaultSettings.defaultReplicaCount=1,defaultSettings.v2DataEngine=true,longhornUI.replicas=1,persistence.defaultClassReplicaCount=1,csi.attacherReplicaCount=1,csi.provisionerReplicaCount=1,csi.resizerReplicaCount=1,csi.snapshotterReplicaCount=1"

  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
    kubeadm, flannel, single node

  • Node config

    • OS type and version: ubuntu 20.04.6 LTS
    • Kernel version: 6.5.0-060500rc6-generic not works; 5.15.0-88-generic works well
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe/HDD): NVMe
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): KVM

  • Number of Longhorn volumes in the cluster: 1

  • Impacted Longhorn resources:

    • Volume names:

Additional context

instance-manager log

[longhorn-instance-manager] time="2023-11-23T02:51:41Z" level=info msg="Creating instance" func="instance.(*Server).InstanceCreate" file="instance.go:87" backendStoreDriver=v2 name=pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16-r-7d684fee type=replica
[longhorn-instance-manager] time="2023-11-23T02:51:42Z" level=info msg="Creating a lvol bdev for the new replica" func="spdk.(*Replica).Create" file="replica.go:484" lvsName=nvme1n1 lvsUUID=c3f7b73f-c6db-4b07-8cbe-6474abcae7be replicaName=pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16-r-7d684fee
[longhorn-instance-manager] time="2023-11-23T02:51:42Z" level=info msg="Creating instance" func="instance.(*Server).InstanceCreate" file="instance.go:87" backendStoreDriver=v2 name=pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16-e-0 type=engine
[2023-11-23 02:51:42.497368] tcp.c: 631:nvmf_tcp_create: *NOTICE*: *** TCP Transport Init ***
[2023-11-23 02:51:42.531600] tcp.c: 856:nvmf_tcp_listen: *NOTICE*: *** NVMe/TCP Target Listening on 172.168.0.211 port 20006 ***
[longhorn-instance-manager] time="2023-11-23T02:51:42Z" level=warning msg="Failed to get devices for address : and nqn nqn.2023-01.io.longhorn.spdk:pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16-e-0" func=nvme.GetDevices.func1 file="nvme.go:45" error="cannot find a valid nvme device with subsystem NQN nqn.2023-01.io.longhorn.spdk:pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16-e-0 and address :"
[longhorn-instance-manager] time="2023-11-23T02:51:42Z" level=info msg="Stopping NVMe initiator blindly before starting" func="nvme.(*Initiator).Start" file="initiator.go:98" name=pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16 subsystemNQN="nqn.2023-01.io.longhorn.spdk:pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16-e-0"
[longhorn-instance-manager] time="2023-11-23T02:51:42Z" level=info msg="Launching NVMe initiator" func="nvme.(*Initiator).Start" file="initiator.go:103" name=pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16 subsystemNQN="nqn.2023-01.io.longhorn.spdk:pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16-e-0"
[longhorn-instance-manager] time="2023-11-23T02:51:42Z" level=warning msg="Failed to discover" func="nvme.(*Initiator).Start" file="initiator.go:109" error="failed to execute: nvme [discover -q nqn.2014-08.org.nvmexpress:uuid:95c4f779-634f-4d33-826b-967ffc57096d -t tcp -a 172.168.0.211 -s 20006 -o json], output , stderr Failed to write to /dev/nvme-fabrics: Invalid argument\nfailed to add controller, error invalid arguments/configuration\n: exit status 1" name=pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16 subsystemNQN="nqn.2023-01.io.longhorn.spdk:pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16-e-0"
[longhorn-instance-manager] time="2023-11-23T02:51:45Z" level=warning msg="Failed to discover" func="nvme.(*Initiator).Start" file="initiator.go:109" error="failed to execute: nvme [discover -q nqn.2014-08.org.nvmexpress:uuid:95c4f779-634f-4d33-826b-967ffc57096d -t tcp -a 172.168.0.211 -s 20006 -o json], output , stderr Failed to write to /dev/nvme-fabrics: Invalid argument\nfailed to add controller, error invalid arguments/configuration\n: exit status 1" name=pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16 subsystemNQN="nqn.2023-01.io.longhorn.spdk:pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16-e-0"
[longhorn-instance-manager] time="2023-11-23T02:51:48Z" level=warning msg="Failed to discover" func="nvme.(*Initiator).Start" file="initiator.go:109" error="failed to execute: nvme [discover -q nqn.2014-08.org.nvmexpress:uuid:95c4f779-634f-4d33-826b-967ffc57096d -t tcp -a 172.168.0.211 -s 20006 -o json], output , stderr Failed to write to /dev/nvme-fabrics: Invalid argument\nfailed to add controller, error invalid arguments/configuration\n: exit status 1" name=pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16 subsystemNQN="nqn.2023-01.io.longhorn.spdk:pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16-e-0"

k describe pod volume-test

  Normal   Scheduled           3m9s               default-scheduler        Successfully assigned default/volume-test to xin-csi-1
  Warning  FailedAttachVolume  28s (x8 over 96s)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16" : rpc error: code = Aborted desc = volume pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16 is not ready for workloads

in instance-manager pod

➜  ~ k exec -n longhorn instance-manager-4e940cc46a64d7999373770995c8ad95 -it -- bash
instance-manager-4e940cc46a64d7999373770995c8ad95:/ # nvme discover -q nqn.2014-08.org.nvmexpress:uuid:95c4f779-634f-4d33-826b-967ffc57096d -t tcp -a 172.16
8.0.211 -s 20006 -o json
Failed to write to /dev/nvme-fabrics: Invalid argument
failed to add controller, error invalid arguments/configuration
instance-manager-4e940cc46a64d7999373770995c8ad95:/ # nvme version
nvme version 2.5 (git 2.5)
libnvme version 1.5 (git 1.5)

in host

➜  ~ sudo nvme discover -q nqn.2014-08.org.nvmexpress:uuid:95c4f779-634f-4d33-826b-967ffc57096d -t tcp -a 172.168.0.211 -s 20006

Discovery Log Number of Records 1, Generation counter 1
=====Discovery Log Entry 0======
trtype:  tcp
adrfam:  ipv4
subtype: nvme subsystem
treq:    not required
portid:  0
trsvcid: 20006
subnqn:  nqn.2023-01.io.longhorn.spdk:pvc-c8f16e8d-afe3-4cc8-b00f-7b8763fa6f16-e-0
traddr:  172.168.0.211
sectype: none
➜  ~ sudo nvme version
nvme version 1.9
@xinydev xinydev added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels Nov 23, 2023
@derekbit
Copy link
Member

@xinydev
Has nvme-tcp kernel module be inserted?

@derekbit derekbit added investigation-needed Identified the issue but require further investigation for resolution (won't be stale) area/v2-data-engine v2 data engine (SPDK) labels Nov 23, 2023
@xinydev
Copy link
Author

xinydev commented Nov 23, 2023

@xinydev Has nvme-tcp kernel module be inserted?

Yes, I have already inserted it.

if kernel module not be inserted, there will be a error log like:

Failed to open /dev/nvme-fabrics: No such file or directory

@derekbit
Copy link
Member

Thanks @xinydev.
Sounds like an issue related to the interaction between the kernel and the userspace tools. We will try to reproduce it first.

@derekbit
Copy link
Member

cc @shuo-wu @DamiaSan

@derekbit derekbit added the area/environment-issue User-specific related issues, ex: network, DNS, host packages, etc. label Nov 23, 2023
@derekbit derekbit added this to the v1.6.0 milestone Nov 23, 2023
@ejweber ejweber moved this from New to Backlog Candidates in Community Review Sprint Nov 27, 2023
@innobead innobead changed the title [BUG]DataEngineV2 Unable to attach a PV to a pod in the newer kernel [BUG] DataEngineV2 Unable to attach a PV to a pod in the newer kernel Nov 29, 2023
@shuo-wu
Copy link
Contributor

shuo-wu commented Nov 30, 2023

Would the module version be an issue? The instance manager pod is using nvme-cli v2.5 while the host is using v1.9.
Could you mind checking the module version?
In my test env, the versions are:

$ modinfo nvme-tcp
filename:       /lib/modules/5.15.0-88-generic/kernel/drivers/nvme/host/nvme-tcp.ko
license:        GPL v2
srcversion:     7FFE7D1724F52D771EB5D26
depends:        nvme-core,nvme-fabrics
retpoline:      Y
intree:         Y
name:           nvme_tcp
vermagic:       5.15.0-88-generic SMP mod_unload modversions
......

$ modinfo nvme-fabrics
filename:       /lib/modules/5.15.0-88-generic/kernel/drivers/nvme/host/nvme-fabrics.ko
license:        GPL v2
srcversion:     9D0BFE90C6484EDBAE5E0FF
depends:        nvme-core
retpoline:      Y
intree:         Y
name:           nvme_fabrics
vermagic:       5.15.0-88-generic SMP mod_unload modversions
......

$ modinfo nvme-core
filename:       /lib/modules/5.15.0-88-generic/kernel/drivers/nvme/host/nvme-core.ko
version:        1.0
license:        GPL
srcversion:     DA0DD0E371F36C4A3E52F58
depends:
retpoline:      Y
intree:         Y
name:           nvme_core
vermagic:       5.15.0-88-generic SMP mod_unload modversions
......

$ nvme version
nvme version 2.5 (git 2.5)
libnvme version 1.5 (git 1.5)

@xinydev
Copy link
Author

xinydev commented Dec 4, 2023

➜  ~ modinfo nvme-tcp
filename:       /lib/modules/6.5.11-060511-generic/kernel/drivers/nvme/host/nvme-tcp.ko.zst
license:        GPL v2
srcversion:     A5B1C9B60F86F438D4705E4
depends:        nvme-core,nvme-fabrics
retpoline:      Y
intree:         Y
name:           nvme_tcp
vermagic:       6.5.11-060511-generic SMP preempt mod_unload modversions

➜  ~ modinfo nvme-fabrics
filename:       /lib/modules/6.5.11-060511-generic/kernel/drivers/nvme/host/nvme-fabrics.ko.zst
license:        GPL v2
srcversion:     EB6EE0CE0E17086935C440A
depends:        nvme-core
retpoline:      Y
intree:         Y
name:           nvme_fabrics
vermagic:       6.5.11-060511-generic SMP preempt mod_unload modversions

➜  ~ modinfo nvme-core
filename:       /lib/modules/6.5.11-060511-generic/kernel/drivers/nvme/host/nvme-core.ko.zst
version:        1.0
license:        GPL
srcversion:     3E9317C3EF9ABEB3D74F2C0
depends:        nvme-common
retpoline:      Y
intree:         Y
name:           nvme_core
vermagic:       6.5.11-060511-generic SMP preempt mod_unload modversions

➜  ~ modinfo nvme-core
filename:       /lib/modules/6.5.11-060511-generic/kernel/drivers/nvme/host/nvme-core.ko.zst
version:        1.0
license:        GPL
srcversion:     3E9317C3EF9ABEB3D74F2C0
depends:        nvme-common
retpoline:      Y
intree:         Y
name:           nvme_core
vermagic:       6.5.11-060511-generic SMP preempt mod_unload modversions






@derekbit
Copy link
Member

derekbit commented Dec 22, 2023

Tried to reproduce the issue and found the error when executing nvme discover

[longhorn-instance-manager] time="2023-12-22T02:35:20Z" level=warning msg="Failed to discover" func="nvme.(*Initiator).Start" file="initiator.go:151" error="failed to execute: /usr/bin/nsenter [nsenter --mount=/proc/1/ns/mnt --ipc=/proc/1/ns/ipc --net=/proc/1/ns/net nvme discover -q nqn.2014-08.org.nvmexpress:uuid:5cd441a4-2841-499f-a06c-7b976fa3f88d -t tcp -a 10.42.2.225 -s 20006 -o json], output , stderr Failed to write to /dev/nvme-fabrics: Invalid argument\nfailed to add controller, error invalid arguments/configuration\n: exit status 1" name=pvc-1aeab07f-aefc-43c4-bf72-13239d811f78 subsystemNQN="nqn.2023-01.io.longhorn.spdk:pvc-1aeab07f-aefc-43c4-bf72-13239d811f78-e-0"

dmesg shows

[ 3657.165204] nvme_fabrics: found same hostid 8dc3ffce-d16d-457b-8011-dfd504e34cea but different hostnqn nqn.2014-08.org.nvmexpress:uuid:5cd441a4-2841-499f-a06c-7b976fa3f88d
[ 3660.172225] nvme_fabrics: found same hostid 8dc3ffce-d16d-457b-8011-dfd504e34cea but different hostnqn nqn.2014-08.org.nvmexpress:uuid:5cd441a4-2841-499f-a06c-7b976fa3f88d
[ 3662.533314] nvme_fabrics: found same hostid 8dc3ffce-d16d-457b-8011-dfd504e34cea but different hostnqn nqn.2014-08.org.nvmexpress:uuid:5cd441a4-2841-499f-a06c-7b976fa3f88d
[ 3663.178411] nvme_fabrics: found same hostid 8dc3ffce-d16d-457b-8011-dfd504e34cea but different hostnqn nqn.2014-08.org.nvmexpress:uuid:5cd441a4-2841-499f-a06c-7b976fa3f88d

@innobead innobead assigned derekbit and unassigned DamiaSan Dec 22, 2023
@derekbit
Copy link
Member

A new check of hostid and hostnqn is introduced since linux kernel v6.5.
https://elixir.bootlin.com/linux/v6.5/source/drivers/nvme/host/fabrics.c#L46

@DamiaSan
Copy link
Contributor

Probably in the nvme connect command we have to specify hostid too: https://lore.kernel.org/linux-nvme/czfmbbq3cqzimjtckrtc3fctg2zar2rsly3prnzt45d6drlyjp@v7cc7srlhi7c/T/

@derekbit
Copy link
Member

Probably in the nvme connect command we have to specify hostid too: https://lore.kernel.org/linux-nvme/czfmbbq3cqzimjtckrtc3fctg2zar2rsly3prnzt45d6drlyjp@v7cc7srlhi7c/T/

Yes, this is a solution, but a random hostid generated when starting the IM pod is ok?

@DamiaSan
Copy link
Contributor

Probably in the nvme connect command we have to specify hostid too: https://lore.kernel.org/linux-nvme/czfmbbq3cqzimjtckrtc3fctg2zar2rsly3prnzt45d6drlyjp@v7cc7srlhi7c/T/

Yes, this is a solution, but a random hostid generated when starting the IM pod is ok?

Don´t know, but from the comment in the code it should be

	/*
	 * We have defined a host as how it is perceived by the target.
	 * Therefore, we don't allow different Host NQNs with the same Host ID.
	 * Similarly, we do not allow the usage of the same Host NQN with
	 * different Host IDs. This'll maintain unambiguous host identification.
	 */

@derekbit derekbit removed the investigation-needed Identified the issue but require further investigation for resolution (won't be stale) label Dec 22, 2023
@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Dec 25, 2023

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:

  • Does the PR include the explanation for the fix or the feature?

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at

longhorn/go-spdk-helper#60
longhorn/longhorn-instance-manager#344

  • Which areas/issues this PR might have potential impacts on?
    Area: v2 volume attachment/detachment
    Issues

@yangchiu
Copy link
Member

Verified passed on master-head (longhorn-instance-manager c61da18). Created an ubuntu-22.04 cluster and upgraded kernel version to 6.6.8-060608-generic:

wget https://raw.githubusercontent.com/pimlie/ubuntu-mainline-kernel.sh/master/ubuntu-mainline-kernel.sh
$ sudo install ubuntu-mainline-kernel.sh /usr/local/bin/
$ sudo ubuntu-mainline-kernel.sh -c    
$ sudo ubuntu-mainline-kernel.sh -i
$ sudo reboot 
$ uname -r
# 6.6.8-060608-generic

Then followed https://longhorn.io/docs/1.5.3/spdk/quick-start/ to setup v2 engine environment and create a v2 volume with a pod. Everything works without problem.

@c3y1huang c3y1huang moved this from Backlog Candidates to Resolved/Scheduled in Community Review Sprint Jan 30, 2024
@shuo-wu
Copy link
Contributor

shuo-wu commented Aug 1, 2024

This issue can be triggered if the 'extras' kernel module package is not up-to-date.

In my recent case, if somehow the nodes provided by the cloud vendor are rebooted, their OS kernel version may be updated automatically. Then the error mentioned by this ticket will be triggered.
The solution here is, upgrading the package to the latest one:

apt install linux-modules-extra-$(uname -r)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/environment-issue User-specific related issues, ex: network, DNS, host packages, etc. area/v2-data-engine v2 data engine (SPDK) kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Projects
Status: Resolved
Status: Closed
Development

No branches or pull requests

6 participants