-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix azure disk mount failure on coreos and some other distros #54334
fix azure disk mount failure on coreos and some other distros #54334
Conversation
/test pull-kubernetes-unit |
/test pull-kubernetes-kubemark-e2e-gce |
/test pull-kubernetes-unit |
glog.Errorf("failed to parse target from %v (%v), err %v", arr[0], name, err) | ||
continue | ||
} | ||
// as observed, targets 0-3 are used by OS disks. Skip them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used to use this heuristics but it is not reliable: on certain instances (last time I used F instance and found this test was different than D instance), this test fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which test are you refering to? You mean ""/sys/bus/scsi/devices"" on different VM instance is different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @rootfs I have tried coreos on Standard F1s
VM, it works well. Do you have any other instance suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paste all the info and logs on F instance, I attached two disks and try to detect Lun 1 disk:
azureuser@andy-coreosf /tmp $ ./detect-new-lun1
Azure sys disks paths: []there is nothing under /dev/disk/azure/
there is nothing under /dev/disk/azure/
there is nothing under /dev/disk/azure/
there is nothing under /dev/disk/azure/
/dev/sdd <nil>
azureuser@andy-coreosf /tmp $ ls /sys/bus/scsi/devices -lt
total 0
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 target0:0:0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host0/target0:0:0
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 target3:0:1 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host3/target3:0:1
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 target5:0:0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781b-1e82-4818-a1c3-63d806ec15bb/host5/target5:0:0
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 5:0:0:0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781b-1e82-4818-a1c3-63d806ec15bb/host5/target5:0:0/5:0:0:0
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 5:0:0:1 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781b-1e82-4818-a1c3-63d806ec15bb/host5/target5:0:0/5:0:0:1
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 host5 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781b-1e82-4818-a1c3-63d806ec15bb/host5
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 3:0:1:0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host3/target3:0:1/3:0:1:0
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 host3 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host3
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 host4 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781a-1e82-4818-a1c3-63d806ec15bb/host4
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 0:0:0:0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host0/target0:0:0/0:0:0:0
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 host0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host0
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 host1 -> ../../../devices/pci0000:00/0000:00:07.1/ata1/host1
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 host2 -> ../../../devices/pci0000:00/0000:00:07.1/ata2/host2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rootfs , thanks for the check.
The latest code already contains #44295, while current coreos issue is: since /dev/disk/azure/
is not populated in coreos, which means there would be no filtering, see: https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/azure_dd/azure_common_linux.go#L119-L124. In the end, findDiskByLun
would return a wrong dev name which is actually os disk. So my fix is add back the skip logic if azureDisks
is empty.
I have also tested on RHEL F instance, my PR works well. Let me know if there is any other issue, thx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andyzhangx thanks, please glog a message here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rootfs done
setting consistent udev rules on coreos is a more robust fix |
WALinuxAgent should populate the same udev rules |
Hi @rootfs , I agree WALinuxAgent should create same udev rule, while current coreos image in azure does not have this udev rule, so lots of customers are complaining about azure disk mount failure of coreos. In this situation, I would suggest we should provide k8s side fix for this issue. You could refer to #50150 for more details, lots of customers complaining in the thread. add @jdumars for your comments, thx. |
The CoreOS I am using have this udev rules file already populated, does not fix anything in my case at least. |
@andyzhangx I don't know what is causing it, many times it works flawlessly. Sometimes when disks get detached it looks like the VM get stuck in {
"...": "...",
"provisioningState" : "updating"
} Why this happend is beyond my knowledge, perhaps the agent service? It renders the node useless at least. I currently have one node in this updating state (due to chart upgrade which moved the PVC from one node to another, source node still in "updating" state). Hopefully this times out, but I've seen it last for over 24h. In my current situation it has been waiting for 45 minutes - which is unacceptable. |
add log message
6442102
to
db3c5a7
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andyzhangx, rootfs Associated issue: 50150 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
Automatic merge from submit-queue (batch tested with PRs 54336, 54470, 54334, 54175). If you want to cherry-pick this change to another branch, please follow the instructions here. |
What this PR does / why we need it:
azure disk mount failure on coreos due to
/dev/disk/azure
not populated on coreos, and also on some other distros.After a long time back and forth discussion, I have decided to fix this issue on k8s side. Customers are leaving azure because of this udev issue, you could read details from #50150 . I have verifed the code works well on current coreos on azure.
Which issue this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close that issue when PR gets merged):fixes #50150: azure_dd: managed disks don't pass "FormatAndMount"
In coreos on azure and some old distros, one udev rule is not applied, which means
/dev/disk/azure
not populated, while current disk detection code logic relies on/dev/disk/azure
, so it would lead to azure disk mount failure, for more details, pls refer to #50150Special notes for your reviewer:
In v1.6, there is no such issue becuase it's not using /dev/azure/disk to detect the LUN of new disk, so I refer the code logic of v1.6:
https://github.com/kubernetes/kubernetes/blob/release-1.6/pkg/volume/azure_dd/vhd_util.go#L59-L66
while from v1.7, it's using /dev/azure/disk, the code logic does not apply to distro which does not have udev rules(/dev/azure/disk), see:
https://github.com/kubernetes/kubernetes/blob/release-1.7/pkg/volume/azure_dd/azure_common.go#L231-L236
Release note:
/sig azure
@rootfs @brendanburns @colemickens
@tlyng @karataliu @szarkos