Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have Cinder mount tasks use new kubelet functions #49786

Closed
wants to merge 1 commit into from

Conversation

jamiehannaford
Copy link
Contributor

What this PR does / why we need it:

Updates Cinder mount and unmount tasks to use new kubelet checks introduced in #48402.

Since the current implementation relies on the old IsLikelyNotMountPoint, we're seeing a lot of problems because kubelet is ignoring existing mounts (it thinks they're not mount points) and therefore failing to unmount/remount them. This means when nodes are rebooted, pods with PVs enter a crash loop because they have the wrong device paths.

I also added more logging to help with debugging mount/unmount.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged):

Related to #47548 and #49503.

Release note:

Improved stability for Cinder mounts/unmounts

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 28, 2017
@k8s-ci-robot
Copy link
Contributor

Hi @jamiehannaford. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 28, 2017
@jamiehannaford
Copy link
Contributor Author

/sig openstack

@k8s-ci-robot k8s-ci-robot added the area/provider/openstack Issues or PRs related to openstack provider label Jul 28, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jamiehannaford
We suggest the following additional approver: anguslees

Assign the PR to them by writing /assign @anguslees in a comment when ready.

Associated issue: 47548

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Jul 28, 2017
@dims
Copy link
Member

dims commented Jul 28, 2017

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 28, 2017
@msau42
Copy link
Member

msau42 commented Jul 28, 2017

IsLikelyNotMountPoint doesn't detect bind mounts based off of directories, which could be the case for local storage, but I didn't think would be true for most other plugins. Could you clarify how the cinder mounts work?

@jamiehannaford
Copy link
Contributor Author

@msau42 My understanding (taken from here) is that:

  1. A Cinder volume is attached to an instance through the nova volume-attach command. This command creates a unique iSCSI IQN that is exposed to the compute node.
  2. The compute node, which runs the instance, now has an active iSCSI session and new local storage (usually a /dev/sdX disk).
  3. Libvirt uses that local storage as storage for the instance. The instance get a new disk, usually a /dev/vdX disk.

The /dev/vdX disk is symlinked in all the usual places in /dev/disk/*. So to all intents and purposes, the kubelet will see it as a dir I think. If IsLikelyNotMountPoint doesn't detect dirs, it leads to a lot of weird issues where the reconciler can neither unmount nor remount disks Cinder gives the host.

/cc @coreypobrien

@k8s-ci-robot
Copy link
Contributor

@jamiehannaford: GitHub didn't allow me to request PR reviews from the following users: coreypobrien.

Note that only kubernetes members can review this PR, and authors cannot review their own PRs.

In response to this:

@msau42 My understanding (taken from here) is that:

  1. A Cinder volume is attached to an instance through the nova volume-attach command. This command creates a unique iSCSI IQN that is exposed to the compute node.
  2. The compute node, which runs the instance, now has an active iSCSI session and new local storage (usually a /dev/sdX disk).
  3. Libvirt uses that local storage as storage for the instance. The instance get a new disk, usually a /dev/vdX disk.

The /dev/vdX disk is symlinked in all the usual places in /dev/disk/*. So to all intents and purposes, the kubelet will see it as a dir I think. If IsLikelyNotMountPoint doesn't detect dirs, it leads to a lot of weird issues where the reconciler can neither unmount nor remount disks Cinder gives the host.

/cc @coreypobrien

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@jamiehannaford: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce-etcd3 7f9c5ef link /test pull-kubernetes-e2e-gce-etcd3

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@msau42
Copy link
Member

msau42 commented Jul 28, 2017

Does /dev/sdX already have the filesystem mounted, or do you need to do an additional mount for the filesystem?

Other volume plugins generally have a flow like:

  1. Attach disk to node under at /dev/X
  2. Mount /dev/X to <attacher's instructed global mount path>
  3. Bind mount to <pod's mount path>

@jamiehannaford
Copy link
Contributor Author

@msau42 AFAIK the /dev/sdX is already there on the compute host and looks like another block device (this is handled by Nova and doesn't concern kubernetes). The flow you described is how the cinder plugin works too. KVM mounts the disk to the VM under /dev/X, and then kubelet binds the symlinks in /dev/disk/by-id to the pod's mount path. I'm not too familar with the "attacher's instructed global mount path" so I'm not sure about the second step

@msau42
Copy link
Member

msau42 commented Jul 28, 2017

I think I'm missing the step between /dev/X, which is a block device, and getting to some /path, which is the filesystem mount.

@jamiehannaford
Copy link
Contributor Author

@msau42 I was under the assumption that the Cinder plugin handled that step:

https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/cinder/attacher.go#L264

@msau42
Copy link
Member

msau42 commented Jul 28, 2017

Ah ok, so then I don't think you should be hitting the issue with bind mounting directories, because this should be doing a real mount of the filesystem from the disk.

@jamiehannaford
Copy link
Contributor Author

@msau42 We're definitely running into issues with IsLikelyNotMountPoint in production, though.

If you reboot a node and a pod is rescheduled to a different node, the volume manager applies the attach/detach logic twice to the new node (this is a separate bug tracked in #49503). This means that in quick succession, there will be different device paths (for example /dev/vda the first time, and /dev/vdb the second time) on the new node.

The kubelet however never updates to /dev/vdb because IsLikelyNotMountPoint does not think /dev/vda is a mount. When the attacher uses the new IsNotMountPoint function, everything works as expected.

@msau42
Copy link
Member

msau42 commented Jul 28, 2017

/dev/vda and /dev/vdb shouldn't be mounts because they're device files and not mount points. IIUC, /dev/vda or /dev/vdb should be devicePath. IsLikelyNotMountPoint is run on deviceMountPath, which is different. I'm just trying to understand the root cause, and the current theory doesn't seem quite right to me. Changing the call to IsNotMountPoint could impact performance if there's a lot of mounts, so we should be careful that we only use it if absolutely necessary.

@msau42
Copy link
Member

msau42 commented Jul 28, 2017

/sig storage

Maybe some cinder experts here may understand better than me.

@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Jul 28, 2017
@dims
Copy link
Member

dims commented Jul 31, 2017

@jungleboyj @stmcginnis Can you please take a look?

@dims
Copy link
Member

dims commented Jul 31, 2017

/cc @jsafrane @anguslees

@stmcginnis
Copy link
Contributor

stmcginnis commented Jul 31, 2017

@msau42 is correct that /dev/vd* are device paths, not mount points.

The expected flow described above appears to be correct from my understanding.

I think you would need to use IsNotMountPoint, but I do not have the full background on these commands, so that is just based on my <5 minute read through of the history here.

@patrick-east
Copy link

The kubelet however never updates to /dev/vdb because IsLikelyNotMountPoint does not think /dev/vda is a mount. When the attacher uses the new IsNotMountPoint function, everything works as expected.

To echo @stmcginnis the /dev/vdX paths are the raw block devices. Which means they should always end up with a true notMnt value. If it is actually seeing /dev/vdX paths for the deviceMountPath to check then something else is wrong. That would mean that it is going to always do the format and mount... which doesn't sound particularly safe.

Looking at https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/cinder/attacher.go#L264 and assuming that the deviceMountPath passed into the IsNotMountPoint method is coming from
https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/cinder/attacher.go#L253 then you should never see it trying to decide if /dev/vda is a mount or not. It should be some /var/lib/kubelet/plugins/foo path. Which is confusing based on the info in this thread.

I'd be curious to see exactly what paths are being used, and what the system looks like when this bug occurs. It might be that this change is "fixing" the issue by covering up some other underlying problem.

@j-griffith
Copy link
Contributor

also to clarify, the /dev/vdX is the internal libvirt device (device seen from the Instance). The attached device will infact be assigned to a new /dev/sdX device on the compute node. You can verify/coorelate these thing by looking in /dev/disk/by-path on your compute node.

The device files when attached (in disk/by-path) will look like this for LVM:
ip-192.168.33.29:3260-iscsi-iqn.2010-10.org.openstack:volume-f068a1a5-2bff-4c1f-a4b5-dcd90693585d-lun-0

Note that the target iqn is set in your cinder.conf with a default: iqn.2010-10.org.openstack

We then form the rest of the string above witht he tgt-ip and the volume-. The initiator iqn is going to come from /etc/iscsi/initiatorname.iscsi on the compute node.

@jamiehannaford
Copy link
Contributor Author

@stmcginnis @patrick-east Apologies for the confusion. I meant to say that the kubelet doesn't seem to be recognizing the mount paths under /var/lib/kubelet/plugins/kubernetes.io/cinder/mounts, not the block devices under /dev/vdX.

To rephrase the problem. This is what's typically happening:

  1. A pod with Cinder PV exists on a node. That node is rebooted, and the pod is rescheduled to a new node.
  2. The volume manager handles the detach/attach, but ends up duplicating the logic (as described here: Volume reconciler duplicates attach/detach logic #49503). This means that Nova actually performs 2 attach calls.
  3. Kubelet only ever seems to mount to the first device path, not the second.

Here are the full logs of what we're seeing. See the post_reboot file.

You'll notice some discrepencies between what's actually mounted by Nova (vdf) and what the kubelet is pointing to (vdc):

core@jamie-terraform-worker-0 ~ $ mount | grep var/lib/kubelet/plugins | grep -v rkt
...
/dev/vdc on /var/lib/kubelet/plugins/kubernetes.io/cinder/mounts/0328a7b2-4b0a-4e58-a396-b87247d51b18 type ext4 (ro,relatime,seclabel,data=ordered)

core@jamie-terraform-worker-0 ~ $ ls -l /dev/disk/by-id/
total 0
...
lrwxrwxrwx. 1 root root  9 Jul 24 11:55 virtio-0328a7b2-4b0a-4e58-a -> ../../vdf

which made me think it's not handling unmounts correctly, possibly due to not handling the "is a mount" logic well for Cinder vols.

@j-griffith
Copy link
Contributor

One more thing... the "dev/vdX" is completely arbitary, it can't be forced, or controlled. Libvirt will always just grab the next letter available (a, b, c, d....) in the Instance. This is an old annoyance due to which we typcialy discourage using the device specification in the attach argument altogether. You can put whatever you like there, but Nova/Libvirt will always just grab the next available path and use it.

@msau42
Copy link
Member

msau42 commented Jul 31, 2017

Could the double attach be causing the initial mount to go stale, and then os.Stat could fail on the directory? If os.Stat fails, then IsLikelyNotMountPoint returns true. When is unmount being invoked? I did not see any unmount logs.

@mtanino
Copy link

mtanino commented Jul 31, 2017

Seems the main discussion point is Nova's 2 attach calls via Attach_detach controller.

FYI:

From the following logs, the kubelet worker1 mounts Cinder disk "0328a7b2-4b0a-4e58-a396-b87247d51b18" under the deviceMountPathdir(*1) before reboot, and the kubelet worker0 mounts the same cinder disk under same deviceMountPathdir after reboot of kubelet worker1. This seems normal behavior of kubelet.

(*1) deviceMountPath: /var/lib/kubelet/plugins/kubernetes.io/cinder/mounts/0328a7b2-4b0a-4e58-a396-b87247d51b18

Kubelet only ever seems to mount to the first device path, not the second.

The persistent volume pvc-1514cd73-7064-11e7-8eca-fa163e1096d9 has volumeID parameter, and the value is 0328a7b2-4b0a-4e58-a396-b87247d51b18. Thus the mount path is "/var/lib/kubelet/plugins/kubernetes.io/cinder/mounts/" + volumeID.

  • pre_reboot_kubelet_worker1
Jul 24 11:35:00 jamie-terraform-worker-1 kubelet-wrapper[3181]: I0724 11:35:00.282457    3181 attacher.go:176] Successfully found attached Cinder disk "0328a7b2-4b0a-4e58-a396-b87247d51b18" at /dev/disk/by-id/virtio-0328a7b2-4b0a-4e58-a.
Jul 24 11:35:00 jamie-terraform-worker-1 kubelet-wrapper[3181]: I0724 11:35:00.282519    3181 operation_generator.go:512] MountVolume.WaitForAttach succeeded for volume "kubernetes.io/cinder/0328a7b2-4b0a-4e58-a396-b87247d51b18" (spec.Name: "pvc-1514cd73-7064-11e7-8eca-fa163e1096d9") pod "14755a99-7064-11e7-8eca-fa163e1096d9" (UID: "14755a99-7064-11e7-8eca-fa163e1096d9").
Jul 24 11:35:00 jamie-terraform-worker-1 kubelet-wrapper[3181]: Mounting arguments: /dev/disk/by-id/virtio-0328a7b2-4b0a-4e58-a /var/lib/kubelet/plugins/kubernetes.io/cinder/mounts/0328a7b2-4b0a-4e58-a396-b87247d51b18 ext4 [defaults]
Jul 24 11:35:00 jamie-terraform-worker-1 kubelet-wrapper[3181]: I0724 11:35:00.422776    3181 mount_linux.go:372] Disk "/dev/disk/by-id/virtio-0328a7b2-4b0a-4e58-a" appears to be unformatted, attempting to format as type: "ext4" with options: [-F /dev/disk/by-id/virtio-0328a7b2-4b0a-4e58-a]
Jul 24 11:35:00 jamie-terraform-worker-1 kubelet-wrapper[3181]: I0724 11:35:00.853221    3181 mount_linux.go:377] Disk successfully formatted (mkfs): ext4 - /dev/disk/by-id/virtio-0328a7b2-4b0a-4e58-a /var/lib/kubelet/plugins/kubernetes.io/cinder/mounts/0328a7b2-4b0a-4e58-a396-b87247d51b18
Jul 24 11:35:00 jamie-terraform-worker-1 kubelet-wrapper[3181]: I0724 11:35:00.873322    3181 operation_generator.go:551] MountVolume.MountDevice succeeded for volume "kubernetes.io/cinder/0328a7b2-4b0a-4e58-a396-b87247d51b18" (spec.Name: "pvc-1514cd73-7064-11e7-8eca-fa163e1096d9") pod "14755a99-7064-11e7-8eca-fa163e1096d9" (UID: "14755a99-7064-11e7-8eca-fa163e1096d9") device mount path "/var/lib/kubelet/plugins/kubernetes.io/cinder/mounts/0328a7b2-4b0a-4e58-a396-b87247d51b18"
Jul 24 11:35:00 jamie-terraform-worker-1 kubelet-wrapper[3181]: I0724 11:35:00.905380    3181 operation_generator.go:597] MountVolume.SetUp succeeded for volume "kubernetes.io/cinder/0328a7b2-4b0a-4e58-a396-b87247d51b18" (spec.Name: "pvc-1514cd73-7064-11e7-8eca-fa163e1096d9") pod "14755a99-7064-11e7-8eca-fa163e1096d9" (UID: "14755a99-7064-11e7-8eca-fa163e1096d9").

  • post_reboot_kubelet_worker0
Jul 24 11:51:39 jamie-terraform-worker-0 kubelet-wrapper[3142]: I0724 11:51:39.738110    3142 attacher.go:176] Successfully found attached Cinder disk "0328a7b2-4b0a-4e58-a396-b87247d51b18" at /dev/disk/by-id/virtio-0328a7b2-4b0a-4e58-a.
Jul 24 11:51:39 jamie-terraform-worker-0 kubelet-wrapper[3142]: I0724 11:51:39.738159    3142 operation_generator.go:512] MountVolume.WaitForAttach succeeded for volume "kubernetes.io/cinder/0328a7b2-4b0a-4e58-a396-b87247d51b18" (spec.Name: "pvc-1514cd73-7064-11e7-8eca-fa163e1096d9") pod "602a6139-7066-11e7-ad14-fa163e82edc0" (UID: "602a6139-7066-11e7-ad14-fa163e82edc0").
Jul 24 11:51:40 jamie-terraform-worker-0 kubelet-wrapper[3142]: I0724 11:51:40.204893    3142 operation_generator.go:551] MountVolume.MountDevice succeeded for volume "kubernetes.io/cinder/0328a7b2-4b0a-4e58-a396-b87247d51b18" (spec.Name: "pvc-1514cd73-7064-11e7-8eca-fa163e1096d9") pod "602a6139-7066-11e7-ad14-fa163e82edc0" (UID: "602a6139-7066-11e7-ad14-fa163e82edc0") device mount path "/var/lib/kubelet/plugins/kubernetes.io/cinder/mounts/0328a7b2-4b0a-4e58-a396-b87247d51b18"
Jul 24 11:51:40 jamie-terraform-worker-0 kubelet-wrapper[3142]: I0724 11:51:40.329824    3142 operation_generator.go:597] MountVolume.SetUp succeeded for volume "kubernetes.io/cinder/0328a7b2-4b0a-4e58-a396-b87247d51b18" (spec.Name: "pvc-1514cd73-7064-11e7-8eca-fa163e1096d9") pod "602a6139-7066-11e7-ad14-fa163e82edc0" (UID: "602a6139-7066-11e7-ad14-fa163e82edc0").

@jamiehannaford
Copy link
Contributor Author

jamiehannaford commented Aug 1, 2017

@msau42 Yeah, that's what I'm suspecting too. The initial mount goes stale (becomes RO), but the kubelet does not update its mount to point to the newest device path. So /var/lib/kubelet/plugins/kubernetes.io/cinder/mounts/0328a7b2-4b0a-4e58-a396-b87247d51b18 will still exist, but still points to /dev/vdc (which no longer longer exists after the second attach), rather than the new /dev/vdf.

We thought this might be happening because when AttachDisk happens, it assumes that if the target path is already a mount point, it must already be mounted okay and skips the mount process. So when Nova effectively renames from vdc to vdf, it stays as vdc:

https://github.com/kubernetes/kubernetes/blob/v1.6.7/pkg/volume/cinder/cinder_util.go#L74-L93

Also, if you look at the logic for that IsLikelyNotMountPoint check itself we think it might be failing for Cinder mounts. This is what the AttachDisk seems to be doing end-to-end:

  1. Figure out where the mount should be (e.g. /var/lib/kubelet/plugins/kubernetes.io/cinder/mounts/0328a7b2-4b0a-4e58-a396-b87247d51b18)
  2. Stat /var/lib/kubelet/plugins/kubernetes.io/cinder/mounts/0328a7b2-4b0a-4e58-a396-b87247d51b18 (which in our case points to /dev/vdc) for its device
  3. Stat its parent /var/lib/kubelet/plugins/kubernetes.io/cinder/mounts for its device
  4. 2 != 3 so the volume must already be attached. Skip the mount.

If you look at a healthy Cinder volume, the device doesn't match the parent dir either:

core@jamie-terraform-worker-2 ~ $ sudo stat /var/lib/kubelet/plugins/kubernetes.io/cinder/mounts/
  File: '/var/lib/kubelet/plugins/kubernetes.io/cinder/mounts/'
  Size: 4096      	Blocks: 16         IO Block: 4096   directory
Device: fd09h/64777d	Inode: 1571806     Links: 6
Access: (0750/drwxr-x---)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:var_lib_t:s0
Access: 2017-07-31 15:48:42.000000000 +0000
Modify: 2017-07-31 15:51:16.000000000 +0000
Change: 2017-07-31 15:51:16.000000000 +0000
 Birth: -
core@jamie-terraform-worker-2 ~ $ sudo stat /var/lib/kubelet/plugins/kubernetes.io/cinder/mounts/02a71021-4dbf-435b-82d2-3871708a19c3
  File: '/var/lib/kubelet/plugins/kubernetes.io/cinder/mounts/02a71021-4dbf-435b-82d2-3871708a19c3'
  Size: 4096      	Blocks: 8          IO Block: 4096   directory
Device: fd30h/64816d	Inode: 2           Links: 4
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:unlabeled_t:s0
Access: 2017-07-31 15:50:43.000000000 +0000
Modify: 2017-07-31 15:50:52.458551232 +0000
Change: 2017-07-31 15:50:52.458551232 +0000
 Birth: -

Is this expected behaviour?

Due to the buggy logic in volume manager, we know there's a high chance of race conditions. So I'm wondering whether the detach calls are not properly deleting the /var/lib/kubelet/plugins/kubernetes.io/cinder/mounts/0328a7b2-4b0a-4e58-a396-b87247d51b18 by the time the next Attach happens. The presence of an existing mount seems to be what's triggering the above.

@msau42
Copy link
Member

msau42 commented Aug 1, 2017

Yes, it is definitely expected that the mount point device not match the parent. It sounds like the real underlying issue is the 2nd attach causing the 1st mount to go stale. Can we detect if the volume has already been attached and not trigger a second attach?

@jamiehannaford
Copy link
Contributor Author

@msau42 Yeah, I agree it's probably best to fix the volume manager. Somebody proposed a potential fix here. Based on that comment, could this race condition also lead to the stale mount not being cleaned up? It seems the volume manager is responsible for handling unmounts, right?

@msau42
Copy link
Member

msau42 commented Aug 1, 2017

I think the stale mount can be avoided if the Cinder plugin could detect that the volume is already attached to this node.

@jamiehannaford
Copy link
Contributor Author

@msau42 Yeah, looks like AWS plugin does exactly that:

https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L1595

Another problem area is that the OpenStack plugin's AttachDisk tries to detach the disk from other nodes if it thinks the state is wrong:

https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/openstack/openstack_volumes.go#L228-L238

We should remove that because that logic should be delegated to the volume manager and is probably contributing to the race conditions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/openstack Issues or PRs related to openstack provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants