Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix WaitForAttach failure issue for azure disk #62612

Merged
merged 1 commit into from
Apr 17, 2018

Conversation

andyzhangx
Copy link
Member

What this PR does / why we need it:
From v1.10, devicePath will be updated due to following code change:

markDeviceMountedErr := actualStateOfWorld.MarkDeviceAsMounted(
volumeToMount.VolumeName, devicePath, deviceMountPath)

So in v1.10.0, MountVolume.WaitForAttach will fail in the azure disk remount, error logs would be like following:

MountVolume.WaitForAttach failed for volume "pvc-f1562ecb-3e5f-11e8-ab6b-000d3af9f967" : azureDisk - Wait for attach expect device path as a lun number, instead got: /dev/disk/azure/scsi1/lun1 (strconv.Atoi: parsing "/dev/disk/azure/scsi1/lun1": invalid syntax)
  Warning  FailedMount             1m (x10 over 21m)   kubelet, k8s-agentpool-66825246-0  Unable to mount volumes for pod  

This PR does not use devicePath anymore since it could be changed, instead, it use diskController.GetDiskLun(diskName, volumeSource.DataDiskURI, nodeName) to get disk LUN, this ARM api call would cost about 0.12s

The GCE disk won't have this issue since devicePath is not used in WaitForAttach func, while aws disk is also using devicePath in WaitForAttach func, I think there is potentical issue for aws_ebs

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #62540

Special notes for your reviewer:
should cherry-pick to v1.10

Release note:

fix WaitForAttach failure issue for azure disk

/assign @feiskyer
/sig azure

FYI @khenidak

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/azure size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 16, 2018
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 16, 2018
@andyzhangx
Copy link
Member Author

@jingxu97

Copy link
Member

@feiskyer feiskyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change lgtm. Let's wait a while for @jingxu97's opinion.

@feiskyer
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 17, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, feiskyer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 62676, 62612). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit bf3cda6 into kubernetes:master Apr 17, 2018
@jingxu97
Copy link
Contributor

Hi @andyzhangx sorry for the delay. According to the comments for WaitforAttach(), the device path is returned. In your case, what is the different between the passing device path and the returned one? I think we should have better comments to explain the difference.
cc @gnufied

// WaitForAttach blocks until the device is attached to this
// node. If it successfully attaches, the path to the device
// is returned. Otherwise, if the device does not attach after
// the given timeout period, an error will be returned.

@andyzhangx
Copy link
Member Author

@jingxu97

  • In the beginning, devicePath is a LUN number(e.g. 0, 1, 2...) returned by azure_dd.Attach func, then WaitForAttach will use findDiskByLun to return real device path(e.g. in Linux, it could be /dev/disk/azure/scsi1/lun0):
    if newDevicePath, err = findDiskByLun(int(lun), io, exec); err != nil {
  • Then devicePath(/dev/disk/azure/scsi1/lun0) will be updated by actualStateOfWorld.MarkDeviceAsMounted:
    markDeviceMountedErr := actualStateOfWorld.MarkDeviceAsMounted(
    volumeToMount.VolumeName, devicePath, deviceMountPath)

    so in the second time of WaitForAttach func invocation, devicePath becomes /dev/disk/azure/scsi1/lun0 in Linux node now, then we will get error in azure_dd

I think same logic would happens in aws_ebs.WaitForAttach func, although it uses verifyDevicePath(devicePaths) since in the first and second time invocation of WaitForAttach func, devicePath value has been changed.

@gnufied
Copy link
Member

gnufied commented Apr 17, 2018

I think EBS will be fine because EBS does not rely on device path to create new device paths in the way Azure is doing. There is a small ambiguity around NVMe disks but I think those should be fine too

@gnufied
Copy link
Member

gnufied commented Apr 18, 2018

@andyzhangx do you want to backport the fix to 1.10? It appears that - attach operation for AzureDisks might be broken on 1.10? cc @saad-ali

@andyzhangx
Copy link
Member Author

@gnufied thanks for remindar, I have already cherry-pick this PR to 1.10: #62693

k8s-github-robot pushed a commit that referenced this pull request Apr 20, 2018
…2612-upstream-release-1.10

Automatic merge from submit-queue.

Automated cherry pick of #62612: fix devicePath update issue in Azure WaitForAttach func

Cherry pick of #62612 on release-1.10.

#62612: fix devicePath update issue in Azure WaitForAttach func
@huydinhle
Copy link

how would I fixing this if I got pods running and needs to be remounted ? @andyzhangx

@andyzhangx
Copy link
Member Author

To be clear, this issue has been fixed in k8s v1.10.2, so only v1.10.0 & v1.10.1 would have this issue, and there is no workaround in those two minor versions, only way to upgrade to a higer version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WaitForAttach failed for azure disk: parsing "/dev/disk/azure/scsi1/lun1": invalid syntax
7 participants