Add AzureDisk support for vmss nodes #59716

feiskyer · 2018-02-11T05:41:05Z

What this PR does / why we need it:

This PR adds AzureDisk support for vmss nodes. Changes include

Upgrade vmss API to 2017-12-01
Upgrade vmss clients with new version API
Abstract AzureDisk operations for vmss and vmas
Added AzureDisk support for vmss
Unit tests and fake clients fix

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #43287

Special notes for your reviewer:

~~Depending on #59652 (the first two commits are from #59652).~~

Release note:

Add AzureDisk support for vmss nodes

khenidak · 2018-02-11T06:00:18Z

/assign @khenidak

andyzhangx

There are some functions like GetNextDiskLun have too many duplicated code with original vmav code.
Functions like AttachDisk, DetachDisk should depend on #59693 which is removing duplicated code.

andyzhangx · 2018-02-11T06:40:41Z

pkg/cloudprovider/providers/azure/azure_controller_vmss.go

+func (ss *scaleSet) AttachDisk(isManagedDisk bool, diskName, diskURI string, nodeName types.NodeName, lun int32, cachingMode compute.CachingTypes) error {
+	ssName, instanceID, vm, err := ss.getVmssVM(string(nodeName))
+	if err != nil {
+		if err == ErrorNotVmssInstance {


this is weird, in the code logic, it uses vmss.AttachDisk, while error says it should not use vmss.AttachDisk, should return to use vmas.AttachDisk ? Is there a real case for such issue?

If master nodes are managed by as, and pods are running on master nodes, then there will be such problem.

This is fine. Also will allow us to support users who have mixed clusters (avsets + scale sets). I do however strongly believe that we should use controller common as the abstraction layer.

andyzhangx · 2018-02-11T06:41:17Z

pkg/cloudprovider/providers/azure/azure_controller_vmss.go

+
+// GetDiskLun finds the lun on the host that the vhd is attached to, given a vhd's diskName and diskURI
+func (ss *scaleSet) GetDiskLun(diskName, diskURI string, nodeName types.NodeName) (int32, error) {
+	_, _, vm, err := ss.getVmssVM(string(nodeName))


this line is only different from Original GetNextDiskLun, I would encourage to combine these two different GetNextDiskLun

See #59716 (comment)

andyzhangx · 2018-02-11T06:41:30Z

pkg/cloudprovider/providers/azure/azure_controller_vmss.go

+func (ss *scaleSet) DetachDiskByName(diskName, diskURI string, nodeName types.NodeName) error {
+	ssName, instanceID, vm, err := ss.getVmssVM(string(nodeName))
+	if err != nil {
+		if err == ErrorNotVmssInstance {


same as above

andyzhangx · 2018-02-11T06:41:41Z

pkg/cloudprovider/providers/azure/azure_controller_vmss.go

+// GetNextDiskLun searches all vhd attachment on the host and find unused lun
+// return -1 if all luns are used
+func (ss *scaleSet) GetNextDiskLun(nodeName types.NodeName) (int32, error) {
+	_, _, vm, err := ss.getVmssVM(string(nodeName))


same as above

andyzhangx · 2018-02-11T06:41:55Z

pkg/cloudprovider/providers/azure/azure_controller_vmss.go

+		attached[diskName] = false
+	}
+
+	_, _, vm, err := ss.getVmssVM(string(nodeName))


same as above

feiskyer · 2018-02-11T07:10:19Z

Functions like AttachDisk, DetachDisk should depend on #59693 which is removing duplicated code.

Please note that VirtualMachineScaleSetVM and VirtualMachine are different data structures (not interfaces). Although disk operations are in same logic with original VirtualMachine, some duplication are still required now.

Update: VirtualMachineScaleSetVM and VirtualMachine are using different api versions now. They should use same api version in the future and we can merge functions togather then.

andyzhangx · 2018-02-11T07:58:22Z

@rootfs this cmmit 0ca2690 removes instanceid truncating code:

	if ind := strings.LastIndex(instanceid, "/"); ind >= 0 {
		instanceid = instanceid[(ind + 1):]
	}

Do you know why you would truncate instance id, e.g. from /subscriptions/4be8920b-2978-43d7-ab14-04d8549c1d00/resourceGroups/andy-k8s192/providers/Microsoft.Compute/virtualMachines/k8s-agentpool-87187153-0 to k8s-agentpool-87187153-0, any consideration at that time?

I have checked the code, it looks ok if no truncating of instance id, still want to confirm with you, thanks.

khenidak · 2018-02-12T02:00:12Z

re #59716 (comment)

Controller Common is the abstraction and this is where we should do if vmss or avset). Let us not default to VMSS now.

khenidak · 2018-02-12T01:49:33Z

pkg/cloudprovider/providers/azure/azure_backoff.go

+
+	if resp != nil {
+		// HTTP 4xx or 5xx suggests we should retry
+		if 399 < resp.StatusCode && resp.StatusCode < 600 {


That is not correct status code such as 403 are terminal in all cases. The will occur if the service principal expired, or principal/MSI/EMSI don't have proper permission

If principal expired, we should surely retry the API call. Isn't this expected?

if retry will exhaust API quota quicker, then we need to be more frugal.

The logic here is same with shouldRetryAPIRequest, just with a different param (http.Response instead of autorest.Response). It doesn't change existing retry logic.

khenidak · 2018-02-12T01:56:33Z

pkg/cloudprovider/providers/azure/azure_controller_vmss.go

+
+// AttachDisk attaches a vhd to vm
+// the vhd must exist, can be identified by diskName, diskURI, and lun.
+func (ss *scaleSet) AttachDisk(isManagedDisk bool, diskName, diskURI string, nodeName types.NodeName, lun int32, cachingMode compute.CachingTypes) error {


We have to find a way to find if a VM is part of availability set or Scale Set. We can not try fail then retry. This information is either part of the VM/config/node label

This is solved by availabilitySetNodesCache, which holds a list of VMs not managed by vmss.

khenidak · 2018-02-12T02:03:18Z

pkg/cloudprovider/providers/azure/azure_controller_vmss.go

+func (ss *scaleSet) AttachDisk(isManagedDisk bool, diskName, diskURI string, nodeName types.NodeName, lun int32, cachingMode compute.CachingTypes) error {
+	ssName, instanceID, vm, err := ss.getVmssVM(string(nodeName))
+	if err != nil {
+		if err == ErrorNotVmssInstance {


This is fine. Also will allow us to support users who have mixed clusters (avsets + scale sets). I do however strongly believe that we should use controller common as the abstraction layer.

khenidak · 2018-02-12T02:04:42Z

pkg/cloudprovider/providers/azure/azure_vmss_cache.go

+
+				computerName := strings.ToLower(*vm.OsProfile.ComputerName)
+				localCache[computerName] = ssName
+			}


why break? This for loop is aiming to hold all VMs.

feiskyer · 2018-02-12T02:28:46Z

Controller Common is the abstraction and this is where we should do if vmss or avset). Let us not default to VMSS now.

If vmType is set to vmss, we are default to vmss because most nodes are expected running on vmss in such case.

Updated comments #59716 (comment). We could merge them together in the future after vmss and vm are using same compute api version, but currently they should still be separated.

Opened #59736 to track this issue.

This is because the last part of VMSS instances are numbers, which may be same if there are multiple VMSS within same cluster.

feiskyer · 2018-02-12T06:10:34Z

vmss cache PR has been merged. Made another rebase.

feiskyer · 2018-02-13T02:12:37Z

Controller Common is the abstraction and this is where we should do if vmss or avset). Let us not default to VMSS now.

Offline talked with @khenidak. vmType checking is better part of controllerCommon.

@khenidak Added a new commit to address this issue. PTAL

khenidak · 2018-02-13T16:59:57Z

/LGTM Let's merge :-)

brendandburns · 2018-02-13T17:07:06Z

/approve

rootfs · 2018-02-13T19:26:30Z

pkg/cloudprovider/providers/azure/azure_controller_common.go

@@ -0,0 +1,182 @@
+/*
+Copyright 2017 The Kubernetes Authors.


rootfs · 2018-02-13T19:28:59Z

pkg/cloudprovider/providers/azure/azure_controller_common.go

+	// vmType is Virtual Machine Scale Set (vmss).
+	ss, ok := c.cloud.vmSet.(*scaleSet)
+	if !ok {
+		return fmt.Errorf("error of converting vmSet (%q) to scaleSet", c.cloud.vmSet)


also dump VMType for diagnostics.

rootfs · 2018-02-13T19:30:04Z

pkg/cloudprovider/providers/azure/azure_controller_common.go

+}
+
+// AttachDisk attaches a vhd to vm. The vhd must exist, can be identified by diskName, diskURI, and lun.
+func (c *controllerCommon) AttachDisk(isManagedDisk bool, diskName, diskURI string, nodeName types.NodeName, lun int32, cachingMode compute.CachingTypes) error {


add comments on how this works. TBH, this flow looks quite cryptic to me

ack, will do

Added in the new commit

rootfs · 2018-02-13T19:43:03Z

pkg/cloudprovider/providers/azure/azure_client.go

 	"time"

 	"github.com/Azure/azure-sdk-for-go/arm/compute"
 	"github.com/Azure/azure-sdk-for-go/arm/disk"
 	"github.com/Azure/azure-sdk-for-go/arm/network"
 	"github.com/Azure/azure-sdk-for-go/arm/storage"
+	computepreview "github.com/Azure/azure-sdk-for-go/services/compute/mgmt/2017-12-01/compute"


any reason why rename it? This rename causes lots of diffs.

There is already an existing package with name compute: "github.com/Azure/azure-sdk-for-go/arm/compute"

feiskyer · 2018-02-14T00:39:13Z

@rootfs @khenidak Thanks for reviewing. Addressed comments. PTAL

khenidak · 2018-02-14T01:10:55Z

Lets clear the test and merge. Thanks a lot for this (and the rest of the VMSS work) been a long time coming!

brendandburns · 2018-02-14T05:35:55Z

/lgtm

k8s-ci-robot · 2018-02-14T05:36:37Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: brendandburns, feiskyer, khenidak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these OWNERS Files:

~~Godeps/OWNERS~~ [brendandburns]
~~pkg/cloudprovider/providers/azure/OWNERS~~ [brendandburns,feiskyer,khenidak]
~~pkg/credentialprovider/azure/OWNERS~~ [brendandburns,feiskyer,khenidak]
~~pkg/volume/azure_dd/OWNERS~~ [brendandburns,feiskyer,khenidak]
~~staging/src/k8s.io/client-go/OWNERS~~ [brendandburns]
~~vendor/OWNERS~~ [brendandburns]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-github-robot · 2018-02-14T07:36:36Z

/test all [submit-queue is verifying that this PR is safe to merge]

k8s-github-robot · 2018-02-14T08:14:32Z

Automatic merge from submit-queue (batch tested with PRs 59489, 59716). If you want to cherry-pick this change to another branch, please follow the instructions here.

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 11, 2018

mogthesprog mentioned this pull request Feb 11, 2018

Support Azure Virtual Machine Scale Sets #43287

Closed

7 tasks

k8s-ci-robot requested review from andyzhangx, karataliu, khenidak, thockin and yujuhong February 11, 2018 05:41

feiskyer assigned andyzhangx and brendandburns Feb 11, 2018

feiskyer added kind/feature Categorizes issue or PR as related to a new feature. sig/azure labels Feb 11, 2018

feiskyer added this to the v1.10 milestone Feb 11, 2018

feiskyer force-pushed the vmss-disk branch from b991589 to 4ffde6d Compare February 11, 2018 05:57

k8s-ci-robot assigned khenidak Feb 11, 2018

feiskyer force-pushed the vmss-disk branch from 4ffde6d to 52a7aef Compare February 11, 2018 06:30

andyzhangx suggested changes Feb 11, 2018

View reviewed changes

feiskyer force-pushed the vmss-disk branch from 52a7aef to 2acd879 Compare February 11, 2018 08:55

khenidak suggested changes Feb 12, 2018

View reviewed changes

feiskyer mentioned this pull request Feb 12, 2018

Improve azure disk operations for vm and vmss #59736

Closed

feiskyer added 4 commits February 12, 2018 14:07

Use full instanceID as lun lock key

6dcd565

This is because the last part of VMSS instances are numbers, which may be same if there are multiple VMSS within same cluster.

Abstract disk operation interfaces in VMSet

11e5399

Update vmss client to new version

829e094

Update vmss fake clients

5d16067

feiskyer added 3 commits February 12, 2018 14:07

update azure API for auth

4b453fb

Fix unit tests for vmss

1976983

Fix godeps for client-go

1d3cf76

feiskyer force-pushed the vmss-disk branch from 2cb2257 to 1d3cf76 Compare February 12, 2018 06:09

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 12, 2018

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 13, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 13, 2018

rootfs reviewed Feb 13, 2018

View reviewed changes

Add vmType checking in Azure disk controller common

fbc871b

feiskyer force-pushed the vmss-disk branch from 9972eb4 to fbc871b Compare February 14, 2018 00:39

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 14, 2018

k8s-github-robot merged commit d89e641 into kubernetes:master Feb 14, 2018

feiskyer deleted the vmss-disk branch February 14, 2018 08:23

cblecker mentioned this pull request Feb 15, 2018

Add cblecker to vendor OWNERS #59587

Merged

wojtek-t mentioned this pull request Apr 5, 2018

'PATCH node-status' latency slo violations #62064

Closed

Add AzureDisk support for vmss nodes #59716

Add AzureDisk support for vmss nodes #59716

Conversation

feiskyer commented Feb 11, 2018 • edited Loading

khenidak commented Feb 11, 2018

andyzhangx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feiskyer commented Feb 11, 2018 • edited Loading

andyzhangx commented Feb 11, 2018 • edited Loading

khenidak commented Feb 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feiskyer Feb 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feiskyer commented Feb 12, 2018 • edited Loading

feiskyer commented Feb 12, 2018

feiskyer commented Feb 13, 2018

khenidak commented Feb 13, 2018

brendandburns commented Feb 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feiskyer Feb 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feiskyer commented Feb 14, 2018

khenidak commented Feb 14, 2018

brendandburns commented Feb 14, 2018

k8s-ci-robot commented Feb 14, 2018

k8s-github-robot commented Feb 14, 2018

k8s-github-robot commented Feb 14, 2018

feiskyer commented Feb 11, 2018 •

edited

Loading

feiskyer commented Feb 11, 2018 •

edited

Loading

andyzhangx commented Feb 11, 2018 •

edited

Loading

feiskyer Feb 12, 2018 •

edited

Loading

feiskyer commented Feb 12, 2018 •

edited

Loading

feiskyer Feb 14, 2018 •

edited

Loading