Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sync state loop in master's volume reconciler #34859

Merged
merged 1 commit into from
Oct 29, 2016

Conversation

jingxu97
Copy link
Contributor

@jingxu97 jingxu97 commented Oct 14, 2016

At master volume reconciler, the information about which volumes are
attached to nodes is cached in actual state of world. However, this
information might be out of date in case that node is terminated (volume
is detached automatically). In this situation, reconciler assume volume
is still attached and will not issue attach operation when node comes
back. Pods created on those nodes will fail to mount.

This PR adds the logic to periodically sync up the truth for attached
volumes kept in the actual state cache. If the volume is no longer attached
to the node, the actual state will be updated to reflect the truth. In turn,
reconciler will take actions if needed.

To avoid issuing many concurrent operations on cloud provider, this PR
tries to add batch operation to check whether a list of volumes are
attached to the node instead of one request per volume.

Add sync state loop in master's volume reconciler

Detailed design is in #33760


This change is Reviewable

@jingxu97 jingxu97 added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Oct 14, 2016
@jingxu97 jingxu97 added this to the v1.4 milestone Oct 14, 2016
@k8s-github-robot k8s-github-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. release-note-label-needed labels Oct 14, 2016
@jingxu97 jingxu97 added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Oct 17, 2016
return err
}
// Give an empty UniqueVolumeName so that this operation could be executed concurrently.
return oe.pendingOperations.Run("", "", volumesAreAttachedFunc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: To improve readability, put variable name in a comment when it is not obvious what the value is for:

    return oe.pendingOperations.Run("" /* volumeName */, "" /* podName */, volumesAreAttachedFunc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// Note that this operation could be operated concurrently with other attach/detach operations.
// In theory (but very unlikely in practise), race condition among these operations might mark volume as detached
// even if it is attached. But reconciler can correct this in a short period of time.
CheckVolumesAreAttached(AttachedVolumes []AttachedVolume, nodeName types.NodeName, actualStateOfWorld ActualStateOfWorldAttacherUpdater) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Check -> Verify

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return err
}
// Give an empty UniqueVolumeName so that this operation could be executed concurrently.
return oe.pendingOperations.Run("", "", volumesAreAttachedFunc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we do not care about pending operations for this function, how about we execute it serially inline? That way 1) we can avoid having to make changes to nestedpendingoperations. And 2) we won't start other (potentially racy) operations until this verification step is complete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I am trying to use batch operation for checking volume attached or not, there is no single unique volume id for serialize the operation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, and that's fine. So the suggestion is to therefore not use nestedpendingoperations at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though about using gorutineMap? but then we need to put goroutineMap into operation_executor besides nestedpendingOperations, which is not needed if choosing to use nestedpendingoperations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm trying to say is that since we are not trying to coordinate with other operations. We don't need to use either nestedpendingoperations gorutineMap or operation_executor. We can do all the things that the method returned by generateVolumesAreAttachedFunc does in thes same file as syncstates serially.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping on this comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke offline, ok to leave as is.

for spec, check := range attached {
if !check {
actualStateOfWorld.MarkVolumeAsDetached(volumeSpecMap[spec], nodeName)
glog.V(2).Infof("Volume %q is no longer attached, mark it as detached", volumeSpecMap[spec])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this V(1). And make the message more explicit: VerifyVolumesAreAttached determined volume %q (spec.Name: %q) is no longer attached to node %q, therefore it was marked as detached.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

volumesPerPlugin := make(map[string][]*volume.Spec)
// volumeSpecMap maps from a volume spec to its unique volumeName which will be used
// when calling MarkVolumeAsDetached
volumeSpecMap := make(map[*volume.Spec]api.UniqueVolumeName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are using a pointer as a key in a map. This means that the map will compare memory address for equality--not the contents of volume.Spec. Is that what you want? Looks like you are doing this in multiple places.

Up to you how you want to handle it. I suggest passing around api.UniqueVolumeName as needed and using that as the key where possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am try to compare whether they are the same object(volume.Spec), not the content. The functions implemented for Attacher interface are all using volume.Spec as parameters. So it is not very convenient here since operations for actual/desired states use volume id.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@jingxu97
Copy link
Contributor Author

@saad-ali PTAL

@jingxu97
Copy link
Contributor Author

@saad-ali PTAL

@@ -321,6 +321,9 @@ type Volumes interface {

// Check if the volume is already attached to the node with the specified NodeName
DiskIsAttached(diskName string, nodeName types.NodeName) (bool, error)

// Check if a list of volumes are attached to the node with the specified NodeName
DisksAreAttached(diskNames []string, nodeName types.NodeName) ([]bool, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little uncomfortable with this method returning an array where the value of the element corresponds to the value of the same index in the input array. I'd prefer the return value to be able to be understood (contain all necessary information) independent of the input. But that's just a queasy feeling, no hard fast rule--so it's fine if you want to leave it like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed to a map.

@@ -240,6 +240,21 @@ func (os *OpenStack) DiskIsAttached(diskName, instanceID string) (bool, error) {
return false, nil
}

// query if a list of volumes are attached to a compute instance
func (os *OpenStack) DisksAreAttached(diskNames []string, instanceID string) ([]bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anguslees @dagnello Can you verify this logic for openstack. The purpose of this method is to verify whether the specified disks are still attached to the specified node. If this check determines the disk is no longer attached when it should be, the controller will try to reattach the volume.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anguslees or @dagnello Can you take a look at the OpenStack changes by EOD?

@@ -660,3 +660,21 @@ func (rs *Rackspace) DiskIsAttached(diskName, instanceID string) (bool, error) {
}
return false, nil
}

// query if a list volumes are attached to a compute instance
func (rs *Rackspace) DisksAreAttached(diskNames []string, instanceID string) ([]bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anguslees, @jamiehannaford, @jrperritt @mkulke Can you verify this logic for rackspace. The purpose of this method is to verify whether the specified disks are still attached to the specified node. If this check determines the disk is no longer attached when it should be, the controller will try to reattach the volume.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anguslees, @jamiehannaford, @jrperritt @mkulke: can one of you verify the rackspace changes by EOD?

@@ -1774,6 +1777,38 @@ func (c *Cloud) DiskIsAttached(diskName string, nodeName types.NodeName) (bool,
return false, nil
}

func (c *Cloud) DisksAreAttached(diskNames []string, nodeName types.NodeName) ([]bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinsb Can you take a look at the AWS bits here. The purpose of this method is to verify whether the specified disks are still attached to the specified node. If this check determines the disk is no longer attached when it should be, the controller will try to reattach the volume.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinsb Could you take a look at the AWS bits by EOD?

@@ -937,6 +941,61 @@ func (vs *VSphere) DiskIsAttached(volPath string, nodeName k8stypes.NodeName) (b
return attached, err
}

// DisksAreAttached returns if disks are attached to the VM using controllers supported by the plugin.
func (vs *VSphere) DisksAreAttached(volPaths []string, nodeName k8stypes.NodeName) ([]bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VSphere maintainers (@dagnello @abithap @imkin @abrarshivani @kerneltime @luomiao): Can one of you verify this logic for VSphere. The purpose of this method is to verify whether the specified disks are still attached to the specified node. If this check determines the disk is no longer attached, then the controller will try to reattach the volume.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for doing this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @abrarshivani

@jingxu97
Copy link
Contributor Author

@saad-ali PTAL

@jingxu97
Copy link
Contributor Author

But this way, syncStates will block reconcile function.

On Mon, Oct 24, 2016 at 4:11 PM, Saad Ali notifications@github.com wrote:

@saad-ali commented on this pull request.

In pkg/volume/util/operationexecutor/operation_executor.go
#34859:

@@ -397,6 +404,19 @@ func (oe operationExecutor) DetachVolume(
volumeToDetach.VolumeName, "" /
podName */, detachFunc)
}

+func (oe *operationExecutor) CheckVolumesAreAttached(

  • attachedVolumes []AttachedVolume,
  • nodeName types.NodeName,
  • actualStateOfWorld ActualStateOfWorldAttacherUpdater) error {
  • volumesAreAttachedFunc, err :=
  • oe.generateVolumesAreAttachedFunc(attachedVolumes, nodeName, actualStateOfWorld)
    
  • if err != nil {
  • return err
    
  • }
  • // Give an empty UniqueVolumeName so that this operation could be executed concurrently.
  • return oe.pendingOperations.Run("", "", volumesAreAttachedFunc)

What I'm trying to say is that since we are not trying to coordinate with
other operations. We don't need to use either nestedpendingoperations
gorutineMap or operation_executor. We can do all the things that the
method returned by generateVolumesAreAttachedFunc does in thes same file
as syncstates serially.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#34859, or mute the thread
https://github.com/notifications/unsubscribe-auth/ASSNxdfDPhBsh50ABui1gvLgvIeynrLlks5q3TsHgaJpZM4KXdGc
.

  • Jing

@jingxu97
Copy link
Contributor Author

@saad-ali PTAL

@saad-ali
Copy link
Member

Will LGTM by EOD. Giving volume plugin owners a chance to take a look in the meantime.

for _, blockDevice := range info.BlockDeviceMappings {
volumeId := aws.StringValue(blockDevice.Ebs.VolumeId)
for _, diskName := range diskNames {
if volumeId == diskName {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry this isn't going to be quite right... I think a diskname normally looks like aws:///vol-abcdef

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I follow the same logic in existing DiskIsAttached function? Is that function fine?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe not, unless I am missing something.

Some background here: https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L1277-L1312

A volumeid from AWS is going to look like vol-abcdef0 (or longer). A volumeid in the volume spec could look like that, or could be aws:///vol-abcdef0. We are supposed to normalize before comparing to raw API ids, but it looks like I missed that check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want, we can merge this as is, and then I might do the same thing I did with types.NodeName, with aws.VolumeNameSpec or something, so I can catch any more of these.

I don't think we gate on AWS e2e failures?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do it in a follow-up. Created issue #35746 to followup

Copy link
Member

@justinsb justinsb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is greatly needed, and I think will solve a lot of problems :-)

A few comments ... I'll check on the big one which is whether that is a volume id in the aws provider.

"Instance %q does not exist. DisksAreAttached will assume PD %v are not attached to it.",
instanceName,
diskNames)
return attached, nil
Copy link
Member

@justinsb justinsb Oct 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to require that we return InstanceNotFound here, as we do for ExternalID?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please point to me where ExternalID you mean? Here the logic is if instance does not exist from cloud provider, we assume volumes are all detached. Please let me know if you think this is any issue in this assumption.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you're doing is fine. In ExternalID in the cloudprovider, we just always return cloudprovider.InstanceNotFound if the instance is not found. Here it would avoid the duplication of code for this case across the providers.

@@ -56,6 +56,10 @@ const (
// desiredStateOfWorldPopulatorLoopSleepPeriod is the amount of time the
// DesiredStateOfWorldPopulator loop waits between successive executions
desiredStateOfWorldPopulatorLoopSleepPeriod time.Duration = 1 * time.Minute

// reconcilerSyncDuration is the amount of time the reconciler sync states loop
// wait between successive exeutions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: executions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected

// If it fails to update node status, skip detach volume
rc.actualStateOfWorld.RemoveVolumeFromReportAsAttached(attachedVolume.VolumeName, attachedVolume.NodeName)
func (rc *reconciler) sync() {
defer rc.updateSyncTime()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to use defer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I tried to use the similar implementation in kubelet reconciler. Any reason defer is a concern?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a little surprising to see a 2 line function with a defer :-) If it's consistent with usage elsewhere though, that feels right!

func (rc *reconciler) syncStates() {
volumesPerNode := rc.actualStateOfWorld.GetAttachedVolumesPerNode()
for nodeName, volumes := range volumesPerNode {
err := rc.attacherDetacher.VerifyVolumesAreAttached(volumes, nodeName, rc.actualStateOfWorld)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cloud providers can probably implement this more efficiently if you allow them to do a bulk query...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean to query multiple nodes with one call?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly ... we can add this later - I suspect we'll hit rate limits on AWS with big clusters

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and GCE, probably!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #35748 for follow up

@@ -96,6 +96,40 @@ func (attacher *gcePersistentDiskAttacher) Attach(spec *volume.Spec, nodeName ty
return path.Join(diskByIdPath, diskGooglePrefix+pdName), nil
}

func (attacher *gcePersistentDiskAttacher) VolumesAreAttached(specs []*volume.Spec, nodeName types.NodeName) (map[*volume.Spec]bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be time to create the Volumes interface on the cloudprovider :-) The objection last time was that there could be different types; maybe if the volumes interface took a string argument for the "volume type" that would overcome that objection? cc @thockin


for spec, check := range attached {
if !check {
actualStateOfWorld.MarkVolumeAsDetached(volumeSpecMap[spec], nodeName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the thread safety guarantee here (genuinely asking - I just don't know how we're keeping things sane here)? Do we serialize operations on a node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actualStateOfWorld has sync.RWMutex to pretect multiple read/writes. Mounting/Umounting attachable volumes are serialized by nestedpendingoperations for the same volume. But checking volumes are attached can be operated concurrently with other operations. Please let me know if there is any concern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that we do asw.GetAttachedVolumesPerNode, which has a read-lock, but that state of the world is then immediately out of date. I think what you're doing here is safe, but I find it hard to think through all the edge cases :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is racy, like you said after doing GetAttachedVolumesPerNode state could immediately be stale (there could be pending operations that complete, or if sync takes long enough other operations could be started and complete in the meantime). That said, after speaking with @jingxu97 we think that should be ok, because in the worst case scenario, a volume is incorrectly reported as detached when it is actually attached, in which case the controller will try to reattach it--since the operation is idempotent, that should succeed. There shouldn't be any cases where a volume is detached, and falsely reported as attached due to this code.

@jingxu97
Copy link
Contributor Author

Updated, @saad-ali PTAL

@saad-ali saad-ali added lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-label-needed labels Oct 28, 2016
@saad-ali
Copy link
Member

LGTM

Thanks Jing

@saad-ali
Copy link
Member

@jessfraz This is not a 1.4.x release blocker. Will be ok to get this cherry picked later next week.

@jessfraz
Copy link
Contributor

ok cool thanks!

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 3e7172d into kubernetes:master Oct 29, 2016
@jessfraz jessfraz added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Oct 29, 2016
@saad-ali
Copy link
Member

saad-ali commented Nov 7, 2016

@jingxu97 Can you please cherrypick this to the 1.4 branch for the next 1.4.x release.

@jessfraz
Copy link
Contributor

jessfraz commented Nov 9, 2016

@jingxu97 can you open the cherry pick so this will be in the release on friday

jessfraz added a commit that referenced this pull request Nov 12, 2016
#35883-upstream-release-1.4

Automated cherry pick of #34859 #35883 upstream release 1.4
shyamjvs pushed a commit to shyamjvs/kubernetes that referenced this pull request Dec 1, 2016
…ck-of-#34859-kubernetes#35883-upstream-release-1.4

Automated cherry pick of kubernetes#34859 kubernetes#35883 upstream release 1.4
dims pushed a commit to dims/kubernetes that referenced this pull request Feb 8, 2018
Automatic merge from submit-queue

Add sync state loop in master's volume reconciler

At master volume reconciler, the information about which volumes are
attached to nodes is cached in actual state of world. However, this
information might be out of date in case that node is terminated (volume
is detached automatically). In this situation, reconciler assume volume
is still attached and will not issue attach operation when node comes
back. Pods created on those nodes will fail to mount.
This PR adds the logic to periodically sync up the truth for attached
volumes kept in
the actual state cache. If the volume is no longer attached to the node,
the actual state will be updated to reflect the truth. In turn,
reconciler will take actions if needed.
To avoid issuing many concurrent operations on cloud provider, this PR
tries to add batch operation to check whether a list of volumes are
attached to the node instead of one request per volume.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants