-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sync state loop in master's volume reconciler #34859
Conversation
ee29dfd
to
73f5e7b
Compare
73f5e7b
to
58e2601
Compare
return err | ||
} | ||
// Give an empty UniqueVolumeName so that this operation could be executed concurrently. | ||
return oe.pendingOperations.Run("", "", volumesAreAttachedFunc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: To improve readability, put variable name in a comment when it is not obvious what the value is for:
return oe.pendingOperations.Run("" /* volumeName */, "" /* podName */, volumesAreAttachedFunc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
// Note that this operation could be operated concurrently with other attach/detach operations. | ||
// In theory (but very unlikely in practise), race condition among these operations might mark volume as detached | ||
// even if it is attached. But reconciler can correct this in a short period of time. | ||
CheckVolumesAreAttached(AttachedVolumes []AttachedVolume, nodeName types.NodeName, actualStateOfWorld ActualStateOfWorldAttacherUpdater) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Check
-> Verify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return err | ||
} | ||
// Give an empty UniqueVolumeName so that this operation could be executed concurrently. | ||
return oe.pendingOperations.Run("", "", volumesAreAttachedFunc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we do not care about pending operations for this function, how about we execute it serially inline? That way 1) we can avoid having to make changes to nestedpendingoperations
. And 2) we won't start other (potentially racy) operations until this verification step is complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I am trying to use batch operation for checking volume attached or not, there is no single unique volume id for serialize the operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, and that's fine. So the suggestion is to therefore not use nestedpendingoperations
at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I though about using gorutineMap? but then we need to put goroutineMap into operation_executor besides nestedpendingOperations, which is not needed if choosing to use nestedpendingoperations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm trying to say is that since we are not trying to coordinate with other operations. We don't need to use either nestedpendingoperations
gorutineMap
or operation_executor
. We can do all the things that the method returned by generateVolumesAreAttachedFunc
does in thes same file as syncstates
serially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping on this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spoke offline, ok to leave as is.
for spec, check := range attached { | ||
if !check { | ||
actualStateOfWorld.MarkVolumeAsDetached(volumeSpecMap[spec], nodeName) | ||
glog.V(2).Infof("Volume %q is no longer attached, mark it as detached", volumeSpecMap[spec]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this V(1)
. And make the message more explicit: VerifyVolumesAreAttached determined volume %q (spec.Name: %q) is no longer attached to node %q, therefore it was marked as detached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
volumesPerPlugin := make(map[string][]*volume.Spec) | ||
// volumeSpecMap maps from a volume spec to its unique volumeName which will be used | ||
// when calling MarkVolumeAsDetached | ||
volumeSpecMap := make(map[*volume.Spec]api.UniqueVolumeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are using a pointer as a key in a map. This means that the map will compare memory address for equality--not the contents of volume.Spec
. Is that what you want? Looks like you are doing this in multiple places.
Up to you how you want to handle it. I suggest passing around api.UniqueVolumeName
as needed and using that as the key where possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am try to compare whether they are the same object(volume.Spec), not the content. The functions implemented for Attacher interface are all using volume.Spec as parameters. So it is not very convenient here since operations for actual/desired states use volume id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
58e2601
to
14c596a
Compare
@saad-ali PTAL |
14c596a
to
0ef3370
Compare
@saad-ali PTAL |
0ef3370
to
feec268
Compare
@@ -321,6 +321,9 @@ type Volumes interface { | |||
|
|||
// Check if the volume is already attached to the node with the specified NodeName | |||
DiskIsAttached(diskName string, nodeName types.NodeName) (bool, error) | |||
|
|||
// Check if a list of volumes are attached to the node with the specified NodeName | |||
DisksAreAttached(diskNames []string, nodeName types.NodeName) ([]bool, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little uncomfortable with this method returning an array where the value of the element corresponds to the value of the same index in the input array. I'd prefer the return value to be able to be understood (contain all necessary information) independent of the input. But that's just a queasy feeling, no hard fast rule--so it's fine if you want to leave it like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed to a map.
@@ -240,6 +240,21 @@ func (os *OpenStack) DiskIsAttached(diskName, instanceID string) (bool, error) { | |||
return false, nil | |||
} | |||
|
|||
// query if a list of volumes are attached to a compute instance | |||
func (os *OpenStack) DisksAreAttached(diskNames []string, instanceID string) ([]bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anguslees @dagnello Can you verify this logic for openstack. The purpose of this method is to verify whether the specified disks are still attached to the specified node. If this check determines the disk is no longer attached when it should be, the controller will try to reattach the volume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anguslees or @dagnello Can you take a look at the OpenStack changes by EOD?
@@ -660,3 +660,21 @@ func (rs *Rackspace) DiskIsAttached(diskName, instanceID string) (bool, error) { | |||
} | |||
return false, nil | |||
} | |||
|
|||
// query if a list volumes are attached to a compute instance | |||
func (rs *Rackspace) DisksAreAttached(diskNames []string, instanceID string) ([]bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anguslees, @jamiehannaford, @jrperritt @mkulke Can you verify this logic for rackspace. The purpose of this method is to verify whether the specified disks are still attached to the specified node. If this check determines the disk is no longer attached when it should be, the controller will try to reattach the volume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anguslees, @jamiehannaford, @jrperritt @mkulke: can one of you verify the rackspace changes by EOD?
@@ -1774,6 +1777,38 @@ func (c *Cloud) DiskIsAttached(diskName string, nodeName types.NodeName) (bool, | |||
return false, nil | |||
} | |||
|
|||
func (c *Cloud) DisksAreAttached(diskNames []string, nodeName types.NodeName) ([]bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinsb Can you take a look at the AWS bits here. The purpose of this method is to verify whether the specified disks are still attached to the specified node. If this check determines the disk is no longer attached when it should be, the controller will try to reattach the volume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinsb Could you take a look at the AWS bits by EOD?
@@ -937,6 +941,61 @@ func (vs *VSphere) DiskIsAttached(volPath string, nodeName k8stypes.NodeName) (b | |||
return attached, err | |||
} | |||
|
|||
// DisksAreAttached returns if disks are attached to the VM using controllers supported by the plugin. | |||
func (vs *VSphere) DisksAreAttached(volPaths []string, nodeName k8stypes.NodeName) ([]bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VSphere maintainers (@dagnello @abithap @imkin @abrarshivani @kerneltime @luomiao): Can one of you verify this logic for VSphere. The purpose of this method is to verify whether the specified disks are still attached to the specified node. If this check determines the disk is no longer attached, then the controller will try to reattach the volume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @abrarshivani
feec268
to
4629322
Compare
@saad-ali PTAL |
But this way, syncStates will block reconcile function. On Mon, Oct 24, 2016 at 4:11 PM, Saad Ali notifications@github.com wrote:
|
4629322
to
038d611
Compare
@saad-ali PTAL |
Will LGTM by EOD. Giving volume plugin owners a chance to take a look in the meantime. |
for _, blockDevice := range info.BlockDeviceMappings { | ||
volumeId := aws.StringValue(blockDevice.Ebs.VolumeId) | ||
for _, diskName := range diskNames { | ||
if volumeId == diskName { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry this isn't going to be quite right... I think a diskname normally looks like aws:///vol-abcdef
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I follow the same logic in existing DiskIsAttached function? Is that function fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe not, unless I am missing something.
Some background here: https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L1277-L1312
A volumeid from AWS is going to look like vol-abcdef0
(or longer). A volumeid in the volume spec could look like that, or could be aws:///vol-abcdef0
. We are supposed to normalize before comparing to raw API ids, but it looks like I missed that check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want, we can merge this as is, and then I might do the same thing I did with types.NodeName
, with aws.VolumeNameSpec
or something, so I can catch any more of these.
I don't think we gate on AWS e2e failures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do it in a follow-up. Created issue #35746 to followup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is greatly needed, and I think will solve a lot of problems :-)
A few comments ... I'll check on the big one which is whether that is a volume id in the aws provider.
"Instance %q does not exist. DisksAreAttached will assume PD %v are not attached to it.", | ||
instanceName, | ||
diskNames) | ||
return attached, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to require that we return InstanceNotFound here, as we do for ExternalID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please point to me where ExternalID you mean? Here the logic is if instance does not exist from cloud provider, we assume volumes are all detached. Please let me know if you think this is any issue in this assumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you're doing is fine. In ExternalID
in the cloudprovider, we just always return cloudprovider.InstanceNotFound if the instance is not found. Here it would avoid the duplication of code for this case across the providers.
@@ -56,6 +56,10 @@ const ( | |||
// desiredStateOfWorldPopulatorLoopSleepPeriod is the amount of time the | |||
// DesiredStateOfWorldPopulator loop waits between successive executions | |||
desiredStateOfWorldPopulatorLoopSleepPeriod time.Duration = 1 * time.Minute | |||
|
|||
// reconcilerSyncDuration is the amount of time the reconciler sync states loop | |||
// wait between successive exeutions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: executions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corrected
// If it fails to update node status, skip detach volume | ||
rc.actualStateOfWorld.RemoveVolumeFromReportAsAttached(attachedVolume.VolumeName, attachedVolume.NodeName) | ||
func (rc *reconciler) sync() { | ||
defer rc.updateSyncTime() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to use defer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I tried to use the similar implementation in kubelet reconciler. Any reason defer is a concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a little surprising to see a 2 line function with a defer :-) If it's consistent with usage elsewhere though, that feels right!
func (rc *reconciler) syncStates() { | ||
volumesPerNode := rc.actualStateOfWorld.GetAttachedVolumesPerNode() | ||
for nodeName, volumes := range volumesPerNode { | ||
err := rc.attacherDetacher.VerifyVolumesAreAttached(volumes, nodeName, rc.actualStateOfWorld) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cloud providers can probably implement this more efficiently if you allow them to do a bulk query...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean to query multiple nodes with one call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly ... we can add this later - I suspect we'll hit rate limits on AWS with big clusters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and GCE, probably!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #35748 for follow up
@@ -96,6 +96,40 @@ func (attacher *gcePersistentDiskAttacher) Attach(spec *volume.Spec, nodeName ty | |||
return path.Join(diskByIdPath, diskGooglePrefix+pdName), nil | |||
} | |||
|
|||
func (attacher *gcePersistentDiskAttacher) VolumesAreAttached(specs []*volume.Spec, nodeName types.NodeName) (map[*volume.Spec]bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be time to create the Volumes interface on the cloudprovider :-) The objection last time was that there could be different types; maybe if the volumes interface took a string argument for the "volume type" that would overcome that objection? cc @thockin
|
||
for spec, check := range attached { | ||
if !check { | ||
actualStateOfWorld.MarkVolumeAsDetached(volumeSpecMap[spec], nodeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the thread safety guarantee here (genuinely asking - I just don't know how we're keeping things sane here)? Do we serialize operations on a node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actualStateOfWorld has sync.RWMutex to pretect multiple read/writes. Mounting/Umounting attachable volumes are serialized by nestedpendingoperations for the same volume. But checking volumes are attached can be operated concurrently with other operations. Please let me know if there is any concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that we do asw.GetAttachedVolumesPerNode, which has a read-lock, but that state of the world is then immediately out of date. I think what you're doing here is safe, but I find it hard to think through all the edge cases :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is racy, like you said after doing GetAttachedVolumesPerNode
state could immediately be stale (there could be pending operations that complete, or if sync takes long enough other operations could be started and complete in the meantime). That said, after speaking with @jingxu97 we think that should be ok, because in the worst case scenario, a volume is incorrectly reported as detached when it is actually attached, in which case the controller will try to reattach it--since the operation is idempotent, that should succeed. There shouldn't be any cases where a volume is detached, and falsely reported as attached due to this code.
Updated, @saad-ali PTAL |
LGTM Thanks Jing |
@jessfraz This is not a 1.4.x release blocker. Will be ok to get this cherry picked later next week. |
ok cool thanks! |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue |
@jingxu97 Can you please cherrypick this to the 1.4 branch for the next 1.4.x release. |
@jingxu97 can you open the cherry pick so this will be in the release on friday |
…ck-of-#34859-kubernetes#35883-upstream-release-1.4 Automated cherry pick of kubernetes#34859 kubernetes#35883 upstream release 1.4
Automatic merge from submit-queue Add sync state loop in master's volume reconciler At master volume reconciler, the information about which volumes are attached to nodes is cached in actual state of world. However, this information might be out of date in case that node is terminated (volume is detached automatically). In this situation, reconciler assume volume is still attached and will not issue attach operation when node comes back. Pods created on those nodes will fail to mount. This PR adds the logic to periodically sync up the truth for attached volumes kept in the actual state cache. If the volume is no longer attached to the node, the actual state will be updated to reflect the truth. In turn, reconciler will take actions if needed. To avoid issuing many concurrent operations on cloud provider, this PR tries to add batch operation to check whether a list of volumes are attached to the node instead of one request per volume.
At master volume reconciler, the information about which volumes are
attached to nodes is cached in actual state of world. However, this
information might be out of date in case that node is terminated (volume
is detached automatically). In this situation, reconciler assume volume
is still attached and will not issue attach operation when node comes
back. Pods created on those nodes will fail to mount.
This PR adds the logic to periodically sync up the truth for attached
volumes kept in the actual state cache. If the volume is no longer attached
to the node, the actual state will be updated to reflect the truth. In turn,
reconciler will take actions if needed.
To avoid issuing many concurrent operations on cloud provider, this PR
tries to add batch operation to check whether a list of volumes are
attached to the node instead of one request per volume.
Detailed design is in #33760
This change is