-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS/GCE: Spread PetSet volume creation across zones, create GCE volumes in non-master zones #27553
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -120,9 +120,11 @@ type Disks interface { | |
// DeleteDisk deletes PD. | ||
DeleteDisk(diskToDelete string) error | ||
|
||
// GetAutoLabelsForPD returns labels to apply to PeristentVolume | ||
// GetAutoLabelsForPD returns labels to apply to PersistentVolume | ||
// representing this PD, namely failure domain and zone. | ||
GetAutoLabelsForPD(name string) (map[string]string, error) | ||
// zone can be provided to specify the zone for the PD, | ||
// if empty all managed zones will be searched. | ||
GetAutoLabelsForPD(name string, zone string) (map[string]string, error) | ||
} | ||
|
||
type instRefSlice []*compute.InstanceReference | ||
|
@@ -2066,6 +2068,38 @@ func (gce *GCECloud) List(filter string) ([]string, error) { | |
return instances, nil | ||
} | ||
|
||
// GetAllZones returns all the zones in which nodes are running | ||
func (gce *GCECloud) GetAllZones() (sets.String, error) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm guessing it's easier to do this than make it a private method in the same module, like we did for aws ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Correct. The GCE PD creation code is split between here and the volumes package; AWS has a cleaner separation of code. So the zone is determine in the cloudprovider in AWS, and in the volumes code in GCE. Note that this method is public, but is not part of the public interface. i.e. it is public on GCECloud, not cloudprovider.Interface |
||
if len(gce.managedZones) == 1 { | ||
return sets.NewString(gce.managedZones...), nil | ||
} | ||
|
||
// TODO: Caching, but this is currently only called when we are creating a volume, | ||
// which is a relatively infrequent operation, and this is only # zones API calls | ||
zones := sets.NewString() | ||
|
||
// TODO: Parallelize, although O(zones) so not too bad (N <= 3 typically) | ||
for _, zone := range gce.managedZones { | ||
// We only retrieve one page in each zone - we only care about existence | ||
listCall := gce.service.Instances.List(gce.projectID, zone) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is there a reson we're listing all instances here and only running instances above ? is it simpler to not account for instance state, since one can't guess if all instances will go down right after we allocate the pv anyway There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I debated this, and am happy to change. I tried to sum it up in "No filter: we assume that a zone is either used or unused". My view was that here we should probably just keep it simple, and if an instance was starting up we would still count it as a zone in use. The above method is constrained to exclude them, but I thought it probably simpler to just keep this method simple for now (we can always filter out instances that are shutting down later...) |
||
|
||
// No filter: We assume that a zone is either used or unused | ||
|
||
// Just a minimal set of fields - we only care about existence | ||
listCall = listCall.Fields("items(name)") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @zmerlynn fyi another list call on fields, but I think this ones ok since it's just checking for exit code 0, essentially There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks fine. The function could be a bit more pessimistic about which zones it uses if the |
||
|
||
res, err := listCall.Do() | ||
if err != nil { | ||
return nil, err | ||
} | ||
if len(res.Items) != 0 { | ||
zones.Insert(zone) | ||
} | ||
} | ||
|
||
return zones, nil | ||
} | ||
|
||
func getMetadataValue(metadata *compute.Metadata, key string) (string, bool) { | ||
for _, item := range metadata.Items { | ||
if item.Key == key { | ||
|
@@ -2218,13 +2252,27 @@ func (gce *GCECloud) DeleteDisk(diskToDelete string) error { | |
// Builds the labels that should be automatically added to a PersistentVolume backed by a GCE PD | ||
// Specifically, this builds FailureDomain (zone) and Region labels. | ||
// The PersistentVolumeLabel admission controller calls this and adds the labels when a PV is created. | ||
func (gce *GCECloud) GetAutoLabelsForPD(name string) (map[string]string, error) { | ||
disk, err := gce.getDiskByNameUnknownZone(name) | ||
if err != nil { | ||
return nil, err | ||
// If zone is specified, the volume will only be found in the specified zone, | ||
// otherwise all managed zones will be searched. | ||
func (gce *GCECloud) GetAutoLabelsForPD(name string, zone string) (map[string]string, error) { | ||
var disk *gceDisk | ||
var err error | ||
if zone == "" { | ||
disk, err = gce.getDiskByNameUnknownZone(name) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how bad would it be to always get by unknown zone, then match zone and throw an error/log? it might save us in the case where something fishy happens and we end up with cross matched zones There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah... so this is subtle. This is actually a workaround for #27656 and #27657. The problem is that in the volumes admission controller the cloudprovider is not configured with multizone=true. This would prevent getDiskByNameUnknownZone from working at all unless the PD was in the master zone. However, the workaround here is to allow creation in a non-master zone by looking to this zone parameter, which is derived by the admission controller when the zone annotation is already set. Also, we actually want to be moving away getDiskByNameUnknownZone and towards volumes that always have an associated zone on GCE, because volume names on GCE are unique per zone, not per region. i.e. the correct specifier for a GCE PD is probably I guess this needs a comment! |
||
if err != nil { | ||
return nil, err | ||
} | ||
zone = disk.Zone | ||
} else { | ||
// We could assume the disks exists; we have all the information we need | ||
// However it is more consistent to ensure the disk exists, | ||
// and in future we may gather addition information (e.g. disk type, IOPS etc) | ||
disk, err = gce.getDiskByName(name, zone) | ||
if err != nil { | ||
return nil, err | ||
} | ||
} | ||
|
||
zone := disk.Zone | ||
region, err := GetGCERegion(zone) | ||
if err != nil { | ||
return nil, err | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -342,6 +342,6 @@ func (testcase *testcase) DeleteDisk(diskToDelete string) error { | |
return errors.New("Not implemented") | ||
} | ||
|
||
func (testcase *testcase) GetAutoLabelsForPD(name string) (map[string]string, error) { | ||
func (testcase *testcase) GetAutoLabelsForPD(name string, zone string) (map[string]string, error) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how hard is a "name: foo-0" check zone type unittests? Optionally, you can actually create a shell petset (https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/petset/fakes.go#L73), create a name identity mapper (https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/petset/identity_mappers.go#L52), set the identity (https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/petset/identity_mappers.go#L64), then pass it through the provisioner and check it's zone. That way I won't be able to get away with breaking the identity mapper without fixing this :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess we could have two tests: one that tests that Petsets creates PVCs with names like |
||
return map[string]string{}, errors.New("Not implemented") | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -49,6 +49,8 @@ type VolumeOptions struct { | |
// PV.Name of the appropriate PersistentVolume. Used to generate cloud | ||
// volume name. | ||
PVName string | ||
// PVC.Name of the PersistentVolumeClaim, if we are auto-provisioning. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this only set when we're auto-provisioning? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All of VolumeOptions is only used for auto-provisioning. I guess "if we are auto-provisioning" is technically superfluous; I thought it was clearer but I guess it begs the question of when VolumeOptions is used other than auto-provisioning. |
||
PVCName string | ||
// Unique name of Kubernetes cluster. | ||
ClusterName string | ||
// Tags to attach to the real volume in the cloud provider - e.g. AWS EBS | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -28,8 +28,13 @@ import ( | |
"k8s.io/kubernetes/pkg/watch" | ||
|
||
"github.com/golang/glog" | ||
"hash/fnv" | ||
"k8s.io/kubernetes/pkg/api/errors" | ||
"k8s.io/kubernetes/pkg/api/resource" | ||
"k8s.io/kubernetes/pkg/util/sets" | ||
"math/rand" | ||
"strconv" | ||
"strings" | ||
) | ||
|
||
// RecycleVolumeByWatchingPodUntilCompletion is intended for use with volume | ||
|
@@ -187,3 +192,47 @@ func GenerateVolumeName(clusterName, pvName string, maxLength int) string { | |
} | ||
return prefix + "-" + pvName | ||
} | ||
|
||
// ChooseZone implements our heuristics for choosing a zone for volume creation based on the volume name | ||
func ChooseZoneForVolume(zones sets.String, pvcName string) string { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you document that if anyone creates volumes like foo-index they'll get spreading across zones? (just comment is fine) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will do! |
||
// We create the volume in a zone determined by the name | ||
// Eventually the scheduler will coordinate placement into an available zone | ||
var hash uint32 | ||
var index uint32 | ||
|
||
if pvcName == "" { | ||
// We should always be called with a name; this shouldn't happen | ||
glog.Warningf("No Name defined during volume create; choosing random zone") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please lower case 'Name' |
||
|
||
hash = rand.Uint32() | ||
} else { | ||
hashString := pvcName | ||
|
||
// Heuristic to make sure that volumes in a PetSet are spread across zones | ||
// PetSet PVCs are (currently) named ClaimName-PetSetName-Id, | ||
// where Id is an integer index | ||
lastDash := strings.LastIndexByte(pvcName, '-') | ||
if lastDash != -1 { | ||
petIDString := pvcName[lastDash+1:] | ||
petID, err := strconv.ParseUint(petIDString, 10, 32) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you please add a comment above where we insert the pet name identityt mentioning that it's used to detect zone? we should consider isolating this into a common function called There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure. I'm using the PVC name though - it seemed superfluous to add another annotation, and also like it would box us in for the future (we'd have to support it forever, whereas now we just have a heuristic that means that PetSet volumes do a sensible thing in most cases today, and in future they can do the optimal thing in all cases) |
||
if err == nil { | ||
// Offset by the pet id, so we round-robin across zones | ||
index = uint32(petID) | ||
// We still hash the volume name, but only the base | ||
hashString = pvcName[:lastDash] | ||
glog.V(2).Infof("Detected PetSet-style volume name %q; index=%d", pvcName, index) | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. log the error at warningf, if any? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We expect an error though if the PVC name is not a PetSet created name (e.g. "my-cool-pvc"). |
||
} | ||
|
||
// We hash the (base) volume name, so we don't bias towards the first N zones | ||
h := fnv.New32() | ||
h.Write([]byte(hashString)) | ||
hash = h.Sum32() | ||
} | ||
|
||
zoneSlice := zones.List() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does the spreading property depend on zones.List() returning the zones in the same order every time it's called? If so, do we know this will actually be true? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It does depend on it, but List() is sorted: https://github.com/kubernetes/kubernetes/blob/master/pkg/util/sets/string.go#L167 This could probably use a comment though. The case where it fails is if the set of zone changes while volumes are being created for a PetSet. I think the most likely case here is that we grow a PetSet from N=1 -> N=3 some time after initial creation, and in the meantime the user has also expanded their k8s cluster. However, I don't really know what we can practically do better here in 1.3 (suggestions welcome). We really should look at the set of PVCs and pick a zone based on the other allocations (like we would with pod anti affinity), but those scheduler-like changes seem impractical for 1.3 (and we know we want to do them later, which is why we are favoring the name-based hack over an annotation that might constrain us in future). I hypothesize though that what we have here will be good enough in practice. I don't think AWS or GCE have any regions with 5 zones. I suspect most HA clusters will be launched into 3 zones, and then the set of zones will not change. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, that's fine, I was just curious. Thanks for the thorough explanation! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah i think we can simply document around scaling zones while a pet is scaling |
||
zone := zoneSlice[(hash+index)%uint32(len(zoneSlice))] | ||
|
||
glog.V(2).Infof("Creating volume for PVC %q; chose zone=%q from zones=%q", pvcName, zone, zoneSlice) | ||
return zone | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how bad is this in large clusters in terms of delay? are we doing it per pvc, I'm guessing its not bad enough in terms of qps to put us into backoff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think creating PDs is a terribly common operation. We used to list all instances every 10 seconds (but we no longer do that) so I think we should be OK. Originally I did try caching, but now that we no longer poll Instances List, that becomes much harder.
I propose that we run this as-is for now, and then if/when we identify it as a problem we can add time-based caching.
Ideally the cloudprovider should have access to a watched set of Nodes, in which case this would be a trivial operation. This actually comes up in a number of places (e.g. LoadBalancer). Maybe we do that in 1.4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG