Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS/GCE: Spread PetSet volume creation across zones, create GCE volumes in non-master zones #27553

Merged
merged 3 commits into from
Jun 22, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 52 additions & 4 deletions pkg/cloudprovider/providers/aws/aws.go
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ import (
aws_credentials "k8s.io/kubernetes/pkg/credentialprovider/aws"
"k8s.io/kubernetes/pkg/types"
"k8s.io/kubernetes/pkg/util/sets"
"k8s.io/kubernetes/pkg/volume"
)

const ProviderName = "aws"
Expand Down Expand Up @@ -198,6 +199,7 @@ type EC2Metadata interface {
type VolumeOptions struct {
CapacityGB int
Tags map[string]string
PVCName string
}

// Volumes is an interface for managing cloud-provisioned volumes
Expand Down Expand Up @@ -914,6 +916,49 @@ func (aws *AWSCloud) List(filter string) ([]string, error) {
return aws.getInstancesByRegex(filter)
}

// getAllZones retrieves a list of all the zones in which nodes are running
// It currently involves querying all instances
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how bad is this in large clusters in terms of delay? are we doing it per pvc, I'm guessing its not bad enough in terms of qps to put us into backoff?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think creating PDs is a terribly common operation. We used to list all instances every 10 seconds (but we no longer do that) so I think we should be OK. Originally I did try caching, but now that we no longer poll Instances List, that becomes much harder.

I propose that we run this as-is for now, and then if/when we identify it as a problem we can add time-based caching.

Ideally the cloudprovider should have access to a watched set of Nodes, in which case this would be a trivial operation. This actually comes up in a number of places (e.g. LoadBalancer). Maybe we do that in 1.4?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG

func (c *AWSCloud) getAllZones() (sets.String, error) {
// We don't currently cache this; it is currently used only in volume
// creation which is expected to be a comparatively rare occurence.

// TODO: Caching / expose api.Nodes to the cloud provider?
// TODO: We could also query for subnets, I think

filters := []*ec2.Filter{newEc2Filter("instance-state-name", "running")}
filters = c.addFilters(filters)
request := &ec2.DescribeInstancesInput{
Filters: filters,
}

instances, err := c.ec2.DescribeInstances(request)
if err != nil {
return nil, err
}
if len(instances) == 0 {
return nil, fmt.Errorf("no instances returned")
}

zones := sets.NewString()

for _, instance := range instances {
// Only return fully-ready instances when listing instances
// (vs a query by name, where we will return it if we find it)
if orEmpty(instance.State.Name) == "pending" {
glog.V(2).Infof("Skipping EC2 instance (pending): %s", *instance.InstanceId)
continue
}

if instance.Placement != nil {
zone := aws.StringValue(instance.Placement.AvailabilityZone)
zones.Insert(zone)
}
}

glog.V(2).Infof("Found instances in zones %s", zones)
return zones, nil
}

// GetZone implements Zones.GetZone
func (c *AWSCloud) GetZone() (cloudprovider.Zone, error) {
return cloudprovider.Zone{
Expand Down Expand Up @@ -1387,11 +1432,14 @@ func (aws *AWSCloud) DetachDisk(diskName string, instanceName string) (string, e
return hostDevicePath, err
}

// Implements Volumes.CreateVolume
// CreateDisk implements Volumes.CreateDisk
func (s *AWSCloud) CreateDisk(volumeOptions *VolumeOptions) (string, error) {
// Default to creating in the current zone
// TODO: Spread across zones?
createAZ := s.selfAWSInstance.availabilityZone
allZones, err := s.getAllZones()
if err != nil {
return "", fmt.Errorf("error querying for all zones: %v", err)
}

createAZ := volume.ChooseZoneForVolume(allZones, volumeOptions.PVCName)

// TODO: Should we tag this with the cluster id (so it gets deleted when the cluster does?)
request := &ec2.CreateVolumeInput{}
Expand Down
78 changes: 71 additions & 7 deletions pkg/cloudprovider/providers/gce/gce.go
Original file line number Diff line number Diff line change
Expand Up @@ -120,9 +120,11 @@ type Disks interface {
// DeleteDisk deletes PD.
DeleteDisk(diskToDelete string) error

// GetAutoLabelsForPD returns labels to apply to PeristentVolume
// GetAutoLabelsForPD returns labels to apply to PersistentVolume
// representing this PD, namely failure domain and zone.
GetAutoLabelsForPD(name string) (map[string]string, error)
// zone can be provided to specify the zone for the PD,
// if empty all managed zones will be searched.
GetAutoLabelsForPD(name string, zone string) (map[string]string, error)
}

type instRefSlice []*compute.InstanceReference
Expand Down Expand Up @@ -2066,6 +2068,48 @@ func (gce *GCECloud) List(filter string) ([]string, error) {
return instances, nil
}

// GetAllZones returns all the zones in which nodes are running
func (gce *GCECloud) GetAllZones() (sets.String, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing it's easier to do this than make it a private method in the same module, like we did for aws ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. The GCE PD creation code is split between here and the volumes package; AWS has a cleaner separation of code. So the zone is determine in the cloudprovider in AWS, and in the volumes code in GCE.

Note that this method is public, but is not part of the public interface. i.e. it is public on GCECloud, not cloudprovider.Interface

// Fast-path for non-multizone
if len(gce.managedZones) == 1 {
return sets.NewString(gce.managedZones...), nil
}

// TODO: Caching, but this is currently only called when we are creating a volume,
// which is a relatively infrequent operation, and this is only # zones API calls
zones := sets.NewString()

// TODO: Parallelize, although O(zones) so not too bad (N <= 3 typically)
for _, zone := range gce.managedZones {
// We only retrieve one page in each zone - we only care about existence
listCall := gce.service.Instances.List(gce.projectID, zone)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reson we're listing all instances here and only running instances above ? is it simpler to not account for instance state, since one can't guess if all instances will go down right after we allocate the pv anyway

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debated this, and am happy to change. I tried to sum it up in "No filter: we assume that a zone is either used or unused". My view was that here we should probably just keep it simple, and if an instance was starting up we would still count it as a zone in use. The above method is constrained to exclude them, but I thought it probably simpler to just keep this method simple for now (we can always filter out instances that are shutting down later...)


// No filter: We assume that a zone is either used or unused
// We could only consider running nodes (like we do in List above),
// but probably if instances are starting we still want to consider them.
// I think we should wait until we have a reason to make the
// call one way or the other; we generally can't guarantee correct
// volume spreading if the set of zones is changing
// (and volume spreading is currently only a heuristic).
// Long term we want to replace GetAllZones (which primarily supports volume
// spreading) with a scheduler policy that is able to see the global state of
// volumes and the health of zones.

// Just a minimal set of fields - we only care about existence
listCall = listCall.Fields("items(name)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zmerlynn fyi another list call on fields, but I think this ones ok since it's just checking for exit code 0, essentially

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine. The function could be a bit more pessimistic about which zones it uses if the nodeInstancePrefix I added in #27741 wasn't actually advisory, but right now we let people join arbitrary garbage to a master, so this sort of check is about all that you can do.


res, err := listCall.Do()
if err != nil {
return nil, err
}
if len(res.Items) != 0 {
zones.Insert(zone)
}
}

return zones, nil
}

func getMetadataValue(metadata *compute.Metadata, key string) (string, bool) {
for _, item := range metadata.Items {
if item.Key == key {
Expand Down Expand Up @@ -2218,13 +2262,33 @@ func (gce *GCECloud) DeleteDisk(diskToDelete string) error {
// Builds the labels that should be automatically added to a PersistentVolume backed by a GCE PD
// Specifically, this builds FailureDomain (zone) and Region labels.
// The PersistentVolumeLabel admission controller calls this and adds the labels when a PV is created.
func (gce *GCECloud) GetAutoLabelsForPD(name string) (map[string]string, error) {
disk, err := gce.getDiskByNameUnknownZone(name)
if err != nil {
return nil, err
// If zone is specified, the volume will only be found in the specified zone,
// otherwise all managed zones will be searched.
func (gce *GCECloud) GetAutoLabelsForPD(name string, zone string) (map[string]string, error) {
var disk *gceDisk
var err error
if zone == "" {
// We would like as far as possible to avoid this case,
// because GCE doesn't guarantee that volumes are uniquely named per region,
// just per zone. However, creation of GCE PDs was originally done only
// by name, so we have to continue to support that.
// However, wherever possible the zone should be passed (and it is passed
// for most cases that we can control, e.g. dynamic volume provisioning)
disk, err = gce.getDiskByNameUnknownZone(name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how bad would it be to always get by unknown zone, then match zone and throw an error/log? it might save us in the case where something fishy happens and we end up with cross matched zones

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... so this is subtle. This is actually a workaround for #27656 and #27657. The problem is that in the volumes admission controller the cloudprovider is not configured with multizone=true. This would prevent getDiskByNameUnknownZone from working at all unless the PD was in the master zone. However, the workaround here is to allow creation in a non-master zone by looking to this zone parameter, which is derived by the admission controller when the zone annotation is already set.

Also, we actually want to be moving away getDiskByNameUnknownZone and towards volumes that always have an associated zone on GCE, because volume names on GCE are unique per zone, not per region. i.e. the correct specifier for a GCE PD is probably <zone>/<name>, not <name>. Thankfully with dynamic provisioning we can just use the annotation (which is painful for a human but fine for code), and static provisioning never really made sense on a cloud anyway...

I guess this needs a comment!

if err != nil {
return nil, err
}
zone = disk.Zone
} else {
// We could assume the disks exists; we have all the information we need
// However it is more consistent to ensure the disk exists,
// and in future we may gather addition information (e.g. disk type, IOPS etc)
disk, err = gce.getDiskByName(name, zone)
if err != nil {
return nil, err
}
}

zone := disk.Zone
region, err := GetGCERegion(zone)
if err != nil {
return nil, err
Expand Down
1 change: 1 addition & 0 deletions pkg/controller/persistentvolume/controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -1125,6 +1125,7 @@ func (ctrl *PersistentVolumeController) provisionClaimOperation(claimObj interfa
CloudTags: &tags,
ClusterName: ctrl.clusterName,
PVName: pvName,
PVCName: claim.Name,
}

// Provision the volume
Expand Down
1 change: 1 addition & 0 deletions pkg/volume/aws_ebs/aws_util.go
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ func (util *AWSDiskUtil) CreateVolume(c *awsElasticBlockStoreProvisioner) (strin
volumeOptions := &aws.VolumeOptions{
CapacityGB: requestGB,
Tags: tags,
PVCName: c.options.PVCName,
}

name, err := cloud.CreateDisk(volumeOptions)
Expand Down
2 changes: 1 addition & 1 deletion pkg/volume/gce_pd/attacher_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -342,6 +342,6 @@ func (testcase *testcase) DeleteDisk(diskToDelete string) error {
return errors.New("Not implemented")
}

func (testcase *testcase) GetAutoLabelsForPD(name string) (map[string]string, error) {
func (testcase *testcase) GetAutoLabelsForPD(name string, zone string) (map[string]string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how hard is a "name: foo-0" check zone type unittests?

Optionally, you can actually create a shell petset (https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/petset/fakes.go#L73), create a name identity mapper (https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/petset/identity_mappers.go#L52), set the identity (https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/petset/identity_mappers.go#L64), then pass it through the provisioner and check it's zone. That way I won't be able to get away with breaking the identity mapper without fixing this :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could have two tests: one that tests that Petsets creates PVCs with names like <prefix>-<ordinal>, and the other that checks that PVCs with names like <prefix>-<ordinal> are round-robin-ed across zones.

return map[string]string{}, errors.New("Not implemented")
}
8 changes: 5 additions & 3 deletions pkg/volume/gce_pd/gce_util.go
Original file line number Diff line number Diff line change
Expand Up @@ -82,20 +82,22 @@ func (gceutil *GCEDiskUtil) CreateVolume(c *gcePersistentDiskProvisioner) (strin

// The disk will be created in the zone in which this code is currently running
// TODO: We should support auto-provisioning volumes in multiple/specified zones
zone, err := cloud.GetZone()
zones, err := cloud.GetAllZones()
if err != nil {
glog.V(2).Infof("error getting zone information from GCE: %v", err)
return "", 0, nil, err
}

err = cloud.CreateDisk(name, zone.FailureDomain, int64(requestGB), *c.options.CloudTags)
zone := volume.ChooseZoneForVolume(zones, c.options.PVCName)

err = cloud.CreateDisk(name, zone, int64(requestGB), *c.options.CloudTags)
if err != nil {
glog.V(2).Infof("Error creating GCE PD volume: %v", err)
return "", 0, nil, err
}
glog.V(2).Infof("Successfully created GCE PD volume %s", name)

labels, err := cloud.GetAutoLabelsForPD(name)
labels, err := cloud.GetAutoLabelsForPD(name, zone)
if err != nil {
// We don't really want to leak the volume here...
glog.Errorf("error getting labels for volume %q: %v", name, err)
Expand Down
2 changes: 2 additions & 0 deletions pkg/volume/plugins.go
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@ type VolumeOptions struct {
// PV.Name of the appropriate PersistentVolume. Used to generate cloud
// volume name.
PVName string
// PVC.Name of the PersistentVolumeClaim; only set during dynamic provisioning.
PVCName string
// Unique name of Kubernetes cluster.
ClusterName string
// Tags to attach to the real volume in the cloud provider - e.g. AWS EBS
Expand Down
61 changes: 61 additions & 0 deletions pkg/volume/util.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,13 @@ import (
"k8s.io/kubernetes/pkg/watch"

"github.com/golang/glog"
"hash/fnv"
"k8s.io/kubernetes/pkg/api/errors"
"k8s.io/kubernetes/pkg/api/resource"
"k8s.io/kubernetes/pkg/util/sets"
"math/rand"
"strconv"
"strings"
)

// RecycleVolumeByWatchingPodUntilCompletion is intended for use with volume
Expand Down Expand Up @@ -187,3 +192,59 @@ func GenerateVolumeName(clusterName, pvName string, maxLength int) string {
}
return prefix + "-" + pvName
}

// ChooseZone implements our heuristics for choosing a zone for volume creation based on the volume name
// Volumes are generally round-robin-ed across all active zones, using the hash of the PVC Name.
// However, if the PVCName ends with `-<integer>`, we will hash the prefix, and then add the integer to the hash.
// This means that a PetSet's volumes (`claimname-petsetname-id`) will spread across available zones,
// assuming the id values are consecutive.
func ChooseZoneForVolume(zones sets.String, pvcName string) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you document that if anyone creates volumes like foo-index they'll get spreading across zones? (just comment is fine)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!

// We create the volume in a zone determined by the name
// Eventually the scheduler will coordinate placement into an available zone
var hash uint32
var index uint32

if pvcName == "" {
// We should always be called with a name; this shouldn't happen
glog.Warningf("No name defined during volume create; choosing random zone")

hash = rand.Uint32()
} else {
hashString := pvcName

// Heuristic to make sure that volumes in a PetSet are spread across zones
// PetSet PVCs are (currently) named ClaimName-PetSetName-Id,
// where Id is an integer index
lastDash := strings.LastIndexByte(pvcName, '-')
if lastDash != -1 {
petIDString := pvcName[lastDash+1:]
petID, err := strconv.ParseUint(petIDString, 10, 32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please add a comment above where we insert the pet name identityt mentioning that it's used to detect zone? we should consider isolating this into a common function called getOrdinality or something that just returns an index for any given name that other parts of the system can reuse, but we can do that later and just put in a comment for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'm using the PVC name though - it seemed superfluous to add another annotation, and also like it would box us in for the future (we'd have to support it forever, whereas now we just have a heuristic that means that PetSet volumes do a sensible thing in most cases today, and in future they can do the optimal thing in all cases)

if err == nil {
// Offset by the pet id, so we round-robin across zones
index = uint32(petID)
// We still hash the volume name, but only the base
hashString = pvcName[:lastDash]
glog.V(2).Infof("Detected PetSet-style volume name %q; index=%d", pvcName, index)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log the error at warningf, if any?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We expect an error though if the PVC name is not a PetSet created name (e.g. "my-cool-pvc").

}

// We hash the (base) volume name, so we don't bias towards the first N zones
h := fnv.New32()
h.Write([]byte(hashString))
hash = h.Sum32()
}

// Zones.List returns zones in a consistent order (sorted)
// We do have a potential failure case where volumes will not be properly spread,
// if the set of zones changes during PetSet volume creation. However, this is
// probably relatively unlikely because we expect the set of zones to be essentially
// static for clusters.
// Hopefully we can address this problem if/when we do full scheduler integration of
// PVC placement (which could also e.g. avoid putting volumes in overloaded or
// unhealthy zones)
zoneSlice := zones.List()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the spreading property depend on zones.List() returning the zones in the same order every time it's called? If so, do we know this will actually be true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does depend on it, but List() is sorted: https://github.com/kubernetes/kubernetes/blob/master/pkg/util/sets/string.go#L167

This could probably use a comment though.

The case where it fails is if the set of zone changes while volumes are being created for a PetSet. I think the most likely case here is that we grow a PetSet from N=1 -> N=3 some time after initial creation, and in the meantime the user has also expanded their k8s cluster. However, I don't really know what we can practically do better here in 1.3 (suggestions welcome). We really should look at the set of PVCs and pick a zone based on the other allocations (like we would with pod anti affinity), but those scheduler-like changes seem impractical for 1.3 (and we know we want to do them later, which is why we are favoring the name-based hack over an annotation that might constrain us in future).

I hypothesize though that what we have here will be good enough in practice. I don't think AWS or GCE have any regions with 5 zones. I suspect most HA clusters will be launched into 3 zones, and then the set of zones will not change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's fine, I was just curious. Thanks for the thorough explanation!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah i think we can simply document around scaling zones while a pet is scaling

zone := zoneSlice[(hash+index)%uint32(len(zoneSlice))]

glog.V(2).Infof("Creating volume for PVC %q; chose zone=%q from zones=%q", pvcName, zone, zoneSlice)
return zone
}
6 changes: 5 additions & 1 deletion plugin/pkg/admission/persistentvolume/label/admission.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ import (

"k8s.io/kubernetes/pkg/admission"
"k8s.io/kubernetes/pkg/api"
"k8s.io/kubernetes/pkg/api/unversioned"
"k8s.io/kubernetes/pkg/cloudprovider"
"k8s.io/kubernetes/pkg/cloudprovider/providers/aws"
"k8s.io/kubernetes/pkg/cloudprovider/providers/gce"
Expand Down Expand Up @@ -159,7 +160,10 @@ func (l *persistentVolumeLabel) findGCEPDLabels(volume *api.PersistentVolume) (m
return nil, fmt.Errorf("unable to build GCE cloud provider for PD")
}

labels, err := provider.GetAutoLabelsForPD(volume.Spec.GCEPersistentDisk.PDName)
// If the zone is already labeled, honor the hint
zone := volume.Labels[unversioned.LabelZoneFailureDomain]

labels, err := provider.GetAutoLabelsForPD(volume.Spec.GCEPersistentDisk.PDName, zone)
if err != nil {
return nil, err
}
Expand Down