Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support spotFleet for instances groups #1784

Closed
ese opened this issue Feb 4, 2017 · 28 comments
Closed

Support spotFleet for instances groups #1784

ese opened this issue Feb 4, 2017 · 28 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Milestone

Comments

@ese
Copy link
Contributor

ese commented Feb 4, 2017

Spot fleets allow to use a wide range of instances types to provide certain amount of resources by requesting spot instances. This is great to provide capacity for workloads that require high amount of resources, dont require real-time interaction and are able to recover if the node dissapear
kubernetes/kubernetes#24472
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-fleet.html

@ese ese changed the title Support spot fleet fot instances groups Support spotFleet fot instances groups Feb 4, 2017
@ese ese changed the title Support spotFleet fot instances groups Support spotFleet for instances groups Feb 4, 2017
@justinsb justinsb added this to the 1.5.1 milestone Feb 5, 2017
@justinsb
Copy link
Member

justinsb commented Feb 5, 2017

I agree this would be a great feature. There were some limitations around tagging of instances previously which make this non-trivial. An interesting option (IMO) is to have the autoscaler be "spot fleet" aware; either by directly using spot fleet or by reimplementing spot fleet.

The advantage of reimplementing spot fleet is maybe the autoscaler can know better what resources are required, for example whether we need CPU or Memory or GPUs etc. Also that it would be cloud agnostic.

@jpcope
Copy link

jpcope commented May 11, 2017

The user mumoshu in the public kubernetes-incubator/kube-aws project has created an implementation that works around the tagging limitations kubernetes-retired/kube-aws#112

As an individual interested in running kubernetes in a production context, the target-capacity limitation of cloudformation with the aforementioned implementation documented at https://github.com/kubernetes-incubator/kube-aws/blob/master/Documentation/kubernetes-on-aws-node-pool.md#known-limitations is troubling.

Running kube-aws update to increase or decrease targetCapacity of a spot fleet results in a complete replacement of the Spot Fleet hence some downtime. This is due to how CloudFormation works for updating a Spot Fleet

I do believe that a re-implementation of spot fleet would be beneficial especially if it means I can mix lifecycle type nodes and define pods that are mission-critical ( never use spot ) vs highly-available ( always have at least X pods on non-spot nodes; scale out using spot ) and interruption-tolerant ( only use spot )

In a production context zone coverage of application pods is also important. Ideally this should be adjustable so you can get the biggest bang for your buck if you are not running in production.

Also draining the pods when a termination is signaled http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html#spot-instance-termination-notices will make a more opaque end-user experience of services running in the cluster.

@bcorijn
Copy link
Contributor

bcorijn commented Aug 2, 2017

As mentioned in kubernetes/kubernetes#24472, AWS now supports propagating tags on Spot Fleets!

@Globegitter
Copy link
Contributor

Would this then also allow to support https://github.com/cristim/autospotting? Or would it just be a matter of that tool to then have to set the right tag on each instance it attaches to the autoscaling group?

@ajohnstone
Copy link
Contributor

@justinsb
Copy link
Member

So spot fleet actually works with kubernetes, now that we have tagging support. Changing a spot fleet configuration though is different from how ASGs work, so we need another approach.

I'm hoping we can look at this in the 1.9 1.10 timeframe, using the machines API probably.

@justinsb justinsb modified the milestones: 1.5.2, 1.9 Nov 18, 2017
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 16, 2018
@cdenneen
Copy link

cdenneen commented Feb 17, 2018

/remove-lifecycle stale

@justinsb is there any updated docs on how to configure kops to do this?

@chrislovecnm
Copy link
Contributor

@cdenneen this needs to be implemented, we only support spot instances via the ASG API calls.

/lifecycle frozen
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 17, 2018
@chrislovecnm chrislovecnm added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. and removed lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. labels Feb 17, 2018
@chrislovecnm
Copy link
Contributor

I am marking with help wanted label since this is a bit of a project, not a great place to start, but a great feature to add.

@itskingori
Copy link
Member

@justinsb I have two comments about #1784 (comment) ...

So spot fleet actually works with kubernetes, now that we have tagging support. Changing a spot fleet configuration though is different from how ASGs work, so we need another approach.

  1. We now have EC2 Fleets which supersede Spot Fleets. I'm not sure if we should keep this open and maybe change the title ... or close it and open another about EC2 Fleets.
  2. At the bottom of that post, they mention that they are planning on connecting EC2 Fleet and EC2 Auto Scaling groups ... IMO, this would significantly affect (simplify) the design/implementation.

@chrislovecnm And on #1784 (comment) ...

I am marking with help wanted label since this is a bit of a project, not a great place to start, but a great feature to add.

I'd posted on Slack that I'm looking to contribute by tackling this but based on my research I'm convinced that the solution is not trivial, a bit premature to tackle and is probably best addressed on the autoscaler level. And if AWS indeed connect EC2 Fleets with ASGs, then even the autoscaler won't need to reinvent the wheel. I guess the question is when! 😅


FYI, there are two issues on autoscaler about fleets. There's kubernetes/autoscaler#519 requesting support for Spot Fleets and kubernetes/autoscaler#838 requesting support for EC2 Fleets.

@justinsb justinsb modified the milestones: 1.9.0, 1.10 May 26, 2018
@cjbottaro
Copy link

Is there any documentation on how to manually make your own instance groups?

In EKS, I made my own spot fleet, but that is because the userdata for nodes joining an EKS cluster is pretty simple and I could easily change it up for my needs.

The userdata for nodes launched with kops is a bit more involved... :/

@cdenneen
Copy link

cdenneen commented Aug 8, 2018

@cjbottaro how did you make your own spot fleet for EKS? I'm interested in how you implemented this and the benefits you see in doing so. Thanks

@cjbottaro
Copy link

the benefits

We currently run in ECS and we have two AGS (using launch configs) populating our cluster: one for on-demand instances and one for spot. More than a few times, we've seen spot prices spike, and thus wipe out 3/4 of our cluster, causing an outage before we realize what's going on and up our on-demand instances to take over.

Spot fleets (which require launch templates instead of launch configs) let you specify multiple machine types. So if prices spike for one type and get wiped out, the fleet will automatically fulfill other machine types that you can afford. Furthermore, you can tell the spot fleet to diversity among the machine types / availability zones, so there is less chance of max extinction.

How in EKS

I followed the tutorial to get an EKS cluster up and running, which includes creating a launch config and ASG for nodes. I simply converted the launch config to a launch template (by hand in the AWS web console) and copy/pasted the userdata which configures and starts kubelet, then made a spot fleet using that launch template. The user data for EKS nodes is pretty small and easy to understand. To add taints and labels, you can simply add something like this to the end of the user data:

NODE_NAME=`hostname -s`.ec2.internal
export KUBECONFIG=/var/lib/kubelet/kubeconfig
until kubectl get node $NODE_NAME; do sleep 1; done
kubectl label node $NODE_NAME lifecycle=od
kubectl taint node $NODE_NAME dedicated=datastores:NoExecute

Issue with kops

Kops has these first class objects instancegroups and it seems like the userdata for nodes is the output of a template plus the instancegroup manifest. The problem is that it seems like there are a lot of values in the manifest that come out in the user data... too many for a simple copy/paste of the resulting user data to not be error prone.

It would be nice if there was a command to output the user data for a given instancegroup:

kops export userdata my_instance_group

So we can create our own launch configs / launch templates.

Thanks!

@rbtcollins
Copy link
Contributor

I think this also needs some proper k8s core integration - I've filed kubernetes/kubernetes#70342 about that.

@gambol99
Copy link
Contributor

related #6277

@disha1104
Copy link

@gambol99 , I checked the PR, it looks great.
Does it mean, we'll be able to use launch templates and smart ASG's with Kops? As Kops currently only supports Launch Configuration.

@gambol99
Copy link
Contributor

@disha1104 ...

Does it mean, we'll be able to use launch templates and smart ASG's with Kops?

yep :-) ..

@disha1104
Copy link

That's really great.
Is this planned for next-release, like by when can I expect this feature?
Also, is the launch-configuration conversion to launch-template handled? And also the Spot-termination handler thing, like in terms of reliability of the nodes?

@oded-dd
Copy link

oded-dd commented Feb 11, 2019

It's amazing. I was thinking on using spotinst but once it is deployed I would rather use it.

When it is planned to be merged?

@gambol99
Copy link
Contributor

@disha1104

Is this planned for next-release, like by when can I expect this feature?

So I don't cut the releases ... you'd need to be speak to @justinsb

Also, is the launch-configuration conversion to launch-template handled?

The launch-configuration -> launch-template is done automatically (though admittedly it leaves the LC hanging, so just needs to deleted manually post update)

And also the Spot-termination handler thing, like in terms of reliability of the nodes?

So I this doesn't add a spot termination handler addon as I'd rather let the users choose how they want it done.

@disha1104
Copy link

@justinsb which release is this planned for?

@wanghanlin
Copy link
Contributor

wanghanlin commented Feb 25, 2019

#6277 is merged, I think we can close this one 🎉

@k8s-ci-robot
Copy link
Contributor

@wanghanlin: You can't close an active issue/PR unless you authored it or you are a collaborator.

In response to this:

#6277 is merged, I think we can close this one 🎉
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@justinsb justinsb modified the milestones: 1.10, 1.12 Mar 14, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 12, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 12, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests