Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][DO NOT MERGE] Proposal: Auto-scaling #546

Closed
wants to merge 8 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
auto scaling proposal
  • Loading branch information
pweil- committed Dec 8, 2014
commit 035e0351a838a6ec6dcd65cc1977c85fca98ae5f
185 changes: 185 additions & 0 deletions docs/autoscaling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
## Auto-Scaling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal needs to be generalized to be Kubernetes specific, not OpenShift specific.


Auto-scaling is a feature that allows users to increase capacity as needed by controlling the number of replicas deployed
within the system based on request statistics. Auto-scaling is bidirectional, it will scale up to a specified limit as
requests increase and scale down to minimal capacity as the requests decrease.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to spend more time talking about what the goals of this proposal are and reference use cases.


### The Auto-Scaling Architecture

* `ReplicationController` - the building block of auto scaling. Pods are deployed and scaled by a `ReplicationController`
* router - a request control point. The routing layer handles outside to inside traffic requests and maintains statistics for the scaler
* kube proxy - a request control point. The proxy handles internal traffic and maintains statistics for the scaler
* data aggregator - gathers data from control points
* auto-scaler - scales replicas up and down by increasing or decreasing the replication count of the `ReplicationController` according
to scaling rules

### Auto-Scaling Data

Data used for auto scaling is based upon routes and the requests for those routes. The auto-scaler cares about the following
data points:

1. network statistics (Requests Per Second, Sessions Per Second, Bytes In, Bytes Out, CPU Usage, Avg Request Duration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be other statistics types potentially - anticipated load, historical data, etc. Reference those as other types of input

1. how the route relates to replication controllers (and by definition, the pods they control)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just routes, services

1. whether or not the replication controller is auto-scalable and the thresholds set
1. auto-scaling thresholds for requests, both up and down.

Example: A `ReplicationController` is configured to auto-scale up when it reaches a threshold of 50 requests per second that is sustained
for at least 30 seconds.

#### Data Retention

- we care about real time data and data patterns over time
- we only care about the route and summary statistics for that route. More granular data used to calculate statistics
is not important to retain and should be available in the router if needed
- statistics must be maintained over time to determine sustained requests thresholds being broken


### Data Aggregator

Data aggregation is a function of two components: the routers and the data aggregator itself.

Routers will implement a function that will serve stats in a specified format. This format will be the same for all implementations and allows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend that you don't try to do this - instead, allow adapters to gather and record stats into a format you need.

the data aggregator to deal with each implementation in the same manner. OpenShift will provide a base image for each
supported implementation (HAProxy, NGINX, and Apache) that will provide the server install binary as well as a default
implementation of stats gathering that works with the functions defined below.

The data aggregator is responsible for periodically pinging each known router's function to gather the statistics for a
time period (last request until now). Then, if configured, the data aggregator may make a prediction about the expected
traffic and store the statistics information so that the auto-scaler can respond by scaling applications up or down.

The data aggregator will be implemented as a controller and be given a registry to store statistics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it has a registry to store statistics, other services can store those stats as well.


type RouterStatsController struct {
//Method to incrementally gather statistics in a loop. Delegates to router implementation which may
//be socket based, http based, etc.
GatherStats func()
//Injectable list of routers to deal with. Injectable for testability
RouterList []router.Router
//Algorithm for generating the next requests per second prediction
GeneratePrediction func() float
}

In order to allow the stats aggregator to delegate statistics gathering details each router must provide a way to access
its statistics. First, the routers must be configured as proposed in the [Router Sharding Proposal](https://github.com/openshift/origin/pull/506).
Each router will implement a function that serves to gather statistics. The underlying implementation may vary from
embedded endpoints that serve data to centralized storage locations that the function just reads from (while the router
writes the data).

type RouterStatistics struct {
.... important info noted below, some already exists in reusable structures ....
//application namespace
Namespace
//name of the app/repl controller/id used to uniquely identify what is being monitored in the namespace
AppName
//route being monitored
RouteName
//interval start date/time
StartTime time.Time
//interval stop date/time
StopTime time.Time
//actual values used for scaling, populated from real time data
Statistics Stats
//predicted values used for scaling
Predictions Stats
}

type Stats struct{
RequestsPerSecond int
SessionsPerSecond int
BytesIn int
BytesOut int
CPUUsage int
AvgRequestDuration int
}

type Router interface {
.... fields omitted ....
//pluggable implementation for retrieving stats from a router that can be called from the aggregator.
//This implementation may be reading from a centralized store or pinging a router directly through an exposed
//endpoint periodically. It is up to each router implementation (or the base implementation) to provide the
//mechanism for providing stats.
Stats func() []RouterStatistics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not really normal for an interface - returning a function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to update it. The intention was that stats was an pluggable function returning the statistics slice, not that it was returning a function. Will fix.

}

#### Predictive Analysis

The stats aggregator may provide predictive analysis based on data trends. When scaling, the auto-scaler may use both
actual and predicted data to scale an application up. The `GeneratePrediction` is a pluggable algorithm that will have two
initial implementations:

* No-op: make no predictive analysis, use actual values only, return `nil`
* Theil-Sen: make predictive analysis based on [Thiel-Sen](http://en.wikipedia.org/wiki/Theil%E2%80%93Sen_estimator)

### Auto-Scaler

The Auto-Scaler is a state reconciler responsible for checking predictions and real data against configured scaling thresholds
and manipulating the `ReplicationControllers` to change the number of replicas an application is running. The scaler will
use a client/cache implementation to receive watch data from the `RouterStatsController` and respond to them by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's receiving data from the rest objects underlying router stats, not the controller directly. Should clarify.

scaling the application.

type AutoScaler struct {
//Get the next statistic to check against thresholds
NextStat func() RouterStatistics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's with the returning functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, was indicating that NextStat is a function that returns RouterStatistics which is how client/cache is implemented

//Adjust an application's repl controller replica count
ScaleApplication(appInfo, numReplControllers) error
}

### Auto-Scaling Configuration

Auto-scaling config parameters will be added to the `ReplicationController`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream has already stated this won't happen - this config needs to be part of either annotations, or in a separate resource you create "AutoScaleController". That's part of the reason why this proposal should be targeted at upstream, because they are likely to have very specific opinions about some of these constructs.


type AutoScaleConfig struct {
//turn auto scaling on or off
Enabled boolean
//max replicas that the auto scaler can use, empty is unlimited
MaxAutoScaleCount int
//min replicas that the auto scaler can use, empty == 0 (idle)
//an idle configuration overrides the ReplicationControllerState.Replicas setting
MinAutoScaleCount int
//Thresholds
AutoScaleThresholds []AutoScaleThreshold
}

//RequestsPerSecond, SessionsPerSecond, BytesIn, BytesOut, CPUUsage, AvgRequestDuration
type AutoScaleType string

//holds the settings for when to scale an application
//example: scale when RequestsPerSecond (type) are above 15 (value) for 10 minutes (duration)
type AutoScaleThreshold struct {
//scale based on this threshold
type AutoScaleType
//after this duration
duration time.Duration
//when this value is passed
value int
}

type ReplicationControllerState struct {
AutoScaleConfig
}

## Implementations

#### HAProxy Implementation
TODO
http://cbonte.github.io/haproxy-dconv/configuration-1.4.html#9 (socat on a socket can be used to pull stats)

[example from Rajat using socat](https://github.com/rajatchopra/geard-router-haproxy/blob/rc/autoscaler.rb)

#### NGINX Implementation
TODO

#### Apache Implementation
TODO
mod_status http://httpd.apache.org/docs/2.2/mod/mod_status.html, http://www.tecmint.com/monitor-apache-web-server-load-and-page-statistics/, http://freecode.com/projects/apachetop/, http://www.opennms.org/wiki/Monitoring_Apache_with_the_HTTP_collector

#### Kube Proxy Implementation

TODO (will need to make stats implementation for proxy and register it as a non-replication controller controller
piece of infa - ie. do not try and manage these routers)

The intention is to have the OpenShift router be able to run in a mode that will allow it to replace the Kube Proxy in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I think we'll continue using the Kube proxy, and it will eventually also do layer 3 sorts of things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line about replacing the proxy with a router was pulled from another PR that I can't seem to find so I guess we can scratch it. That said, it seems to make sense that if we end up moving routers upstream that we could use a base impl for the proxy to allow us to gather stats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gathering qps and other stats may take many forms. We'll use Kube proxy as inter service connection, and it may have optional features that we add at a later point.

On Dec 8, 2014, at 5:13 PM, Paul notifications@github.com wrote:

In docs/autoscaling.md:

+example from Rajat using socat
+
+#### NGINX Implementation
+TODO
+
+#### Apache Implementation
+TODO
+mod_status http://httpd.apache.org/docs/2.2/mod/mod_status.html, http://www.tecmint.com/monitor-apache-web-server-load-and-page-statistics/, http://freecode.com/projects/apachetop/, http://www.opennms.org/wiki/Monitoring_Apache_with_the_HTTP_collector
+
+#### Kube Proxy Implementation
+
+TODO (will need to make stats implementation for proxy and register it as a non-replication controller controller
+piece of infa - ie. do not try and manage these routers)
+
+The intention is to have the OpenShift router be able to run in a mode that will allow it to replace the Kube Proxy in
The line about replacing the proxy with a router was pulled from another PR that I can't seem to find so I guess we can scratch it. That said, it seems to make sense that if we end up moving routers upstream that we could use a base impl for the proxy to allow us to gather stats.


Reply to this email directly or view it on GitHub.

the future. This proposal assumes that if we want to gather statistics from the Kube Proxy prior to the OpenShift router
implementation we will be able to modify it to expose statistics like any other router. This proposal also assumes
that we will be able to automatically register the Kube Proxy as a non-controlled (not managed by Kubernetes, part of infra )
router and have the stats gathering mechanism ping it's endpoint like any other implementation.