SRE-vocabulary

A vocabulary collection for SREs (mostly influenced by Google SREs)

Dictionary

Site Reliability Engineering

"Fundamentally, it’s what happens when you ask a software engineer to design an operations function" -- Ben Treynor, VP of Engineering @ Google[1].

Uptime

Availability %	Downtime per year	Downtime per month	Downtime per Week
90%	36.5 days	72 hours	16.8 hours
95%	18.25 days	36 hours	8.4 hours
98%	7.30 days	14.4 hours	3.36 hours
99%	3.65 days	7.20 hours	1.68 hours
99.5%	1.83 days	3.60 hours	50.4 minutes
99.8%	17.52 hours	86.23 minutes	20.16 minutes
99.9%	8.76 hours	43.2 minutes	10.1 minutes
99.95%	4.38 hours	21.56 minutes	5.04 minutes
99.99%	52.6 minutes	4.32 minutes	1.01 minutes
99.999%	5.26 minutes	25.9 seconds	6.05 seconds
99.9999%	31.5 seconds	2.59 seconds	0.605 seconds

Downtime per month is calculated at 30 days[2].

Error Budget

Four Golden Signals

Saturation

Latency

Errors

Traffic

Monitoring

Alerts

Tickets

Logs

MTTR

Mean Time To Recover. MTTR is the average time that a device will take to recover from any failure[4].

MTBF

Mean Time Between Failures. MTBF is the predicted elapsed time between inherent failures of a mechanical or electronic system, during normal system operation. MTBF can be calculated as the arithmetic mean (average) time between failures of a system. The term is used for repairable systems, while mean time to failure (MTTF) denotes the expected time to failure for a non-repairable system[5].

MTTF

Mean Time To Failure. MTTF denotes the expected time to failure for a non-repairable system[5].

SLA

Service Level Agreement. A SLA is a (legal) agreement with repercussions for failure to meet[3].

SLI

Service Level Identicator. A SLI is a carefully defined quantitative measure of some aspect of the level of service that is provided[3].

SLO

Service Level Objective. A SLO is a target value or range of values for a service level that is measured by an SLI[3].

Sources

[1] Google SRE Interview, Niall Murphy and Ben Treynor, "What is 'Site Reliability Engineering', 2018-09-26, https://landing.google.com/sre/interview/ben-treynor.html [2] https://interworks.com/blog/rclapp/2010/05/06/what-does-availabilityuptime-mean-real-world/ [3] Google Cloud Next 2018: Nori and Dan, "Best Practices from Google SRE", 2018-07-26, https://www.youtube.com/watch?v=XPtoEjqJexs [4] https://en.wikipedia.org/wiki/Mean_time_to_recovery [5] https://en.wikipedia.org/wiki/Mean_time_between_failures

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SRE-vocabulary

Dictionary

Site Reliability Engineering

Uptime

Error Budget

Four Golden Signals

Saturation

Latency

Errors

Traffic

Monitoring

Alerts

Tickets

Logs

MTTR

MTBF

MTTF

SLA

SLI

SLO

Sources

About

Releases

Packages

shibumi/SRE-cheat-sheet

Folders and files

Latest commit

History

Repository files navigation

SRE-vocabulary

Dictionary

Site Reliability Engineering

Uptime

Error Budget

Four Golden Signals

Saturation

Latency

Errors

Traffic

Monitoring

Alerts

Tickets

Logs

MTTR

MTBF

MTTF

SLA

SLI

SLO

Sources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages