Building a culture of system reliability
Breakout-Talks, Trends and inspiration
Laufzeit
Füllen Sie das Formular aus, um das ganze Video zu sehen
Even the briefest system outages can have unforeseen orders of impact for businesses. Learn best practices from industry leaders who manage reliability at scale, and go inside the culture and systems at Stripe that maintain reliability while processing 1% of the global GDP.
Speakers
Rahul Patil, Deputy CTO, Stripe
Farhan Thawar, VP, Head of Engineering, Shopify
RAHUL PATIL: Reliable systems are not noticed until they fail. For instance, when you call the fire department, you expect to be able to reach them, and you expect them to be able to show up. But this wasn't always the case.
Almost 100 years ago on the streets of London, a house fire erupted. Neighbors saw the blaze and called the fire department. Unfortunately, the telephone system was down that hour. This meant that the neighbors could not reach the fire department despite repeated attempts. By the time the fire department responded, tragically, five people were killed. This tragedy was preventable if only the telephone system had worked that day.
What did we learn from the tragedy? That reliability is not a property of the best system on its best day or even on its average day. Reliability is a property of the worst system on its worst day when you need it the most under load, under strain. Put more simply, under pressure, you're only as good as your weakest link.
This tragedy gave birth to the world's first emergency hotline in London, which in turn began the reliability campaign for the entire telecommunication stack. To this day, emergency response systems around the world target 99.9995% or five and a half nines of availability. Five and a half nine means in a month we can only tolerate 13 seconds of unavailability.
That is 13 seconds out of 2.6 million seconds. 13 seconds—that is not a lot of room for error. It is common for consumer applications to target three and a half nines, which is roughly 22 minutes of downtime a month. At Stripe, we don't think that's good enough. We look to emergency response systems as the standard and to meet that standard we invest proactively.
We believe it's an imperative to be leading the industry in reliability. During our last Black Friday Cyber Monday, Stripe processed $18.6 billion in transactions. We had over 30,000 Stripe users achieve their highest and best sales day ever and the volume peaked at 93,000 transactions per minute. Even with this load, we maintained five and a half nines target.
My name is Rahul Patil. I'm the deputy CTO here. I build tier zero systems at Stripe and I'm responsible for the end-to-end reliability, security, performance, global operations, and financial rigor. A lot of my friends like to call me the “chief worst day officer.” So, I get asked often, how do we build such a highly available system?
For me, the real question is not how but why? 40%. Research shows that 40% of customers whose payment is declined will abandon that business entirely. Imagine how much money is left on the table.
Our mission at Stripe is to grow the GDP of the internet and to deliver the infrastructure that powers commerce around the world. More than just APIs, we orchestrate interactions that drive value exchange in real time. And when it comes to you, our users, every transaction matters and it matters every second.
The impact of unavailability goes beyond just money. It is the long-lasting reputational impact that is at stake. The news is littered with examples of companies that lost customers forever because of outages. At Stripe, our product is integrated into your applications. This means our faults will result in negative press for you, our users. We don't want your customers to walk away forever. That's why being the most reliable component in your business is so critical to us. We treat reliability with the same seriousness as the emergency response systems. So back to the original question, what's the secret sauce?
Yes, we do all the standard stuff. There are tons of talks and papers on how to build highly available systems through excellent network design, data design, computer resiliency, cell architecture, and so on. We fully adopt these best technologies and the design patterns in the space. But these alone are not enough.
We need will and perseverance to do this consistently. At its heart, reliability is a mindset and not just a metric. The idea of treating reliability this seriously isn't new. The approach started long before software was eating the world. Such as the emergency hotline systems we discussed. We are emulating these standards that literally deal with life and death because it's so important to us.
Today I will share with you top three practices to build a culture of system reliability and to provide an additional perspective, my friend Farhan Thawar will join me to discuss how Shopify approaches reliability as well. First, practice. Practice your worst days every day. At Stripe, we exhibit this childlike curiosity and push our systems to the breaking point to understand how they fail. Only after we do this, we leverage these systems in production. In other words, I'm skeptical about running a system in production that has not been broken before. So why do we do this?
A key to preventing user impact is accepting that failures are inevitable. So, it's important to design for failure from the ground up. Every system fails. It's not if but how. A system should never fall apart under failure, but instead it should allow you to predictably steer it back to stability. Like a circuit breaker, when a system fails, a system needs to rapidly adapt by isolating failures and by converting total failures to partial outages and automatically recover to minimize impact.
One simple technique to practice our worst days. One simple technique is to apply a ton of pressure and load to our system. We use massive synthetic traffic to punish our systems at 20x the worst-case scenario.
This enables you to do two things: uncover your weakest link immediately, and identify the symptoms of resource exhaustion before failure. We simulate this and then we repeat this in production, load testing our systems farther and farther, ensuring that we know we will be successful when users need us the most.
So, we go beyond just brute-force pressure testing and we become a bit mischievous. We go around injecting falls throughout the system and just to see how it reacts. We add latency networking, we drop network packets, we turn off computers, we pull out the disc, we make a service unavailable completely and see what happens. We then build our systems to adapt and gracefully work around these faults, but this only still gets us to four nines of availability.
To reach five and a half nines at our scale, we need to be resilient to multiple concurrent faults. For this, we go from mischief and mischievous mayhem to complete chaos. Using our chaos testing engine, we trigger multiple complex faults in production. Not only does this help reproduce failures, but also it helps discover interesting new ones that we've never seen before. At Stripe, we want to control our own destiny. So, to prevent failures, we practice a healthy amount of pessimism and paranoia.
Failure is inevitable and we want to be ready to keep failures from impacting our users. By pretending that every day is the worst day, we discover and prevent issues before they ever happen. Our second practice, perhaps my favorite one: never send a human to do a machine's job.
I mentioned earlier that we reach more than five and a half nines of availability. So, my reaction budget to any potential issue is 13 seconds. So, let's think about this. At best, humans are going to react in minutes. So, we cannot rely on humans to react and mitigate within 13 seconds. Machines can react in milliseconds. Thus, only machines can react and prevent the failure before the impact is felt by the users. So, what does it take to have machines react in milliseconds?
We have more than a million CPUs dedicated just to monitor[ing] our systems. Our production system emits millions of health indicators per second. We use this telemetry to detect degradations and enable machines to react quickly. Measuring so much real-time data allows us to constantly calibrate our intelligence systems. To prevent more, to react faster, to resolve in real time.
So where do humans come into play? It's the heart. Humans build the machines that provide defense in depth, and we obsess over improving them. So regardless of the user's integration, we monitor their experience as perceived by them. So rather than a singular API response rate, we expand our definition of success to what we call EQ metrics, or experience quality metrics.
We drive a cycle of measuring and improving all product surface areas through a 360-degree view of the user experience. In summary, we assign the responsibility of measurement, detection, response to the robots while we humans interpret and design for end-to-end user experience with the heart.
So, we talked about breaking our systems, and we talked about reacting to failures in seconds. Now, let's talk about the most important thing, building an accountability mindset.
Great ideas are just that—ideas. It requires an act of will and perseverance to see them through. Earlier I shared the story of the UK's emergency hotline. Guess when it was built? 1937. So how long did it take for that system to come to the US? 20 years. More than 20 years later, the US had its first emergency 911 system and then it took another 30 years from there for widespread coverage. So why did it take so long to implement?
I believe it was a lack of single-threaded extreme ownership. Reliability is not owned by a committee. It is a pride that every leader and engineer puts into the systems. Quality requires leadership.
This same word of accountability may be synonymous with punishment or placing blame. For us, it reflects empowering ownership and agency. Builders are owners of all outcomes and incidents. Developers, managers, and leaders at all levels are pageable when a user needs us the most. Incident reviews are not done by some central SRE team. They're done by the service-owning engineering team. Words like “doggedness,” “will,” “perseverance” all sound good, but how do we exactly institutionalize them?
I said earlier, builders are owners of all outcomes and incidents, but we audit for that ownership. If an incident requires humans for remediation, we force a deep dive with the teams. We run a Stripe-wide senior executive review and here we randomly call on an executive. We like to call it “spin the wheel,” but instead of a prize, an executive gets to give a detailed overview of all incidents that happen under their watch.
Not surprisingly, they don't like this. But, this ensures preparation by all leaders, not just engineers on the ground, as our executives are expected to speak to all of the incidents, the user impact, the root cause, and the remediations we are taking to prevent it from happening again. Leaders are accountable for the whole outcome as perceived by the user, not just their little slice.
If the root causes were seen before, we hold these leaders aggressively accountable. Now we're not all just sticks. We do embrace carrots. We actively reward leaders who can reduce incidents and improve experiences for our users. Little trick, carrots are actually small sticks by the way. Up front I know our executive ops review is not the most popular. No one likes being cold-called, but it's a tangible forcing function that has enabled us to instill and very importantly scale reliability as a cultural pillar amongst our leadership. And this ownership keeps solutions from being constrained to the immediate problem, but instead owners then address the end-to-end experience and the outcome.
So, wrapping this up, I want you to leave with three takeaways. First, practice your worst days every day. Second, never send a human to do a machine's job. Remember, design by humans but protect by machines. And finally, practicing extreme ownership. We have to go beyond words here. Implementing processes that drive these pillars to the forefront of your team's prioritization is the only way we can instill change. Yes, it requires actions and accountability that are uncomfortable, but this discomfort breeds culture. And culture is not taught; it's caught.
So now we're going to do a fun Q&A with Farhan Thawar from Shopify about reliability. Shopify, as you all know, is an all-in-one commerce platform to start, run, and grow a business. Millions of businesses, around 175 countries, are powered by Shopify. Farhan is a VP and head of engineering at Shopify for the last five and a half years. He's engineered Shopify to new heights, ubiquity, scale, and critically, reliability.
Please welcome Farhan Thawar onto the stage.
RAHUL PATIL: So, Farhan, I love Shopify's scale story, and what a fun story… over the last few years, particularly one interesting trend is flash sales. Can you tell us more about flash sales?
FARHAN THAWAR: Yeah, so flash sales are the idea whereby a celebrity or influencer will drop a product and it causes massive scale to that one merchant for a short amount of time. So, you can imagine a Kim Kardashian, or you know a Lady Gaga. I mean recently we did Taylor Swift, and they'll drop some merch or drop an album and all of a sudden we'll send immense amounts of traffic through one channel for a short amount of time. And we have to be able to have a great experience for both the merchant who's managing the sale and all the buyers.
RAHUL PATIL: And how do you design and how do you prepare for flash sales?
FARHAN THAWAR: Yeah, so sometimes, as you can imagine, celebrities tell us what they're doing, in which case we can prepare and maybe we want to pre-warm some machines and make sure that the experience will be good by actually spending time getting their infrastructure ready. But a lot of times they don't tell us, in which case our environment must automatically be ready. Which means we have auto-scaling like you mentioned like don't ever send a human to do a machine's job. We have auto-scaling built in such that as demand starts accruing, our systems will scale up over time to meet that demand.
RAHUL PATIL: Got it. And I hear Thursdays are fun for you.
FARHAN THAWAR: Yeah, so we onboarded a company called Supreme and every Thursday at 11:00 a.m. they do a sale, you know, like clockwork, [of] different products. And every Thursday what's amazing is we end up breaking the largest sale in internet history like every Thursday. It's because every new sale has more buyers and more buyers and then it ends up breaking records every single week.
RAHUL PATIL: Every Thursday is new internet history.
FARHAN THAWAR: Exactly. Basically.
RAHUL PATIL: Sounds like a fun job. Next question is around measuring reliability from a user perspective. So, we have UI elements and checkout and so on, but they're all orchestrated on top of APIs. So, we obsess over API success rate. What are customer-centric metrics for Shopify like?
FARHAN THAWAR: Yeah, so our customer is the merchant, right? So, the brand that is building the storefront on top of Shopify. And so paradoxically, we actually want the merchant to not be inside of Shopify.
Like some products you want more time being spent, we want less time being spent. So, we actually use a metric called number of seconds or number of years saved in the Shopify admin. So, when we reduce latency or we remove a step from the process, or we enable for them to not have to do something anymore, we use it as a save time metric.
We want our merchants to be spending time building great products and curating their community, and any extra minute we can give them allows them to do that. So, we actively look for a going down chart of how much time they have to spend in our products.
Our merchant's customers, our customer's customer, is the buyer. So, if you buy something from a Shopify store, the buyer is you and we want you to have a great experience also. And so, every piece of latency we can remove for the buyer—we want to make sure that they have a great experience as well. And that means going through the checkout flow, having no errors, and being able to get their product.
RAHUL PATIL: Awesome. And you know, when you follow the user or the consumer that's actually purchasing something, their flows can be complex, it can be multi-minute, like their essential journey could be long and you're orchestrating across multiple systems. And then we go back to like, you know, the reliability depends on the weakest link. So how do you think about reliability in those scenarios, and what are some common patterns that are helpful there?
FARHAN THAWAR: Yeah, I mean you mentioned earlier like chaos engineering or just trying to unplug systems. So, we have the same thing, right?
If one part of the checkout journey is not going to happen for the user, we have to either fall back to something or put them into a situation where they might come back, right? So, for example, we have something called a checkout queue.
So, if a large amount of buyers comes to a storefront and the system is taking time to auto scale because you know, if we've, you know, if you're Taylor Swift and you drop a T-shirt, you're gonna have millions of users hitting that thing very, very quickly. They might enter a queue and the queue will give them a great experience saying, “Hey, you're like 30 seconds away from getting great Taylor merch.”
But at the same time, that has to actually be then funneled through. So, the system is scaling up, the consuming of the queue is going faster and faster as our system scale[s] up, and then the user now enters the checkout journey. So, it is a journey, there [are] multiple steps, but like you, if a piece fails, the whole thing can't fail.
We have to make sure that actually that there's still a good experience to be had. And sometimes by the way, you know, if you're trying to buy merch from Supreme, sometimes it's out of stock and there's nothing that we can do. But like you've now you've missed the sale and now you have to kind of hopefully believe in the brand and the experience you had and come back next time.
RAHUL PATIL: Right. Yeah, especially orchestrating across lots of complex systems across partners across the internet, it's definitely interesting to see what it takes to build a highly reliable system. One of the patterns that we've adopted is like fail open, right? Which is like if you have a lamp with 10 lights, one of the lights goes down you don't say, “Okay cool, the whole lamp is busted.” What we really want to do is like, what is the minimum quality of experience we can provide despite single faults? I see with the queue analogy that there [are] lots of options to fail open there.
FARHAN THAWAR: Yeah, and I think, well you're right, because we have to have these like these short circuits. So, for example, if you're going to be 13 seconds down, that means we have to add in all of the downtime of all the people we rely upon and that can be the best that Shopify can do.
So we can't be, you know, four nines five, we have to be less because we rely on people. And so, part of it is I have high expectations of Stripe to make sure that this commerce experience would be great, but that means all of the partners for you know, shipping rates and tax calculations, all of those things must be fast.
Which means we build in short circuits, meaning I'm gonna go to a shipping provider, I need a shipping rate for this ZIP code and if you don't provide it to me, I'm going to use this short circuit rate because I need to get this back in 300 milliseconds. That's an example of failing in the open.
RAHUL PATIL: And I think that's why selfishly, like, the whole industry shooting for the highest available nines is going to be so important, because we all orchestrate across each other.
FARHAN THAWAR: Right, I would say one last thing is that if Stripe fails, they don't blame Stripe; they blame Shopify.
RAHUL PATIL: Yeah, exactly, exactly. This is why we want to be the most reliable component because the negative press is toward our user, right?
So our favorite topic in the industry—it's frequently believed that reliability comes at the cost of meeting users' needs quickly. So, developers always want to move fast. They're like, “Why are we shooting this for this higher nines of availability when I could be shipping more software?” Curious what internal debates have been around at Shopify.
FARHAN THAWAR: Yeah, so we're lucky in that our CEO is an engineer and we really focus our engineering culture around shipping on quality. And so, what that means is if we've got a list of priorities of all the things we want to build, we will instead of actually just cutting it short and saying we're going to ship, we actually will only ship the things that are hitting the quality bar.
Actually, famously we will unship things that actually our merchants are using and are providing value if we don't think that they're built well enough to be reliable or built as a foundation that we can build on top of. And so, we spend a lot of time making sure the architecture is right. We spend a lot of time ensuring that the underlying code, like you remember Steve Jobs, “the inside of the MacBook must be beautiful.” We are very similar in that the inside of the modules we are building must be beautiful, otherwise we will just unship it or not ship it at all because it's not ready.
Famously as well we try to delete a lot of what we build, which means that taking that beginner's mindset of saying, “Well if we could start over today, how would we build it?” Would it be simpler? Can we rely on that library instead of this? Can we delete the code and start over? And usually, you end up with something more elegant and so we want to make sure that we are trying to use that mindset all the way through engineering. And yes, it could be frustrating if you're somebody who wants to, you know, build things that are not infrastructure. But for us, if we categorize them, right… if you're building a feature, or you're building an experiment, or you're building a piece of infrastructure, we want you to think differently about how you spend those timelines. And so, when it's infrastructure, we will spend much more time making sure it's working.
RAHUL PATIL: Yeah, deleting is also shipping in many ways. So, for me, it comes down to, every incident can take away time but also, a user on the other side is feeling pain. So therefore, averages, even if you say five and a half nines and that, you know, that last bit, it could be one user who's really struggling and it could be hidden in our averages. And so it's very important to bring it up.
For us the debate is around, “Hey, this is an opinionated infrastructure.” And it's very, especially when you hire from the industry, everyone's like, “I want to cargo cult this type of system from there, that type of system from here.” And we can end up in worst of all weaknesses but not have any of the strengths. So we care deeply about opinionated infrastructure.
One of the principles we share in common is never send a human to do a machine's job, right? It's very powerful as we start to apply that everywhere, it then comes into deploy. As an example where we're like, hey, as soon as a code is checked… so today's Sessions, there's no code freeze. And as developers are checking in code, it goes through all the automation, tests automatically, gets deployed, the stage rollout roll[s] back, and so on. So when we apply the principles, our tools and infrastructure should give that confidence so that developers can move faster as well.
FARHAN THAWAR: Yeah, I'll give you a different example as well. When we're thinking about building infrastructure from the beginning of the project, we will actually imagine what is the end, you know, use case. And I remember this example where an engineer said it would probably take three weeks to build something and we said well instead of building it in three weeks, why don't you build the infrastructure layer?
Maybe it'll take you two or three months, such that someone can build it in an hour on top of your infrastructure. So, when you think that way, it takes longer, but now many more use cases, and many and much better reliability is going to come from it because it's built as an infrastructure layer versus just the feature on top of the existing infra.
RAHUL PATIL: Yeah. So we talked a lot about how, you know, we can scale engineering systems and we can think about engineering systems being designed for reliability and so on. But also, reliability is a people scale problem as well. So, for our final question, we just have a little bit of time, we have a saying, “We don't rise to the level of our expectations, we fall to the level of our training.” So, I'm kind of curious, what has worked best for you to scale this entire workforce around reliability?
FARHAN THAWAR: Yeah, so I'll say two things for this. One is that we celebrate the things that don't normally have celebrating, right? So usually you celebrate features, like this came out, and here's the press release, and this merchant is using this, and instead we try to celebrate like the cleanup.
RAHUL PATIL: Yeah.
FARHAN THAWAR: Right? So we have a channel that's inside Shopify called “cleanup wins.” It's my favorite channel. It's like when you delete a module or you delete code or you've cleaned up something that was technically hairy, maybe you haven't actually enabled something specifically that a merchant can see today, but you're enabling the infrastructure to move much more quickly for the future. So that's a favorite channel of mine.
The second thing is we actually looked at our performance management system and we realized that those who talked about the features ended up having a higher impact than those who worked on the glue code behind the scenes.
RAHUL PATIL: Oh, that's a good thing.
FARHAN THAWAR: And so we actually actively made changes to make sure that we are calling out and having managers reward those who are working on the hard behind the scenes work. And so it's kind of those two things like highlighting the great work that happens that is maybe not necessarily shown to a merchant. And then also celebrating by an impact perspective all of the glue work that might happen behind the scenes. And that culturally helps people feel like, “Hey, this is important, and this is what we are rewarding, and you are actually doing important work.”
RAHUL PATIL: Yeah. Celebration of all the uncool work, if you will. But behind the scenes that's what's like delivering so much user impact. Yeah, I love the stories around celebrations of both difficult days as well as unshipping and keeping things simple and so on. For me it comes down to practicing our worst days. It comes down to gamifying, you know, there's a lot of accountability and pressure, but gamifying and being ready every day matters to us a lot.
That's all the time we have folks. And Farhan, I've really enjoyed all the years of partnership and the shared scars, if you will, from the reliability. And thank you so much for joining us and thank you audience.
FARHAN THAWAR: Thanks.