Kununu in New Work Development on Medium

Resiliency patterns for cloud-based applications — Part I

kununu Blog — Fri, 28 Jul 2023 08:29:13 GMT

Resiliency patterns for cloud-based applications — Part I

Aline Souza — Backend Engineer

People expect to be able to use their applications anytime they want to. To accomplish this, engineering teams need to keep resiliency in mind while building their applications. Resilience is the ability of a system to manage and graciously recover from failures. Resiliency patterns aim to ensure the applications are available whenever the users need them.

You need to face a lot of challenges when developing and designing cloud applications. Throughout this article, we are going to walk through some resiliency patterns you may want to consider when building a cloud-based application, to keep it up and running.

Parallel Availability

Resilience can be estimated in terms of a system’s availability at any given time. System availability is determined by the availability of all of its components. These components can be linked in serial or parallel connections [1].

In a serial connection, if one of the components fails the entire system fails. For instance, if a system consists of two components operating in series, a failure on either component leads to a system failure [1], [2].

In a parallel connection, if you have two parallel components and one of them fails, the system keeps running without failure (or at least it should). For example, if a system comprises two components operating in parallel, a failure of a component leads to the other component taking over the operations of the failed component [1], [2].

A serial system is operational only if all its components are available. Hence, the availability of a serial system is a product of the availability of its components. For example, in a system with components X and Y, you multiply the availability of component X by the availability of the component Y. The following equation represents the availability of the system: A = Aₓ Aᵧ [2], [3].

Based on the above equation we concluded that the combined availability of two components in series is always lower than the availability of the individual components. The table below shows the availability and downtime for individual components and the combined system [2], [3], [4].

As we can see from the above table, even though a very high availability component Y was used, the low availability of component X pulls downs the overall availability of the system. As the saying goes, a chain is no stronger than its weakest link. In this case, the chain is weaker than its weakest link [2].

Now, assume that you have a system consisted of two components in parallel. This system is operational if either component is available. So, the combined availability is the result of the multiplication of the unavailabilities subtracted from 1. The following equation represents the combined availability of the system: A = 1 — (1 — Aₓ)² [2], [3].

That means that the combined availability of two components in parallel is always much higher than the availability of its individual components. The table below shows the availability and downtime for individual components and the parallel combinations [2], [3],[4], [5].

Looking at the above table, it is clear that even though a very low availability Component X was used, the overall availability of the system is much higher. In such a manner, availability in parallel provides a very powerful mechanism for making a more reliable and resilient system [2].

Multi-AZ Deployment

With the previous pattern, we learned that the duplication of the system’s components maximizes the system’s total availability. Within the cloud, this means deploying it over several availability zones (multi-AZ) and, in some situations, across multi-regions.

Availability zones (AZs) are unique physical locations within regions from which public cloud computing resources are hosted. Each zone is made up of one or more data centers equipped with independent power, cooling, and networking. Each region is a separate geographic area. The physical separation of AZs within a region protects applications and data from data center failures [6], [7].

If a system is built in a Multi-AZ architecture, it takes advantage of having zone-redundant services replicating the applications and data across AZs to protect from single-points-of-failure [6].

For example, if your system has a Multi-AZ database instance, there is a primary database instance that synchronously replicates the data to a standby instance in a different AZ. In case of an infrastructure failure, there will be an automatic failover to the standby instance, so that you can resume database operations as soon as the failover is complete. Since the endpoint for your database instance remains the same after a failover, your application can resume database operation without the need for manual intervention [8].

Therefore, Multi-AZ deployment increases the availability of a system and its tolerance to faults.

Stateless Services

As we have seen, if a component being called upon fails in your application, you can have a copy of that component ready to go. You can achieve that goal with stateless services.

A stateless app or service does not hold any data or state, so any copy of that service can serve the same function as the original. A stateless model ensures that any request or interaction with the service can be managed independently of the previous requests. This model facilitates auto scalability and recoverability, as new instances of the services can be dynamically created as the need arises or be restarted without losing data that’s required in order to handle any running processes or requests [9].

The widely used REST (REpresentational State Transfer) paradigm is a stateless model, and actually, this is one of the key considerations whether anything is RESTful or not. Roy Fielding’s original dissertation details the REST definition and says [10]:

“Each request from client to server must contain all of the information necessary to understand the request, and cannot take advantage of any stored context on the server. Session state is therefore kept entirely on the client.”

While you might argue that using stateless services is not a resilience strategy per se, it is still an important and valid technique to improve the resilience of a system.

Asynchronous Decoupling

Although REST APIs are popular and useful in designing applications, REST APIs tend to be built with synchronous communications, where a response is required. A request from an end-user client can trigger a complex communication journey within your services architecture that can effectively introduce coupling between the services at runtime [11].

Asynchronous messaging is the most common decoupling technique. Take, for example, the need to send orders to Component X and Component Y generated on different external systems.

From a high-availability perspective, the loosely-coupled asynchronous approach enables Component X and Component Y to be unavailable as a result of a planned or unplanned outage, without affecting the external systems. The external systems can send the order creation request messages to a message queue [12].

On the other hand, if the communication is synchronous, Component X and Component Y must be available for the external system to create the request. The availability requirements of Component X and Component Y in this architecture must be the greatest of all the availability requirements of all tightly connected systems combined [12].

Under a higher load, your services will need to scale out to process the requests. You then have to consider the scale-out latency, as it takes a few moments from when an auto-scaling party triggers the creation of additional instances until they are ready for action. It takes time to initiate new container tasks too [11].

In a synchronous communication approach, if the scaling event happens late, you may be unable to handle all incoming requests with the available resources. Such requests can be lost or answered with HTTP status code 5xx [11].

In contrast, in an asynchronous communication approach, you can use message queues that buffer messages during a scaling event to help avoid this. This is the more robust architecture, even in use cases where the end-user client is waiting for an immediate response. When your infrastructure takes time to scale out, and you cannot process all requests in a timely manner, the requests will persist [11].

Prioritize traffic with queues

Although it may be easy to see the benefits of a queue asynchronously processing messages, the drawbacks of using a queue are subtle. With a queue-based system, during intervals of high traffic, messages can arrive faster than your services can process them. While in a case when processing stops but messages keep coming in, the message debt will grow into a huge backlog, pushing up the processing time [13], [14].

To put it another way, a queue-based system has two operating modes or bimodal behavior. The latency of the system is low when there’s no backlog in the queue, and the system is in steady mode. However, if a failure or a higher load causes the rate of arrival to surpass the processing limit, it easily flips into a more sinister mode of operation. The end-to-end latency in this mode increases exponentially, and it can take a lot of time to work through the backlog to get back into the steady mode [13].

Below are a few design techniques that can help you prevent long queue backlogs and recovery times:

In asynchronous systems, security is essential at every layer. In an asynchronous system, each part of the system needs to protect itself against overload, and prevent one workload from consuming an excessive share of resources. So, we protect them by implementing throttling and admission control [13].
Using multiple queues helps to control traffic. Often asynchronous systems are multitenant, performing work on behalf of a wide number of different customers. In certain aspects, there’s an incompatibility between a single queue and multitenancy. By the time that the work is queued up in a shared queue, isolating a customer workload from another is difficult [13].
Turn to LIFO behavior instead of FIFO when faced with a backlog. For most real-time systems is preferable to have fresh data processed immediately, when a backlog happens. Any data accumulated during an outage or spike can then be processed when there is capacity available [13].

In certain cases, it is too late to prioritize traffic after a backlog has built up in a queue. However, if processing the message is quite costly or time-consuming, being able to transfer messages into a separate queue can still be worthwhile. For example, during a spike, expensive messages can be transferred to a low priority queue. We can use the same approach to messages that meet certain age criteria, transferring them into a separate queue. The system works on low priority queue messages as soon as the resources are available [13], [15].

There are many strategies to make asynchronous systems resilient to workload changes, such as shuffle-sharding, dropping old messages (message time-to-live), heartbeating long-running messages, and so on. We are not going to cover all of them here, but you can take a look at the next section for further learning resources.

Find out more!

That’s all for today

Today we discussed five of the most popular resiliency patterns out there. You probably want to consider them when building your resilient cloud-based application.

I hope you enjoyed part I. In the next part, we’re going to discuss resiliency patterns related to databases. Keep tuned!

[1] Oggerino, C., High Availability Network Fundamentals. Cisco, 2001.

[2] System Reliability and Availability

[3] Building Global, Multi-Region Serverless Backends

[4] Uptime & downtime conversion cheat sheet

[5] Uptime & downtime tool

[6] Regions and Availability Zones in Azure

[7] Regions, Availability Zones, and Local Zones

[8] Amazon RDS Multi-AZ Deployments

[9] Patterns for scalable and resilient apps

[10] Roy Fielding’s Dissertation — Chapter 5: Representational State Transfer (REST)

[11] Understanding asynchronous messaging for microservices

[12] Asynchronous integration as a decoupling technique

[13] Avoiding insurmountable queue backlogs

[14] Cisco IOS QoS Solutions Configuration Guide, Release 12.2SR

[15] Priority Queue pattern

Resiliency patterns for cloud-based applications — Part I was originally published in New Work Development on Medium, where people are continuing the conversation by highlighting and responding to this story.

Artificial intelligence and work in the future

New Work Engineers — Fri, 03 Feb 2023 14:31:03 GMT

In recent centuries, we have been experiencing real processes of change in society and in the world. From the first moment of industrialization in 1765 with the introduction of coal, steam engines and production lines to what we consider to be the third industrial revolution with the introduction of atomic energy, computers and telecommunications, that the world has as a force of change a new, more powerful, source of energy. But will this be the engine for the next industrial revolution? I don’t believe it…

I believe that we are already experiencing a new industrial revolution but, today, it is based on a new paradigm. For the first time in history, it is not energy that is changing the world, but computing and computers. As Marc Andreessen so aptly summarized: “software is eating the world”. It is with this in mind that I invite you on a philosophical journey about the impact of software, and more specifically of artificial intelligence, on the world and on society.

I like to define artificial intelligence as the ability of a digital machine to perform a task normally associated with intelligent beings, namely the ability to see, act, communicate, infer and learn.

In order to understand how this technology can impact humanity, I believe we must take refuge in philosophy, more specifically in the schools of thought of dualism and monism.

Dualism arises in the Greek school of thought with Plato, Aristotle and Descartes where they argue that the mind and the body are two distinct and not identical entities. Plato speculates on the existence of a world of ideas that contains things that do not exist in the real world, such as a perfect circle. A truly perfect circle will never exist in the real world, but it does exist in our imagination.

Monism, in turn, appears with Heraclitus and is developed several centuries later by thinkers such as Spinoza and Berkeley, where they defend that the mind and the body are manifestations of a single entity. Fundamentally, everything in the universe is made up of atoms and everything is subject to the same set of laws of physics. Even our emotions and dreams are just brain synapses and chemical reactions. In this context, everything can be broken down into its fundamental components, where a chair uses the same fundamental elements as a human being.

Will it be us humans, machines, animals or humans? Based on the two currents of thought, we can infer that, on one hand, a machine can behave like a human, or, on the other hand, a machine will never have access to the world of creativity and emotions that characterize us so well. But what does all this have to do with artificial intelligence and its impact on the job market? In fact, everything!

Monism defends a union between the body and the mind, which leads me to think that artificial intelligence can leave humanity without work, with the world having to reorder itself and find new ways of physical and emotional sustainability (we talked about a single universal income, for example). If we are made of exactly the same fundamental elements as the rest of the universe, there are solid reasons to believe that a machine could one day take our place, in many cases being better than us humans.

Now, if we look at the same problem with a dualistic lens, I believe that not all jobs will be taken over by machines. Since humans are fundamentally different from machines, there will always be work that machines will not be able to do. Thus, there will always be a haven for humanity, where machines will not be able to access our mental, abstract and creative world. I believe that many sectors will be transformed by artificial intelligence (such as industry, retail, services, among others), but many more will be created based on human sensitivity.

It is with this duality of perspectives that I leave you. With the certainty that we are on the verge of the greatest industrial and societal revolution in recent centuries. It is important to be aware of this duality of thinking that shapes our beliefs, conversations and predictions about impact on the world. I, personally, am a humanist. I believe in dualism and the existence of a creative, emotional and imaginative world. Although artificial intelligence will wipe out many jobs, many more will be created precisely based on these characteristics that make us human.

This article was originally published at Dinheiro Vivo.

Author: Rui Barreira VP Engineering at kununu

Artificial intelligence and work in the future was originally published in New Work Development on Medium, where people are continuing the conversation by highlighting and responding to this story.