AWS Architecture Blog

WellRight modernizes to an event-driven architecture to manage bursty and unpredictable traffic

John Lee — Mon, 24 Feb 2025 16:19:34 +0000

WellRight is a leading comprehensive corporate wellness platform provider that helps organizations and employees drive meaningful outcomes through personalized wellness programs. The platform increases engagement and benefit utilization by delivering engaging challenges across multiple dimensions of wellness, from physical activities like step tracking to mental health initiatives and team-building exercises.

In this post, we share how WellRight optimized the cost and performance of their application through a ground-up modernization to an event-driven architecture.

The challenge

WellRight’s infrastructure often experiences bursty and unpredictable traffic patterns. For instance, clients can upload bulk user data at any time, which can impact tens of thousands of users, which then cascade into millions of changes. WellRight’s legacy monolithic infrastructure had several challenges when faced with such traffic:

Multiple processes such as registration, progress calculation, and reward distribution relied on a single server, leading to a noisy neighbor problem.
Certain core services were isolated to avoid the noisy neighbor problem, but with high burst workloads, auto scaling didn’t react fast enough to meet the demand. This led to queues backing up with millions of requests. In addition, the database also had to be overprovisioned to avoid throttling, adding to the overall cost.
Parts of the application were not designed with auto scaling in mind, leading to overprovisioning of resources.

The following figure shows the Number of Messages Received metric from a sample Amazon Simple Queue Service (Amazon SQS) queue. WellRight would often receive burst of events at an unpredictable time.

Solution overview

To address the challenges, WellRight made the strategic decision to transition to an event-driven architecture using fully managed AWS services. WellRight’s platform is driven by asynchronous state changes that propagate through multiple wellness programs, which is well suited for an event-driven architecture and can be broken down into microservices. Managed services such as AWS Lambda, Amazon SQS, and Amazon DynamoDB were appealing because they would eliminate the need to manage servers and allow WellRight to focus on core business logic and reduce the operational burden to their engineering team. It also has the added benefit of avoiding overprovisioning of infrastructure or continuously right-sizing resources. Each microservice would scale automatically as needed with no manual efforts, minimizing costs. The loosely coupled architecture would allow the WellRight team to be flexible, being able to add or make modifications to existing programs without affecting existing workflows.

Design

WellRight’s initial event-driven architecture was centered around using serverless and fully managed services. DynamoDB was used as a primary data store for user information. For instance, when a user makes progress on their step challenge, the update in the DynamoDB table would propagate through DynamoDB Streams to Amazon EventBridge. Then, the event would be routed to the appropriate SQS queue, which functions as a buffer and provides fault tolerance to the events. A Lambda function would then process individual user metrics and update the Programs table. The Programs table uses DynamoDB Streams to send out updates using Amazon Simple Notification Service (Amazon SNS), keeping users informed about their progress and rankings.

The following diagram illustrates the flow of an event after a user update.

The first iteration of the event-driven architecture fared better than the monolithic legacy application, but the bursty nature of the traffic was still an issue. Lambda functions triggered by SQS queues scaled rapidly, handling requests in under 15 minutes that previously required 30 servers and took hours to process. Lambda provided WellRight the scalability that they needed, but the rapid scaling introduced a new challenge. This resulted in the throttling of DynamoDB and reaching Lambda concurrency limits during times of extremely high load, which led to many unprocessed messages in the dead-letter queue (DLQ).

Maximum concurrency solution

In January 2023, AWS introduced the maximum concurrency feature for Lambda functions using Amazon SQS as an event source. This new feature allowed WellRight to control the concurrency of their Lambda functions for each SQS queue. Prior to this launch, Lambda functions would continue to scale as long as there were messages in the SQS queue. At times, Lambda functions would scale to its concurrency limits, resulting in it throttling itself. However, with this feature in place, the scaling Lambda functions would not exceed the set maximum concurrency value. This provided WellRight fine-grained control over the overall throughput of the system. WellRight would adjust the maximum concurrency value as needed to protect downstream processes from being overwhelmed, while responding to customer requests in a timely manner.

The following screenshot of the Lambda console shows the maximum concurrency for the function is set to 100 for an SQS trigger.

WellRight converted all Amazon SQS to Lambda integrations to use this feature. This provided WellRight with full control over the throughput of customer requests while preventing overloading the system. With the maximum concurrency feature, WellRight reduced failed processed messages by 99%, and eliminated DynamoDB throttling events. The feature was enabled for all Amazon SQS and Lambda integrations, including those without scaling issues, as a safeguard for potential future scaling demands.

Performance and cost savings

WellRight’s event-driven architecture significantly improved their ability to handle bursty and unpredictable traffic patterns. The managed serverless services can scale instantaneously to handle these traffic spikes, providing a seamless experience for their clients. With their previous legacy architecture, clients experienced lags in challenge progress, leaderboards, and reward processing.

Now, clients continue to upload updates with over 1 million entries at any time, and WellRight can maintain up-to-the-minute leaderboards and reward processing. The transition to the new architecture has also yielded significant cost savings for WellRight. Prior to the serverless architecture, their baseline architecture required several large Amazon Elastic Compute Cloud (Amazon EC2) instances to handle the initial burst of traffic. After implementing the event-driven architecture, WellRight reduced their costs by 70% on the progress calculation service.

Future plans

WellRight is currently in the process of rolling out the new event-driven architecture to the remaining clients. By the end of 2024, WellRight plans to retire the majority of their remaining servers, further reducing their infrastructure costs.

Conclusion

WellRight’s transition to an event-driven architecture on AWS has been a successful endeavor. By using fully managed services such as Lambda, Amazon SQS, and DynamoDB, they have been able to handle bursty and unpredictable traffic patterns efficiently, while providing a seamless experience for their clients. The introduction of maximum concurrency for Lambda functions has been a game changer, allowing WellRight to control the throughput of their Lambda functions and avoid overwhelming downstream resources.

Overall, the event-driven architecture has enabled WellRight to scale efficiently, improve performance, and reduce costs of their progress calculation service by over 70%. As they continue to optimize their serverless architecture and migrate remaining clients, WellRight is well-positioned to further enhance their platform and provide an exceptional experience to their customers.

To learn more about building event-driven architectures, including key concepts, best practices, AWS services, and getting started resources, visit Serverless Land.

About the authors

Realizing twelve-factors with the AWS Well-Architected Framework

Michael Phorn — Mon, 17 Feb 2025 16:54:47 +0000

Organizations that are interested in improving their development velocity that follow the principles of the twelve-factor app might find benefits in understanding how to realize those concepts on Amazon Web Services (AWS). In this post, I will help you correlate the twelve-factors app concepts as you architect solutions on AWS.

Twelve-factors

Let’s start with a quick recap of twelve-factors. The Twelve-Factor App was published in 2011 by Adam Wiggins as a collaboration between developers at Heroku. He published it at a time when developers were shifting from a paradigm of writing software-as-a-service (SaaS) applications in their own cloud environments to having the applications hosted on a cloud provider, such as AWS. Their intent was to provide “a broad set of conceptual solutions” for building applications that were portable and resilient. The principles centered around reducing the software lifecycle burden, including application introduction, maintenance and operations, through sunsetting. These principles were captured in the following 12 factors:

Codebase
Dependencies
Config
Backing services
Build, release, run
Processes
Port binding
Concurrency
Disposability
Development and production environment parity
Logs
Admin process

These principles are portable and resilient application best practices.

At AWS, we have the Well-Architected Framework to capture cloud and architecture best practices, which contains similar practices to twelve-factors. The Framework comes from years of AWS Solutions Architect collective experience of building solutions across business verticals and use cases. The results are architectures that support secure, high-performing, resilient, and cost-effective systems in the cloud. If you’re responsible for the underlying infrastructure or the application, the Framework helps you, the CTO, the architect, the developer, or operations team member, understand the benefits and trade-offs of decisions that have to be made.

A brief history of the AWS Well-Architected Framework

AWS published the first version of the Framework in 2012, and we released the AWS Well-Architected Framework whitepaper in 2015. Following the initial introduction, we added the Operational Excellence pillar in 2016 and released pillar-specific whitepapers and AWS Well-Architected Lenses in 2017. The following year, the AWS Well-Architected Tool was launched.

While twelve-factors focuses on application characteristics, the AWS Well-Architected Framework provides architectural guidance. When your architecture undergoes a Well-Architected review, you can meet the guidance for a twelve-factors application more easily. With some factors, the Framework helps the application developer delegate some responsibility from the application to the infrastructure. Both frameworks aim to help you deliver applications and services that are robust, scalable, and cloud centered. The AWS Well-Architected Framework helps you reinforce these mechanisms.

The six pillars of the AWS Well-Architected Framework

Let’s explore the six pillars of the AWS Well-Architected Framework, what each aims to achieve, and where the twelve-factors concepts intersect with AWS guidance.

The following figures shows the twelve factors and how they map to processes in AWS, which are described in this section.

1. Operational excellence

The operational excellence pillar helps you review your organization’s ability to support development and run workloads efficiently. You can use the topics in this pillar to evaluate how you operate your solutions. The pillar guides you through inspection of organizational structure, inspection of your mechanisms, and identification of obstacles and roadblocks that might slow your ability to innovate. The results include a feedback loop of continuous improvement for operating the infrastructure and solutions.

The factors you capture through operational excellence are codebase (I) and development and production environment parity (X). Codebase prescribes that there is exactly one codebase used to deploy everywhere, which echoes the purpose of reducing the operational burden of maintaining your software. The argument for a single code base is for consistency, traceability, and efficiency across a unified development lifecycle. The second factor is development and production environment parity, which encourages developers to create smaller but more frequent deployments. It also encourages developers to maintain parity not just of the core software, but also the backing services between environments. Parity of environments is conducive to smoother development and deployment processes. Additionally, this parity helps developers catch issues in a non-production environment more consistently.

AWS services that can help you achieve operational excellence are AWS CodeConnections, AWS CodePipeline, AWS CloudFormation, AWS Systems Manager, Amazon CloudWatch, AWS Config, AWS CloudTrail, Amazon EventBridge, AWS X-Ray, AWS Organizations, AWS Control Tower, AWS Trusted Advisor, AWS Service Catalog, AWS Proton, Amazon CodeGuru (Preview), AWS Lambda, Amazon Simple Queue Service (Amazon SQS), Amazon Simple Notification Service (Amazon SNS), and AWS StepFunctions

2. Security

The security pillar describes how to use AWS Cloud technologies to protect data, systems, and assets that improve your security posture. At AWS, we advocate the shared responsibility model, which applies to the security pillar. AWS is responsible for providing a secure environment for managing and operating your systems and solutions, but it is your responsibility to implement those best practices in the context of your requirements. The security pillar describes best practices such as reviewing how you manage identities for people and machines, which helps you store secrets securely.

The config (III) factor can be mapped to the security pillar, which advises you to store variables and items that depend upon the environment as environment variables. This allows you to move between deployments without having to update your code. Configuration settings such as database connection strings, API keys, credentials, and other sensitive information should be separated from the application code. At AWS, we provide services that can be used to securely meet this requirement, including AWS Secrets Manager, AWS Systems Manager Parameter Store, AWS Certificate Manager, and AWS Key Management Service (KMS).

AWS services that can help you achieve security are AWS Identity and Access Management (IAM), Amazon GuardDuty, AWS Shield, AWS Web Application Firewall (WAF), Amazon Inspector, AWS CloudHSM, Amazon Macie, AWS Security Hub, AWS Config, AWS CloudTrail, Amazon VPC (Virtual Private Cloud), AWS Direct Connect, Amazon Cognito, AWS Firewall Manager, AWS Network Firewall, and AWS IAM Access Analyzer.

3. Reliability

The reliability pillar encompasses the ability of a workload to perform its intended function correctly and consistently when it’s expected to. Reliability means that your architecture and systems:

Appropriately scale resources to meet demands
Mitigate disruptions caused by misconfiguration or transient network issues
Recover when disruptions do occur

Automation of scaling and recovery are best practices within the reliability pillar.

Because twelve-factors helps developers deliver a reliable application, multiple factors are categorized under the reliability pillar of the AWS Well-Architected Framework. Backing services (IV) explains that you should have flexibility for integrating services. This way, when your system experiences issues with availability, the application can replace the troubled service without code changes. You should choose the right resource that provides scalability while optimizing costs. Dependencies (III) describes that applications declare and isolate dependencies to become modular and self-contained. This speeds up recovery by simplifying the setup for handlers of the application code. Applications that adhere to the processes (VI) factor run as a collection of stateless processes to support scaling. This is equivalent to creating microservices that can scale up or down depending upon the workload or bring in additional instances when one fails. Disposability (IX) suggests that an application’s processes can be started and stopped rapidly, which makes the application resilient to failures and capable of being adapted to elastically scale.

AWS services that can help you achieve reliability are Amazon EC2 Auto Scaling, Elastic Load Balancing (ELB), Amazon RDS Multi-AZ, Amazon Simple Storage Service (Amazon S3), AWS CloudFormation, Amazon Route53, AWS Shield, AWS Backup, Amazon CloudWatch, AWS Systems Manager, AWS Global Accelerator, Amazon Aurora, AWS Lambda, Amazon DynamoDB, and AWS Transit Gateway.

4. Performance efficiency

The principles under the performance efficiency pillar focus on using computing resources to build architectures on AWS that efficiently deliver sustained performance as demand changes and technologies evolve. Topics in this pillar include simplifying the consumption of technologies that align with your goals, the ability to go global in minutes, and reducing the time and effort needed to deliver a service.

The concurrency (VIII) factor prioritizes management of processes, which should be stateless and allow for horizontal scaling, promoting performance efficiency. The backing services (IV) factor also falls under this category because it dictates flexibility in integration. This flexibility enables the application to maximize performance by using the right resource that meets scalability and performance requirements.

AWS services that can help you achieve performance efficiency are Amazon Elastic Compute Cloud (Amazon EC2), Amazon EC2 AutoScaling, Amazon Elastic Block Store (Amazon EBS), Amazon S3, Amazon Aurora, Amazon DynamoDB, Amazon ElastiCache, Amazon CloudFront,Application Auto Scaling, Elastic Load Balancing (ELB), AWS Lambda, Amazon API Gateway, AWS Step Functions, Amazon SQS, Amazon Kinesis, AWS Global Accelerator, Amazon Aurora, AWS X-Ray, Amazon CloudWatch, and AWS Compute Optimizer.

5. Cost optimization

The cost optimization pillar provides guidance for the architecture’s ability to operate systems and deliver the business value at the lowest price point. The cost optimization reviews help you avoid unnecessary costs, analyze and attribute expenditure, and use appropriate pricing models.

The relationship of the build, release, run (V) factor advocates for the process separation and strict discipline around efficient handling of application deployments. This aligns to the cost optimization pillar because cost effective operations are typically a result of well-designed processes and mechanisms. AWS services that can support the build, release, run factor are, AWS CodeBuild and AWS CodeDeploy.

Other AWS services that can help you with cost optimization are AWS Cost Explorer, AWS Budgets, AWS Data Exports, AWS Trusted Advisor, AWS Compute Optimizer, EC2 Spot Instances, AWS Savings Plans, Amazon S3 Intelligent-Tiering, AWS Lambda, Amazon Aurora, Application Auto Scaling, AWS Organizations, AWS Resource Groups, Tag Editor, AWS Marketplace, AWS License Manager, AWS Glue, and Amazon Athena.

6. Sustainability

The sustainability pillar focuses on minimizing the environmental impact of running workloads in the cloud. Topics in this include reviewing the lifecycle of your data and retention policies as a methodology to use only what is needed.

The disposability (IX) factor is aligned to sustainability because it highlights an application’s ability to rapidly start and shut down at a moment’s notice. This provides agility and optimized use of resources during the life of the application.

AWS services that can help you achieve sustainability are AWS Customer Carbon Footprint Tool, Amazon EC2 AutoScaling, AWS Lambda, Amazon EC2 Spot Instances, Amazon EBS gp3 volumes, Amazon S3 Intelligent-Tiering, Amazon S3 Lifecycle configurations, AWS Graviton processors, Amazon Aurora Serverless, Amazon RDS Multi-AZ deployments, AWS Compute Optimizer, AWS Well Architected Tool, and Amazon CloudWatch.

Remaining factors

Port binding, logs, and admin processes aren’t specifically categorized into the pillars of the AWS Well-Architected Framework. However, these factors can be addressed as an essential part of the services that AWS delivers.

The seventh factor: Port binding

The port binding factor says that an application should be bound to a specific port when it’s hosted as a web application with the intention of making the application completely self-contained. In the context of AWS, we offer you different ways to achieve this principle, which are dependent upon the way your application is deployed on AWS. When implementing port binding on AWS, we offer features such as security, service discovery, and dynamic port mapping to simplify and secure your applications through services like Amazon EC2, Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), AWS Elastic Beanstalk, AWS App Runner.

The eleventh factor: Logs

The logs factor dictates that an application should treat its running process as an event stream out to files that are managed completely by the execution environment. AWS offers many types of logging to capture different aspects of your application and the supporting infrastructure. CloudWatch is a centralized logging management service that monitors, stores, and provides access to log files from AWS services. For more detail, see AWS services for logging and monitoring.

The twelfth factor: Admin processes

The admin processes factor advises application developers to perform administrative tasks in an isolated manner to minimize the impact on the main application. At AWS, this factor is realized as a separation of the control plane and the data plane. The control plane is responsible for managing, configuring, and controlling the network or system infrastructure, while the data plane is responsible for the handling the actual user data or traffic. This separation is an inherent part of AWS services. We believe this separation allows AWS to deliver services that are scalable, highly available, secure, and efficient.

Applying the AWS Well-Architected Framework

The Framework shouldn’t be treated as a checklist that you review after development is complete. Instead, a review should be explored during the design phase to help you learn and apply architectural best practices. By the end of development, architects should have built a solution that facilitates faster, lower-risk service building and deployment. The Framework is not a static document, and as AWS evolves, architects continue to learn from working with customers and refine the definition of well-architected.

Conclusion

If you are familiar with twelve-factors or want to develop a twelve-factors app on AWS, read more about the AWS Well-Architected Framework. Consider starting a review project on your own to explore the detailed questions underneath each category or if you have specific workload that you’re already working on. You can use one of the many AWS Well-Architected Tool lenses to focus on applying these best practices to the services that you’re using. To get started on a lens review, see AWS Well-Architected Tool, which is accessible at no charge through the AWS Management Console.

About the author

Create a serverless custom retry mechanism for stateless queue consumers

Kaizad Wadia — Tue, 11 Feb 2025 17:44:03 +0000

Serverless queue processors like AWS Lambda often exist in architectures where they pull messages from queues such as Amazon Simple Queue Service (Amazon SQS) and interact with downstream services or external APIs in a distributed architecture. Robust retry approaches are necessary to provide reliable message processing due to the susceptibility of these downstream services to short-term outages or throttling. This often requires implementing special retry logic with features like dead-letter queues (DLQs) and exponential backoff to handle these cases gracefully, making sure that the downstream systems don’t get overwhelmed by too many retries.

In this post, we propose a solution that handles serverless retries when the workflow’s state isn’t managed by an additional service.

Solution overview

Some custom retry logic is required when Lambda functions interact with downstream services after consuming messages from SQS queues. This strategy involves the usage of Amazon EventBridge Scheduler and code in Lambda. The core concept is to implement a robust retry mechanism for handling failed message processing attempts using an EventBridge scheduler. When a Lambda function encounters a problem while processing a message, it triggers a specific error. Upon catching this error in a catch block, the function generates an EventBridge schedule. As a result, the message is sent back to the SQS queue and will be available for processing again at a specified future time.

In this approach, the retry mechanism can have a fine-grained level of control over the retry timing that might also support various techniques, including exponential backoff and linear retry intervals. This approach separates the retry logic from the code to process the message itself, making the Lambda function performant. Along with handling messages when all retries are exhausted, this solution interfaces with a DLQ to keep such messages separate from the main queue.

The following diagram illustrates the solution architecture.

The error handling and retry choice logic in the Lambda function code form the basis for how this custom retry mechanism is implemented. If there is an error while processing the message, the function raises a specific exception. Raising the exception then initiates the retry flow. A try-catch block catches this exception and calls a function that interfaces with the EventBridge Scheduler API to build a custom schedule. To configure the schedule, we include the destination SQS queue and the intended timestamp when the message is meant to be retried. We can change the delay with some code modifications depending on a number of parameters, such as error type, number of prior retries, or other custom backoff schemes.

As part of this approach, we use SQS message attributes for idempotency and to track retries. On each retry, the function adds the new timestamp to an array in the message body. If the function consumes the message more times than the maximum retry limit (determined by the array of retry attempts) it sends the message to the DLQ without rescheduling.

The solution also involves the integration of a DLQ so that it doesn’t keep messages in the main processing queue and be retried forever. The Lambda function will register messages with the DLQ in case of either exceeding the maximum retry limit or when certain error scenarios require it to stop early. This queue keeps all communications that have failed until such a time they can be manually reviewed, reprocessed, or even corrected.

Considerations and best practices

There are a few key factors to keep in mind while putting this custom retry system into practice. One aspect is handling partial failures, that is, processing where only part of the steps are complete. In such cases, we could use some form of compensating action or rollback to maintain consistency in data and avoid discrepancies downstream of the queue consumer.

Another crucial factor is controlling retry limits. Although the system design allows for variable retry limits, we must balance resource usage and resilience. Too many retries might cause higher costs and lead to slowdowns or service degradation. That is why we recommend that appropriate retry limits are set, considering probable failure rates, SLAs, and business consequences of failures.

We must also consider that EventBridge Scheduler has a granularity of 1 minute, and there is additional latency between the queue and the function, so the mechanism will not be completely precise. In principle, the scheduler sets the minimum time before which the message can be processed, making sure the Lambda function adheres to the rate limits at a minimum. This could also result in additional delays, so the mechanism would need to be adjusted for time-sensitive applications to account for these delays.

Because the solution might deal with variable volumes of messages and processing loads, scaling issues are also important. For example, the Lambda concurrency and retention period for the queue represent resource configurations we should monitor and adjust for optimal performance and cost.

Finally, we need to consider security as part of the solution. If the downstream service runs in a virtual private cloud (VPC), we would also need to place the Lambda function in the VPC. In this case, we would need to access EventBridge Scheduler through AWS PrivateLink, which enables secure and performant access to services from within a VPC.

Additionally, it is important to implement the AWS Identity and Access Management (IAM) roles (mainly the Lambda function role) with the principal of least privilege, which gives it access to create the EventBridge schedule (and iam:PassRole to give the scheduler the required permissions) as well as pass the scheduler’s IAM role to it. The scheduler’s role only needs permission to place a message into the source queue. We also need to give the function access to place a message in the DLQ and receive messages from the source queue.

Monitoring and troubleshooting

The custom retry mechanism demands efficient monitoring and debugging. With that in mind, we might view various behaviors of the system and identify potential problems by using Amazon CloudWatch logs and metrics.

The number of invocations of Lambda functions, related error rates, runtimes, and use of DLQ are the key indicators that we should monitor. It would be worth setting up alarms in CloudWatch to send an alert or initiate automated actions when the Lambda function’s metrics surpass certain predetermined thresholds. By doing this, we proactively detect and resolve certain issues pertaining to the function.

Also, we can examine logs of the Lambda function for certain error situations, retry patterns, or problems with the downstream services or with the retry logic itself. We can place logging lines judiciously in the function code to record pertinent information, including message attributes, retry attempts, and error details.

Future enhancements

There are some improvements we could consider to enhance the capabilities and flexibility of the suggested approach even further, which provides a foundation to customize retry mechanisms.

A possible improvement would be to introduce dynamic retry intervals depending on the conditions of a downstream service or kinds of errors. Instead of being based on predefined backoff schemes, the system might dynamically adjust the retry intervals based on specific error types detected or in-service health monitoring in real time. This concept’s principal disadvantage is additional complexity, which might cause the failure of the retry process itself.

Another potential enhancement is the integration of the system with external configuration services such as Amazon DynamoDB or Parameter Store, a capability of AWS Systems Manager. That way, we can handle the retry configurations centrally and dynamically to provide ease of maintenance and modification in retry strategies without having to redeploy the Lambda function code.

It would also be possible to build in advanced error analysis and reporting into the system. The system would then have the potential to provide key insights for root cause analysis and proactive remediation through comprehensive reporting, patterns of errors analyzed, and failures correlated with downstream service health.

Conclusion

It is often challenging to build scalable, robust serverless applications that might need to talk with external services. However, the proposed solution using Lambda, Amazon SQS, and EventBridge Scheduler brings a simple yet effective solution to implement customized retry mechanisms. It gives the developer fine-grained control over the retry interval, supports scenarios such as exponential backoff, and works seamlessly with DLQs for persisting failures and EventBridge Scheduler for delayed retries of messages. The mechanism can also be reused more broadly for stateless queue consumers, not only for Lambda functions. This pattern enables developers to implement robust, fault-tolerant serverless systems that handle disruptions in downstream services gracefully.

About the Author

Use generative AI on AWS for efficient clinical document analysis

Alex Boudreau — Wed, 05 Feb 2025 19:04:42 +0000

Clinical trials involve the ingestion and processing of vast amounts of highly regulated data, including complex protocol documents that describe how the trial will be conducted. Managing this volume of information can be overwhelming, but generative AI offers a solution by helping automate the process and enabling clinical researchers to quickly focus on the most relevant information. Currently, the drug approval process takes on average 10–12 years, with clinical trial study startup time accounting for 1 year of that timeframe. Much of the challenge with study startup lies in the complex and non-standard nature of protocol documents. These often require weeks or months of effort to review and assess. This review time adds to the already long cycle time to bring a new drug to market.

In this post, we show how Clario uses the AWS platform to accelerate clinical document analysis.

About Clario

Clario is a leading provider of endpoint data solutions to the clinical trials industry providing regulatory-grade clinical evidence for pharmaceutical, biotech, and medical device partners. Since Clario’s founding more than 50 years ago, their endpoint data solutions have supported clinical trials more than 26,000 times with over 700 regulatory approvals across more than 100 countries. One of the critical challenges Clario faces is the time-consuming process of generating documentation for clinical trials, which can take weeks or months.

The business challenge

Clinical trials are essential for the approval of new health innovations, including treatments, procedures, and medical devices. They require the collection of vast quantities of complex data from dispersed clinical trial sites to support assessments of medical benefits and risks, all while maintaining privacy and regulatory compliance. To make matters even more challenging, capturing data in clinical trial occurs not only in healthcare centers but also through remote capture through various aspects of trial participants’ daily activities.

Partners like Clario understand the challenges faced by life sciences companies when it comes to analyzing large volumes of complex clinical documents, such as study protocols. These documents often contain a mix of structured and unstructured data, including tables, images, and diagrams, making it difficult to accurately interpret and extract key information at scale. In this post, we explore how Clario has used the power of generative AI on AWS to efficiently analyze clinical documents and drive better outcomes for its clients.

Harnessing the power of large language models

The rapid progress in large language models (LLMs) has expanded the potential applications of natural language processing beyond simple conversational AI assistants. Clario has experimented with various techniques, such as zero-shot learning, few-shot learning, classification, entity extraction, and summarization, for the effective use of LLMs in specialized use cases. By employing prompt engineering, AI orchestration, and content retrieval, Clario can guide the models to accurately generate insights and extract relevant information from key clinical research documents, including complex clinical trial protocols.

Four pillars of effective document analysis on AWS

Through its research and development efforts, Clario has identified four core pillars that enable effective document analysis using generative AI on AWS:

Parsing – Clario uses AWS services such as Amazon Textract and Amazon Comprehend to extract text, images, and tables from clinical documents, maintaining both data privacy and security.
Retrieval – By using embedding models and vector databases like Amazon OpenSearch Service, Clario efficiently stores and retrieves relevant information from large document collections based on similarity search. The team has experimented with various chunking and retrieval strategies to optimize accuracy and performance.
Prompting – Using techniques like zero-shot and few-shot learning, Clario has enhanced the accuracy of LLMs for classifying and extracting information . AWS services such as and Amazon Bedrock simplify experimentation with different prompting strategies and the evaluation of model performance.
Generation – Clario carefully considers factors such as context size, reasoning capabilities, and latency when selecting the appropriate LLMs for generating structured outputs. AWS offers a range of pre-trained models and frameworks that seamlessly integrate into Clario’s pipeline.

Solution overview

To tackle the unique challenges associated with analyzing clinical documents, Clario has built a custom generative AI platform on AWS. This platform incorporates an orchestration engine that combines multiple LLMs and deep learning models, enabling it to extract key information accurately and at scale. By using AWS services such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Simple Storage Service (Amazon S3), SageMaker, and AWS Lambda, Clario can efficiently process thousands of documents in a matter of seconds.

The following diagram illustrates the solution architecture.

The workflow consists of the following steps:

Documents are collected on premises (1) and uploaded using AWS Direct Connect (2) with encryption in transit to Amazon S3 (3). All uploaded documents are then automatically and securely stored with server-side object-level encryption.
After the documents are uploaded and the user has reviewed them, the Clario AI Orchestration Engine (4) determines the best document parsing strategy based on file type, and extracts text using Amazon Textract (5). Once extracted, the text is vectorized and stored in the Amazon OpenSearch Service vector engine (6) for later semantic retrieval.
After vectorization, the Clario AI Orchestration Engine (4), which runs as a distributed service in Amazon EKS, launches a document classification async task using Amazon MQ. Amazon EC2 and Lambda are used for additional processing if needed. This triggers the Document Classification Agent, which uses Amazon Bedrock LLMs (8), for automatically determining the document type.
After the documents are classified, the Clario AI Orchestration Engine (4) launches the appropriate document analysis agent for further background processing. In the case of study protocols, the engine launches the Protocol Analysis agent, which uses a predefined analysis graph configuration stored in Amazon Relational Database Service (Amazon RDS) (7), as well as a combination of retrieval strategies and AI models, including custom deep learning models on SageMaker (9), and pre-trained LLMs on Amazon Bedrock (8). This orchestration powers advanced document analysis, transforming massive amounts of unstructured multi-modal data into structured data and insights.
Following the analysis, all structured data is then persisted to Amazon RDS (7) for later visualization, review, and querying.

Recommendations and best practices

Based on their experience developing and deploying generative AI solutions on AWS, Clario learned the following best practices:

Adopt an incremental and iterative development approach to gradually build and refine your models
Follow a standard machine learning approach for evaluating and validating model performance using representative test sets
Optimize the four pillars of document analysis before investing in fine-tuning and continuous pre-training of LLMs
Tailor your approaches to specific use cases, because not all problems require the same models or techniques

Conclusion

By using the power of generative AI on AWS, Clario has been able to efficiently analyze complex clinical trial documents and extract valuable insights for its clients in the life sciences industry. Through a combination of careful model selection, iterative development, and adherence to best practices, Clario has built a scalable and accurate document analysis pipeline using AWS. Unlock the full potential of your clinical trial data by applying these best practices with an AWS generative AI solution today.

About the Authors

How Nielsen uses serverless concepts on Amazon EKS for big data processing with Spark workloads

Shani Adadi Kazaz — Tue, 28 Jan 2025 16:37:22 +0000

Nielsen Marketing Cloud, a leading ad tech company, processes in one of their pipelines 25 TB of data and 30 billion events daily. As their data volumes grew, so did the challenges of scaling their Apache Spark workloads efficiently.

Nielsen’s team faced a scenario in which, as they scaled up their cluster by adding more instances, the performance per instance degraded. The degradation resulted in a decrease in the amount of work done per hour by each instance, and drove costs per GB of data processed up.

Furthermore, they encountered occasional data skew issues. Data skew, where data is unevenly distributed across partitions, created processing bottlenecks and further reduced cluster efficiency. In extreme cases, these combined factors led to cluster failures.

In this post, we follow Nielsen’s journey to build a robust and scalable architecture while enjoying linear scaling. We start by examining the initial challenges Nielsen faced and the root causes behind these issues. Then, we explore Nielsen’s solution: running Spark on Amazon Elastic Kubernetes Service (Amazon EKS) while adopting serverless concepts.

Evolving from a Spark cluster to Spark pods on Amazon EKS

Nielsen’s Marketing Cloud architecture began as a typical Spark cluster on Amazon EMR, receiving a constant stream of files of varying sizes to process. As both data volume and cluster size grew, the team noticed a degradation in performance per instance, as illustrated in the following graphs. Beyond the slower processing and the higher costs, Nielsen occasionally suffered production issues caused by data skew.

The team realized the problem was the growing number of remote shuffles between instances as the cluster grew. Remote shuffle, a process in Spark where data is redistributed across partitions, involves significant data transfer over the network and can become a major bottleneck. Due to the streaming nature of the data in their scenario, Nielsen realized they could instead process data in smaller batches. This meant they didn’t have to lean on the distributed processing capabilities of Spark by using large Spark clusters, and opt for small ones instead.

To address the performance degradation, the team decided to change its growth strategy: instead of scaling up their single Spark cluster, they scaled out using multiple local mode Spark clusters (a single node cluster) running on Amazon EKS. When compared to Spark cluster mode, local mode provides better performance for small analytics workloads. Each local mode is running a limited, smaller amount of data, requiring no remote shuffle and no interaction with other Spark instances.

Moreover, the pods running on Amazon EKS can scale up and down based on the amount of pending work, meaning Nielsen could stop resources when they are not needed.

The new solution scales linearly, is 55% cheaper, and handles data faster, even under large burst conditions.

Why shuffle matters

Remote shuffle is triggered when data needs to be exchanged between Spark instances. Some transformations, like join or repartition, necessitate a shuffle of data. Remote shuffle is an order of magnitude slower than in-memory computations because it requires moving data over the network. It could slow down processing significantly, sometimes adding 100–200% to the total processing time.

The problem Nielsen ran into was that as cluster size grew, the amount of data shuffled grew proportionally to the cluster size. The following graph shows why this happens. It calculates the amount of data exchanged for a randomly distributed dataset as cluster size grows.

The following graph illustrates that the correlation is to the size of the cluster and not to the size of the data.

Addressing shuffle

The team hypothesized that minimizing shuffle could lead to substantial performance improvements. Nielsen’s engineers decided to implement ideas from serverless patterns by drastically reducing the size of each cluster to a minimum while at the same time adding more of these smaller clusters to compensate for the lower capacity of each one. This approach promised to eliminate remote shuffle entirely for each data work item, as illustrated in the preceding graph.

Although this strategy promised performance gains, it also introduced a constraint: a limit on the amount of data per work item.

Designing the new system based on serverless patterns

Nielsen’s team developed a new architecture that uses two core concepts:

A queue of work items to pull from
A group of local mode Spark modules pulling work items from the queue

They had the following design goals:

Keep the Spark modules busy at all times
Stop modules when not needed
Make sure all work items are processed successfully

The following diagram illustrates the workflow.

Final design

The final design includes the following components:

File metadata storage – An Amazon Relational Database Service (Amazon RDS) cluster runs the PostgreSQL engine to store and manage statistics about each file entering the system.
Work manager – An AWS Lambda function is used to periodically pull waiting files from the database, prepare work items comprised of one or multiple files, and publish the work items to an Amazon Simple Queue Service (Amazon SQS) message queue.
Work queue – An SQS message queue is used for work items waiting to be pulled for processing.
Processing units – Local mode Spark instances run as pods on an EKS cluster. They pull work items from the SQS queue. As long as there are waiting work items in the queue, the pods are constantly busy.
Metrics adaptor – An adaptor (Kubernetes-cloudwatch-adapter) provides Amazon CloudWatch metrics to the Kubernetes Horizontal Pod Autoscaler.
Kubernetes Horizontal Pod Autoscaler – Horizontal Pod Autoscaler (HPA) uses a scaling rule to scale pods up or down based on the metrics from CloudWatch. It scales according to the number of messages (work items) visible in the queue, which are proportional to the work waiting to be processed. In Nielsen’s system, HPA scales the pods by targetValue = {SQS length/2}. .
Work completion queue – A second SQS message queue is used for reporting completion of work items. The completions get pulled by another Lambda function and get updated in the PostgreSQL database.

The following diagram illustrates the architecture of the final system.

⁠Analyzing the results

The following graphs demonstrate the EKS pods scaling based on the amount of work items. The active pods pick up new work items as soon as they finish their previous ones.

The following graph shows a large burst of data coming in. The system reacts quickly and scales up to process the added work. It quickly scales down when work is complete.

Analyzing the performance achieved per instance, the new system demonstrated a significant improvement. Performance per instance increased by approximately 130% while growing linearly and maintaining close to constant costs per GB processed.

The comparison of performance between the new system and the old system can be seen in the following graph.

The new system’s costs are 55% lower for the same amount of data processed.

The following graphs compare the costs before and after the implementation.

Conclusion

Nielsen’s journey from a traditional architecture to a serverless-inspired architecture on Amazon EKS exemplifies the power of rethinking established patterns in big data processing.

By addressing the core challenges of data shuffle and scaling, Nielsen not only achieved performance gains and cost reductions, but also demonstrated the potential for linear scaling in large-scale data operations.

If you have big data processing jobs that that can be broken down into many independent small parts, consider using similar ideas over Amazon EKS to achieve linear scaling and large cost savings.

This post was copyedited for grammar, spelling, capitalization, punctuation, terminology, and legal issues. Other important issues are noted in comments, and you should consider revising the content accordingly before publication.

About the Authors

Enhance the resilience of critical workloads by architecting with multiple AWS Regions

John Formento — Wed, 22 Jan 2025 16:11:25 +0000

In this post, we will share how you can use multi-Region as an architectural approach to achieve higher resilience on Amazon Web Services (AWS). This approach relies on first operating a workload across multiple Availability Zones within an AWS Region, before expanding to achieve even higher resilience by using multiple Regions. This is because within a Region there are multiple Availability Zones, which are physically separated by many miles but still close enough together (60 miles or less) to allow for single-digit millisecond latency. Each Availability Zone features one or more data centers, each housed in its own facility with its own redundant networking, connectivity, and power. Availability Zones provide fundamental building blocks that can help you achieve your resilience goals for your applications. First, you can benefit from the separation between Availability Zones by using Zonal services to specify which Availability Zone a resource is in, such as an Amazon Elastic Compute Cloud (Amazon EC2) instance. This means that if you build your application with redundant replicas of your application resources in each Availability Zone, you can gain excellent resilience to infrastructure events impacting any one Availability Zone.

A multi-Region approach is a reliable way to achieve a bounded recovery time for critical applications in the rare event of a service failure in a Region that is impacting your application. Each Region has strict logical and physical separation from other Regions. This purposeful design helps avoid service and infrastructure disruptions in one Region affecting another Region. This unique property of Regions can be used to build multi-Region applications with predictable fault domains.

While a multi-Region approach can improve your application’s resilience to failures, it can be challenging to build and operate such an application. It requires careful work to take advantage of the isolation between Regions, with care taken to not remove this isolation benefit at the application level. For example, if you fail over an application between Regions, you need to maintain strict separation between your application stacks in each Region, be aware of all the application dependencies, and fail over all parts of the application together. This kind of system requires planning and coordination amongst many engineering and business teams, especially with a complex, microservices-based architecture that could have several dependencies between applications.

If you’re replicating data between Regions using an asynchronous approach, you should be aware of the risk that not all your data has been replicated to the standby Region when you fail over. Because there’s a finite time needed to copy data over between Regions, data might be out of sync between the primary and standby Regions. If you use a synchronously replicated database across Regions to support your applications running from more than one Region concurrently, you avoid issues with data being out of sync when starting your application in the new Region. However, this introduces higher latency characteristics into your application’s resources. This is because writes need to commit to more than one Region, and the Regions can span hundreds or thousands of miles from one another. This latency characteristic needs to be accounted for in your application design. In addition, synchronous replication can increase the chance for correlated failures because writes need to be committed to more than one Region to be successful. If there is an impairment within one Region, you’ll need to form a quorum for writes to be successful, which typically involves having your database in three Regions and having a quorum of two out of three.

Finally, you need to practice the failover and simulate Region impairments to know that it works when you need it. It’s a substantial time and resource investment to regularly rotate your application between Regions to practice failover, but it’s a recommended practice if you plan to build a multi-Region application.

Given these additional considerations when implementing a multi-Region approach, for most AWS customers, multi-AZ is the right approach for building and operating resiliently in the cloud. This approach helps mitigate most infrastructure failures, which are usually contained within an Availability Zone. A multi-Region approach is most common in the following scenarios.

Meet regulatory and compliance requirements and enhance disaster recovery capabilities

Regulated industries like financial services and healthcare and life sciences can require that applications be multi-Region. Healthcare providers and pharmaceutical companies, for example, often deploy electronic health records (EHR), clinical trial management systems, and other applications across multiple Regions for enhanced data redundancy, disaster recovery, and compliance with regional data privacy regulations (like HIPAA in the US or GDPR in the EU). Epic on AWS, for example, is typically deployed across multiple Availability Zones and multiple Regions to increase the resilience of customers’ EHR and integrated application environment, making full use of the resources and geographic diversity of the AWS Cloud.

Banks and financial institutions, including Fidelity and Vanguard, also deploy many of their core trading and investment platforms and customer-facing applications across multiple Regions for enhanced business continuity and compliance with local data protection regulations.

Achieve a bounded recovery time to support highly available business-critical workloads

With growing demand for always-on applications and services, companies are increasingly reliant on cloud-based services and infrastructure for day-to-day operations and business continuity. While a single Region supports highly available and resilient applications, distributing workloads across multiple Regions enables a bounded recovery time in the rare event of a disruption to the application. The physical and logical separation of Regions provides a well-defined fault isolation boundary that you can use to create predictable fault boundaries for your applications. If the application experiences issues in one Region, the workloads can continue operating in another Region, which minimizes downtime for customers and users.

Streaming platforms like Netflix, NBCUniversal, and Disney, for example, deploy their content delivery networks (CDNs) and video streaming infrastructure across multiple Regions to provide a seamless media experience for their customers. In many cases, video streaming and video gaming companies deploy their infrastructure across multiple Regions to offer lower-latency gaming experiences for players worldwide.

Automotive companies such as Honda deploy their connected vehicle platforms across multiple Regions to scale globally. They use geo-location routing that identifies the closest broker the vehicle should communicate with based on customer-configured rules that govern how vehicles connect to the cloud infrastructure. This allows them to reliably connect millions of vehicles to the cloud while supporting high availability.

Conclusion

No matter the industry or scenario, AWS is the definitive choice for organizations that want to build and run highly available, resilient applications in the cloud, with resilience built into its infrastructure, operational models, and comprehensive capabilities across Regions. To learn how to choose between the different options for building resilience into your application, see the Well-Architected reliability pillar, and for a detailed framework for choosing multi-Region, see AWS Multi-Region Fundamentals.

TVS Supply Chain Solutions built a file transfer platform using AWS Transfer Family for AS2 for B2B collaboration

Suresh Kanniappan — Mon, 13 Jan 2025 21:52:14 +0000

TVS Supply Chain Solutions (TVS SCS), promoted by the erstwhile TVS Group and now part of the $3 billion TVS Mobility Group, is an India-based multinational company who pioneered the development of the supply chain solutions market in India.

For the last 2 decades, it has provided supply chain management services to customers in the automotive, consumer goods, defense, and utility sectors in India, the United Kingdom, Europe, and the US. It has a presence in 26 countries with over 17,000 employees and provides services to 78 global Fortune 500 companies. The company went public in 2023.

To meet its customers’ compliance requirements, TVS SCS sought a reliable file transfer solution supporting Applicability Statement 2 (AS2), a business-to-business (B2B) messaging protocol. This post describes how TVS SCS built a secure file transfer platform using AWS Transfer Family for AS2 to exchange Electronic Data Interchange (EDI) documents with their B2B customers in the logistics industry.

Business use case

Several end customers in the manufacturing sector mandated the exchange of EDI documents through the AS2 protocol over the internet. To address this requirement while maintaining manageability, security, and scalability, TVS SCS implemented a file transfer platform on AWS.

TVS SCS serves end customers in the manufacturing sector who require supply chain solutions between various locations:

Source – Plants, warehouses, technology
Destination – OEM vendors, plants, dealers

The process involves the following steps:

The end customer sends a booking request document (booking fact) to TVS SCS.
TVS SCS and the end customer exchange a series of EDI documents.
TVS SCS must acknowledge, process, and update the end customer upon receipt of each EDI document.

TVS SCS built a file transfer platform using Transfer Family with AS2 configuration to achieve the following:

Securely exchange EDI documents with end customers
Provide continuous notification using Message Disposition Notifications (MDNs)

The following diagram illustrates the end-to-end business process (requisition, sourcing, purchase orders, receiving, and invoicing) between TVS SCS and an end customer using the AS2 protocol.

Why the cloud?

TVS SCS chose AWS to build their AS2-compliant file transfer platform for three key reasons:

Data location – All relevant data (such as order creation and customer details) already resides in AWS
Infrastructure management – AWS addresses challenges in the following areas:
- Maintaining highly available and scalable infrastructure
- Maintaining correct AS2 system interoperability with trading partners
- Meeting compliance requirements
Versatility for non-AS2 customers – TVS SCS uses multiple scalable and fully managed AWS services to build customized APIs and webhooks for customers not using AS2

This cloud-based approach allows TVS SCS to focus on their core business while AWS handles the complexities of secure, compliant, and scalable file transfer infrastructure.

Why Transfer Family and AS2?

AS2 is a B2B messaging protocol commonly used for exchanging EDI documents securely with integrity control according to the EDIFACT standard, reliably, and cost-effectively over the internet using the HTTP and HTTPS protocols. B2B integration over the AS2 protocol can be challenging, such as with trading partner onboarding, AS2 EDI integration, firewall configuration, certificate maintenance, and high licensing costs for commercial AS2 solutions.

By choosing Transfer Family with AS2 configuration, TVS SCS addresses these challenges and gains several advantages:

Simplified partner onboarding
Managed infrastructure, reducing maintenance overhead
Built-in security features
Flexible scaling to meet changing business needs
Pay-as-you-go pricing model

Solution overview

The following diagram shows the relationship between the AS2 objects involved in the inbound and outbound processes.

The following diagram illustrates the solution architecture with AWS services.

For step-by-step instructions about creating an AS2 server using Transfer Family, refer to Create an AS2 server using the Transfer Family console.

The allowlisted IP address of the end-customer AS2 server is allowed to communicate with Transfer Family for AS2 on AWS. The customer sends the EDI document through Transfer Family, and the EDIs are stored in Amazon Simple Storage Service (Amazon S3). The business logic is implemented in AWS Lambda functions to read the EDI documents, process them, and update customers. AWS B2B Data Interchange, a fully managed service for automating EDI document transformation, can be considered as a complementary or alternative solution for EDI processing. There are two Lambda functions created: one handles truck booking using NodeJS, and the other handles outbound file transfer (from Amazon S3 to the AS2 server) using Python 3.2.

This architecture enables TVS SCS to securely and efficiently manage the EDI document flow, from receipt through processing and outbound transfer, using scalable and serverless AWS services. The solution provides a compliant and cost-effective approach to B2B data exchange with customers and partners.

Prerequisites

For the prerequisites to configure Transfer Family with AS2, see Configuring AS2. To learn more about the security features in Transfer Family, see Security in AWS Transfer Family.

End customer to TVS SCS communication workflow

The following diagram illustrates the step-by-step process of a truck booking request from an end customer to TVS SCS using AWS services.

This streamlined workflow demonstrates how TVS SCS uses AWS services to efficiently handle truck booking requests from customers:

The customer initiates a truck booking by sending a booking fact EDI to TVS SCS. The EDI contains details like customer name, date, source location, destination location, and more.
The signed and encrypted booking fact EDI is sent as an inbound HTTP AS2 payload to Transfer Family through the internet.
Transfer Family writes the booking fact EDI to the S3 bucket.
TVS SCS confirms receipt of the booking fact EDI either through the inline HTTP response or an asynchronous HTTP POST request to the originating server.
The EDI exchange audit trail is logged in Amazon CloudWatch Logs.
The EDI document is available for TVS SCS consumption, and a Lambda function processes the document using business logic.

TVS SCS to end customer communication workflow

The following diagram depicts the workflow from TVS SCS to the end customer.

This workflow demonstrates how TVS SCS uses AWS services to provide timely and accurate updates to customers throughout the delivery process:

The customer confirms the price quote. TVS SCS uploads EDI documents to S3 bucket.
TVS SCS sends a series of updates using the AS2 outbound connector, such as truck allocation, truck departure, truck in-transit status, truck delay notifications, delivery confirmation, and billing invoice. A Lambda function reads the EDI documents from Amazon S3 and runs business logic to generate responses for the end customer.
The EDI documents are sent as an outbound HTTP payload.
The customer AS2 server sends an acknowledgment using MDN.
The EDI exchange audit trail is logged in CloudWatch Logs.
The EDI document is available for the customer’s consumption and further processing.

Results

The following customer challenges were addressed with this solution:

It meets end customer requirements for EDI file exchange through AS2 protocol
It eliminates the need for in-house AS2 infrastructure management
It provides flexibility to add new customers to the file transfer platform

By addressing these challenges and using AWS services, TVS SCS has created a future-proof file transfer platform.

Summary

This post demonstrated how cloud-based services can transform traditional B2B communication processes, offering supply chain companies a path to improved efficiency, compliance, and customer satisfaction. For supply chain providers facing similar challenges, this solution offers a blueprint for modernizing file transfer systems while maintaining compliance with industry standards.

To learn more about this AWS solution for supply chain companies, contact AWS for further assistance. AWS can provide detailed information about implementation, pricing, and how to tailor the solution to your specific business needs. They have teams of experts who can guide companies through the process of modernizing their B2B communication systems using cloud-based services.

About the Authors

Transform lease agreement workflows with Amazon Bedrock

Syed Masudullah Sadullah — Tue, 31 Dec 2024 15:06:09 +0000

Rental and lease agreements can be a complex and time-consuming process for property management companies and landlords. The agreements contain legal language, varied formatting, and diverse terms and conditions based on state and local regulations. Landlord-tenant laws vary significantly across the country, with each state having its own set of regulations. For example, California’s landlord-tenant law spans over 100 pages in the state’s Civil Code. Manually extracting and processing the key details from lease documents is inefficient and error prone. In 2023, there were approximately 45 million rental units managed by over 310,000 property management companies in the US, most of which want to take advantage of AI-powered lease management systems to streamline operations, enhance tenant experience, and optimize costs.

Generative AI, powered by large language models (LLMs), is helping how businesses approach complex document processing tasks, including lease management. Amazon Bedrock, a fully managed service, offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Luma (coming soon), Meta, Mistral AI, poolside (coming soon), Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

This post explores how Amazon Bedrock can transform property management operations and optimize costs. We examine a practical approach to tackle challenges such as processing high volumes of lease agreements, maintaining compliance with varied regulatory requirements.

Lease management process

Rental property management requires a careful balance of manual and automated processes to provide smooth administration of lease agreements. Although technological solutions have improved efficiency in many areas, the handling of lease documents still relies heavily on manual effort from both property managers and back-office staff.

The following diagram shows a critical part of the lease processing workflow.

In this workflow, when a tenant signs a physical lease document, the property manager scans and uploads it to capture the terms electronically. A back office processor reviews the files, manually extracting key details like rent, duration, and deposit, and uses this to set up billing, payments, and reminders. The processor also manages lease functions, including processing payments, sending reminders, and issuing renewal notices, with some tasks automated but requiring manual review to address non-standard lease terms and special conditions. Alternatively, in the case when a tenant signs the lease digitally, the document is automatically captured in the system and processed further.

Overall, lease management functions involve manual and automated steps.

Solution overview

By using LLMs, you can automate key steps in the lease handling workflow, transitioning from a manual approach to a more streamlined and intelligent system. With prompt engineering, LLMs can interpret the language of lease agreements mandated by state, county, and local laws, and accurately extract terms and conditions for downstream functions such as rent processing and renewal notifications. Optionally, a fine-tuning approach helps LLMs understand industry-specific terminology.

The solution approach in this post uses Amazon Bedrock, which offers a selection of FMs and provides seamless integration with other AWS services. Although we used Anthropic’s Claude 3 Sonnet model on Amazon Bedrock to describe the solution in the post, Amazon Bedrock allows you to experiment with other models using the same approach, enabling you to find the best fit for your specific requirements.

Our event-driven solution is structured in three key steps, as illustrated in the following diagram:

Constructing a standard lease terms knowledge base – This stage involves building a comprehensive repository of standard lease terms and conditions
Validating and extracting lease agreement details – Here, we focus on accurately parsing and extracting crucial information from individual lease agreements
Automating lease-related downstream processes – The final stage implements automation for various lease management tasks and workflows

This solution demonstrates how advanced models can be effectively integrated into real-world business processes, streamlining lease management operations while maintaining accuracy and compliance.

For a practical implementation of this solution, download and unzip the assets from solution repository, where you can find code for AWS Lambda functions, a sample standard lease template, and an example lease document for you to test in your own AWS environment.

Prerequisites

To implement this solution, you need the following prerequisites:

An AWS account with AWS Management Console and programmatic administrator access.
Access to Amazon Bedrock models. To demonstrate this approach, we use Anthropic’s Claude 3 Sonnet.
Access to AWS Identity and Access Management (IAM) to create roles and policies.
Proficiency in developing and deploying Lambda functions in your preferred programming language. We use Python pseudocode to describe steps in this post.
Amazon Simple Queue Service (Amazon SQS) to scale Lambda function invocations.
Access to an Amazon Simple Storage Service (Amazon S3) bucket to store standard lease templates, lease documents, and other tenant communication templates as required. You should have proficiency in setting up S3 notifications to destinations such as Lambda and Amazon SQS.
Access to Amazon DynamoDB with an understanding of your data volumes and throughput capacity mode to store lease terms extracted from lease documents.
Amazon EventBridge Scheduler to configure schedules for recurring lease-related activities.

Build a standard lease terms knowledge base

In the first stage, you build a foundation of the solution by curating a library of standard lease document templates to capture diverse laws and regulations across different states, cities, and counties.

To describe the solution approach in this post, we use the Amazon Bedrock Converse API, which provides a consistent way to invoke models, removing the complexity to adjust for model-specific differences such as inference parameters. It also manages multi-turn conversations by incorporating conversational history into requests.

With the Converse API, you can establish a centralized knowledge base in DynamoDB to streamline validation of mandatory requirements in lease documents. Because the lease templates don’t change often, a DynamoDB based knowledge base provides a cost-effective way to store mandatory terms required by different jurisdictions, removing the need to invoke Amazon Bedrock queries every time a lease is processed. The use of the Converse API with DynamoDB also eliminates an extra layer of complex knowledge base creation that requires additional integration, cost, and maintenance.

Complete the following steps to create your knowledge base:

Create an S3 bucket called Lease Templates and upload the standard lease templates.

Because lease templates don’t change often, this step is done only for new or modified templates.

Next, you configure S3 notifications to trigger a Lambda function to process the template.

Create a prompt instructing the LLM to analyze lease templates and identify terms and conditions mandated by state, county, and city regulations. The prompt can also include directives on how to parse the template and extract terms, conditions, and clauses as defined in the sample. See the following code:

<instructions>

Please review the provided residential apartment lease agreement template and extract the following information for each state or jurisdiction represented in the document. Extract state, county, city, zipcode and township details of the template in json format such as state as key and Ohio as value, zipcode as key and 43065 as value, etc. State and Zipcode is mandatory.

<laws>

Mandated state or local laws: Identify any specific laws, statutes, or regulations that the lease agreement must include or comply with based on the state or local jurisdiction. This could include things like maximum security deposit amounts, required notice periods for lease termination, or provisions tenant rights, security features on doors or windows or balcony, wall paint related obligations and landlord obligations. Provide output in json format with name and condition as key, value pairs.

</laws>

<terms>

Mandated lease terms and clauses: Extract any specific terms, clauses, or language that the lease agreement must contain due to state or local requirements. This may include items like required disclosures, prohibited provisions, or mandatory sections covering topics such as security deposits, maintenance responsibilities, or move-in/move-out procedures. Provide output in json format with name and condition as key, value pairs.

</terms>

<structure>

Formatting or structure requirements: Note if the lease agreement template must follow a particular format, structure, or organization based on state or local guidelines. This could involve the order of sections, required headings, or formatting of specific provisions. Provide output in json format with name and condition as key, value pairs.

</structure>

For each state or jurisdiction represented in the lease agreement template, please provide the extracted information in json format as described above. Include the state/jurisdiction name, the relevant mandated laws, terms, clauses, and formatting requirements. Where possible, cite the specific legal authority or source for the required provisions. The goal is to create a comprehensive guide in json format that a property manager could use to ensure their residential lease agreements comply with the applicable state and local requirements, based on the provided template document. In addition to above terms and conditions, provide any other relevant terms you find the template that could be important and should be included in lease documents by property manager. Provide only json output and don't include any other text and don't add any super header to the overall json response. Start the json with state key, value pair to put the item into Amazon DynamoDB table.

</instructions>

Using the Converse API, extract mandatory terms and conditions as JSON output with state and zipcode as unique identifiers:

doc_message = {
              "role": "user",
              "content":
[
{ "document":{"name": "Document 1",
         "format": "pdf",
         "source":{"bytes":file_bytes}}
},
{ "text": prompt
}
]
response = bedrock.converse
(
  modelId = "anthropic.claude-3-sonnet-20240229-v1:0",    
  messages = [doc_message],
  inferenceConfig = {"maxTokens":4096, "temperature":0}
)

The following screenshot shows the output of the Amazon Bedrock Converse API call, which will serve as a reference for processing lease documents for that jurisdiction.

Create a leaseagreementtemplateterms table in DynamoDB and store the JSON output, forming the knowledge base:

#Convert JSON string to Python dictionary
item = json.loads(response_text)

#Insert response_text item into DynamoDB table
table = dynamodb.Table('leaseagreementtemplateterms')
try:
response = table.put_item(Item=item)
print('Item inserted successfully: ', item['state'], item['zipcode'])
except Exception as e:
print('Error inserting item: ', item['state'], item['zipcode'], e)

You can configure on-demand or provisioned throughput capacity for the table based on your workload requirements. This data repository makes sure that the mandatory requirements for each jurisdiction are readily available for validation when new lease agreements are processed. It’s also more cost-effective to retrieve terms from the DynamoDB table than invoking Amazon Bedrock every time a lease needs to be validated against standard terms in the template.

You can repeat the process to capture standard lease terms of all jurisdictions you have operations in and if there are regulatory changes in the standard terms of already processed templates.

Validate and extract lease agreement details

In the second stage of the solution, you validate each lease agreement against standard terms captured during the previous stage to confirm compliance. After the lease is determined to be compliant on all mandatory clauses for the jurisdiction, you extract terms and conditions to run lease management functions. Compared to the volume and frequency of templates processed in first stage, you frequently process a larger number of documents in the lease processing stage, therefore a scalable solution using Amazon SQS is optimal. You can use S3 notifications and an SQS queue-based approach to decouple and scale the document processing as required.

Complete the following steps:

Create an S3 bucket called Lease Agreements to upload lease documents, and configure S3 upload notifications to destination type Amazon SQS.

Next, you configure Amazon SQS to trigger a Lambda function to perform downstream processing of the lease document.

For this post, to identify the jurisdiction, we mentioned state and zipcode as part of file name. With that information, retrieve mandatory terms corresponding to that jurisdiction from the DynamoDB leaseagreementtemplateterms knowledge base.
```
Table = dynamodb.Table('leaseagreementtemplateterms')
response = table.query(KeyConditionExpression = Key('state').eq(state) &
Key('zipcode').eq(zipcode))
```

Over a period of time, standard lease templates may change for various reasons. If you have more than one version of the template for each state and zipcode combination, use the latest version of mandatory terms for validation.

With the extracted mandatory terms and uploaded lease document, create a prompt for the Amazon Bedrock Converse API to validate whether the lease complies with all required clauses and conditions. The following prompt considers various aspects of lease processing, and you can add more details as required for your use case. The prompt also asks the LLM to score the confidence level on the accuracy of the processing, which you can use to determine if further manual review is required.

<instructions>

You are an AI data processor assisting a residential property management company. Your task is to review residential lease agreement document uploaded and validate that it contains the mandatory terms, conditions, and clauses provided in the following context.

<json_mandatory_terms>

+ str(mandatory_lease_terms_json)

</ json_mandatory_terms>

Please review the lease agreement document and check if it includes the mandatory terms, conditions, and clauses as mentioned in terms above. Do not hallucinate or use any public information for validation. Clauses could be just statements. Don't look for specific statements but make sure the meaning is in alignment.

Validate if rent amount, lease start date, security deposit amount, etc, have valid values such as amounts and dates. For example, if security deposit is mandatory in the terms JSON, then the lease document should have the term security deposit with a valid $ amount value. Identify any gaps or missing elements that are in the JSON and provide a summary report.

The report should include: The state and local jurisdiction of the property. A list of all the mandatory terms, conditions, and clauses required for that jurisdiction as per JSON. A list of any missing or incomplete elements in the lease agreement document you just reviewed. If any mandatory terms are missing or not properly mentioned with valid values in the lease document, please provide recommendations on what needs to be amended in the lease document and approximate wording for each recommendation to add in the lease document. Please provide the report in a clear and concise format that the property manager can easily understand and act upon. If all mandatory terms look good, then confirm the same in the report by outputting a response 'status: agreement is validated' along with the report. If a term or condition or clause doesn't fulfill as per mandatory JSON, then output a response 'status: agreement is not fully validated' along with the report.

<confidence_score>

Share a confidence score in percentage on how confident are you that you validation is accurate and the lease document is complete.

</confidence_score>

</instructions>

The Converse API call generates a detailed validation report in JSON format as shown in the following screenshot, outlining any sections or terms that don’t align with the mandatory requirements. It also provides a confidence score on the accuracy of the lease document and recommendations on how to amend those terms and conditions.

Based on the model’s recommendations, you can amend the lease and make sure the terms and conditions are compliant with mandatory requirements, and then re-validate the lease document.

After the document is successfully validated, the model prepares a final validation report along with a confidence score. In our solution, we’ve considered 95% as the threshold for successful validation. You can decide your threshold and have a manual review step in the workflow as required.

After the amended lease is validated successfully, prompt the Amazon Bedrock Converse API to extract required terms from the lease document, such as tenancy start date, end date, security deposit, utilities paid by, and so on. Add additional fields to the prompt as required for your business activities and workflows.

<instructions>

You are a Lease document data processor. You will be provided a lease agreement of a real estate rental unit such as apartment, home or condo. Extract the information from the lease document and create a json that can be inserted into Amazon DynamoDB table. Following are the terms and conditions of the lease that you need to extract:

state is state where the lease is processed (Example: Ohio, Pennsylvania, etc.)

zipcode is zipcode where the lease is processed (example 43065, 19019, etc.)

lease_id is Rental agreement title
new_or_amendment is 'new'
agreement_signed_date is date on which this lease is signed (mm/dd/yyyy)
deposit_amount is Deposit amount
deposit_paid_by_date is date when deposit should be paid by mm/dd/yyyy)
fixtures are kitchen appliances, furnitures or any other applicances
owner_name is Landlord's or Owner's name of the rental unit
property_address is address of the rental unit which is on lease
rent_amount is monthly rent amount
rent_paid_by_day_of_month is due date of rental payment
tenancy_end_date is lease end date on which the lease is terminating
tenancy_start_date is lease start date on which the lease is starting
tenant_name is Tenant's name of the rental unit
termination_notice_min_days is minimum notice period in days
utilities_terms_electricity is who will pay the electricity bill
When creating the summary, be sure to understand the legal language in the agreement and create a valid output.

</instructions>

Create a Lease Agreements table in DynamoDB to store the terms and condition of the lease as a lease primary record.

You can use this record to carry out lease management activities throughout the life of the lease, such as rent reminders, renewal notices, and promotional emails. Because the lease is renewed by the same tenant, you can update the primary record and extend the process. If the lease expires and a new lease is signed by different tenant, you can create a new lease primary record again for the rental unit, thereby enabling the continuous lifecycle of property management workflows.

The following screenshot is a sample lease record for each lease agreement processed in the table.

Automate lease-related notifications and reminders

After the lease terms are extracted into the lease agreement table, you can automate downstream processes. The solution in this post uses EventBridge Scheduler and Lambda functions to run different lease management functions. However, you can also use Amazon Bedrock to perform some of those functions, such as generating communications or custom notifications as required. You can determine what works best for your use case based on volumes, flexibility, and cost involved in using Amazon Bedrock and modify the approach.

Complete the following steps:

Using dates and other lease terms, configure EventBridge Scheduler to trigger periodic notifications and batch processes. For example, you can schedule monthly rent reminders or renewal notices nearing lease end or periodic promotions.

Using standard templates from Amazon S3, you can automate notices and reminders for an improved customer experience and archive the communications for future audits.

#Send rent reminder on 25th of every month using templates stored in s3
response = s3.get_object(Bucket = "leasenoticetemplates",

Key = "rentreminder.txt" )
#Publish SNS email message
topic = sns.Topic('arn:aws:sns:us-east-2:1234567890:leasecommunications')
response = topic.publish(Message = rentreminder)

The following screenshot is a sample recurring rent reminder email scheduled through EventBridge.

Conclusion

In this post, we explored a generative AI-based approach to lease processing using the power of Amazon Bedrock. Our approach addresses the complex challenges of manual lease management by establishing a comprehensive lease template library and knowledge base, automating compliance validation against jurisdiction-specific requirements, and centralizing lease term storage for efficient processing of rental management functions. This approach not only streamlines the initial processing of leases, but also significantly reduces administrative overhead in ongoing lease management. By automating lease processing activities, you can optimize administrative costs, improve accuracy, and enhance overall operational efficiency.

For the implementation of this solution, download and unzip the assets from the solution repository, which contains Lambda function code and sample lease files to test in your own AWS environment.

Efficient satellite imagery supply with AWS Serverless at BASF Digital Farming GmbH

Kevin S. Ridolfi — Fri, 06 Dec 2024 16:31:05 +0000

BASF Digital Farming’s mission is to support farmers worldwide with cutting-edge digital agronomic decision advice by using its main crop optimization platform, xarvio FIELD MANAGER. This necessitates providing the most recent satellite imagery available as quickly as possible. This blog post describes the serverless architecture developed by BASF Digital Farming for efficiently downloading and supplying satellite imagery from various providers to support its xarvio platform.

Figure 1. Screenshot showing the xarvio Field Manager platform

Architecture

Figure 2 shows the serverless architecture implemented with AWS services for downloading and processing satellite imagery. The subscription management components handle subscription creation, updates, and deletions, while the actual data downloading and processing occurs in AWS Step Functions.

Figure 2. Serverless implementation of the new imagery service

Subscriptions are created using Amazon API Gateway for external API access, which provides request throttling and can be used to manage API request authorizations.
An AWS Lambda API function manages subscriptions. It implements common create, read, update, and delete operations with request validations and provides an endpoint for replaying failed requests. Subscriptions contain geometry, data provider, as well as start and end date and other parameters, which are stored in the subscription database (Step 7) before a message is sent out for processing.
Notice that the entire architecture is serverless and thus allows for theoretically unbounded scaling. In case of a bug, this can lead to severe cost impacts, so we implemented a safety buffer, which enables us to prioritize and limit the number of Step Functions executions of the processing pipeline.
All requests (such as the initial request for imagery when a subscription is created) are sent to the Amazon Simple Queue Service (Amazon SQS) processing queue first, which functions as a processing buffer and allows for request prioritization.
Subsequently, Amazon EventBridge Pipes connects the processing buffer with AWS Step Functions. It handles pipe-internal errors automatically; for example, when the Step Functions concurrency limit is reached, the invocation will be retired automatically. This does not handle exceptions raised within Step Functions, such as runtime errors.
AWS Step Functions then performs the actual downloading, processing, and ingestion to the STAC catalog of satellite data from different providers. In case of failure, the request message with error description is sent to the failure queue.
Step Functions uploads the data to Amazon Simple Storage Service (Amazon S3), which stores satellite imagery data.
Following this, Step Functions updates the subscriptions in the Amazon DynamoDB-based subscription database, which stores relevant metadata, such as start and end date, boundary, provider, collection, and last update.
A notification is sent out to inform the user that new data is available through Amazon Simple Notification Service (Amazon SNS), which informs users and services about any updates on a subscription, such as new data being available or subscriptions having been created, deleted, updated, or having failed.
Next, the data is published to our internal STAC catalog, which registers the satellite imagery and makes it directly accessible for subsequent processing.
In case of failed Step Functions execution in Step 5, the Amazon SQS-based failure queue buffers failed executions. Failure messages contain the error message and request body. Depending on error reasons, they can be replayed using the corresponding API endpoint, enabling reprocessing through the replay endpoint on the API Lambda function. The endpoint also allows users to filter messages based on their failure type and to delete messages that cannot be replayed.
An update checker, built on AWS Lambda, regularly checks whether a subscription can be updated. It is triggered in conjunction with an event scheduler every 5 minutes, checks the database for subscriptions that can be updated, and sends update request messages to the processing buffer. Besides actively checking resources, such as API endpoints and STAC catalogs, it also sends out an update message if a notification was received, for example, through an external notification service.
Finally, a delete checker, also built on AWS Lambda, identifies subscriptions that can be deleted. It is triggered in conjunction with an event scheduler every 12 hours. It regularly checks the database for subscriptions that can be deleted and removes them from the database, the S3 bucket, and the STAC catalog. As a safety mechanism, a subscription will first be marked for deletion for 6 months before it gets deleted.

Imagery step function

The actual downloading and processing of data from different providers is handled by the imagery function, illustrated for two different providers (Public and Planet) in Figure 3.

Figure 3. Diagram showing detail state machine for the Imagery Step Function

When a request arrives, the provider choice state determines the provider from the request body, depending on which the Step Functions flow routes to different Lambda states.
In case a public provider is selected (for example, Earth Search), the Public_Provider Lambda function downloads the data from STAC-based open data providers and directly uploads it to the S3 data bucket, as shown in Figure 2.
In case Planet data is selected, the data retrieval involves an asynchronous call to an external API: First, the Planet_Requester sends an order to the Planet API, together with a task token for pausing Step Functions and the URL of the Planet_Webhook Lambda function.
The Planet_Webhook function is invoked by Planet when the requested order is available for downloading. Given the transmitted task token, Step Functions is resumed with the next state.
Subsequently, the Planet_Provider Lambda function downloads and processes the Planet data.
For both public providers and Planet, the subsequent Public_Provider Lambda function updates the subscription database entries, as shown in Figure 2 (for example, with the latest available timestamp), and adds the download and processed data to the internal STAC catalog, before it ends in the Success state.
If an error occurs in any of the Lambda functions (2, 3, 5, 6), an error message is prepared in the Error_Parsing If an unknown provider is handed in, an error message, including the request body, is prepared in the Error_Provider_Unknown state. In both cases, the error message is pushed to the Failure_Queue (refer to #10 of Figure 2), before it ends in the Failure state.

Conclusion

BASF Digital Farming GmbH developed a serverless architecture on AWS for efficiently downloading and supplying satellite imagery for use by its xarvio platform. This architecture led to a 5x faster delivery rate, an 80% cost reduction through on-demand data downloading, and a 3x accelerated development cycle. Future work will include optimizing the architecture, exploring additional AWS services, and onboarding more satellite imagery providers. Similar serverless architectures using AWS services like AWS Step Functions, AWS Lambda, and Amazon API Gateway can enhance flexibility, scalability, and cost efficiency in imagery provisioning. Learn more about AWS serverless offerings at aws.amazon.com/serverless.

Let’s Architect! Serverless developer experience in AWS

Luca Mezzalira — Mon, 02 Dec 2024 22:45:34 +0000

Are you a developer approaching serverless for the first time, or even an experienced one looking for a better way to accelerate your feedback loop from code to production? This collection of resources is perfect for you!

There are plenty of developer goodies available on AWS to streamline your code creation and achieve a faster flow in your development lifecycle. Let us share a few examples with you.

What if I told you that you could have an assistant to create your tests? Or that you could review the schema of DynamoDB tables without logging into the AWS Console? Get ready to discover some game-changing tools and techniques that will revolutionize your serverless development process.

And if you want to know more, check out the AWS developer center for more content dedicated to your developer experience on AWS.

Enjoy the journey!

Introducing an enhanced local IDE experience for AWS Lambda developers

We’re excited to announce significant enhancements to the AWS Toolkit, designed to streamline the AWS Lambda development experience. These new features bring the power of Lambda directly to your local development environment, allowing you to work more efficiently within your preferred IDE.

With this update, you can now create, test, and debug Lambda functions locally with unprecedented ease. The toolkit supports local invocation of Lambda functions, enabling real-time testing and debugging without cloud deployment. We’ve also incorporated intelligent code completion and inline documentation for AWS SDK calls, reducing errors and accelerating your coding process.

These improvements offer substantial benefits: faster iteration cycles, deeper insights into Lambda function behavior, and the ability to deliver high-quality serverless applications more rapidly. Whether you’re new to serverless or an experienced Lambda developer, this enhanced local development experience provides a more intuitive and productive environment for building cloud-native solutions.

Figure 1. AWS Toolkit offers the possibility to retrieve real-time the logs of your AWS Lambda functions directly inside your IDE

Take me to this blog

Test Driven Development with Amazon Q Developer

Amazon Q for developers is a versatile AI-powered assistant designed to enhance various aspects of the software development lifecycle. This innovative tool can help streamline numerous tasks, from writing code and documentation to generating unit tests, effectively reducing the time spent on common development activities. By embracing Amazon Q Developer, developers can boost their productivity and focus more on creative problem-solving, with capabilities like test generation serving as just one example of how it can accelerate the development process and improve code quality.

In this example, you will discover how Amazon Q Developer can help out to embrace test-driven development (TDD) in your projects.

Figure 2. Amazon Q Developer in action! As you can see you can choose the right recommendation for your code

Take me to this blog

Stop guesstimating the Lambda functions memory size

Optimizing Lambda function performance is crucial for both cost efficiency and user experience, yet many developers still rely on guesswork when setting memory allocations. This approach often leads to suboptimal configurations, resulting in either wasted resources or underperforming functions. Here is where AWS Lambda Power Tuning comes in. By automatically testing your Lambda function with various memory configurations, you can identify the optimal balance between performance and cost. This data-driven approach ensures your functions run at peak efficiency, potentially reducing costs and improving response times. Moreover, as your application evolves, regular power tuning can help you adapt to changing requirements and usage patterns.

Figure 3. The output of running Lambda Power Tuning with your code is a diagram that shows you the best memory size based on your goals. Either optimized for cost or response time or you can choose a more balanced approach

Take me to this tool

NoSQL Workbench for Amazon DynamoDB

Developers working with Amazon DynamoDB have a powerful ally in their local development toolkit: NoSQL Workbench for Amazon DynamoDB. This intuitive, graphical tool changes the way you interact with DynamoDB tables, offering a fast and efficient feedback loop right on your laptop. With NoSQL Workbench, you can visually design, create, and modify your DynamoDB table structures without the need to constantly access the AWS Console. The tool’s data modeler allows you to experiment with different schemas, ensuring optimal design before deployment. Need to populate your tables for testing? NoSQL Workbench has you covered with its data visualization and manipulation features, enabling quick data insertion and querying. Moreover, its ability to generate sample data and visualize query results in real-time accelerates the development and debugging process.

Figure 4. Visualizing single table design helps you to understand how to structure your serverless applications

Take me to the documentation

Instrument observability for Lambda functions with Powertools

AWS Lambda Powertools is your go-to open source project when you want to instrument observability and beyond for AWS Lambda functions. Available for multiple programming languages including Python, Node.js, Java, and .NET, Powertools empowers developers to build production-ready Lambda functions with ease. At its core, it provides comprehensive observability features, enabling structured logging, creating custom metrics, and implementing distributed tracing with minimal overhead. But Powertools doesn’t stop there – it also includes utilities for parameter store and secrets management, making it simpler to handle configuration and sensitive data. The suite offers idempotency helpers to ensure reliable execution of your functions, even in the face of retries or duplicates. With its event handler functions, Powertools streamlines the processing of various AWS events, reducing boilerplate code and potential errors. By adopting Powertools, developers can significantly reduce the time spent on implementing best practices, allowing them to focus on building business logic while ensuring their Lambda functions are performant, secure, and easily maintainable.

Figure 5. Powertools for Python goes over and beyond just observability as you can see by the list on the left of this screenshot

Take me to this tool

AWS Serverless developer experience workshop

The AWS Serverless Developer Experience workshop is an hands-on guide that brings together all the cutting-edge tools and techniques we’ve discussed, offering developers a holistic approach to building serverless applications. This free, self-paced workshop is designed to elevate your serverless development skills, regardless of your experience level. It covers a wide range of topics, from implementing best practices with AWS Lambda Powertools, to optimizing your functions using AWS Lambda Power Tuning. The workshop also delves into CI/CD practices, showing you how to automate your deployment pipeline for faster, more reliable releases.

Figure 6. The serverless developer experience architecture you will work on during the workshop

Take me to the workshop

See you next time!

Thanks for reading! This is the last post of the year, thank you so much for being with us for the 3rd year in a row. To revisit any of our previous posts or explore the entire series, visit the Let’s Architect! page.

Know before you go – AWS re:Invent 2024 cloud resilience

Shllomi Ezra — Mon, 18 Nov 2024 23:27:21 +0000

With AWS re:Invent 2024 just weeks away, the excitement is building and we’re looking forward to seeing you all soon. If you’re attending re:Invent with the goal of improving your organization’s cloud resilience operations, we will be offering valuable insights, best practices, and fun activities to improve your cloud resilience expertise.

This year, we’re offering more than 100 resilience breakout sessions, workshops, chalk talks, builders’ sessions, and code talks. Find the complete list in the re:Invent 2024 session catalog and filter by “Resilience” in the area of interest field.

In this post, we highlight must-see sessions for those building resilient applications and architectures on AWS. Reserved seating is now open, so act quickly to claim your seat. Be sure to also check out the vertical-specific re:Invent guides.

Our recommendations are divided into three topics to help you choose the sessions most relevant to your business: resilience fundamentals, advanced resilience patterns, and resilience for customers operating in regulated industries.

What is cloud resilience all about?

Cloud resilience refers to the ability for an application to resist or recover from disruptions, including those related to infrastructure, dependent services, misconfigurations, transient network issues, and load spikes. Cloud resilience also plays a critical role in an organization’s broader business resilience strategy, including the ability to meet digital sovereignty requirements. Resilient applications are those built with high availability—the percentage of time the application is available for use—and also those with a disaster recovery or continuity of operations plan in place.

Resilience fundamentals

Join us as we explore the strategies, tools, and mindsets that enable organizations to thrive in the face of uncertainty. These sessions cover conceptual overviews and demos of AWS cloud resilience services.

Breakout sessions

Failing without flailing: Lessons we learned at AWS the hard way (ARC333)

At AWS, we’ve learned that building resilient services requires more than just designing for high availability. In this session, AWS operational leaders are back for more insights on how to mitigate impact when, not if, the unexpected happens. Hear a few short stories collected from 18 years of operational excellence, with practical advice on preparing for and mitigating failure.

Think big, build small: When to scale and when to simplify (ARC331)

Join this session to learn how to navigate the complexities of cloud architecture. Hear insights and guidance developed from working with successful AWS customers, including how to optimize for business value and agility. Discover the AWS approach to architectural tiers, engineering simplicity and reliability, and treating infrastructure as an investment.

Mastering resilience at every layer of the cake (ARC327)

Join this session to learn resilience at various levels, from platform to applications, using AWS services like AWS Resilience Hub, AWS Fault Injection Service, ARC, Amazon Elastic Disaster Recovery, and AWS Backup. You’ll leave with a mental model for resilience across these layers, and ready-to-use reference architectures and guidance. The session includes demos for a fun, lively experience.

Building resilient applications on AWS with Capital One (ARC334)

In this session, discover the patterns and principles of AWS resilience best practices. Then, hear Capital One showcase its next-generation design and deployment patterns that push the boundaries of resilient architectures and support its most critical business processes. Learn about the AWS services it uses, the trade-offs it must consider, and the decision matrix that guides developers to the right pattern for the right use case.

Data protection and resilience with AWS storage (STG301)

Join this session to dive deep on how AWS storage offers organizations defense-in-depth data protection and resilience for application data across recovery point and time objectives, helping mitigate risks with immutable solutions, restore testing, policy-based access controls, encryption, and auditing and reporting.

Workshops

Building, operating, and testing resilient Multi-AZ applications (ARC303)

Join this workshop to get hands-on experience building, operating, and testing a resilient Multi-AZ application.

Building resilient architectures with observability (COP308)

Explore how to use AWS services, including AWS Resilience Hub, Amazon CloudWatch, and AWS Fault Injection Service, to build resilient and reliable cloud-based applications.

Advanced resilience patterns

Building resilient and reliable applications in the cloud is critical for organizations running mission-critical workloads. Unexpected outages, latency spikes, or performance issues can have severe business impact. The sessions and workshops in this track explore advanced techniques and tools to help you proactively identify and address resilience weaknesses in your systems. Learn how to use chaos engineering, multi-Region architectures, and the latest AWS services and best practices to enhance the resilience and operational excellence of your cloud applications.

Breakout sessions

Chaos engineering: A proactive approach to system resilience (ARC326)

This session demonstrates the benefits of chaos engineering in action. Gain insights from BMW Group’s transformative journey, learning key lessons on scaling chaos engineering across the organization, and how BMW Group conducts large-scale chaos experiments in production, uncovering issues and fostering a culture of greater resilience and continuous improvement.

Try again: The tools and techniques behind resilient systems (ARC403)

Grand architectural theories are nice, but what makes systems resilient is in the details. Marc Brooker, VP and distinguished engineer, looks at some of the resiliency tools and techniques AWS uses in its systems. Marc rethinks, retries, breaks open circuit breakers, decodes erasure coding, and tackles the tail. Learn about formal methods and simulation, and how these tools help build faster code, faster.

Multi-Region or single Region? Considerations and architectures (ARC309)

Watch experts walk through and whiteboard architectures that take advantage of AWS services that support multi-Region capabilities, and discuss what a failover scenario would look like in real life. Leave with an understanding of what it takes to run a multi-Region architecture on AWS.

Best practices for creating multi-Region architectures on AWS (ARC323)

In this session, learn the two critical areas you’ll need to consider. First, explore different failover strategies and the trade-offs between them. Then, learn how to make the decision to initiate a cross-Region failover as well as what goes into the process. Lastly, hear from Samsung Account about their multi-Region application and how they think about these two critical areas.

Workshops

Chaos engineering workshop (ARC322)

This workshop introduces AWS Fault Injection Service for running controlled resilience experiments to improve application performance, observability, and resilience. You must bring your laptop to participate.

Gen AI resilience: Chaos engineering with AWS Fault Injection Service (ARC305)

Learn how to construct a useful hypothesis backlog for generative AI applications and how to use AWS Fault Injection Service to run those experiments. You must bring your laptop to participate.

Building operational resilience in workloads using generative AI (SUP401)

Building operational resilience requires proactive identification and mitigation of risks. In this workshop, use AWS managed generative AI services in real-world scenarios to learn how to assess readiness, proactively improve your architecture, react quickly to events, troubleshoot issues, and implement effective observability practices. Also use AWS Countdown and the AWS Well-Architected Framework as the entry point reference frameworks to use generative AI services for operation. Through hands-on activities, learn strategies for debugging issues, detecting anomalies and incidents, and optimizing architectures to improve the resilience of your workloads. You must bring your laptop to participate.

Resilience for customers operating in regulated industries

In regulated industries like finance, healthcare, and telecom, resilient architecture is critical for compliance, security, and operational continuity. These sectors face strict regulations that demand robust data protection, disaster recovery, and uptime guarantees. A resilient architecture helps organizations maintain service availability, minimize downtime, and recover quickly from disruptions, safeguarding sensitive data and avoiding regulatory breaches. It also enables businesses to adapt to evolving regulations while delivering secure, uninterrupted services.

Breakout sessions

Fidelity Investments: Building for mission-critical resilience (FSI318)

This session explores the transformation of Fidelity Investments’s trade processing platform on AWS and the critical role resiliency plays in preserving operational integrity.

Service event replay: Stress-testing your architecture’s resilience (FSI314)

Learn how to assess the resiliency of your own architectures and develop strategies to strengthen your response and recovery capabilities.

Workshops

Scaling multi-tenant SaaS with a cell-based architecture (ARC402)

In this workshop, see how cell-based architectures provide you with new ways to group, deploy, scale, and operate your multi-tenant workloads. Also see how this approach influences the tiering, scaling, and resilience profile of your SaaS architecture. You must bring your laptop to participate.

Advanced cross-Region DR patterns on AWS (ARC401)

Join this hands-on workshop to explore a resilient, cloud-centered architecture that surpasses the stringent availability and recovery regulations for financial markets utility providers. You must bring your laptop to participate.

Meet experts at the AWS Cloud Resilience kiosk

Throughout the re:Invent week, if you have any questions or suggestions for the AWS Cloud Resilience team, drop by the Cloud Resilience kiosk at the AWS Village in the 2024 re:Invent Expo (the Venetian).

To view the complete guide for all the sessions, chalktalks and workshop, check out the Attendee Guide for Resilience.

How an insurance company implements disaster recovery of 3-tier applications

Amit Narang — Mon, 11 Nov 2024 22:39:49 +0000

A good strategy for resilience will include operating with high availability and planning for business continuity. It also accounts for the incidence of natural disasters, such as earthquakes or floods and technical failures, such as power failure or network connectivity. AWS recommends a multi-AZ strategy for high availability and a multi-Region strategy for disaster recovery. In this post, we explore how one of our customers, a US-based insurance company, uses cloud-native services to implement the disaster recovery of 3-tier applications.

At this insurance company, a relevant number of critical applications are 3-tier Java or .Net applications. These applications require access to IBM DB2, Oracle, or Microsoft SQLServer databases that run on Amazon EC2 instances. The requirement was to create a disaster recovery strategy that implements a Pilot Light or Warm/Standby scenario. This design needs to keep costs at a minimum, and it needs to allow for failure detection and manual failover of resources. Furthermore, it needs to keep the Recovery Time Objective (RTO) and the Recovery Point Objective (RPO) under 15 minutes. Finally, the solution could not use any public resources.

The solution

Amazon Route53 Application Recovery Controller (Route53 ARC) helps manage and orchestrate application failover and recovery across multiple AWS Regions or on-premises environments. It is specifically focused on managing DNS routing and traffic management during failover and recovery operation; however, some customers decide to implement their own strategies for application recovery. In this blog, we are going to focus on how one of our financial services customer implements it.

The Well-Architected framework explains that a good disaster recovery plan needs to manage configuration drift. A good practice is to use the delivery pipeline to deploy to both Regions and to regularly test the recovery pattern. There are customers that go a step further and even choose to operate in the secondary Region for a period of time.

The solution chosen by one of our leading insurance customers encompasses two distinct scenarios: a failover and a failback scenario. The failover scenario covers a list of steps to failover applications from the primary Region to the secondary Region. The failback process is the return of the operations to the primary Region.

Failover

Our customer decided to test the Pilot Light scenario. This scenario considers an application and a database deployed both in the primary and secondary Regions. As a requirement to achieve the 15-minute RPO, an application deployed in the primary Region needs to replicate data to the secondary Region. This async replication is implemented for each of the company’s database engines (DB2, SQLServer, Oracle) using native tooling. Leveraging native tooling was an existing practice and going with it would help minimize any operational impact.

It is important to notice that the detection and failover mechanisms is created in the secondary Region. This ensures these components will remain available when the primary Region becomes unavailable. Another important aspect is to establish connectivity between the two networks. This is needed to allow for the database replication.

Figure 1. The Pilot Light scenario for a 3-tier application that has application servers and a database deployed in two Regions

The failover procedure uses the following steps for detection and failover:

An Amazon EventBridge scheduler runs the AWS Lambda function every 60 seconds.
The Lambda function tests the application endpoint and adds a custom metric to Amazon CloudWatch. If the application is unavailable, a CloudWatch Alarm will start the Lambda Function that initiates the failover.
A Lambda function initiates the failover by starting a Jenkins pipeline. The pipeline will failover the application and the database to the secondary Region. The Jenkins pipeline starts with a manual approval step, ensuring that the failover process does not start automatically.
Once approvers validate the necessity of the failover, they approve the workflow, and the pipeline moves to the next stage.
The pipeline failovers the database, promoting the database in the secondary Region to the primary state and enables write operations.
Next, start or scale out application servers that run on EC2 instances or containers. This is important to assure they will support the increased load in the secondary Region once failover is complete.
At this point, database and application servers are ready to receive load. Next, the Application Load Balancer (ALB) needs to failover to the secondary Region. Route53 failover routing policy automatically failovers between Regions, but this customer wanted to manually control this step using a health check. To implement a manual failover of the ALB, the pipeline creates a file in a designated S3 bucket. A Lambda function regularly checks if this file exists in the expected location. If so, it triggers a CloudWatch Alarm and the Route53 health check will fail. At this point, Route 53 will redirect traffic to the ALB in the secondary Region, becoming the new active endpoint.

Failback

The failback scenario starts when all the required services become online in the primary Region. AWS recommends using AWS Personal Health Dashboard to check for service health. Figure 2 illustrates the failback process in detail. It shows the step-by-step flow from initiating the failback procedure to the final DNS switchover, highlighting the key components and interactions involved in each stage. This visual representation helps to clarify the complex process of returning operations to the primary Region.

Figure 2. Diagram of the failback process

The failback procedure is implemented in six steps:

A cloud operator or Site Reliability Engineer (SRE) initiates the failback procedure by submitting a form on an HTML page. A Lambda function starts a Jenkins pipeline.
The pipeline initiates the delta sync replication of the database. This ensures that data changes made in the secondary Region are replicated to the primary Region.
The next stage is a manual approval to recover back to the primary Region, where the SRE verifies that the databases are in sync and all services needed are online in the primary Region.
Upon approval, the pipeline starts the application servers in the primary Region.
Next, the database in the primary Region is promoted for write operations. The database endpoint in the secondary Region is updated to point to the primary Region’s database.
As explained in the failover section, the DNS switchover depends on a file existing in S3. Since this file was created for our failover event, the pipeline will now remove this file. The Lambda function identifies the change and updates the state of the CloudWatch Alarm, then the Route53 Healthcheck will change the state. At this point, the ALB in the primary Region becomes active and failback completes successfully.

Benefits

This customer identified the following benefits in implementing this design:

Customizable solution that aligns with the company’s internal processes, operating model, and technologies in use
Standardized pattern applicable across the organization for applications with different technologies, including databases, Windows and Linux applications running on EC2
Recovery Point Objective (RPO) and Recovery Time Objective (RTO) of less than 15 minutes
A cost optimized solution that uses cloud native services to implement the detection and failover scenarios

Conclusion

The solution for the disaster recovery of 3-tier applications demonstrates this financial services customer’s commitment to ensuring business continuity and resilience. This design showcases the company’s ability to tailor their architecture to their specific requirements. Achieving an RPO and RTO of less than 15 minutes for critical applications is a remarkable feat. It ensures minimal disruption to business operations during regional outages.

Furthermore, this solution leverages existing technologies and processes within the company, allowing for seamless integration and adoption across the organization. The ability to standardize this pattern for applications with different technologies helps simplifying the operating model.

If you’re an enterprise seeking to enhance the resilience of your critical applications, this disaster recovery solution from one of our enterprise customers serves as an inspiring example. To further explore the disaster recovery strategies and best practices on AWS, we recommend the following resources:

Disaster Recovery of Workloads on AWS: Recovery in the Cloud: Provides a comprehensive overview of disaster recovery concepts and strategies on AWS.
Creating a Multi-Region Application with AWS Services: A three-part blog post offers insights into designing applications that span multiple AWS Regions for improved resilience.
AWS Well-Architected Framework – Reliability Pillar: Discusses best practices for building reliable and resilient systems on AWS.
Disaster Recovery Architectures on AWS: A four-part blog post with a collection of reference architectures for various disaster recovery scenarios.

How to build custom nodes workflow with ComfyUI on Amazon EKS

Wang Rui — Mon, 11 Nov 2024 21:27:56 +0000

ComfyUI is an open-source node-based workflow solution for Stable Diffusion and increasingly being used by many creators. We previously published a blog and solution about how to deploy ComfyUI on AWS.

Typically, ComfyUI users use various custom nodes, which extend the capabilities of ComfyUI, to build their own workflows, often using ComfyUI-Manager to conveniently install and manage their custom nodes.

Following our blog post, we received numerous customer requests to integrate ComfyUI custom nodes into our solution. This post will guide you through the process of integrating custom nodes within ComfyUI-on-EKS.

Architecture overview

Figure 1. Architecture diagram showing the ComfyUI integration with Amazon EKS

To integrate custom nodes within ComfyUI-on-EKS solution, we need to prepare custom nodes codes and environment, as well as needed models:

Code and Environment: Custom node code is placed in $HOME/ComfyUI/custom_nodes, and the environment is prepared by running pip install -r on all requirements.txt files in the custom node directories (any dependency conflicts between custom nodes need to be handled separately). Additionally, any system packages required by the custom nodes also should be installed. All these operations are performed through the Dockerfile, building an image containing the required custom nodes.
Models: Models used by custom nodes are placed in different directories under s3://comfyui-models-{account_id}-{region}. This triggers a Lambda function to send commands to all GPU nodes to synchronize the newly uploaded models to local instance store.

We’ll use the Stable Video Diffusion (SVD) – Image to video generation with high FPS workflow as an example to illustrate how to integrate custom nodes (you can also use your own workflow).

Build docker image

When loading this workflow, it will display the missing custom nodes. Next, we will build the missing custom nodes into the docker image.

Figure 2. Error message showing the missing node types

There are two ways to build the image:

Build from GitHub: In the Dockerfile, download the code for each custom node and set up the environment and dependencies separately.
Build locally: Copy all the custom nodes from your local Dev environment into the image and set up the environment and dependencies.

Before building the image, please switch to the corresponding branch

git clone https://github.com/aws-samples/comfyui-on-eks ~/comfyui-on-eks
cd ~/comfyui-on-eks && git checkout custom_nodes_demo

Build from GitHub

Install custom nodes and dependencies with RUN command in the Dockerfile. You’ll need to find the GitHub URLs for all missing custom nodes.

...
RUN apt-get update && apt-get install -y \
    git \
    python3.10 \
    python3-pip \
    # needed by custom node ComfyUI-VideoHelperSuite
    libsm6 \
    libgl1 \
    libglib2.0-0
...
# Custom nodes demo of https://comfyworkflows.com/workflows/bf3b455d-ba13-4063-9ab7-ff1de0c9fa75

## custom node ComfyUI-Stable-Video-Diffusion
RUN cd /app/ComfyUI/custom_nodes && git clone https://github.com/thecooltechguy/ComfyUI-Stable-Video-Diffusion.git && cd ComfyUI-Stable-Video-Diffusion/ && python3 install.py
## custom node ComfyUI-VideoHelperSuite
RUN cd /app/ComfyUI/custom_nodes && git clone https://github.com/Kosinkadink/ComfyUI-VideoHelperSuite.git && pip3 install -r ComfyUI-VideoHelperSuite/requirements.txt
## custom node ComfyUI-Frame-Interpolation
RUN cd /app/ComfyUI/custom_nodes && git clone https://github.com/Fannovel16/ComfyUI-Frame-Interpolation.git && cd ComfyUI-Frame-Interpolation/ && python3 install.py
...

Refer to comfyui-on-eks/comfyui_image/Dockerfile.github for the complete Dockerfile.

Run following command to build and push Docker image

region="us-west-2" # Modify the region to your current region.
cd ~/comfyui-on-eks/comfyui_image/ && bash build_and_push.sh $region Dockerfile.github

Building from GitHub provides a clear understanding of the installation method, version, and environmental dependencies for each custom node, providing better control over the entire ComfyUI environment.

However, when there are too many custom nodes, installation and management can be time-consuming, and you need to find the URL for each custom node yourself (on the other hand, this can also be seen as a pro, as it makes you more familiar with the entire ComfyUI environment).

Build locally

Often, we use ComfyUI-Manager to install missing custom nodes. ComfyUI-Manager hides the installation details, and we cannot clearly know which custom nodes have been installed. In this case, we can build the image by COPY the entire ComfyUI directory (except the input, output, models, and other directories) into the Dockerfile.

The prerequisite for building the image locally is that you already have a working ComfyUI environment with custom nodes. In the same directory as ComfyUI, create a .dockerignore file and add the following content to ignore these directories when building the Docker image

ComfyUI/models
ComfyUI/input
ComfyUI/output
ComfyUI/custom_nodes/ComfyUI-Manager

Copy the two files comfyui-on-eks/comfyui_image/Dockerfile.local and comfyui-on-eks/comfyui_image/build_and_push.sh to the same directory as your local ComfyUI, like this:

ubuntu@comfyui:~$ ll
-rwxrwxr-x  1 ubuntu ubuntu       792 Jul 16 10:27 build_and_push.sh*
drwxrwxr-x 19 ubuntu ubuntu      4096 Jul 15 08:10 ComfyUI/
-rw-rw-r--  1 ubuntu ubuntu       784 Jul 16 10:41 Dockerfile.local
-rw-rw-r--  1 ubuntu ubuntu        81 Jul 16 10:45 .dockerignore
...

The Dockerfile.local builds the image by COPY the directory

...
# Python Evn
RUN pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121
COPY ComfyUI /app/ComfyUI
RUN pip3 install -r /app/ComfyUI/requirements.txt

# Custom Nodes Env, may encounter some conflicts
RUN find /app/ComfyUI/custom_nodes -maxdepth 2 -name "requirements.txt"|xargs -I {} pip install -r {}
...

Refer to comfyui-on-eks/comfyui_image/Dockerfile.local for the complete Dockerfile.

Run the following command to build and upload the Docker image

region="us-west-2" # Modify the region to your current region.
bash build_and_push.sh $region Dockerfile.local

With this method, you can easily and quickly build your local Dev environment into an image for deployment, without paying attention to the installation, version, and dependency details of custom nodes when there are many of them.

However, not paying attention to the deployment environment of custom nodes may cause conflicts or missing dependencies, which need to be manually tested and resolved.

Upload models

Upload all the models needed for the workflow to the s3://comfyui-models-{account_id}-{region} corresponding directory using your preferred method. The GPU nodes will automatically sync from Amazon S3 (triggered by Lambda). If the models are large and numerous, you might need to wait. You can log into the GPU nodes using the aws ssm start-session --target ${instance_id} command and use the ps command to check the progress of the aws s3 sync process.

To set up this demo, you need to download the following models to s3://comfyui-models-{account_id}-{region}/svd/:

safetensors – Download
safetensors – Download
safetensors – Download
safetensors – Download

Test the Docker image locally (optional)

Since there are many types of custom nodes with different dependencies and versions, the runtime environment is quite complex. We recommend testing the Docker image locally after building it to ensure it runs correctly.

Refer to the code in comfyui-on-eks/comfyui_image/test_docker_image_locally.sh. Prepare the models and input directories (assuming the models and input images are stored in /home/ubuntu/ComfyUI/models and /home/ubuntu/ComfyUI/input respectively), and run the script to test the Docker image:

bash comfyui-on-eks/comfyui_image/test_docker_image_locally.sh

Rolling update K8S pods

Use your preferred method to perform a rolling update of the image for the online K8S pods, and then test the service.

Note, to run this demo, you need to:

use g5.2xlarge GPU node
set lower num_frames in Load Stable Video Diffusion Model (for example to 6)
set lower decoding_t in Stable Video Diffusion Decoder node (for example to 1)

Figure 3. Screenshot showing the rolling update demo

Conclusion

Custom nodes empower creators to unleash the full potential of ComfyUI by seamlessly integrating a wide range of capabilities into their own workflows.

This article demonstrate how to build custom nodes into ComfyUI-on-EKS solution, you can build your own ComfyUI CI/CD pipeline following the instructions.

Announcing updates to the AWS Well-Architected Framework guidance

Haleh Najafzadeh — Wed, 06 Nov 2024 19:02:59 +0000

We are excited to announce the availability of enhanced and expanded guidance for the AWS Well-Architected Framework with the following six pillars: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability

This release includes new best practices and improved prescriptive implementation guidance for the existing best practices. This includes enhanced recommendations and steps on reusable architecture patterns focused on specific business outcomes.

A brief history

The Well-Architected Framework is a collection of best practices that allow customers to evaluate and improve the design, implementation, and operations of their workloads in the cloud.

Figure 1. 2024 AWS Well-Architected guidance timeline

In 2012, we published the first version of the Framework. In 2015, we released the AWS Well-Architected Framework whitepaper. We added the Operational Excellence pillar in 2016. We released the pillar-specific whitepapers and AWS Well-Architected Lenses in 2017. The following year, the AWS Well-Architected Tool was launched.

In 2020, we released the new version of the Well-Architected Framework guidance, more lenses, and an API integration with the AWS Well-Architected Tool. We added the sixth pillar, Sustainability in 2021. In 2022, dedicated pages were introduced for each consolidated best practices across all six pillars, with several best practices updated with improved prescriptive guidance. By December 2023, we improved more than 75% of the Framework’s best practices. As of November 2024, we’ve refreshed 100% of the Framework’s best practices at least once since October 2022.

What’s new

The Well-Architected Framework supports customers as they mature in their cloud journey by providing guidance to help achieve more operable, secure, sustainable, scalable, and resilient environment and workload solutions.

The content updates and prescriptive guidance improvements in this release provide more complete coverage across AWS, helping customers make informed decisions when developing implementation plans. We added or expanded on guidance for the following services in this update: Amazon API Gateway, Amazon CloudFront, Amazon CloudWatch, Amazon CodeGuru, Amazon Cognito, Amazon GuardDuty, Amazon Inspector, Amazon Macie, Amazon Q Business, Amazon Q Developers, Amazon Redshift, Amazon S3, AWS Certificate Manager, AWS CloudFormation, AWS CloudTrail, AWS CodeBuild, AWS CodeDeploy, AWS CodePipeline, AWS Config, AWS Control Tower, AWS Customer Carbon Footprint Tool, AWS Glue, AWS Health, AWS Identity and Access Management (IAM), AWS Key Management Service (KMS), AWS OpenSearch, AWS Organizations, AWS Resource Access Manager, AWS Secrets Manager, AWS Security Hub, AWS Step Functions, AWS Systems Manager, AWS Trusted Advisor, AWS Verified Access, and AWS WAF.

Pillar updates

Operational Excellence

In the Operational Excellence Pillar, we updated five best practices across four questions. This includes OPS02, OPS05, OPS09, and OPS10. The updates in this release include improved prescriptive guidance on multiple AWS services. OPS02-BP02 updates leverage Amazon Q Business for improving workforce collaboration and productivity. OPS05-BP08 updates demonstrate AWS Organizations and AWS Control Tower capabilities that enable updates to a multi-environment setup while meeting governance and policy requirements. OPS09-BP01 and OPS09-BP02 have updated guidance and resources for developing operational key performance indicators (KPIs). OPS10-BP02 has been updated with information on AWS Health, including its planned lifecycle events feature, for integrating into an incident management workflow.

Security

In the Security Pillar, we updated 43 best practices across nine questions. This includes SEC02, SEC03, SEC04, SEC06, SEC07, SEC08, SEC09, SEC10, and SEC11. All best practices in SEC03 (Permissions management) were revised, with updates to guidance on Attribute Based Access Control (ABAC), AWS IAM Access Analyzer, and emergency access processes. SEC02 (Identity management) also saw updates to all six of its best practices, including refinements to guidance on identity federation and secrets management. SEC07 through SEC11 received updates to guidance on data protection, incident response, and application security. Key updates include replacing the security information and event management SIEM solution on AWS OpenSearch recommendation with AWS CloudTrail Lake in SEC04 (Detection), expanded guidance on AWS S3 Object Lock and AWS S3 Glacier Vault Lock in SEC08 (Protecting data at rest), and the addition of recommendations for Mutual Transport Layer Security (mTLS) and private certificates in SEC09 (Protecting data in transit). Overall, these changes reflect AWS’s commitment to providing up-to-date, comprehensive security guidance in line with evolving best practices and new service capabilities.

Reliability

In the Reliability Pillar, we updated 14 best practices across nine questions. This includes REL01, REL02, REL04, REL06, REL07, REL08, REL10, REL12, and REL13. We expanded and clarified our guidance throughout the Pillar and added detailed implementation steps to each best practice that did not previously have them. We refreshed our multi-location deployment guidance by merging REL10-BP02 into REL10-BP01, while improving the prescriptive guidance of this best practice with a new title of Deploy the workload to multiple locations. We updated our idempotent operations guidance in REL04-BP04 to provide detailed technical guidance for builders who wish to provide idempotent APIs and updated the title to Make mutating operations idempotent. We merged functional testing guidance by migrating the content previously published under REL12-BP03 to REL08-BP02 (Integrate functional testing as part of your deployment) and expanded our guidance on testing in CI/CD pipelines. We refreshed REL07-BP01 to emphasize infrastructure as code (IaC) as a cornerstone of automated resource management and scaling. We improved our guidance in REL06-BP02 on how to use system and application logs to improve workload observability. We also refreshed our links to relevant resources including documents, videos, and presentations.

Performance Efficiency

In the Performance Efficiency Pillar, we updated the resources of PERF03-BP04 with the latest services.

Sustainability

In the Sustainability Pillar, we updated 10 best practices across six questions. This includes SUS01, SUS03, SUS04, SUS05, and SUS06. Best practices SUS01-BP01, SUS03-BP02, SUS03-BP05, SUS04-BP03, SUS04-BP05, SUS04-BP06, SUS04-BP07, SUS04-BP08, SUS05-BP04, and SUS06-BP02 now offer improved prescriptive guidance. Additionally, we added a new best practice, SUS06-BP01 Communicate and cascade your sustainability goals, which highlights the critical role of the central IT team in cascading sustainability goals and objectives across the broader organization. By strategically leveraging the cloud, implementing resource-efficient practices, and employing sustainability-focused tools and analytics, IT teams can play a pivotal role in driving meaningful reductions in the organization’s environmental impact.

Conclusion

This release includes updates and improvements to the Framework guidance totaling 78 best practices. As of this release, we’ve updated 100% of the existing Framework best practices at least once since October 2022. With this release, we have refreshed 100% of all the pillars of the Framework including the Reliability Pillar, with 14 of its best practice updated for the first time since major Framework improvements started in 2022.

Updates in this release will be incorporated into the AWS Well-Architected Tool in future releases, which you can use to review your workloads, address important design considerations, and help you follow the AWS Well-Architected Framework guidance.

The content will be available in 11 languages: English, Spanish, French, German, Italian, Japanese, Korean, Indonesian, Brazilian Portuguese, Simplified Chinese, and Traditional Chinese.

Ready to get started? Review the updated AWS Well-Architected Framework Pillar best practices and pillar-specific whitepapers.

Have questions about some of the new best practices or most recent updates? Join our growing community on AWS re:Post.

Channel deflection from voice to chat using Amazon Connect

Siva Thangavel — Wed, 06 Nov 2024 15:39:02 +0000

This post was co-written with Sagar Bedmutha, senior solutions architect at Tata Consultancy Services, and Rajiya Patan, AWS developer at Tata Consultancy Services

Service excellence helps cultivate customer satisfaction and brand loyalty. According to Gartner, one service excellence challenge is long wait times on interactive voice response (IVR) systems. Long wait times can translate into frustrated customers and potentially lost business. To maintain and grow business, companies must examine the shape of their customer service—avoiding long wait times, offering alternative communication channels such as chat, and designing easier-to-use, more efficient systems.

Amazon Connect, an AWS cloud-based contact center solution, is specialized in both voice and chat communication. This powerful tool can open up new avenues for businesses to enhance their customer service experience. Through Amazon Connect, companies can implement strategies like transferring a voice call to a chat channel. This can help resolve the pain point of wait times while maintaining the continuity of the engagement with customers.

This post outlines an Amazon Connect architecture pattern for transitioning between voice and chat channels. With this solution, a customer in a long queue on a voice call can choose a callback or to continue the conversation with an agent through chat.

Prerequisites

To implement this solution, you’ll need the following:

An AWS account with both AWS Management Console and programmatic administrator access.
Access to AWS Identity and Access Management (IAM) to create roles and policies.
An existing Amazon Connect instance, and basic knowledge of Amazon Connect and its contact flows.
Proficiency in developing and deploying AWS Lambda functions.
An Amazon Simple Storage Service (Amazon S3) bucket to store the custom chat widget.
An Amazon CloudFront distribution to serve the chat widget.
An Amazon Pinpoint project to handle email and SMS communications.

Solution overview

Our solution provides an alternate channel and call-back option if there is a long wait time in IVR. Customers can transition from voice to a chat or email instantly without additional work.

We designed this solution by using the following AWS services and custom widget:

Amazon Connect – Omnichannel cloud contact center that helps you provide superior customer service at a lower cost. Amazon Connect contact flows define the customer experience from start to finish.
Lambda – Serverless, event-driven compute service that lets you run code for virtually any type of application or backend service, without you needing to provision or manage servers.
CloudFront – Content delivery network (CDN) that speeds up delivery of static and dynamic web content, such as HTML, CSS, JavaScript, and images. CloudFront caches content at edge locations closer to end users.
Amazon Pinpoint – Flexible, scalable marketing communications service that connects you with customers over email, SMS, push notifications, or voice.
Customized chat widget – Hosted in an Amazon S3 bucket, the widget provides the interface for chat interactions. It is developed using HTML, Vanilla JavaScript, and customized styling.

The following high-level architecture diagram outlines the flow of the process.

Channel deflection architecture diagram

The customer initiates a call to the IVR system for customer support.
If there is a long wait time, the IVR system provides an option for callback through the voice channel or the ability to switch to another channel like chat or SMS.
The customer selects option to transition the call to a chat channel.
The Amazon Connect flow invokes a Lambda function to create a chat session for the customer. The Lambda function generates a secure, time-limited signed URL for the chat channel, including relevant context.
The solution sends the URL to the customer’s registered mobile number and email address through Amazon Pinpoint.
The customer receives the chat link on their mobile device or email, then they select the link.
A chat session initiates in a web browser, and a live agent is connected to assist the customer.

Note: The chat link becomes inactive if the user doesn’t access it within the designated schedule.

Implementation considerations

When implementing this voice-to-chat transition solution, it’s important to consider the following:

Ensure that your AWS account has the necessary permissions, and that you’ve set up appropriate IAM roles and policies for secure access to Amazon Connect, Lambda, Amazon S3, CloudFront, and Amazon Pinpoint.
Ensure that you have the necessary technical knowledge. Familiarity with Amazon Connect contact flows is crucial, as is proficiency in developing and deploying Lambda functions. You must create custom Lambda functions to handle the chat session creation and generate secure, time-limited signed URLs.
Set up an S3 bucket to host your custom chat widget, and configure a CloudFront distribution for performance and security.
Integrate Amazon Pinpoint for communication delivery. This requires careful setup to handle email and SMS notifications effectively.
When developing the custom chat widget, focus on creating a user-friendly interface that integrates with the Amazon Connect chat API. Pay special attention to security measures, particularly in generating and managing the signed URLs for chat access.
Complete testing to confirm smooth operations across various scenarios, including edge cases like expired chat links.
Remember to monitor the solution’s performance in production and consider scalability as your customer base grows.

By addressing these implementation considerations, you’ll be well-positioned to deploy a robust and effective voice-to-chat transition system that enhances your customer service capabilities.

Extended use cases

You can extend this solution for solving other contact center use cases with minimal or no modification. The following are some examples:

Assisting customers with complex technical issues that require a step by step guide.
Helping customers to follow instructions by reading the manual to complete backend processes, like profile updates.
Overcoming language barriers with international customer support by using writing instead of voice.
Authenticating customers using address, zip code, or other demographics.
Offering chat functionality to customers who prefer to multitask during interactions.
Deflecting traffic to alternate channels to improve customer experience and reduce costs.
Offering a method for secure document exchange, such as during financial services consultations.

Conclusion

Using Amazon Connect and other AWS services, this solution offers an implementation that can transition voice calls to a chat channel. This approach provides flexibility to your customer so that they can switch between channels. This helps to improve the total customer experience, the company’s efficiency, and the agent’s productivity. The flow provides continuity in conversations, so that agents can resume conversations with clients across channels and still maintain context. In the end, this solution empowers companies to deliver exceptional customer service and drive positive outcomes for their business. You can learn more about using Amazon Connect by visiting our Amazon Connect Resources page.

Let’s Architect! Modern data architectures

Luca Mezzalira — Tue, 05 Nov 2024 22:31:27 +0000

Data is the fuel for AI; modern data is even more important for generative AI and advanced data analytics, producing more accurate, relevant, and impactful results. Modern data comes in various forms: real-time, unstructured, or user-generated. Each form requires a different solution. AWS’s data journey began with Amazon Simple Storage Service (Amazon S3) in 2006, marking the start of cloud-based data storage at scale. Since then, AWS has expanded its data offerings to cover the entire data lifecycle, offering a comprehensive ecosystem of services designed to harness the full potential of modern data, from ingestion and storage to processing and analysis, supporting the entire lifecycle of AI-driven innovation.

In this blog post, we will cover some AWS use cases for modern data architectures, showing how AWS enables organizations to leverage the power of data and generative AI technologies.

Key considerations when choosing a database for your generative AI applications

This blog focuses on selecting the right database for generative AI applications and provide knowledge that can enhance your understanding, guide your decision making, and ultimately lead to more successful AI projects. Selecting the right database for generative AI applications is not just about storage; it significantly impacts performance, scalability, ease of integration, and overall effectiveness of the AI solution.

Figure 1. Diagram that shows the key steps in a RAG workflow

Take me to this blog

Strategies for building a data mesh-based enterprise solution on AWS

Adopting a data mesh architecture can enhance an organization’s ability to manage data effectively, leading to improved performance, innovation, and overall business success. In this guidance, you will discover some strategies to build data mesh solutions on AWS.

Figure 2. The data mesh organizes data into domains, where data are seen as quality products to expose for consumption

Take me to this guidance

Optimizing storage price and performance with Amazon S3

Amazon S3 is an object storage service that supports multiple use cases, including data architectures. Big data pipelines can use Amazon S3 to store input, output, and intermediate results. Machine learning systems use Amazon S3 to process application logs and build the datasets both for experimentation and for production model training. Given the importance of the service and the number of use cases that a foundational storage service can support, we want to share best practices, performance optimization, and cost optimization strategies to work with Amazon S3. This video shows how Anthropic designs its architecture around Amazon S3 in their data architecture.

Figure 3. Workloads with predictable patterns often have low retrieval rates for long periods of time after, so we can design to adopt cheaper storage classes for them

Take me to this video

If you are curious about the underlying architecture of Amazon S3 and want to drill down into its internal design, you can watch the re:Invent video Dive deep on Amazon S3.

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

This is an AWS case study on how HPE Aruba Supply Chain successfully re-architected and deployed their data solution by adopting a modern data architecture on AWS. The new solution has helped Aruba integrate data from multiple sources, along with optimizing their cost, performance, and scalability. This has also allowed the Aruba Supply Chain leadership to receive in-depth and timely insights for better decision-making, thereby elevating the customer experience.

Figure 4. Reference architecture diagram showing HPE Aruba Supply Chain’s architecture, featuring Amazon S3

Take me to this blog

AWS Modern Data Architecture Immersion Day

This workshop highlights advantage of adopting a modern data architecture on AWS. By integrating the flexibility of a data lake with specialized analytics services, organizations can significantly enhance their data-driven decision-making capabilities. We encourage everyone to explore how this architecture can streamline their analytics processes and support diverse use cases, from real-time insights to advanced machine learning. It’s an excellent opportunity to leverage modern data architecture.

Figure 5. Data architectures are fundamental to power use cases ranging from analytics to machine learning

Take me to this workshop

See you next time!

Thanks for reading! In the next blog, we will cover some tips on how to get the best out of your developer experience on AWS. To revisit any of our previous posts or explore the entire series, visit the Let’s Architect! page.

Automating multi-AZ high availability for WebLogic administration server with DNS: Part 2

Robin Geddes — Wed, 16 Oct 2024 20:19:37 +0000

In Part 1 of this series, we used a floating virtual IP (VIP) to achieve hands-off high availability (HA) of WebLogic Admin Server. In Part 2, we’ll achieve an arguably superior solution using Domain Name System (DNS) resolution.

Using a DNS to resolve the address for WebLogic admin server

Let’s look at the reference WebLogic deployment architecture on AWS shown in Figure 1.

Figure 1. Reference WebLogic deployment with multi-AZ admin HA capability

This solution comes in two parts:

Configure the environment to use DNS to locate the admin server.
Create a mechanism to automatically update the DNS entry when the admin server is launched.

Environment configuration

A WebLogic domain resides in private subnets of a Virtual Private Cloud (VPC). The admin server resides in one of the private subnets on its own Amazon Elastic Compute Cloud (Amazon EC2) instance. In this scenario, the admin server is bound to the private IP address of the EC2 host associated with a hostname/DNS record (configured in Amazon Route53).

We deploy WebLogic in multi-Availability Zone (multi-AZ) active-active stretch architecture. For this simple example, there is only one WebLogic domain and one admin server. To meet this requirement, we:

create an EC2 launch template for the admin server, and then
associate the launch template to an Amazon EC2 Auto Scaling group named wlsadmin-asg with min, max, and desired capacity of 1. Note we will need the group name later.

The Auto Scaling group detects EC2 and Availability Zone degradation and launches a new instance – in a different AZ if the current one becomes unavailable.

To enable access, we create two route tables: one for the private subnets, and the other for public subnets.

Next, we use the Amazon Route 53 DNS service to abstract the IPv4 address of the WebLogic admin server:

Create a private hosted zone in Amazon Route 53; in this example, we use example.com.
Create an A record for the admin server; in this example, example.com, pointing to the IP address of the EC2 instance hosting the admin server. Set the TTL to 60 seconds so the managed servers’ DNS records will be propagated before the admin server has finished starting.
Note the ID of the hosted zone, it will be required later in two places: to create an IAM role with permissions to update the DNS A record, and as an environment variable for an AWS Lambda function to perform the update.

We then update the WebLogic domain configuration and set the WebLogic Admin server listen address to the DNS name we chose. In this example, we set the line of WebLogic Admin server configuration to <listen-address>wlsadmin.example.com</listen-address> in WebLogic domain configuration file $DOMAIN_HOME/config/config.xml.

Automatically updating the DNS A record upon admin server launch

On-premises, it would often be a cultural anathema to update a DNS record as part of a server’s lifecycle. Operations that cut across team boundaries and responsibilities can be difficult to orchestrate. In the cloud, we have tools and a security model to enable such operations.

There are several approaches for this, and it is important to understand the patterns we prototyped and why they were rejected before we describe our recommended implementation pattern:

Rejected Option 1 – Simple: The user data script makes an API call to update the A record (with suitable IAM instance policy). However, a compromised server could update that A record for nefarious means; hence, we reject this option.
Rejected Option 2 – Better: The user data script calls a Lambda function to update the A record and include suitable checks to prevent misuse of the A record, such as setting it to a public address. This still requires granting permission for instance to call the lambda function and determining the correct logic to validate the IP address.
Accepted Option 3 – Best: We do not grant the EC2 instance any additional permission to update the DNS A Record. We rely on the event lifecycle of the Auto Scaling group as shown in Figure 2.

Figure 2. Triggering the DNS A record update from EventBridge using Lambda

When the Auto Scaling group successfully launches a new admin server through a scale-out action, an “EC2 Instance Launch Successful” event is created in Amazon EventBridge.
An EventBridge rule calls an AWS Lambda function, passing the event data as a JSON object.
The Lambda function:
1. parses the event data to determine the EC2 Instance ID,
2. obtains the IP address of new server using the Instance ID, then
3. updates the DNS A Record for the admin server in Hosted Zone we created above with the IP address.
The Lambda function needs permissions to:
- describe EC2 instances within the account (to get the IP address).
- update the A-record in (only) the Hosted Zone we created earlier.

Working backwards, first we create the IAM Policy; second, we create the Lambda function (which references the policy); finally, we create the EventBridge rule (which references the Lambda function).

Policy

Create a policy “AllowWeblogicAdminServerUpdateDNS“ with the following JSON. Replace <MY_HOSTED_ZONE_ID> with the ID you recorded earlier.

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": [
				"route53:ChangeResourceRecordSets"
			],
			"Resource": "arn:aws:route53:::hostedzone/<MY_HOSTED_ZONE_ID>",
			"Condition": {
				"ForAllValues:StringLike": {
					"route53:ChangeResourceRecordSetsNormalizedRecordNames": [
						"wlsadmin.example.com"
					]
				},
				"ForAnyValue:StringEquals": {
					"route53:ChangeResourceRecordSetsRecordTypes": "A"
				}
			}
		},
		{
			"Effect": "Allow",
			"Action": [
				"ec2:DescribeInstances"
			],
			"Resource": "*"
		}
	]
}

Lambda function

We create a Lambda function named “wlsAdminARecordUpdater” with the default settings for runtime (Node.js), architecture (x86_64) and permissions.

Add an environment variable named WLSHostedZoneID and value of the Hosted Zone ID created earlier.

A role will have been created for the Lambda function with a name beginning with “wlsAdminARecordUpdater-role-“. Add the policy AllowWeblogicAdminServerUpdateDNS to this role.

Finally, add the following code then save and deploy the Lambda function.

import { EC2Client, DescribeInstancesCommand } from "@aws-sdk/client-ec2"; 
import { Route53Client, ChangeResourceRecordSetsCommand } from "@aws-sdk/client-route-53"; 
				
export const handler = async (event, context, callback) => {
  				  
  const ec2input = {
    "InstanceIds": [
      event.detail.EC2InstanceId 
    ]
  };
				
  const ec2client = new EC2Client({region: event.region});
  const route53Client = new Route53Client({region: event.region});
				  
  const ec2command = new DescribeInstancesCommand(ec2input);
  const ec2data = await ec2client.send(ec2command);
  const ec2privateip = ec2data.Reservations[0].Instances[0].PrivateIpAddress;
				    
  const r53input = {
  "ChangeBatch": {
    "Changes": [
      {
        "Action": "UPSERT",
        "ResourceRecordSet": {
          "Name": "wlsadmin.weblogic.com",
          "ResourceRecords": [
            {
              "Value": ec2privateip
            }
          ],
          "TTL": 60,
          "Type": "A"
        }
      }
    ],
    "Comment": "weblogic admin server"
    },
    "HostedZoneId": process.env.WLSHostedZoneID
  };
 const r53command = new ChangeResourceRecordSetsCommand(r53input);
 
 return await route53Client.send(r53command);
 
};

EventBridge rule

We create an EventBridge rule, “wlsAdminASG-ScaleOut”, enabled on the default event bus.

Rule type: “Rule with an event pattern”
Event Source: AWS Events or EventBridge partner events
Creation Method – Use pattern Form
Event Pattern
- Event Source: AWS Services
- AWS Service: Auto Scaling
- Event Type: Instance Launch and Terminate
- Event Type Specification 1: Specific instance event(s)
- Event Type Specification 2: wlsadmin-asg
  The event definition should look like the following example, scoped only to the Auto Scaling group wlsadmin-asg we created earlier.
```
{
  "source": ["aws.autoscaling"],
  "detail-type": ["EC2 Instance Launch Successful"],
  "detail": {
    "AutoScalingGroupName": ["wlsadmin-asg"]
  }
}
```

Target 1: AWS Service
- Select a target: Lambda Service
- Function: wlsAdminARecordUpdater

Review and create the rule. Note that “EventBridge (CloudWatch Events): wlsAdminASG-ScaleOut” will be added as a trigger to the Lambda function.

If you cycle the Auto Scaling group (set min and desired to 0, let the admin server terminate, then set min and desired to 1), you will observe that after the new server is successfully launched, the value of the DNS A record wlsadmin.example.com matches the IP of the new WebLogic Admin server.

Enabling internet access to the admin server

If we want to enable internet access to the admin server, we need to create an internet-facing Application Load Balancer (ALB) attached to the public subnets. With the route to the admin server, the ALB can forward traffic to it.

Create an IP-based target group that points to the wlsadmin.example.com.
Add a forwarding rule in the ALB to route WebLogic admin traffic to the admin server.

Conclusion

AWS has a successful track record of running Oracle applications, Oracle EBS, PeopleSoft, and mission critical JEE workloads. In this post, we delved into leveraging DNS for the WebLogic admin server location, and using Auto Scaling groups to ensure an available and singular admin server. We showed how to automate the DNS A record update for the admin server. We also covered enabling public access to the admin server. This solution showcases multi-AZ resilience for WebLogic admin server with automated recovery.

How CyberArk is streamlining serverless governance by codifying architectural blueprints

Anton Aleksandrov — Fri, 11 Oct 2024 16:03:57 +0000

This post was co-written with Ran Isenberg, Principal Software Architect at CyberArk and an AWS Serverless Hero.

Serverless architectures enable agility and simplified cloud resource management. Organizations embracing serverless architectures build robust, distributed cloud applications. As organizations grow and the number of development teams increases, maintaining architectural consistency, standardization, and governance across projects becomes crucial.

In this post, you will discover how CyberArk, a leading identity security company, efficiently implements serverless architecture governance, reduces duplicative efforts, and saves months of development time by codifying architectural blueprints. This approach helps to prevent redundant efforts and promotes uniform architectural standards, facilitating the seamless adoption of organizational best practices and governance across diverse teams.

Overview

The risk of duplicative efforts and architectural inconsistencies is particularly pronounced in large organizations, especially for requirements unrelated to specific business domains owned by individual teams. Diverse approaches to Infrastructure-as-Code, CI/CD, observability, and security can lead to inconsistent implementations across teams. Application developers should focus on delivering business value efficiently, rather than navigating the complexities of building and operating distributed architectures while adhering to organizational best practices. To achieve this, you need an approach that empowers developers and provides guardrails to ensure vetted architectural patterns are consistently applied. This solution should enable accelerated delivery without sacrificing agility and innovation.

Some organizations implement internal wiki consolidating architectural guidance. While well-intentioned, relying solely on documentation assumes development teams diligently follow the guidelines, which often requires manual validation and limits scalability. To overcome this limitation, organizations should adopt a scalable approach that codifies, automates, and promotes architectural best practices. This mechanism allows developers to focus on delivering business-domain value and drives standardized operational excellence, governance, and organizational policies adherence.

Introducing serverless blueprints

CyberArk engineering team had over 900 developers. It was looking for ways to ensure they build their serverless services based on vetted architectural and security best practices with fully automated governance controls enforcement. The solution came in the form of codified architecture blueprints and automated tooling.

Serverless architectures are composed using loosely coupled services, integrated based on the application requirements. Application developers use IaC tools such as AWS CDK and HashiCorp Terraform to define their serverless architectures and integration patterns. CyberArk has augmented the IaC with governance tools, such as cdk-nag, AWS Config, and AWS Control Tower. With these complementary tools in place, they’ve built serverless blueprints which include architectural definitions based on organizational best practices, as well as automatically applied governance controls

To illustrate this, consider a simple serverless architecture pattern. In this common pattern, an SQS queue serves as the event source for a Lambda function, which parses incoming messages and updates an Amazon S3 bucket.

Figure 1. A simple serverless architecture with SQS Queue, Lambda function, and S3 Bucket

While this pattern seems simple, turning it into an enterprise-ready service requires additional effort. You must consider aspects like resiliency, security, governance, observability, and coding best practices. Let’s examine several examples codified in architectural blueprints at CyberArk.

Error-handling best practices

Your services should be resilient. Retries can help to overcome occasional network hiccups, but you also need to handle scenarios when your function consistently fails to process particular messages (known as poison message) – for example, because of a code bug. This can lead to endless processing loops, data loss, and potential extra charges. To address this, a blueprint can implement a failure handling mechanism with a dead letter queue, alerting, and redrive. This pattern is straightforward to implement and adds extra resiliency to your architecture. It is also generic and does not contain any business domain code. This is a typical example of an architectural pattern that can be codified in a blueprint and reused across development teams.

Figure 2. The simple serverless architecture with added resiliency best practices

Security best practices

Another example is securing S3 buckets. Organizations must enforce S3 security best practices, such as enabling access logs, blocking public access, and enabling encryption at rest. Codifying these guardrails in architectural blueprints adds an extra layer that allows your developers to comply with organization standards without having to explicitly implement adherence to each best practice and policy on their own.

Figure 3. The simple serverless architecture with added security best practices

The following code snippet uses AWS CDK to create an S3 bucket with common best practices:

Enables bucket versioning on production environments only to save costs in non-production environments
Enforces data encryption with AWS-managed keys.
Blocks all public access
Enforces SSL to block all non-secure-transport access
Enables access logs

def _create_bucket(self, server_access_logs_bucket: s3.Bucket, is_production_env: bool) -> s3.Bucket:
    # Create an S3 bucket with AWS-managed keys encryption
    bucket = s3.Bucket(
        self,
        constants.BUCKET_NAME,
        versioned=True if is_production_env else False,
        encryption=s3.BucketEncryption.S3_MANAGED,
        block_public_access=s3.BlockPublicAccess.BLOCK_ALL,
        enforce_ssl=True,
        server_access_logs_bucket=server_access_logs_bucket, 
        # redacted
    )

Additional security best practices you can codify in your blueprints include the principle of least privilege access, VPC-attachment, and code signing for sensitive Lambda functions, and using KMS keys for encryption.

Lambda best practices

Your Lambda functions are another example of where blueprints can help. By providing a function blueprint implementing the baseline for capabilities like observability, idempotency, and batch processing out-of-the-box, you enable developers to focus on their business domain code.

Figure 4. Layered view of a Lambda function in CyberArk’s serverless architecture blueprint

CyberArk embeds Powertools for AWS Lambda, a toolkit that implements serverless best practices to increase developer velocity, into their blueprints. The following code snippets embed Powertools for enabling enhanced observability and implementing batch processing.

# CDK code
lambda_function = lambda.Function(
    environment={
        constants.POWERTOOLS_SERVICE_NAME: constants.SERVICE_NAME,
        constants.POWER_TOOLS_LOG_LEVEL: 'INFO',  
    },
    tracing=lambda.Tracing.ACTIVE,
    layers=["powertools-layer"],
    log_format=lambda.LogFormat.JSON.value,
    system_log_level=lambda.SystemLogLevel.INFO.value
    # redacted
)

# Function handler code
processor = BatchProcessor(event_type=EventType.SQS, model=OrderSqsRecord)

@logger.inject_lambda_context
@metrics.log_metrics
@tracer.capture_lambda_handler(capture_response=False)
def lambda_handler(event, context: LambdaContext):
    return process_partial_response(
        event=event,
        record_handler=record_handler,
        processor=processor,
        context=context,
)

Governance controls

Blueprints are not static; they evolve as you adopt new best practices and governance policies. Developers start with a vetted blueprint but can deviate as they evolve their serverless apps. To enable continuous adherence, it is important to use a combination of organizational governance tools, such as AWS Control Tower and Service Control Policies, and architecture blueprints that embed governance controls automatically enforced by CI/CD. This ensures that any architectural modification will be validated for adhering to organizational standards.

AWS defines proactive controls as mechanisms that prevent developers from deploying resources that violate governance policies. Detective controls are mechanisms that detect, log, and alert on resource or configuration changes that violate governance policies.

Figure 5. Applying governance controls at all stages of CI/CD

Depending on the IaC tool, you can leverage different types of governance tools for proactive control enforcement. The following screenshot shows a proactive control violation identified during CI/CD via the cdk-nag framework. You can see cdk-nag throwing an error for the stack deployment due to Lambda execution role being assigned wild-card permissions.

Figure 6. Exception thrown by cdk-nag for using wildcard permissions

See the practical guide for implementing serverless governance.

Sample code

Ran Isenberg has open-sourced a sample Lambda Handler Cookbook blueprint illustrating some of the patterns CyberArk has adopted.

Additional serverless architecture patterns you might consider implementing in your blueprints are server-side encryption for an Amazon SNS topic with an encrypted Amazon SQS queue subscribed, auto-adjusting provisioned concurrency for Lambda functions, secure Serverless Aurora Cluster with bastion host, and more.

See more patterns implemented at serverlessland.com and cdkpatterns.com

Conclusion

Translating architectural and security best practices into modular IaC definitions, such as CDK constructs or Terraform modules, is a scalable and reusable technique that allows CyberArk to reduce duplicative efforts and save months of development time. Using IaC tools like AWS CDK or Terraform, augmented with governance tools like cdk-nag or checkov, enabled CyberArk to share implementation best practices and encode governance policies into architectural blueprints. Development teams adopting these blueprints do not need to reinvent the wheel, each trying to solve the same problem on their own. Instead, they leverage the knowledge codified in the blueprint.

How Banfico built an Open Banking and Payment Services Directive (PSD2) compliance solution on AWS

Otis Antoniou — Fri, 04 Oct 2024 14:03:39 +0000

This post was co-written with Paulo Barbosa, the COO of Banfico.

Introduction

Banfico is a London-based FinTech company, providing market-leading Open Banking regulatory compliance solutions. Over 185 leading Financial Institutions and FinTech companies use Banfico to streamline their compliance process and deliver the future of banking.

Under the EU’s revised PSD2, banks can use application programming interfaces (APIs) to securely share financial data with licensed and approved third-party providers (TPPs), when there is customer consent. For example, this can allow you to track your bank balances across multiple accounts in a single budgeting app.

PSD2 requires that all parties in the open banking system are identified in real time using secured certificates. Banks must also provide a service desk to TPPs, and communicate any planned or unplanned downtime that could impact the shared services.

In this blog post, you will learn how the Red Hat OpenShift Service on AWS helped Banfico deliver their highly secure, available, and scalable Open Banking Directory — a product that enables seamless and compliant connectivity between banks and FinTech companies.

Using this modular architecture, Banfico can also serve other use cases such as confirmation of payee, which is designed to help consumers verify that the name of the recipient account, or business, is indeed the name that they intended to send money to.

Design Considerations

Banfico prioritized the following design principles when building their product:

Scalability: Banfico needed their solution to be able to scale up seamlessly as more financial institutions and TPPs begin to utilize the solution, without any interruption to service.
Leverage Managed Solutions and Minimize Administrative Overhead: The Banfico team wanted to focus on their areas of core competency around the product, financial services regulation, and open banking. They wanted to leverage solutions that could minimize the amount of infrastructure maintenance they have to perform.
Reliability: Because the PSD2 regulations require real-time identification and up-to-date communication about planned or unplanned downtime, reliability was a top priority to enable stable communication channels between TPPs and banks. The Open Banking Directory therefore needed to reach availability of 99.95%.
Security and Compliance: The Open Banking Directory needed to be highly secure, ensuring that sensitive data is protected at all times. This was also important due to Banfico’s ISO27001 certification.

To address these requirements, Banfico decided to partner up with AWS and Red Hat and use the Red Hat OpenShift Service on AWS (ROSA). This is a service operated by Red Hat and jointly supported with AWS to provide fully managed Red Hat OpenShift platform that gives them a scalable, secure, and reliable way to build their product. They also leveraged other AWS Managed Services to minimize infrastructure management tasks and focus on delivering business value for their customers.

To understand how they were able to architect a solution that addressed their needs while following the design considerations, see the following reference architecture diagram.

Banfico’s Open Banking Directory Architecture Overview:

Breakdown of key components:

Red Hat OpenShift Service on AWS (ROSA) cluster: The Banfico Open Banking SaaS key services are built on a ROSA cluster that is deployed across three Availability Zones for high availability and fault tolerance. These key services support the following fundamental business capabilities:

Their core aggregated API platform that integrates with, and provides access to banking information for TPPs.
Facilitating transactions and payment authorizations.
TPP authentication and authorization, more specifically:
- Checking if a certain TPP is authorized by each country’s central bank to check account information and initiate payments.
- Validating TPP certificates that are issued by Qualified Trust Service Provider (QTSPs), which are: “regulated (Qualified) to provide trusted digital certificates under the electronic Identification and Signature (eIDAS) regulation. PSD2 also requires specific types of eIDAS certificates to be issued.” – Planky Open Banking Glossary
Certificate issuing and management. Banfico is able to issue, manage, and store digital certificates that TPPs can use to interact with Open Banking APIs.
The collection of data from central banks across the world to collect regulated entity details.

Elastic Load Balancer (ELB): A load balancer helps Banfico deliver their highly-available and scalable product. It allows them to route traffic across their containers as they grow, and perform health checks accordingly, and it provides Banfico customers access to the application workloads running on ROSA through the ROSA router layers.

Amazon Elastic File System (Amazon EFS): During the collection of data from central banks, either through APIs or by scraping HTML, Banfico’s workloads and apps use the highly-scalable and durable Amazon EFS for shared storage. Amazon EFS automatically scales and provides high availability, simplifying operations and enabling Banfico to focus on application development and delivery.

Amazon Simple Storage Service (Amazon S3): To store digital certificates issued and managed by Banfico’s Open Banking Directory, they rely on Amazon S3, which is a highly-durable, available, and scalable object storage service.

Amazon Relational Database Service (Amazon RDS): The Open Banking Directory uses Amazon RDS PostgreSQL to store application data coming from their different containerized services. Using Amazon RDS, they are able to have a highly-available managed relational database which they also replicate to a secondary Region for disaster recovery purposes.

AWS Key Management Service (AWS KMS): Banfico uses AWS KMS to encrypt all data stored on the volumes used by Amazon RDS to make sure their data is secured.

AWS Identity and Access Management (IAM): Leveraging IAM with the principle of least privilege allows the product to follow security best practices.

AWS Shield: Banfico’s product relies on AWS Shield for DDoS protection, which helps in dynamic detection and automatic inline mitigation.

Amazon Route 53: Amazon Route 53 routes end users to Banfico’s site reliably with globally dispersed Domain Name System (DNS) servers and automatic scaling. They can set up in minutes, and having custom routing policies help Banfico maintain compliance.

Using this architecture and AWS technologies, Banfico is able to deliver their Open Banking Directory to their customers, through a SaaS frontend as shown in the following image.

Conclusion

This AWS solution has proven instrumental in meeting Banfico’s critical business needs, delivering 99.95% availability and scalability. Through the utilization of AWS services, the Open Banking Directory product seamlessly accommodates the entirety of Banfico’s client traffic across Europe. This heightened agility not only facilitates rapid feature deployment (40% faster application development), but also enhances user satisfaction. Looking ahead, Banfico’s Open Banking Directory remains committed to fostering safety and trust within the open banking ecosystem, with AWS standing as a valued partner in Banfico’s journey toward sustained success. Customers who are looking to build their own secure and scalable products in the Financial Services Industry have access industry AWS Specialists; contact us for help in your cloud journey. You can also learn more about AWS services and solutions for financial services by visiting AWS for Financial Services.

AWS Architecture Blog

WellRight modernizes to an event-driven architecture to manage bursty and unpredictable traffic

The challenge

Solution overview

Design

Maximum concurrency solution

Performance and cost savings

Future plans

Conclusion

About the authors

Realizing twelve-factors with the AWS Well-Architected Framework

Twelve-factors

A brief history of the AWS Well-Architected Framework

The six pillars of the AWS Well-Architected Framework

1. Operational excellence

2. Security

3. Reliability

4. Performance efficiency

5. Cost optimization

6. Sustainability

Remaining factors

The seventh factor: Port binding

The eleventh factor: Logs

The twelfth factor: Admin processes

Applying the AWS Well-Architected Framework

Conclusion

About the author

Create a serverless custom retry mechanism for stateless queue consumers

Solution overview

Considerations and best practices

Monitoring and troubleshooting

Future enhancements

Conclusion

About the Author

Use generative AI on AWS for efficient clinical document analysis

About Clario

The business challenge

Harnessing the power of large language models

Four pillars of effective document analysis on AWS

Solution overview

Recommendations and best practices

Conclusion

About the Authors

How Nielsen uses serverless concepts on Amazon EKS for big data processing with Spark workloads

Evolving from a Spark cluster to Spark pods on Amazon EKS

Why shuffle matters

Addressing shuffle

Designing the new system based on serverless patterns

Final design

⁠Analyzing the results

Conclusion

About the Authors

Top Architecture Blog Posts of 2024

#10 Deploy Stable Diffusion ComfyUI on AWS elastically and efficiently

#9 Let’s Architect! Designing Well-Architected systems

#8 Let’s Architect! Learn About Machine Learning on AWS

#7 Creating an organizational multi-Region failover strategy

#6 Building a three-tier architecture on a budget

#5 Announcing updates to the AWS Well-Architected Framework guidance

#4 Let’s Architect! Serverless developer experience in AWS

#3 London Stock Exchange Group uses chaos engineering on AWS to improve resilience

#2 Achieving Frugal Architecture using the AWS Well-Architected Framework Guidance

#1 How an insurance company implements disaster recovery of 3-tier applications

Thank you!

Enhance the resilience of critical workloads by architecting with multiple AWS Regions

Meet regulatory and compliance requirements and enhance disaster recovery capabilities

Achieve a bounded recovery time to support highly available business-critical workloads

Conclusion

TVS Supply Chain Solutions built a file transfer platform using AWS Transfer Family for AS2 for B2B collaboration

Business use case

Why the cloud?

Why Transfer Family and AS2?

Solution overview

Prerequisites

End customer to TVS SCS communication workflow

TVS SCS to end customer communication workflow

Results

Summary

About the Authors

Transform lease agreement workflows with Amazon Bedrock