Stories by Pascal Euhus on Medium

8 Symptoms of Ineffective Cloud Cost Management and How to Fix Them

Pascal Euhus — Mon, 29 Jul 2024 15:01:48 GMT

Image by DALL-E 3

Several studies like the Cloud Spend Optimization Report 2024 by Vertice uncover that organizations are wasting up to a third of their cloud spend. The report also shows that 43% of the participating organizations have not yet implemented a cost management strategy. Implementing a cost management strategy aka. establishing FinOps practices is a vital part of cloud governance, but it is not a trivial task. The FinOps methodology is a combination of systems, best practices, and culture that aims to optimize cloud spend and provides transparency. It is a cross-functional approach that involves finance, engineering, and business teams. The goal is to ensure that the organization is getting the most value out of cloud spending as well as providing data to make informed business decisions for future investments.
AWS provides a comprehensive set of tools that help you collect and visualize the data that you need to implement cost management effectively.¹

The most crucial part is the adoption of the Cloud Intelligence Dashboards Framework. It’s an open-source framework, lovingly cultivated and maintained by AWS, that gives customers the power to get high-level and granular insight into their cost and usage data.²

To check if your organization is managing cloud costs effectively and to find out how to mitigate issues, here are eight common symptoms that indicate that your organization could improve:

1. No cost governance at all

The organization has no idea where the cloud spending is going. There is no cost governance in place and no one is responsible for managing cloud costs continuously. Even though the actual cloud spend is within budget, without cost governance, the organization is not able to optimize cloud spend and tends to overspend and waste money. One reason no cost governance is in place is that organizations claim that money is not a first-class citizen to them because they have been running profitably for years. However, this is a dangerous assumption and comes with a big risk. There will be times when the organization will face financial difficulties and is forced to cut costs. Implementing a cost governance model needs time and knowledge, both become critical when you are under pressure. You may not put all your efforts into cost governance when you have other priorities, but you should at least constantly evolve your cost governance model.
A minimum viable cost governance model should include a monthly cost review, check for cost anomalies, and check for cost optimization opportunities.

You should be able to answer questions like:

What are the top cost drivers?
Why did the cost increase or decrease? Take a look at the top 5 movers and bottom 5 movers.
If you detect a cost anomaly, investigate it and take action. Put measures in place to prevent it and check again next month.

2. No internal chargeback model

Big organizations often have multiple teams or even business units that use cloud resources. Typically they also use a federated AWS account setup with consolidated billing. Without an internal chargeback model, teams have little to no incentive to optimize cloud spend. Especially when you have savings plans in place, top spenders get discounts while all others pay the full price. An internal chargeback model is a way to allocate cloud costs to the teams that use the resources and distribute savings evenly. It helps to create cost awareness and accountability.
To start implementing a chargeback model, you can use AWS Cost Allocation Tags. You can tag resources with the team name or business unit and use the AWS Cost Explorer to create cost allocation reports. Gather an overview of common shared costs, like costs for shared network components, DDoS protection service fees, and also discounts that you can distribute to the teams. The chargeback model aims to be a fair and transparent way to distribute costs and savings, regardless of the workload size. This helps to unveil the true cost of cloud resources per workload.

3. Avoiding commitment

Organizations that are reluctant to commit to a baseline of consumed resources are missing out on significant cost savings. Reserved Instances and Savings Plans are a great way to save money on cloud spend, compared to on-demand pricing. While organizations that migrate from on-premises were forced in the past to commit to a certain amount of resources for a long period (aka CAPEX). Strictly follow the promised land of you pay only for what you use (aka OPEX) is not always the best choice.
Typically, it is a balance between both, but finding the right optimum is a slippery slope and requires quite some knowledge and experience.

As a rule of thumb, your baseline production traffic is ideally covered by Reserved Instances and Savings Plans. Reserved Instances are a good choice for databases (because it’s the only option), while compute resources are better covered by Savings Plans. Savings Plans are more flexible and can be used for any EC2 instance, Fargate, or Lambda usage. You should constantly monitor your usage and adjust your Reserved Instances and Savings Plans accordingly to not waste money.

Saving Plans ideally cover constantly around 90–95% of your total compute usage. If you are above 95% you are likely to run into a situation where your Savings Plan would cover more than you use. If you are below 90% you are likely to miss out on potential savings.
Since this is a complex topic, organizations with federated AWS accounts should manage Reserved Instances and Savings Plans centrally. This way, you can optimize the usage of Reserved Instances and Savings Plans across all accounts. Together with the aforementioned internal chargeback model, you can distribute the savings equally across all teams.

4. No cost awareness during development

Developers are often not aware of the cost implications of their code. Neither do architects tread cost optimization as a first-class citizen. Rapid prototypes are implemented as if they must be able to handle production-like traffic.

To mitigate this, you should create a cost-aware culture in your organization. Infrastructure-as-Code makes systems easier to bootstrap, scale and delete environments. Shut down development environments when they are not used outside business hours. Implement automated account reset/ nuke mechanisms for development environments. Use AWS Budgets to set cost limits and alerts and implement strict remediation rules.

5. Burn money on development environments

Development environments are typically systems without the need to run 24/7, they are only used during business hours. While on-prem there is no need to shut them down when they are not used, in a cloud environment this leads to unnecessary costs. To mitigate this, you should implement efficient autoscaling for development environments and shut them down outside business hours. Using AWS Instance Scheduler³ is a great way to automate this process and typically results in more savings than you could generate by purely optimizing with Reserved Instances and Savings Plans.

6. Not continuously improving

The capabilities and services of cloud providers are constantly evolving. New instance types typically offer better performance at a lower price and emerging features change how you can architect solutions.

Organizations that do not continuously revise and improve their cloud solutions are missing further potential saving opportunities. To exchange reserved instances before the term ends you can sell your reserved instances on the AWS Reserved Instance Marketplace⁴. Savings Plans offer much more flexibility, and you can easily switch instance types. However, you may need to adapt your Savings Plans to new coverage requirements in the future. AWS Compute Optimizer can help you to spot potential improvement opportunities. It provides recommendations for right-sizing your instances and helps you to identify underutilized resources.

7. Constantly running on oversized production environments

Coming from the on-prem world, where you had to plan for peak traffic, organizations tend to oversize their production environments. This may make sense in a lift and shift migration approach but to fully leverage the cloud benefits you should right-size your production environments as soon as possible.

Organizations tend to postpone that task in fear of breaking things or simply not trusting their system and scaling capabilities of the cloud provider. They rather prefer running an unoptimized workload than investing in optimization. However, the only way to constantly evolve and adapt is to embrace a working culture that allows it to fail fast. Services like AWS Compute Optimizer can help to right-size your instances and help you to identify underutilized resources mitigating the risk of breaking things or optimizing on the wrong end.

8. Copying other organizations’ architecture

Even though you may have a similar workload as another organization, resp. your competitor, copying their architecture for success is typically not the best choice.

Context always matters and every business that is meant to stay needs at least one unique selling point to make a difference to other competitors. You should always evaluate your requirements and constraints and design your architecture accordingly. Constantly review your strategy and architecture to be able to adapt it to the ever-changing environment. If you copy the success of others, you may also copy their failures but also their optimization for their specific use case. Strive to find the best solution for your specific use case. This does not mean you cannot learn from others, but keep evaluating if their solution is the best suited for you.

Implementing a cost management strategy is a vital part of cloud governance and helps you to optimize cloud spend.
The aforementioned eight symptoms are common in organizations that are not managing cloud costs effectively.

It’s a starting point to help you to evaluate your cost management strategy.
Once you understand the reasons why you are wasting money on cloud resources, you can take action to mitigate them.

Thanks for reading, and happy cost optimization!
I am happy to receive your feedback and answer your questions.

[1] Data collection framework

[2] CUDOS demo dashboards

[3] AWS Instance scheduler

[4] Reserved Instances Marketplace

👋 If you find this helpful, please click the clap 👏 button below a few times to show your support for the author 👇

🚀Join FAUN Developer Community & Get Similar Stories in your Inbox Each Week

8 Symptoms of Ineffective Cloud Cost Management and How to Fix Them was originally published in FAUN — Developer Community 🐾 on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to safely recreate a CDK-baked DynamoDB table using S3 backups

Pascal Euhus — Fri, 26 Jul 2024 14:44:25 GMT

Image by DALL-E 3

I love working with AWS CDK, but some things get nasty because of the way Cloudformation works and the way it has been designed. One of those things is when you want to perform an update on critical infrastructure components like a DynamoDB table that requires a replacement.

This is a common scenario when you want to change the key schema of a table or want to rename it¹.

Note that this approach will require a write stop on the table for the duration of the migration. If you are looking for a zero-downtime migration, you should consider using a streaming solution like DynamoDB Streams to actually replicate the data to a new table.

DynamoDB tables integrate well with S3. You can export to S3 and restore data from S3. Instead of performing destructive updates on the production table, you can create a new table with the desired configuration and then restore the data from the S3 backup.

With the power of CDK, you can build the new table with data in parallel to the existing table and then update the references (like Lambda functions) to the new table.

The system under migration is a simple DynamoDB table with a single partition key and no sort key. A Lambda function stores and retrieves data from the table.

The table is defined in the CDK stack like this:

export class MyExample extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);

    // create the table
    const myTable = new Table(this, "MyTable", {
          partitionKey: { name: "Name", type: AttributeType.STRING },
          billingMode: BillingMode.PAY_PER_REQUEST,
          tableName: "MyTable",
          removalPolicy: RemovalPolicy.RETAIN,
          deletionProtection: true,
          pointInTimeRecovery: true,
          timeToLiveAttribute: "ExpiresAt",
    });

    // create the Lambda function
    const consumer = new NodejsFunction(this, "MyFunction", {
          functionName: "HelloWorld",
          entry: `functions/helloWorld.ts`,
          handler: `handler`,
          runtime: Runtime.NODEJS_20_X,
          architecture: Architecture.ARM_64,
          bundling: {
            minify: true,
          },
          logGroup: new LogGroup(this, `HelloWorldLogGroup`, {
            retention: RetentionDays.ONE_WEEK,
            logGroupName: `/aws/lambda/example/hello-world`,
          }),
          loggingFormat: LoggingFormat.JSON,
          applicationLogLevelV2: ApplicationLogLevel.INFO,
    });

    // grant the Lambda function read/write access to the table
    myTable.grantReadWriteData(consumer);
  }
}

Now, let’s say we want to change the partition key of the table to UUID. We can’t do this with a simple update because it requires a replacement.

The idea is to setup the new table with the desired configuration and then restore the data from the old table to the new table.

Once confirmed that the new table is working as expected, we can update the references to the new table and delete the old table.

For that, we need to export the data from the old table to S3 and then import the data from S3 to the new table.

You can only export data to S3 if you have enabled point-in-time recovery on the table. This is a non-destructive operation, and you can set it up at any time.

First, we need to export the data from the old table to S3, you could either do this manually in the webconsole or leverage the aws-cli to do it.

Make sure you setup a S3 Bucket in advance to store the data.

aws dynamodb export-table-to-point-in-time --table-arn  \
     --s3-bucket  --export-format DYNAMODB_JSON

After running this command, you will see a new folder in the S3 bucket with the data of the table.

Bear in mind, that from now on any write to the old table will not be included in the export and hence, not be restored to the new table.

The S3 bucket we used to store the data in this example is called “MyTableBackupBucket”. We use this bucket name to fetch the data from the bucket in the CDK stack and pass it to the importSource configuration of the new table.

First, we setup the new table with the desired configuration and import the data from the old table. Everything else stays the same.

export class MyExample extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);

    // create a separate table with the new configuration and import the data from the old table
    const myNewTable = new Table(this, "MyNewTable", {
          partitionKey: { name: "UUID", type: AttributeType.STRING },
          billingMode: BillingMode.PAY_PER_REQUEST,
          tableName: "MyNewTable",
          removalPolicy: RemovalPolicy.RETAIN,
          deletionProtection: true,
          pointInTimeRecovery: true,
          timeToLiveAttribute: "ExpiresAt",
          importSource: {
            bucket: Bucket.fromBucketName(this, "ImportSourceBucket", "MyTableBackupBucket"),
            inputFormat: InputFormat.dynamoDBJson()
          }
    });

    // create the table
    const myTable = new Table(this, "MyTable", {
          partitionKey: { name: "Name", type: AttributeType.STRING },
          billingMode: BillingMode.PAY_PER_REQUEST,
          tableName: "MyTable",
          removalPolicy: RemovalPolicy.RETAIN,
          deletionProtection: true,
          pointInTimeRecovery: true,
          timeToLiveAttribute: "ExpiresAt",
    });

    // create the Lambda function
    const consumer = new NodejsFunction(this, "MyFunction", {
          functionName: "HelloWorld",
          entry: `functions/helloWorld.ts`,
          handler: `handler`,
          runtime: Runtime.NODEJS_20_X,
          architecture: Architecture.ARM_64,
          bundling: {
            minify: true,
          },
          logGroup: new LogGroup(this, `HelloWorldLogGroup`, {
            retention: RetentionDays.ONE_WEEK,
            logGroupName: `/aws/lambda/example/hello-world`,
          }),
          loggingFormat: LoggingFormat.JSON,
          applicationLogLevelV2: ApplicationLogLevel.INFO,
    });

    // grant the Lambda function read/write access to the table
    myTable.grantReadWriteData(consumer);
  }
}

After changing our Stack we need to deploy the new change set using

cdk deploy

When the new table is set and the data is imported, you can revise the new table configuration.

Once you are confident that the new table is working as expected, you can update the references to the new table and delete the old table.

export class MyExample extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);

    // create a separate table with the new configuration and import the data from the old table
    const myNewTable = new Table(this, "MyNewTable", {
          partitionKey: { name: "UUID", type: AttributeType.STRING },
          billingMode: BillingMode.PAY_PER_REQUEST,
          tableName: "MyNewTable",
          removalPolicy: RemovalPolicy.RETAIN,
          deletionProtection: true,
          pointInTimeRecovery: true,
          timeToLiveAttribute: "ExpiresAt",
          importSource: {
            bucket: Bucket.fromBucketName(this, "ImportSourceBucket", "MyTableBackupBucket"),
            inputFormat: InputFormat.dynamoDBJson()
          }
    });

    // create the Lambda function
    const consumer = new NodejsFunction(this, "MyFunction", {
          functionName: "HelloWorld",
          entry: `functions/helloWorld.ts`,
          handler: `handler`,
          runtime: Runtime.NODEJS_20_X,
          architecture: Architecture.ARM_64,
          bundling: {
            minify: true,
          },
          logGroup: new LogGroup(this, `HelloWorldLogGroup`, {
            retention: RetentionDays.ONE_WEEK,
            logGroupName: `/aws/lambda/example/hello-world`,
          }),
          loggingFormat: LoggingFormat.JSON,
          applicationLogLevelV2: ApplicationLogLevel.INFO,
    });

    // grant the Lambda function read/write access to the table
    myTable.grantReadWriteData(myNewTable);
  }
}

Deploy once again the changes using

cdk deploy

That’s it. You have successfully migrated your DynamoDB table. Mind that the example sets the removal policy to RETAIN. This means the table will not be deleted when the stack is deleted. You can now delete the old table manually via the webconsole or the aws-cli. If you specified a more destructive removal policy, the table may already be deleted when you deployed the latest changeset.

The new table specifies the “importSource” configuration to import the data from the S3 bucket. Changes to that configuration will not trigger a replacement of the table, although the “importSource” configuration is only used during the table creation.

Feel free to reach out if you have any questions or feedback.

[1] According the official docs it is recommended to let Cloudformation name your resources and put a custom name into resource tags. However, reality is that, including myself, people tend to name things because the webconsole looks better with a human-readable name.

Implementing an AWS Account Vending Machine

Pascal Euhus — Tue, 13 Dec 2022 13:33:45 GMT

Automated AWS Account creation on enterprise scale

Abstract

This article demonstrates how to automate the creation of AWS accounts at enterprise scale and what the benefits are. This approach is based on AWS Control Tower.

What Account Vending Machine is about

Control Tower is a service for centrally administering account management, compliance, and security, and simplifying it in part by providing predefined rule sets. In turn, Control Tower is built on various AWS services and orchestrates them. These services are mainly AWS Organizations, Service Catalog and Cloudformation StackSets. At present, Control Tower does not offer its own API for automated account creation. However, the Service Catalog API can be used to create AWS accounts that are automatically managed by Control Tower.

Oftentimes, there are minimum requirements for a (new) AWS account, which is highly linked to the context of the company. Especially compliance and the characteristics of the security policies vary quite strongly between enterprises. But also the state of the delivery to the mostly internal customers varies, so it may be that a fully configured network setup is part of a basic account or not.

With Control Tower, AWS Organizations and Cloudformation StackSets, such requirements can be implemented easily.

If you look at Control Tower from the perspective of an (internal) service provider that wants to offer an AWS account as a service, you quickly run into a few pitfalls. Especially the fact that the account creation, which runs on top of the Service Catalog, has to be done sequentially, makes it difficult to fully automate the process. The official AWS documentation refers to a blog post that addresses this issue.

This solution is based on ad hoc batch provisioning, i.e., you collect accounts to be provisioned and then feed an automation with a list that provisions AWS accounts according to the input. However, if you want to provision an AWS account as an ad hoc service, where a user orders an account on demand and gets one delivered minutes later, this solution has its weaknesses. The approach of using a DynamoDB as a lock to guarantee that AWS accounts are only created sequentially, however, sounds promising.

The Account Vending Machine should meet the following criteria:

Short turnaround times for AWS account creation
Fully automated, following a manual approval process
There must be a distinction between personalized playground accounts and staging/production accounts
A user should be able to order an Account via web UI

What we already had at New Work

The topic of automated account creation was not a new one, there existed already (partial) automation, which however originated before Control Tower and directly used the APIs of AWS Organisations and Cloudformation. The solution, a Lambda that ran for about 10 minutes, did its job, but lacked features and automation. Furthermore a single Lambda function with many API calls stacked on top of each other was not particularly error resistant. Due to the amount of effort on maintenance and extension we would have had to spend on the existing solution, the decision was made for a new implementation.

The cornerstones of the Account Vending Machine

The setup should be reproducible described with Infrastructure-as-Code. Hashicorp’s Terraform has been very common to date. Since Control Tower was to be used for account management in the future, and since it is heavily based on Cloudformation, we decided against Terraform and instead opted for the Cloud Development Kit (CDK), which transpiles to Cloudformation. Furthermore, this allowed us to remain homogeneous with our tech stack, since the business logic should be implemented in Typescript and also the infrastructure through the CDK. This allowed us to use the same ecosystem for automated testing, code analysis, etc., regardless of the infrastructure code or business logic.

Typescript was chosen for the business logic primarily because of the broad support for NodeJS in the AWS environment and because of the existing internal know-how.

Since the Account Vending Machine can be seen as a direct extension of the Control Tower and has to be deployed in the root account, the easiest setup is to also deploy the Account Vending Machine in the root account. This allows access to Control Tower native mechanisms and roles needed to manage and initialise accounts. Deploying outside the root account is possible, but requires a lot of configuration to forward events to the member account and still requires mechanisms to be created that allow access to all accounts.

The User Interface

The best system is only as good as its usability and accessibility for the end user. Internally, the necessary approval process for an account order has already been implemented and covered via Jira. In order to keep the effort low, as well as to keep the comfort for the end user, Jira should continue to act as the UI. This way, there was no change in the process for the customer. The account vending machine is triggered via Jira Automation Workflow, after approval with data from the ticket via internal REST API. Making the whole process ticket-based also offers the possibility to provide a simple feedback mechanism on transaction level for the customer. This way, she can always track the status of the order based on the information provided on the ticket, up to the complete delivery.

The architecture

The whole stack is based on serverless functions, which is appropriate for that particular use case because we don’t expect the system to run all the time. All components are decoupled and glued via events. Fig. 1 depicts the components used for the implementation.

Fig. 1 Technical overview

If a customer orders an account via the UI (in our case, she raises a Jira ticket), the following process (Fig. 2) starts. The whole system is running as long as there are open account orders in the database.

Fig. 2 Sequence diagram, account ordering

Account orders can be received via REST API and a Validator checks the input. This mechanism is based on JSON schema. Since a valid input directly places an order in the order database (DynamoDB), a direct implementation of the validation logic on API gateway level was omitted here to have better control over it. A DynamoDB stream with a filter on the new and completed provisioning status field triggers the Dispatcher. The Dispatcher checks if no order is already in the status of PROVISIONING and creates a message for the Account Creator if necessary. The message triggers the Account Creator which creates a lock for the order in the database. After placing the lock, an API call is made to the Service Catalog, which in turn creates an AWS Account and registers it with Control Tower.

https://medium.com/media/78b27f375961d01c04a9080090816ca9/href

If an account was successfully registered, Control Tower sends the corresponding EventBridge event. This will be used as a trigger for the Orchestrator (a StepFunction). The actual account creation was deliberately not included in the Orchestrator, because at the time of implementation, waiting for events within a StepFunction is not natively possible (the solution would have been to implement this logic itself via Lambda, which seemed less intuitive and transparent compared to the now existing solution). If the Orchestrator has done its work, it sends the result (SUCCESS/ FAIL) also to a SNS Topic, where among other components, e.g. a mailer (SES based) is registered, to notify the customers. Since the initial process relies on creating a Jira ticket for the account order, a Lambda could hook in at that point and automatically, update the corresponding Jira ticket with relevant informations (eg. accountId and account-alias) and close the ticket.

Essentially, all inter-system communication is based on (internal) events and (DynamoDB) streams. This not only ensures a strong decoupling of the individual sub-steps, ordering, creation and individualization, it also simplifies the provision and implementation of monitoring and alerting. It also makes it easier to test and develop the individual sub-functions, since the input of the individual functions is based on standardized formats, which means that consumer-driven contracts can be extensively used in development. In this context, a function defines which output the preceding sub step must provide.

Using an orchestrator for account customization

The third step in the process is the customization of the account to meet the company’s internal requirements. This includes, for example, the enforcement of additional compliance policies, the creation of an account alias that is constrained by a fixed naming convention, and the tagging of accounts and resources for cost allocation and budget management.

The Orchestrator (Fig. 3) is based on AWS StepFunctions and therefore provides a state machine for sequential sequences, but also the ability to parallelize as much as possible to keep the overall turnaround time at a minimum. Retry mechanisms and error handling work very comfortably with the on-board mechanisms of StepFunctions. Furthermore, many AWS services are natively integrated and boilerplate code in the form of Lambda functions is often spared at this point. Some customization, like setting an account alias, is deployed as a Cloudformation custom resource. Within the StepFunction you need to poll the rollout of such resources to get the actual status.

Fig 3 Orchestrator, that handles account customization

The next steps

The most obvious extension is to support additional account lifecycle events, such as account closures. This is straightforward thanks to AWS Organizations’ API, but there are a few things to keep in mind when using Control Tower (see).

By providing a REST API that creates AWS accounts and integrates with a workflow tool like Jira, there are many opportunities to successively extend and build an internal platform around the provision of an AWS specific base setup. For example, the API can be extended to offer network configurations, standardized database setups, or other cloud products to customers via the same user interface (Jira). This centralization and standardization greatly improves the visibility of internal service portfolios as well as the standardization of the system landscape. This in turn pays directly towards the goals of a cloud platform, namely to increase the speed of product teams and to shorten the time-to-market for new innovations. On the other hand, standardized basic configuration reduces the effort for maintenance and operation, and it decreases the extrinsic cognitive load, especially within the operations area.

TL;DR;

Automating account creation is not just an issue for medium — large enterprises that have a huge number of AWS accounts. Those running a multi-account setup will have to address some compliance, security and FinOps issues in order to stay on top of things. The level of automation of the processes involved plays a critical role in efficiency. This article provided a detailed look at automating the account creation and basic configuration of a landing zone managed by Control Tower. An alternative approach based on AWS CodeBuild is described in this blog article.

If you are interested to work with us on such challenges using AWS at enterprise scale, feel free to reach out and let’s have a chat.

Implementing an AWS Account Vending Machine was originally published in New Work Development on Medium, where people are continuing the conversation by highlighting and responding to this story.

Terraform Functionbeat

Pascal Euhus — Wed, 04 May 2022 08:29:06 GMT

How to ship AWS Lambda logs with Functionbeat in a Terraform context

Photo by Luke Chesser on Unsplash

Anybody thinking about a solution for centralized logging and monitoring of their applications, will quickly come across the Elastic Stack. With its sophisticated Kibana UI, the powerful Elasticsearch backend, and the ecosystem for log and metric collectors, the Beats-Framework, it offers a good all-in-one solution for various application purposes.

In the following, we will look at how logs from AWS Lambdas can easily get into the Elastic Stack. There is a simple off-the-shelf solution for this
with Functionbeat. This beat, which is itself an AWS Lambda, comes with an installer and uses Cloudformation to generate the required AWS resources, such as IAM and security groups.
The biggest disadvantage here, however, is if you do not want to stick to Cloudformation, but also want to integrate the Functionbeat into your IaC stack.
Fortunately the appropriate Terraform module offers a solution. The module is a wrapper around the installer and takes care of the Functionbeat configuration and the required AWS resources.

It is assumed that a VPC and at least one subnet already exists, where
Functionbeat is about to be deployed.
If not, a complete, more comprehensive Terraform example can be found in the official Github repo.

What we are going to deploy

The following concentrates on how to deploy Functionbeat as AWS Lambda via Terraform with optional direct attachment of Cloudwatch Loggroups. As shown in the overview below, there are more inputs apart from Cloudwatch logs available for Functionbeat. Even thought the Functionbeat Terraform module has no built-in support for additional triggers, it can be used as a foundation to attach Lambda triggers to Functionbeat, the standart Terraform way using the modules ouput of the actual Functionbeat ARN.

Photo by Elastic

Integrate Functionbeat module in Terraform

Create a security group for Functionbeat

resource "aws_security_group" "functionbeat_securitygroup" {
  name   = "Functionbeat"
  vpc_id = 

  egress {
    from_port   = 443
    protocol    = "tcp"
    to_port     = 443
    description = "HTTPS"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

2. Integrate the Functionbeat module

module "functionbeat" {
  source = "git::ssh://git@github.com:PacoVK/terraform-aws-functionbeat.git"

  application_name     = "crazy-test-application"         # (1)
  functionbeat_version = "7.17.1"                         # (2)
  lambda_config = {
    name = "my-kibana-exporter"                           # (3)

    vpc_config = {
      vpc_id             = data.aws_vpc.vpc.id
      subnet_ids         = data.aws_subnets.private.ids
      security_group_ids = [       aws_security_group.functionbeat_securitygroup.id
     ]
    }

    output_elasticsearch = {                              # (4)
      hosts : ["https://your-endpoint:443"]
      protocol : "https"
      username : "elastic"
      password : "mysupersecret"
    }
  }
}

1) application_name => value added to any log in Kibana as a tag for filtering
2) functionbeat_version => specify which version to deploy
3) name => Name of the deployt Functionbeat Lambda
4) output_elasticsearch => any valid functionbeat YAML config for output.elasticsearch in HCL syntax

To further configure Functionbeat you can use fb_extra_configuration to pass all valid options as HCL construct into the module. To keep the transformation from YAML to HCL simple it is recommended to use a online converter like YAML to HCL

If you are hosting the Elastic Stack on Elastic Cloud and expect a big amount of logs to ship, i recommend to use the private link feature to keep the traffic within AWS. This will save costs because the traffic won’t leave AWS backbone and you wont be charged for egress traffic. Of course you’ll be charged for the private link resources.

After having everything is in place you can run the following to deploy Functionbeat:

terraform get && terraform apply

You won’t have any logs in Kibana yet since no subscriptions are defined. To do so you have several options now.

Pure Terraform — subscribe a cloudwatch loggroup

To use the modules built in cloudwatch subscription capability pass the corresponding cloudwatch group name to the
Functionbeat module via loggroup_name property.

Integrate Lambdas deployed via Serverless Framework

If you leverage on the Serverless framework for your application lambdas, this module offers an interface for that.
The Functionbeat ARN is written to SSM per default, hence you can make use of the parameter within your serverless specs.
Install the plugin serverless-plugin-log-subscription into you Serverless stack
which makes it a breeze to attach the corresponding cloudwatchlog groups.

Use the Functionbeat module, install the Lambda and ensure lambda_write_arn_to_ssm is set to true, which is default.

module “functionbeat” {
 lambda_config = {
    name = “my-kibana-log-shipper”
 …
}

2. To attach all your Lambdas logs for your Serverless application add the following plugin config into your serverless.yml

custom:
 logSubscription:
 enabled: true
 destinationArn: ‘${ssm:my-kibana-log-shipper_arn}’

Other than that the Serverless plugin can also operate on a per function level (please head over to the official docs).

Conclusion

We saw how to make use of Functionbeat in a Terraform context and how easy it is to integrate Functionbeat with existing Serverless
based applications. Of course you can make use of any other infrastructure-as-code tool by using the target Lambda ARN exposed via SSM.

Thank you for reading, you can reach me out via:

Web
Twitter: @pascal_euhus
LinkedIn

Resources

If this post was helpful, please click the clap 👏 button below a few times to show your support for the author 👇

Terraform Functionbeat was originally published in FAUN — Developer Community 🐾 on Medium, where people are continuing the conversation by highlighting and responding to this story.

Software Architects and Autonomous Teams

Pascal Euhus — Tue, 23 Feb 2021 19:17:29 GMT

How Does a Software Architect Fit into Autonomous Teams?

Photo by You X Ventures on Unsplash

Software architects are responsible for the architecture of a system. This includes the overall design, the corresponding components, and their communication.

One popular definition of architecture is “stuff that’s hard to change”. I’d argue that a good architect makes change easier — thus reducing architecture.

Martin Fowler

A software architect should also aim for maintainability and sustainability of a system, thus making assumptions on the system usage and trying to verify these. In contrast, a developer’s point of view of the system is more fine-granular, focused on the implementation details and limited to the components’ context regardless whether the component is only a smaller part of the overall system. The outcome of this is a discrepancy between what is seen as the most important component of a system (the developers perspective, most likely technical) and what is the most important component in the context of the whole system (addressing both the component’s technical and business relevance). Thus, an architect has make sure to create and distribute a shared global understanding of the system. Apart from the communicational aspects, an architect could define guidelines for teams to prevent uncontrolled technology growth. This is necessary if there are either management decisions to cooperate with certain technology partners only (eg. Oracle) or the organization is not able to make good architectural desicions, eg. because of missing experience/ maturity. An example for such a guideline could be the decision to use REST over SOAP for communication or recommendations on choosing a programming language. Within these boundaries, teams should have the possibility to interact and make decisions on their own.

However the movement from functional teams towards cross functional and autonomous teams blurr the lines of responsibilities and the capabilities to make decisions. How can a software architect facilitate the idea of autonomous teams?

Social aspects of developing software

This article mainly elaborates on how a software architect fits into autonomous teams, whereas the role of an architect is interpreted as a mediator between teams. Therefore it is crucial to understand the aspect of communication in terms of developing software. Social interactions between individuals and groups have a great impact on the success of software development projects. 1968 Melvin E. Conway found out that an organization is constrained to create systems which are reflections of their communication. Hence, any friction in communication between departments or any complex workflow is likely to be reflected within the software. Communication is the result of the corporate culture, hence changes in communication means changing at least parts of the corporate’s culture. Ignoring this dependency will result in frictions and lack of understanding of strategic/ management decisions. Referring to a software development process, you need to ensure that someone is taking care of discovering and removing any negative influence on social interactions between all involved departments. This article takes the architect in account for such duties and responsibilities.

Functional Teams vs. Cross Functional Teams

Functional teams are teams with the focus on one skill set. A functional team of database administrators has a huge expertise in managing databases but most like little knowledge about web frontend technology, whereas a team of frontend specialists cannot administer databases, but create great websites. The communication within the group of the specialists has low barriers, because they are all grouped into teams, which enables them to share their specific knowledge easily.

Functional teams are knowledge silos

However building software usually needs a subset of different skills, eg. backend, database, frontend, UI, UX and operations. Functional teams are highly skilled but the external communication path (between the teams) is very heavy weighted and often fails due to miscommunication. To tackle this problem, the agile movement came up with the idea of cross-functional teams, where teams are constituted by members with different skill sets.

Cross functional teams

This enables the database specialist to closely work together with the backend developer. There are two major benefits: on the one hand, special knowledge can easily be shared across skill boundaries, on the other hand decisions which affect the product can be made directly considering feedback from all involved parties. Both of them foster the team’s autonomy.

Cross Functional Teams vs. Autonomous Teams

Building cross functional teams, doesn’t implicitly result in having autonomous teams. This strictly depends on some further key drivers like the team’s maturity, social capabilities, and the context size of their area of responsibility, leadership and organizational structure.

Having a cross functional team enables you to address business goals directly to the team. Functional autonomous teams are constrained to solve problems within their special field, which means a team of backend developers will build a performant webshop backend but will likely fail on building a nice storefront. Having multiple autonomous teams requires a clear and solid communication path between the different teams. If you cannot provide such, your teams likely have to deal with a huge communicational overhead which results in decreasing developer experience and therefore inefficiency due to frictions and frustration. Frustration and frictions between teams causes the members to be more likely to reject being responsible for their product.

Indeed the role of a software architect has a broad range of interpretations. Many believe a software architect is somebody who primarily:

takes decisions regarding the software architecture as well on some hardware related topics
is more related to the C-levels
is responsible for the quality of software
enforces technical standards, tools and platforms

This strongly contradicts the idea of autonomous teams because it shifts the responsibility for core decisions away from the teams and prohibits the willingness to take over responsibility, which is essential for autonomy. If a team does not agree with a decision, they tend to blame the architect for any related failure.

However, the role of a software architect within a cross-functional team can be implemented in two ways, each has specific characteristics.

Architect as a role in a team

As already mentioned, having multiple autonomous teams eases the communication and collaboration between skill silos but requires a very good communication path to external teams. The technical solution for that is well defined APIs for the system-to-system communication. The non-technical solution can be achieved with proper team design. To prevent a lack of functionality as well as overengineering a system’s integration (eg. in form of an API), the best design evolves from cross team communication. The team member feeling responsible for the architecture has to ensure to get the team to know the consumer’s requirements. He facilitates the external communication and if necessary the cooperation with other teams.

Architect as role in a team

The architect as part of the team is not a single, dedicated role. It’s more an extension to another role, whereas the focus is most likely but not limited to a role, that is strongly related to a developer role (eg. backend/frontend developer). Typically an architect knows how to write software and already has several years of experience. You can think of a senior developer or tech lead whose main focus is doing architectural work. The drawback of this design is that it’s crucial for the overall system that the architects within the teams have a clear overview across all components and common requirements. Without a clear responsibility there might be team members who feel responsible for the architecture on a voluntary basis. If that fails, the system’s landscape could mess up. Bear in mind that even if the architect in this approach is not a dedicated role, there is still the need of having a dedicated person within the team owning this attachment. Otherwise your team members may feel left alone and take decisions they are not capable to take.

Architect-as-a-service

In this scenario, the architect, or the architects are not part of the development team. Within an organizational diagram they could be a team, but since it can be a single person role, I’d say it’s more like a pool of skilled person(s). The Architect-as-a-service can be consumed from teams whenever it is necessary, eg. API design review, support for technological decisions or even to contribute to the code base.

Architect-as-a-service

It is important that the time the architect works for/ in a team has a defined start and end date. The architect will come as a promoter/ enabler and the goal is to close knowledge/ competence gaps within the team and streamline the teams effort to the company goals. She aims for making her role within the team obsolete. As an architect is not assigned to a specific team, her responsibility is to have a clear high-level perspective of the overall system landscape.

The architect-as-a-service offers several advantages over the prior approach:

Due to the meta-involvement in all teams, an architect is able to spread success stories across the teams.
Having a dedicated instance to identify common requirements (standards). No need to reinvent the wheel multiple times.
Communicational overhead is not directly within the teams, whereby this primarily is true for social interactions and knowledge sharing across teams. The architect-as-a-service can moderate and foster communication paths, where it is vital for the project’s success.
The architect has a less emotional relationship to a team’s product which facilitates more pragmatic decisions/ advice.
Being not a full-time team member, the architect can focus more on specific topics and architectural work than a developer who implements features.

There are several downsides that come along with the approach of having architects-as-a-service. Due to her temporary engagement within a team, the architect has to have a strong focus to get the team enabled until she leaves again. During the time of collaboration, the team has effectively more development resources, however the architect should not be treated as a full-time team member to get features in place. It’s a small gap between being part of the team in a very special role without being noticed as a stranger. This requires the architect to have a huge social and communicative competence. Additionally the architect-as-a-service always has to consider herself rather as an advisor than a decision maker. The ownership and responsibility for the system remains within the team. Enforcing decisions for the team apart from the aforementioned architectural boundaries and leaving them alone with the maintenance and ongoing support, will end up in distrust, frustration and the loss of the willingness to take over the responsibility for the system.

To summarize, the final decision about any changes to the software architecture still belongs to the teams but they can get support, review and expertise at any time from outside.

Recap

The software architect role is a role for professionals. It requires high skills and experience in both management and technical aspects. The architect should not be treated as a single point of knowledge and should not be the only one being responsible for a system. Rather the architect is more like a mentor and enabler to the teams. Depending on your team structure (cross-functional team vs. autonomous team) and skill sets, the architect can be part of a product team or available as a service on demand. Both models come with advantages and disadvantages. Having a dedicated architect enforcing technological decisions contradicts the goal of having autonomous teams and impedes the team from taking over full responsibility. However architecture still needs, in the vast majority, somebody to care about. This person, respectively the architect, has to focus on communication, removing frictions and a coherent understanding of the system’s purpose and design across all participants.

To complete, I want to express my thanks to Eberhard Wolff and my wonderful colleague Niko Huber for spending their time improving and reviewing this article. Your feedback was always an eye-opener.

Thank you for reading. You can reach me out via:

LinkedIn
Twitter: @pascal_euhus

References

Twitter — Martin Fowler

Who needs an architect — Martin Fowler

Maximizing Developer Effectiveness — Tim Cochran

How do committees invent? — Melvin E. Conway

(German) Podcast Organisation als Werkzeug zur Umsetzung von Architektur— Eberhard Wolff/ Gerrit Beine

Software Architects and Autonomous Teams was originally published in The Startup on Medium, where people are continuing the conversation by highlighting and responding to this story.

Integrate Keycloak with HashiCorp Vault

Pascal Euhus — Thu, 11 Feb 2021 09:01:29 GMT

A How-To guide using Terraform

Photo by Mika Baumeister on Unsplash

Hashicorp Vault is an open-source tool to manage secrets and secret access. The official definition of a secret in Vault:

A secret is anything that you want to tightly control access to, such as API keys, passwords, certificates, and more. — Vault Documentation

Access to secrets is granted via group memberships and the corresponding policies. Despite you can manage users within Vault, in an enterprise context users are often managed centrally. There are several capabilities to authenticate users to Vault and this post elaborates on how to integrate the open source identity provider Keycloak with Vault. You can have a look at the sample project on GitHub.

This is a how to guide and the samples are written in Terraform. Nevertheless you can easily follow these steps and set up everything via the corresponding UI.

Prerequisites to follow with the sample project:

Running Vault server
Running Keycloak
Terraform cli

Configure Keycloak

First, we need to prepare Keycloak to be able to link an OIDC client to Vault. If you have already set up Keycloak with realms and users you can skip this part.

Create a realm

resource "keycloak_realm" "realm" {
  realm   = "demo-realm"
  enabled = true
}

2. Add users to the realm, for this demo we have to set up Alice and Bob.

resource "keycloak_user" "user_alice" {
  realm_id   = keycloak_realm.realm.id
  username   = "alice"
  enabled    = true

  email      = "alice@domain.com"
  first_name = "Alice"
  last_name  = "Aliceberg"

  initial_password {
    value     = "alice"
    temporary = false
  }
}

3. Create an OIDC client, you might notice the resource is called keycloak_openid_client. OIDC is OpenID Connect, a standard built on top of the OAuth2.0 authorization framework. As valid_redirect_uris for the client, we define the Vault endpoint (in the sample project it’s localhost).

resource "keycloak_openid_client" "openid_client" {
  realm_id            = keycloak_realm.realm.id
  client_id           = "vault"

  name                = "vault"
  enabled             = true
  standard_flow_enabled = true

  access_type         = "CONFIDENTIAL"
  valid_redirect_uris = [
    "http://localhost:8200/*"
  ]

  login_theme = "keycloak"
}

4. Define client roles, according to your use-case. In this sample, we will have a management and a reader role. You can use composite roles to benefit from permission inheritance for fine granular access controls. In this sample, the management inherits the permissions of the reader role.

resource "keycloak_role" "management_role" {
  realm_id    = keycloak_realm.realm.id
  client_id   = keycloak_openid_client.openid_client.id
  name        = "management"
  description = "Management role"
  composite_roles = [
    keycloak_role.reader_role.id
  ]
}

resource "keycloak_role" "reader_role" {
  realm_id    = keycloak_realm.realm.id
  client_id   = keycloak_openid_client.openid_client.id
  name        = "reader"
  description = "Reader role"
}

5. To finish the Keycloak configuration, we need to add the claims to the idToken which is issued for Vault access. The claims are added under the claim_name specification. The roles key at root level is a reserved key and can not be used. In this sample we put the claims under resource_access.vault.roles.

resource "keycloak_openid_user_client_role_protocol_mapper"
  "user_client_role_mapper" {
     realm_id   = keycloak_realm.realm.id
     client_id  = keycloak_openid_client.openid_client.id
     name       = "user-client-role-mapper"
     claim_name = format("resource_access.%s.roles",  
                  keycloak_openid_client.openid_client.client_id)                                    
     multivalued = true
}

6. If you enable Direct Access Grants for the client, you can test your setup and issue an idToken using eg. curl.

curl \
--data "username=bob&password=bob&grant_type=password& client_id=vault&client_secret=" \ http://localhost:8080/auth/realms/demo-realm/protocol/openid-connect/token

So far, we are done with the Keycloak configuration. We proceed to enable the OIDC authentication engine in Vault and setup the corresponding mapping between the token claims and Vault policies.

Configure Vault

To understand which internal Vault resources we are going to configure, the following is a high-level overview of how these resources are related to each other. We need to set up policies, groups, and entities as well as a mapping between the Vault internal structure and the one provided by Keycloak.

Vault configuration

Vault will sign each token that is issued by the secrets engine. Hence we provide a key for our OIDC identity.

resource "vault_identity_oidc_key" "keycloak_provider_key" {
  name      = "keycloak"
  algorithm = "RS256"
}

2. We have to enable the OIDC auth backend for Vault. The listing_visibility set to unauth, enables the auth method to be present in the login screen. If you omit the default_role, you need to specify a role each time you log in. For better usability specify a default role with the least privileges. Notice that oidc_discovery_url is the URL of Keycloak, in this example running on localhost.

resource "vault_jwt_auth_backend" "keycloak" {
  path               = "oidc"
  type               = "oidc"
  default_role       = "default"
  oidc_discovery_url = format("http://localhost:8080/auth/realms/%s,   
                               keycloak_realm.realm.id)  
  oidc_client_id  =  keycloak_openid_client.openid_client.client_id
  oidc_client_secret =     
                  keycloak_openid_client.openid_client.client_secret

  tune {
    audit_non_hmac_request_keys  = []
    audit_non_hmac_response_keys = []
    default_lease_ttl            = "1h"
    listing_visibility           = "unauth"
    max_lease_ttl                = "1h"
    passthrough_request_headers  = []
    token_type                   = "default-service"
  }
}

3. Define a backend role to be used for authentication and to assign/ map permissions to a user. Here happens the magic. The user_claim is the unique identifier, for whom the token is issued. The identifier is used to create Vault entities on the fly when a user logs in via the OIDC method. Each entity can be enriched with some metadata from the token. Define a mapping within the claim_mapping property. To dynamically assign Vault policies for grants from Keycloak (claims), we have to tell Vault where the claims are listed in the idToken. We set this in the Keycloak preparation section to resource_access.vault.roles. Vault expects nested JSON attributes in JSON-Path syntax which means resource_access.vault.roles in JSON Path notation becomes /resource_access/vault/roles.

resource "vault_jwt_auth_backend_role" "default" {
  backend        = vault_jwt_auth_backend.keycloak.path
  role_name      = "default"
  role_type      = "oidc"
  token_ttl      = 3600
  token_max_ttl  = 3600

  bound_audiences = [keycloak_openid_client.openid_client.client_id]
  user_claim      = "sub"
  claim_mappings = {
    preferred_username = "username"
    email              = "email"
  }

  allowed_redirect_uris = [
      "http://localhost:8200/ui/vault/auth/oidc/oidc/callback",    
      "http://localhost:8250/oidc/callback"
  ]
  groups_claim = format("/resource_access/%s/roles", 
                     keycloak_openid_client.openid_client.client_id)
}

4. Our authentication backend in Vault is ready to use. Now we need to provide some policies and groups that Vault can actually grant permissions to resources based on the idToken. In this sample we have a management and a reader policy, where only the management policy grants write access to secrets. Note that we configured the management role on Keycloak as a composite role inheriting the reader role, hence we don’t have to grant read, list permissions in multiple policies.

data "vault_policy_document" "reader_policy" {
  rule {
    path         = "/secret/*"
    capabilities = ["list", "read"]
  }
}

resource "vault_policy" "reader_policy" {
  name   = "reader"
  policy = data.vault_policy_document.reader_policy.hcl
}

data "vault_policy_document" "manager_policy" {
  rule {
    path         = "/secret/*"
    capabilities = ["create", "update", "delete"]
  }
}

resource "vault_policy" "manager_policy" {
  name   = "management"
  policy = data.vault_policy_document.manager_policy.hcl
}

5. Now that we have created policies, we assign them to groups. Additionally we create a role (entity) and assign the signing key from our OIDC client. There are different types of Vault groups, external and internal. External groups are considered to be a mapper to groups, managed from an external system, in that case, Keycloak.

resource "vault_identity_oidc_role" "management_role" {
  name = "management"
  key  = vault_identity_oidc_key.keycloak_provider_key.name
}

resource "vault_identity_group" "management_group" {
  name     = vault_identity_oidc_role.management_role.name
  type     = "external"
  policies = [
    vault_policy.manager_policy.name
  ]
}

6. Last step is to create an alias for the external group. This alias is supposed to be a Vault internal representation of the identity from the external system. In terms of users managed in an active directory, the alias has to represent the respective role.

resource "vault_identity_group_alias" "management_group_alias" {
  name           = "management"
  mount_accessor = vault_jwt_auth_backend.keycloak.accessor
  canonical_id   = vault_identity_group.management_group.id
}

Verify

Let’s verify our setup. Either browse to your existing Vault or use the URL from the demo http://localhost:8200. Remember, in our setup, we configured a default role, hence no need to specify a role for the login. If you didn’t define a default role you have to choose an existing one, otherwise you’re not able to proceed.

OIDC is now available as a Login method

If you hit the “Sign in with OIDC Provider” button, Keycloak pop-up will open and ask you to log in. Either choose alice or bob. After you’re logged in you should be redirected to Vault, where Alice is only permitted to read secrets and Bob has write access.

Hello Vault!

Recap

Vault is quite powerful, easy to use, and very extensible. We saw how easy it is to integrate any identity provider over OIDC such as Keycloak. While the Terraform Keycloak provider is a third-party provider, the Vault integration is officially supported by HashiCorp. You can fully write your Vault setup in HCL and manage your cluster with Terraform. Currently, HashiCorp offers a free public beta for fully managed Vault on HashiCorp Cloud Platform.

Thank you for reading, you can reach me out via:

LinkedIn
Twitter: @pascal_euhus

Resources

Vault Documentation

Keycloak Documentation

Terraform Vault Provider

Terraform Keycloak Provider

HashiCorp Cloud Platform

👋 Join FAUN today and receive similar stories each week in your inbox! ️ Get your weekly dose of the must-read tech stories, news, and tutorials.

Follow us on Twitter 🐦 and Facebook 👥 and Instagram 📷 and join our Facebook and Linkedin Groups 💬

If this post was helpful, please click the clap 👏 button below a few times to show your support for the author! ⬇

Integrate Keycloak with HashiCorp Vault was originally published in FAUN — Developer Community 🐾 on Medium, where people are continuing the conversation by highlighting and responding to this story.

Progressive Deployment

Pascal Euhus — Mon, 25 Jan 2021 15:37:29 GMT

Progressive Delivery

Beyond Continuous Deployment

Photo by Marc-Olivier Jodoin on Unsplash

Releasing software efficiently is a crucial part of the software delivery chain. The idea of integrating different copies of software multiple times a day initially was born with the concept of Extreme programming (XP). Today many companies already have an automated process of integrating their software with the help of unit- and integration test suites. Due to several aspects (eg. the system’s complexity, business criticality, legacy), the release/ deployment workflow is still a manual step. With the adoption of flexible cloud infrastructure and serverless applications, there is a capability to easily extend Continuous Integration (CI) to Continuous Deployment (CD).

CI/CD Process

To add resiliency to the deployment process, you can implement a blue/green or canary deployment strategy. The idea of having a QA on a staging system to prevent the big bang in production in conjunction with hot/cold machines encourages deploying more often with fewer failures. In case of an emergency, you can do a rollback to the prior version.

At first sight, this process seems to be rock-solid but on the second sight, this process is only reliable, if you have enough and proper test suites as well as a careful and holistic QA. Additionally, there are some features that are hard to test under staging. Those systems are often production-like (in a sense not like production) to save costs, which apparently leads to poor load test scenarios (if such tests exist). To address these issues, two major enhancements were introduced with the idea of progressive delivery (PD), feature flags, and gradual rollout.

This strategy is already adopted by big players like Google, Facebook, and Netflix. It embraces modern practices like canary deployments, observability, and A/B testing.

Feature flags

Feature flags, also known as feature toggles, are a pattern to implement a switch to enable/ disable a feature without changing code. Inherently at the price of bloating your production shipped artefact. Feature flags are often used for A/B testing purposes. Instead of deploying changes/ new features to the entire set of users, you only enable the functionality to a subset of users.

Feature flags

Testing in production is probably the most realistic testing mechanism before going live because it is the production and not production-like. From a development perspective, feature flags are adding another layer of resiliency because, in case of an error/ performance problem, there is a soft switch to turn off the broken functionality. From a business point of view, you didn’t harm all of your users, only a small percentage of them are affected. In a highly competitive market eg. mobile/ e-commerce, it is essential to keep and strengthen the user's loyalty. Feature flag management should support granular user targeting. There are several tools like ConfigCat, which simplify the implementation and usage of feature toggles capabilities.

A great article about feature flags is Feature Toggles (aka Feature Flags) from Martin Fowler.

Gradual rollout

In terms of CD, blue/green deployments and canary deployments are very popular. Whereas blue/green deployment is a switch with 100% traffic either on the blue or green instances, canary deployments are gradually. You will find canary deployments often mentioned in the context of Kubernetes because it is one of the supported strategies.

Blue/Green vs. Canary deployment

Canary Deployments in conjunction with feature flags are very powerful and combines the best of both. The idea is to release features into production but set as inactive. When increasingly shifting small portions of production traffic onto the new release, you can monitor that your deployment is still stable. In the second step, you can granularly enable-feature flags to several target user segments based on region, email, or any other user attribute. In a rollback scenario, only a small group of users are affected and you don’t have to revert the whole deployment, just to turn off the broken feature.

Monitoring and observability

Monitoring and observability are essential for the successful adoption of progressive delivery. Whereas continuous deployment focuses on secure and stable releases of feature sets into production, PD treats each feature as micro deployment with the primary focus on the end-user. The overall intention is to acquire new customers and enforce customer retention through usability and reliability.

In case of impediments, you need to be able to react as soon as possible and take care that the outage impairs as few users as possible. Depending on your business, an impediment is not only limited to an outage. Think of a real-time bidding platform, an increased latency can cause the same harm to the user. Well-defined alarms and good monitoring boards help you to understand the state and the usage of your software.

Recap

Progressive delivery does not only increase resiliency and decrease downtime caused by deployments, in a sense it embraces development and the end-user due to its customer-centric approach. Tools like A/B testing no longer prolong to (performance) marketing departments only. Even if progressive delivery is an extension of CD and an already existing CD process is therefore mandatory, it is not trivial to adopt PD. Developers have to deal with stealth code in production and take care of its continuous removal. Otherwise, you’ll suffer from technical debts through unused code. Furthermore, observability requires adequate tracing and monitoring with well-defined triggers and alarms to be able to intercept and react in case of a failure. At the worst PD will only add a vast complexity, without delivering value to your business.

Thank you for reading, you can reach me out via:

LinkedIn
Twitter: @pascal_euhus

Resources

XP — Wikipedia

Feature Toggles (aka Feature Flags) — Martin Fowler

ConfigCat

👋 Join FAUN today and receive similar stories each week in your inbox! ️ Get your weekly dose of the must-read tech stories, news, and tutorials.

Follow us on Twitter 🐦 and Facebook 👥 and Instagram 📷 and join our Facebook and Linkedin Groups 💬

If this post was helpful, please click the clap 👏 button below a few times to show your support for the author! ⬇

Progressive Deployment was originally published in FAUN — Developer Community 🐾 on Medium, where people are continuing the conversation by highlighting and responding to this story.

Systems Simplicity vs. Complexity

Pascal Euhus — Mon, 18 Jan 2021 15:38:16 GMT

Beware of the microservices, if you don’t need them

Photo by Joshua Sortino on Unsplash

At the very beginning of a new project or when maintaining an existing software there is a point where you have to decide which kind of system architecture you’d like to choose. Whether it’s suitable for the projects use case to build a monolithic system, a self contained system or a microservice.

Influenced from the steadily growing serverless hype many decisions are made towards the microservice approach. One reason for this is that serverless is omnipresent at conferences and in articles, the other is that the microservices approach is promising to get a production ready MVP in a very short period of time, additionally solving many problems that can occur when maintaining a monolithic system.

There are many good arguments for microservices, however there are some pitfalls that you should be aware of when choosing this direction. This article will give you an idea of what to keep in mind when choosing microservices from a broader perspective.

Inner complexity vs. outer complexity

One of the biggest advantages with microservices is the reduction of a system’s complexity. One service is responsible for one particular functionality and the business logic complexity within that service is limited to that functionality, which makes contribution to the service’s codebase much easier because onboarding new members is therefore easier. Nevertheless, the system’s complexity is only shifted to the outer boundaries. Every contributor has to know about several outer limitations like runtime, timeout restrictions, communication to other systems, failure handling, etc. Depending on the microservice, the outer boundaries are crucial to understand when working within a microservice ecosystem. Otherwise you won’t get the approache’s benefits like resiliency.

„When you use microservices you have to work on automated deployment, monitoring, dealing with failure, eventual consistency, and other factors that a distributed system introduces.“ — Martin Fowler

To summarize it: Nothing comes without a price and you’ll always be in the situation to decide which trade off is the most suitable for your project.

Is your organization capable of handling microservices?

Before starting to think about possible architectural styles, one of the most important things to know is whether your organization is generally capable of handling the chosen architecture. You can answer this question with the help of several indicators.

1. Remember Conway’s Law

Conway’s law always strikes back when you ignore it. 1968 Melvin E. Conway found out and published that systems designed by an organization will always reflect their path of communication.

“Organizations which design systems […] are constrained to produce designs which are copies of the communication structures of these organizations.” — Melvin E. Conway

Hence if you have several teams crossing each other’s boundaries and sharing some responsibilities, you won’t be able to develop decoupled systems. You might not detect negative effects at the beginning, but as the systems grow you will discover a tremendous overhead of communication, if you don’t manage to define clear boundaries between the teams. In the end it prevents you from getting a faster time to market and autonomous releases. At its worst this can lead to a blaming culture between your teams, where the team members won’t take over responsibility for the system they develop.

2. Review your current organizational structure

Choosing microservices is suitable for nearly every use case, presupposed you are aware of both the complexity and Conway’s law. However the microservice architectural patterns are predestinated for developing software with multiple teams. Microservices enable you to have autonomous release cycles and a minimum of communication overhead to other teams. If you are the only team, you won’t benefit from any of these advantages because you literally don’t have to communicate with other teams a lot or at least you are the only ones deploying software into production. One could argue microservices enable you to have independent scaling capabilities for specific parts of your system. However you should ask yourself if the system you develop really needs to scale indefinitely. If that’s not the case, you probably won’t get the benefits of microservices, on the contrary, you might end up developing a distributed monolith because of the lack of distributed teams and therefore (clear) boundaries.

3. Know your team’s capabilities

As already mentioned, the outer complexity of a project is increasing when choosing microservices. This has an impact on the required knowhow inside the teams to take over the full responsibility of the product. There is not only the need for business knowhow but also, at least partially, knowledge about runtime specifics (in case of a serverless runtime: timeout restrictions, ephemeral storage, etc.), networking and service to service communication (eg. queues, busses, topic). On top, every microservice needs to be assigned to a team which then owns the service and can be held responsible. Having umpteen microservices means having umpteen boundaries. If they are not logically assigned to the right teams, you might discover hidden communication complexity across teams, unless you have enough engineering teams so each of them can be responsible for a small amount of services only.

4. About splitting the monolith

A common scenario for introducing microservices is the gradual split of a monolith. At that point I want you to be aware that moving directly from monolith to microservices is risky (ref. MonolithFirst — Martin Fowler (2015)). However, there are many examples where companies tried starting over by outsourcing parts of a domain of an existing monolith by implementing microservices directly. Unfortunately this tends to run into a distributed monolith when cutting more domains from the existing monolith. Apart from the architectural perspective, you should always track whether your current organizational structure is capable of handling more microservices. Especially when you are maintaining multiple microservices with different frameworks and deployment pipelines, you will sooner or later reach the point where you exceed your team’s cognitive capabilities and they’ll no longer be able to maintain their services in an adequate way.

A more safe approach could be to move from monolith to modulith and then if necessary over to microservices at the cost of time. If you are starting from scratch, implementing a deployment monolith first could prevent you from running into a distributed monolith. With a clear decision towards a monolith, at least you can start implementing and still have the flexibility to move over to microservices or just stick with a mix of both moduliths and microservices later.

Recap

Microservices are still en vogue, at least reinforced through serverless capabilities in public clouds. At the end you have to keep in mind that neither microservices nor serverless are a silver bullet. Nothing comes without a price and the disadvantages can be harmful to your organization. Building software is not restricted to software departments only, it’s the organization’s culture having an influence on the architectural capabilities and therefore on the systems itself. It is crucial to be aware of your decision’s side effects when changing your system’s architecture and/or landscape. Developing a distributed system demands proper organizational evolution.

You can reach me out via:

LinkedIn
Twitter: @pascal_euhus

References

How do committees invent? — Melvin E. Conway

MonolithFirst — Martin Fowler (2015)

Systems Simplicity vs. Complexity was originally published in The Startup on Medium, where people are continuing the conversation by highlighting and responding to this story.

CI/CD for Cloud Run with Terraform

Pascal Euhus — Mon, 04 Jan 2021 09:15:35 GMT

How to automate GCP Cloud Run setup with Terraform Cloud (Part 1/2)

Starting on a greenfield

In this post, we will deploy a web service written in Deno to GCP Cloud Run using Terraform, a cloud independent infrastructure as a code framework. This post is split into two parts. The first one is about how to set up all baselines, where the second one will cover the integration into GitHub and automation with GitHub actions. You can get the sample project files here.

Requirements

To follow this post you need to have the following setup:

✓ gcloud CLI

✓ Docker

✓ Terraform

✓ The sample project

Setup Google Cloud

GCP is our target platform, hence we need to have an account. New customers get initially $300 credits. Nevertheless, the resources created here are all covered within the free tier. In this section, we will

✓ create a new project

✓ create a service account for Terraform

✓ enable the proper GCP APIs

Our goal is to have the most resources written in Terraform as we want to have our infrastructure as code and to be integrated into Terraform Cloud. However, in this project's small context we’ll keep it simple and create the appropriate service accounts manually via the web console.

Create a project

To get started with GCP you first need to set up a project. If you’re familiar with AWS, projects can be compared to accounts under an organization. They are completely encapsulated from each other, but in the end, grouped under your account. If you login into GCP Console you will be asked to create a new project or you can choose an existing one. If you’re done you should end up at the projects dashboard view.

2. Create a service account for Terraform

Navigate in GCP Console to Service Accounts at the IAM & Admin panel to create a new service account.

Create service account via GCP console

As already mentioned, to keep it simple we won’t dive deep into IAM topics. This setup is not ready to be used in production! We proceed to grant our service account owner access to our project. This enables full access to all resources and Terraform has permissions to provision any resource inside this project.

Grant Project Owner to the service account

We’ll skip the third part (grant user access to the service account), hitting the done button to finish account creation. Last but not least you’ll need to create an access key for the service account to use it for Terraforms’ authentication. Select your beforehand created service account and create a new key. Choose the JSON structure and save the data (-hashValue.json), it will only be downloadable on creation. Bear in mind this data is highly sensitive because it grants a huge permission set. We need these Google credentials for the next steps with slight modifications, to be able to store them into a variable. To store the credentials we need to remove the newlines from the downloaded JSON-file, you can use the following:

jq -c . -hashValue.json

This will print the credentials without newlines to stdin, copy the value for the Terraform section.

3. Enable the proper GCP APIs

You need to ensure that the following APIs are enabled for the target GCP project:

Setup Terraform Cloud

First, you need to sign up to Terraform Cloud, we only make use of the free plan here. There is a quite helpful compendium from Hashicorp, how to get started with Terraform Cloud. To follow this post, you need to ensure to have a workspace with the following setup:

✓ remote execution mode

✓ apply method to manual

✓ Terraform working directory set to gcp

✓ Setup variables

Terraform workspace configuration

After workspace creation, we have to set up the proper variables for Terraform actions. Head over to the Variables tab of your freshly created workspace. We will create Terraform variables, defined in the variables.tf.

GOOGLE_CREDENTIALS (Terraform authentication to GCP, use the formerly created and formatted Google credentials for Terraform GCP service account)
project_id (the ID of our GCP project, created earlier)
app_version (version of our Deno-Oak application, set the default to latest)

Deploy your setup initially

As we will run our Terraform code on Terraform Cloud infrastructure you will now probably get an error issuing the terraform plan inside the gcp directory:

Error if you are not authenticated

To fix this issue the terraform login command. Follow the instructions to create a User API token for your Terraform Cloud organization you created. You can see your user's tokens in your user settings and if necessary revoke them. To read more about different kinds of API tokens, head over to the official docs.

If you get something like this, your API key creation was successfully

Let’s give it another spin with terraform plan and succeed. Alright then let’s terraform apply to create the resources. Again we run into an error.

Error if the image does not exist in our registry

To fix we have to build and push the Image. GCP has several registries and you first need to authenticate to be able to write into it. We can make use of the gcloud registry helper for that. GCP has the following registries, the more you add the more you might encounter delays running docker build commands. It is recommended to add the needed registries only, choose between:

eu.gcr.io
us.gcr.io
staging-k8s.gcr.io
asia.gcr.io
gcr.io
marketplace.gcr.io

In this sample we only make use of eu.gcr.io, hence we issue the command gcloud auth configure-docker eu.gcr.io.

Navigate into the projects root where the Dockerfile exists and run the docker build -t eu.gcr.io/

Finally, run terraform apply inside the gcp directory and see Terraform succeed. There is one output:

api_url (URL to be used to call our deployed service)

Recap

Terraform Cloud is a very neat solution for a secure, collaborative and resilient way of working with Terraform plans. You don’t have to care about statefile management, where potentially high sensitive values are stored and you don’t have to care about locking. Furthermore, it enables you to easily setup review/ apply processes and proper rights management for your infrastructure as code. In a bigger context, you can even Terraform your Terraform Cloud with the official provider.

To be continued

Thanks for your interest and stay tuned, this post will have a follow-up, where we set up a full CI/CD pipeline with GitHub actions. Read on here.

You can reach me out via:

LinkedIn
Twitter: @pascal_euhus

Resources

Terraform Cloud

GCP — Free tier/ Credits

👋 Join FAUN today and receive similar stories each week in your inbox! ️ Get your weekly dose of the must-read tech stories, news, and tutorials.

Follow us on Twitter 🐦 and Facebook 👥 and Instagram 📷 and join our Facebook and Linkedin Groups 💬

If this post was helpful, please click the clap 👏 button below a few times to show your support for the author! ⬇

CI/CD for Cloud Run with Terraform was originally published in FAUN — Developer Community 🐾 on Medium, where people are continuing the conversation by highlighting and responding to this story.

CI/CD for Cloud Run with Terraform

Pascal Euhus — Mon, 04 Jan 2021 09:13:14 GMT

Setup a pipeline with GitHub actions (Part 2/2)

Shipping code continuously

In this post we will create a CI/CD pipeline to deploy a webservice written in Deno to GCP Google Cloud Run with Terraform and GitHub actions. This is the second part of a series, where the first part was about the basic setup. If you haven’t read it yet, head over to part I. The source of the sample project can be found here.

GitHub Actions

GitHub released its fully integrated CI/CD workflow tool GitHub actions as GA in November 2019. The tool is event-driven and enables you to run a series of commands after an event happened. Events can be internal (eg. push, pull request, etc.) or external (eg. triggered from other sources using tokens). The overall configuration is called the workflow file. A workflow can be triggered via events and consists of jobs. Jobs are a group of steps and will run on the same GitHub runner, hence you can easily share data between various steps. A step consists of an action or shell command. They are the smallest portable block of a workflow file. You can define your own action or include one of the numerous predefined actions.

GitHub Actions

Creating a new workflow

To create a new workflow we will create a new file inside our projects root under .github/workflows/build.yml. First, we define a name and the events to listen to, in this case, push to master branch.

name: Release

on:
  push:
    branches:
      - master

For our project, a release probably can be split into two parts, first building the Docker image with the new release binary and the second could be the deployment to production. Therefore we will define two jobs, where the seconds depends on the first one. Let’s go!

Building the Docker image

Jobs reside under the jobs key. The first one is for building the Docker image. We define a name and which runner environment to take (runs-on). The first step is a sources checkout, instead of reinventing the wheel every time we want to checkout, we generally try to make use of predefined actions wherever possible. For compatibility reasons, it is recommended to pin the version of the action.

jobs:
  build_Docker_image:
    name: Build Docker Image
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v2

We want to push our brand new release container to our private GCR (Google Container Registry), eu.gcr.io which we created in part I. Hence the next step is the authentication to GCP. We will make use of the Service Account, also created in the prior part of the post series. You’ll need to create an access key for the service account to use it for GitHub’s authentication. Select the beforehand created service account GitHub SA and create a new key. Choose the JSON structure and save the data (-hashValue.json), it will only be downloadable on creation. Now copy the content of your downloaded JSON file and create a GitHub secret (Settings tab → Secrets → New repository secret) named GCR_SERVICE_ACCOUNT_KEY. Additionally, we create a second secret GCP_PROJECT_ID with the value of your preferred target GCP project. You are now able to use google predefined action to log in, referring to the two secrets. We export the credentials to share the information with all other following steps inside the job.

- name: Login to GCP
  uses: google-github-actions/setup-gcloud@master
  with:
    service_account_key: ${{ secrets.GCR_SERVICE_ACCOUNT_KEY }}
    project_id: ${{ secrets.GCP_PROJECT_ID }}
    export_default_credentials: true

Only a few steps to complete the Docker building job. Docker needs to be configured to integrate gcloud as credentials helper. This can be achieved by adding the following step.

- name: Docker use gcloud CLI as credential helper
  run: |
    gcloud auth configure-docker eu.gcr.io -q

Finally, you can decrease your Docker build time using Docker Buildkit capabilities with buildx and we need to specify our build command. I wrote another post about BuildKit compared to the legacy Docker build.

- name: Setup BuildX
  uses: docker/setup-buildx-action@v1
  id: buildx
  with:
    install: true
- name: Image
  run: |-
    docker build \
          -t eu.gcr.io/$GCLOUD_PROJECT/$SERVICE_NAME:$GITHUB_SHA \
          --push .

Finally, we have all parts for building and pushing new Docker images of our Deno webservice. We use the commit for Docker image versioning. Let’s move on to the deployment part, where we integrate Terraforms’ part into GitHub actions.

Deploy a new Docker image to Google Cloud Run

To implement the CI/CD for the Terraform Cloud part, we first take a look at how Terraform Cloud works. We already configured in part I a workspace with a remote execution backend. This means our Terraform commands won’t run on the GitHub runners themselves, rather than in an isolated remote environment. Therefore Terraform commands neither have access to the runner's environment variables nor are dynamically defined variables currently supported (eg. via the -var flag or .tfvars). Variables are defined inside workspaces. There are both, plan specific Terraform variables and environment variables. We solely use Terraform variables inside our project. The advantage here is your plan does not rely on any “hidden” environment configuration, it’s all defined in variables.tf.

Setup the job

We set up another job called deploy and specify a name as well as the runner environment specification ubuntu-latest. There's a dependency between the former Docker build job which has to succeed prior to the Terraform execution. This can be realized with the needs keyword. Additionally our whole Terraform files are inside the gcp directory, therefore we set our default working directory for all following steps to this folder.

deploy:
  name: Terraform Deploy
  runs-on: ubuntu-latest
  needs: build_Docker_image
  defaults:
    run:
      working-directory: gcp

The job basically consists of a source checkout, terraform fmt , terraform init , terraform plan and terraform apply. To be able to run Terraform Cloud we need to authenticate via token. It’s recommended to use team tokens for CI/CD tools. You can generate team tokens in your Terraform cloud workspace settings under the Teams tab. Copy the value and setup another GitHub secret (TF_API_TOKEN).

steps:
  - name: Checkout
    uses: actions/checkout@v2
  - name: Setup Terraform
    uses: hashicorp/setup-terraform@v1
    with:
      cli_config_credentials_token: ${{ secrets.TF_API_TOKEN }}
  - name: Terraform Format
    id: fmt
    run: terraform fmt -check
  - name: Terraform Init
    id: init
    run: terraform init
  - name: Terraform Plan
    id: plan
    run: terraform plan -no-color
  - name: Terraform Apply
    run: terraform apply -auto-approve

Updating remote variables

The deployed webservice release version is passed in as a variable, we set up in part I to default latest. As already mentioned remote backends currently do not support dynamically defined variables (eg. via the -var flag or .tfvars).

Here comes the tricky part, we want to update the variable, but the value of the variable is stored inside the Terraform Cloud workspace configuration. Fortunately, Hashicorp provides a variables API to update the value. All you need is the variables ID which is accessible over this API. Follow those steps to get the ID of the variable:

Perform a CLI login into Terraform cloud via terraform login. This will create and store a token inside ~/.terraform.d/credentials.tfrc.json
Retrieve the token cat ~/.terraform.d/credentials.tfrc.json | jq '..|.token?|select(type!="null)' and set an environment variable with export TOKEN=
List all variables inside your workspace via the variables API using curl curl -s --header "Authorization: Bearer $TOKEN" --header "Content-Type: application/vnd.api+json" https://app.terraform.io/api/v2/vars?filter%5Borganization%5D%5Bname%5D=YOUR_ORGANIZATION_NAME&filter%5Bworkspace%5D%5Bname%5D=YOUR_WORKSPACE_NAME" | jq
Search for the variable app_version, we already set up in part I.

Copy the ID of the variable

5. Setup another GitHub secret TF_APP_VERSION_VAR_ID with the variable ID retrieved in step 4.

Finalize the pipeline

Finally, we add a step where we can set the remote Terraform variable via the variables API. Remember, we use the commit for the Docker image tag, this should become our new value. This step has to be added before executing the steps for terraform plan and terraform apply.

- name: Terraform set deploy version variable
  run: |
    curl \
      --header "Authorization: Bearer ${{ secrets.TF_API_TOKEN }}" \
      --header "Content-Type: application/vnd.api+json" \
      --request PATCH \
      --data '{
                "data": {
                  "id":"${{ secrets.TF_APP_VERSION_VAR_ID }}",
                  "type":"vars",
                  "attributes": {
                    "key":"app_version",
                    "value":"${{ github.sha }}"
                  }
                }
              }' \
      https://app.terraform.io/api/v2/vars/${{ secrets.TF_APP_VERSION_VAR_ID }}

Recap

We successfully created a project with full CI/CD integration via GitHub actions and Terraform. GitHub actions are very comfortable to use, especially the feature to use “official” actions like the GCP or Terraform ones. You don’t have to reinvent the wheel for every pipeline or implement and host your own templates.

Thanks for your interest, you can reach me out via:

LinkedIn
Twitter: @pascal_euhus

Resources

GitHub actions

Docker Buildkit (buildx)

Terraform Cloud Token

👋 Join FAUN today and receive similar stories each week in your inbox! ️ Get your weekly dose of the must-read tech stories, news, and tutorials.

Follow us on Twitter 🐦 and Facebook 👥 and Instagram 📷 and join our Facebook and Linkedin Groups 💬

If this post was helpful, please click the clap 👏 button below a few times to show your support for the author! ⬇

CI/CD for Cloud Run with Terraform was originally published in FAUN — Developer Community 🐾 on Medium, where people are continuing the conversation by highlighting and responding to this story.