Infrastructure as Code (AWS)

Infrastructure as Code - AWS

Terraform is used to create the underlying infrastructure to support Hydra.

Remote State

When working collaboratively on a Terraform project, it is recommended to use remote state and lock files. Remote storage of the state and locks should be done before provisioning the infrastructure. Accomplishing this can be done using a S3 bucket and DynamoDB table to store the state and locks of a Terraform project respectively. To set up this remote backend, complete the following steps.

Prerequesites

Read and write access to S3 in AWS account

Getting Started

Configure your AWS CLI.
Change directory into ~/hydra/iac/aws/remote_state.
Create variable definitions file (.tfvars) with suited values or manually enter variable values on command line.
Initialize the Terraform project.
```
terraform init
```
Review and authorize changes to infrastructure.
```
terraform apply
```

Input Variables

Name	Description	Type
remote_state_bucket_name	The name of the bucket that stores the remote state of a Terraform project	`string`
remote_locks_dynamodb_name	The name of the DynamoDB table that stores the remote locks of a Terraform project	`string`
remote_locks_read_capacity	The read capacity of the DynamoDB table that stores the remote locks of a Terraform project	`number`
remote_locks_write_capacity	The write capacity of the DynamoDB table that stores the remote locks of a Terraform project	`number`

Batch

Batch is a cloud service that allows to run large-scale batch computing jobs using AWS. Batch relies on compute environments, which contain AWS ECS container instances to run containerized batch jobs, and is able to dynamically scale in allocated CPUs based on different demands. These jobs are submitted to job queues where they reside until they can be scheduled to run in a compute environment.

To support jobs being run using Batch in Hydra, this infrastructure setup supports tracking job metadata from Hydra using an RDS MySQL instance whose credentials are stored in Secrets Manager. This database is initialized using a table setup SQL script that defines the schemas to be used within tables of the database, which is stored as a S3 bucket object. Once the database has been created, a lambda function is invoked using Terraform that will sequentially execute the data definition commands within the SQL script into the newly created database.

Prerequisites

Read and write access to all above mentioned services on your AWS account

Getting Started

Configure your AWS CLI.
Change directory into ~/hydra/iac/aws/batch.
In main.tf, set the appropriate bucket, region, and dynamodb_table values to the remote backend that was created earlier. Set the key value to the path that the state and locks will be stored.
Create variable definitions file (.tfvars) with suited values or manually enter variable values on command line.
Initialize the Terraform project
```
terraform init
```
NOTE: Step 7 and 8 can be completed together using make build_lambda.
Change directory into ~/hydra/iac/aws/batch/modules/lambda/function. Execute the following command
```
pip3 install pymysql sqlalchemy -t .
```
Next, select all folders and files in this directory and compress into a ZIP file named batch_lambda.zip.
```
zip -r batch_lambda.zip .
```
Change directory into ~/hydra/iac/aws/batch. Review and authorize changes to the infrastructure.
```
terraform apply
```

Modules

permissions

The permissions module is responsible for creating IAM roles with appropriate permissions to attach to the computer environment service, computer environment instance, and the lambda function.

Input Variables

Name	Description	Type
compute_envionment_service_role_name	IAM name of the compute environment service role	`string`
compute_envionment_service_iam_policy_arn	IAM policies attached to compute environment service role	`list(string)`
compute_envionment_instance_role_name	IAM name of the compute environment instance role	`string`
compute_envionment_instance_iam_policy_arn	IAM policies attached to compute environment instance role	`list(string)`
lambda_service_role_name	IAM name of the lambda function service role	`string`
lambda_service_iam_policy_arn	IAM policies attached to the lambda function service role	`list(string)`

Output Variables

Name	Description
compute_environment_service_role_arn	ARN of IAM role of the compute environment service role
compute_environmnet_instance_profile_arn	ARN of the IAM compute environment instance profile
lambda_service_role_arn	ARN of the IAM lambda function service role

secrets

The secrets module is responsible for randomly generating a username and password and then storing them in Secrets Manager.

Input Variables

Name	Description	Type
username_length	Number of characters in randomly generated username	`number`
password_length	Number of characters in randomly generated password	`number`
username_recovery_window	Number of days before username secret can be deleted	`number`
password_recovery_window	Number of days before password secret can be deleted	`number`
username_secret_name	Name of the username secret	`string`
password_secret_name	Name of the password secret	`string`

Output Values

Name	Description
username	Randomly generated username
username_secret	Secret name of the username of the RDS instance
password	Randomly generated password
password_secret	Secret name of the password of the RDS instance

networking

The networking module is responsible for creating a database subnet group to be associated with an RDS instance.

Input Variables

Name	Description	Type
rds_subnet_group_name	Name of the database subnet group to be created	`string`
rds_subnets	The IDs of the subnets attached to be attached to the RDS subnet group	`list(string)`

Output Values

Name	Description
db_subnet_group	Name of the created database subnet group

storage

The storage module is responsible for creating an RDS MySQL instance and storing the table setup SQL script as an S3 object.

Input Variables

Name	Description	Type
table_setup_script_bucket_name	The name of the S3 bucket that will store the table setup script	string
table_setup_script_bucket_key	The key of the S3 bucket that will store the table setup script	string
table_setup_script_local_path	The local path of the SQL script to be executed in RDS	string
batch_backend_store_identifier	The identifier of the RDS database to be created	`string`
allocated_storage	The allocated storage of the RDS database to be created	`string`
storage_type	The storage type of the RDS database to be created (in GiB)	`string`
db_engine_version	The engine version of the RDS MySQL database	`string`
db_instance_class	The instance class of the RDS database to be created	`string`
db_default_name	The name of the default database that is created in RDS	`string`
skip_final_snapshot	Whether a final snapshot is created immediately before the database is deleted	`bool`
db_username	The admin username of the database	`string`
db_password	The admin password of the database	`string`
db_subnet_group_name	The name of the database subnet group	`string`
vpc_security_groups	The security groups associated with the database instance	`list(string)`
publicly_accessible	Whether the RDS instance is made is made publicly accessible	`bool`

Output Values

Name	Description
db_host	The hostname of the created RDS MySQL instance
table_setup_script_bucket_name	S3 Bucket that stores the table setup SQL script
table_setup_script_bucket_key	S3 Key of bucket that stores the table setup SQL script

batch

The batch module is responsible for dynamically creating job queues and compute environments in AWS Batch.

Input Variables

Name	Description	Type
aws_region	AWS Region	string
compute_environments	List of maps of compute environments to be created; map key name is 'name', map value name is 'instance_type'	list
compute_environment_instance_profile_arn	ARN of the instance profile to be used in compute environment	string
compute_environment_resource_type	Resource type to be used in compute environment: Valid options are 'EC2' or 'SPOT'	string
compute_environment_max_vcpus	Maximum vCPUs that the compute environment should maintain	number
compute_environment_min_vcpus	Minimum vCPUs that the compute environment should maintain	number
compute_environment_security_group_ids	EC2 security groups associated with instances within compute environment	list(string)
compute_environment_service_role_arn	ARN of the IAM role allowing Batch to call other services	string
compute_environment_subnets	Subnets that compute resources are launched in	list(string)
compute_environment_type	The type of the compute environment: Valid options are 'MANAGED' or 'UNMANAGED'	string
job_queues	List of maps of compute environments to be created; map key name is 'name', map value name is 'compute_environment'	list
job_queue_priority	Priority of the job queue	number
job_queue_state	The state of the job queue: Valid options are 'ENABLED' or 'DISABLED'	string

lambda

The lambda module is responsible for building a lambda function with handler batch_lambda.initialize_db, and invoking this function.

Input Variables

Name	Description	Type
aws_region	AWS region	string
lambda_service_role_arn	ARN of the IAM lambda function service role	string
lambda_function_file_path	File path of the lambda ZIP file that will be executed	string
lambda_function_timeout	Timeout of the executed lambda function	number
lambda_function_name	Name of the lambda function that will be created	string
lambda_security_group_ids	List of security groups to attach lambda function to	list(string)
lambda_subnets	List of subnets to attach lambda function to	list(string)
database_hostname	Hostname of the Batch RDS instance	string
database_username_secret	Secret name of the Batch RDS Username	string
database_password_secret	Secret name of the Batch RDS Password	string
database_default_name	Default database name created in RDS instance	string
table_setup_script_bucket_name	The name of the S3 bucket that will store the table setup script	string
table_setup_script_bucket_key	The key of the S3 bucket that will store the table setup script	string

Output Variables

Name	Description
lambda_invocation_response	Output of the invoked lambda function initializing the batch database

Useful Commands

make destroy

To destroy all of the Batch associated infrastructure in Hydra created with Terraform, there is a 3-step process involved:

Destroy all the job queues managed by this Terraform project.

terraform destroy -target=module.batch.aws_batch_job_queue.batch_job_queue

Destroy all of the compute environments managed by this Terraform project.

terraform destroy -target=module.batch.aws_batch_compute_environment.batch_compute_environment

Destroy all remaining infrastructure managed by this Terraform project.
```
terraform destroy
```

The process of running these steps can be done using make destroy.

make build_lambda

Install the PyPI libraries pymysql and sqlalchemy locally in the path ~/hydra/iac/aws/batch/modules/lambda/function.
```
pip3 install pymysql sqlalchemy -t ./modules/lambda/function
```
Change directory into the path ~/hydra/iac/aws/batch/modules/lambda/function, create a ZIP file called batch_lambda.zip compressing all existing files in this path, and change directory back into the path ~/hydra/iac/aws/batch.
```
cd ./modules/lambda/function; zip -r batch_lambda.zip .; cd ../../..
```

The process of running these steps can be done using make build_lambda.

MLflow

MLflow is an open source platform that manages the machine learning lifecycle. In this infrastructure setup, a Docker container is used to run the MLflow tracking server, and is deployed using ECS Fargate, where the Docker images will be stored using ECR. MLflow logs are stored in an RDS MySQL instance whose credentials are stored in Secrets Manager, and MLflow models are stored in an S3 bucket. Autoscaling is set up so that containers will be automatically deployed and destroyed based on demand. To setup this infrastructure, complete the following steps.

NOTE: In this infrastructure build, please ensure that when tracking jobs using a new experiment name, be sure to create the experiment using the MLflow UI before tracking jobs under this experiment.

Prerequisites

Read and write access to all above mentioned services on your AWS account

Getting Started

Configure your AWS CLI.
Change directory into ~/hydra/iac/aws/mlflow.
In main.tf, set the appropriate bucket, region, and dynamodb_table values to the remote backend that was created earlier. Set the key value to the path that the state and locks will be stored.
Create variable definitions file (.tfvars) with suited values or manually enter variable values on command line.
Initialize the Terraform project
```
terraform init
```
In docker_push.sh, set the region and repository variables to use the desired values of your Docker image repository on ECR
NOTE: Step 8, 9, and 10 can be completed together by running make start.

Create the Docker image registry on ECR.

terraform apply -target=module.container_repository.aws_ecr_repository.mlflow_container_repository

Push the local Dockerfile to ECR.
```
bash docker_push.sh
```
Review and authorize changes to the remaining infrastructure.
```
terraform apply
```

Modules

container_repository

The container_repository is responsible for creating a Docker image registry using ECR.

Input Variables

Name	Description	Type
mlflow_container_repository	Name of the Docker container registry to be created	`string`
scan_on_push	Scan docker image on push for vulnerabilities	`bool`

Output Values

Name	Description
container_repository_url	URL of the created Docker container registry

permissions

The permissions module is responsible for creating an IAM role to complete the necessary tasks of ECS and a Security Group to control inbound and outbound traffic.

Input Variables

Name	Description	Type
mlflow_ecs_tasks_role	Name of the IAM role to be created	`string`
ecs_task_iam_policy_arn	IAM policies to attached to the created IAM role	`list(string)`
mlflow_sg	Name of the security group to be created	`string`
vpc_id	The ID of the VPC to be used for the security group	`string`
cidr_blocks	List of CIDR blocks to allow ingress access	`list(string)`

Output Values

Name	Description
mlflow_sg_id	ID of the created security group
mlflow_ecs_tasks_role_arn	ARN of the created IAM role that will execute ECS tasks

networking

The networking module is responsible for creating a database subnet group to be associated with an RDS instance.

Input Variables

Name	Description	Type
rds_subnet_group_name	Name of the database subnet group to be created	`string`
rds_subnets	The IDs of the subnets attached to be attached to the RDS subnet group	`list(string)`

Output Values

Name	Description
db_subnet_group	Name of the created database subnet group

secrets

The secrets module is responsible for randomly generating a username and password and then storing them in Secrets Manager.

Input Variables

Name	Description	Type
username_length	Number of characters in randomly generated username	`number`
password_length	Number of characters in randomly generated password	`number`
username_recovery_window	Number of days before username secret can be deleted	`number`
password_recovery_window	Number of days before password secret can be deleted	`number`
username_secret_name	Name of the username secret	`string`
password_secret_name	Name of the password secret	`string`

Output Values

Name	Description
username	Randomly generated username
username_arn	ARN of username secret
password	Randomly generated password
password_arn	ARN of password secret

load_balancing

The load_balancing module is responsible for creating an application load balancer to be used in AWS.

Input Variables

Name	Description	Type
vpc_id	The ID of the VPC to be used for the load balancer	`string`
lb_name	Name of the application to be created	`string`
lb_security_groups	List of the security group IDs to be attached to the load balancer	`list(string)`
lb_subnets	List of the subnets to be attached to the load balancer (must be at least two)	`list(string)`
lb_target_group	Name of the load balancer target group to be created	`string`

Output Values

Name	Description
lb_target_group_arn	ARN of the created Load Balancer target group

storage

The storage module is responsible for creating an RDS MySQL instance and S3 bucket to be the MLflow backend store and artifact store respectively.

Input Variables

Name	Description	Type
mlflow_artifact_store	The name of the S3 bucket to be created	`string`
mlflow_backend_store_identifier	The identifier of the RDS database to be created	`string`
allocated_storage	The allocated storage of the RDS database to be created	`string`
storage_type	The storage type of the RDS database to be created (in GiB)	`string`
db_engine_version	The engine version of the RDS MySQL database	`string`
db_instance_class	The instance class of the RDS database to be created	`string`
db_default_name	The name of the default database that is created in RDS	`string`
skip_final_snapshot	Whether a final snapshot is created immediately before the database is deleted	`bool`
db_username	The admin username of the database	`string`
db_password	The admin password of the database	`string`
db_subnet_group_name	The name of the database subnet group	`string`
vpc_security_groups	The security groups associated with the database instance	`list(string)`

Output Values

Name	Description
db_host	The hostname of the created RDS MySQL instance
db_name	The name of the default database name of the RDS MySQL instance
s3_bucket	The name of the created S3 bucket

task_deployment

The task_deployment module is responsible for creating an ECS cluster, a Fargate task definition that runs an MLflow tracking service, and run an ECS service that uses deployed instances of this task.

Input Variables

Name	Description	Type
aws_region	AWS region	`string`
mlflow_server_cluster	Name of MLflow server cluster to be created	`string`
ecs_service_name	Name of ECS Fargate service name to be created	`string`
cloudwatch_log_group	Name of cloudwatch log group associated with service to be created	`string`
mlflow_ecs_task_family	Name of the ECS task family to be created	`string`
container_name	Name of the container to be run using a Fargate task	`string`
s3_bucket_name	Name of the S3 bucket that will be the artifact store	`string`
s3_bucket_folder	Name of the folder in the S3 bucket that will be used to store models	`string`
db_name	Name of the database that will be used as the backend store	`string`
db_host	Hostname of the database that will be used as the backend store	`string`
db_port	Port of the database connection	`string`
docker_image	URL of the docker image that will be used to run the task in Fargate	`string`
task_memory	Total memory to be used by a single Fargate task (in MiB)	`number`
task_cpu	Number of CPU units to be used by a single Fargate task	`number`
admin_username_arn	ARN of the RDS admin username secret	`string`
admin_password_arn	ARN of the RDS admin password secret	`string`
task_role_arn	ARN of the IAM task role	`string`
execution_role_arn	ARN of the IAM execution role	`string`
aws_lb_target_group_arn	ARN of the load balancer target group	`string`
ecs_service_subnets	Subnets to be used in the network configuration of the created ECS service	`list(string)`
ecs_service_security_groups	Security groups to be attached to the created ECS service	`list(string)`

Output Values

Name	Description
ecs_service_name	Name of the created ECS service
ecs_cluster_name	Name of the created ECS cluster

autoscaling

The autoscaling module is responsible for creating and attaching autoscaling policies based on CPU and memory utilization percentage.

Name	Description	Type
server_cluster_name	Name of ECS cluster to create autoscaling policy in	`string`
ecs_service_name	Name of ECS service to create autoscaling policy in	`string`
min_tasks	Minimum number of running tasks in ECS service	`number`
max_tasks	Maximum number of running tasks in ECS service	`number`
memory_autoscaling_policy_name	Name of memory autoscaling policy to be created	`string`
cpu_autoscaling_policy_name	Name of CPU autoscaling policy to be created	`string`
memory_autoscale_in_cooldown	Cooldown time for scale in based on memory metric (in seconds)	`number`
memory_autoscale_out_cooldown	Cooldown time for scale out based on memory metric (in seconds)	`number`
memory_autoscale_target	Target value of memory utilization percentage in each task	`number`
cpu_autoscale_in_cooldown	Cooldown time for scale in based on CPU metric (in seconds)	`number`
cpu_autoscale_out_cooldown	Cooldown time for scale out based on CPU metric (in seconds)	`number`
cpu_autoscale_target	Target value of CPU utilization percentage in each task	`number`

Useful Commands

make start

To create an MLflow server running on ECS Fargate from scratch, there is a 3-step process involved:

Create the Docker image registry on ECR.

terraform apply -target=module.container_repository.aws_ecr_repository.mlflow_container_repository

Push the local Dockerfile to ECR.
```
bash docker_push.sh
```
Review and authorize changes to the remaining infrastructure.
```
terraform apply
```

The process of running these steps can be done using make start.

make update_container

To update the base image that is used to run the MLflow server in ECS, there is a 3-step process involved:

Destroy the existing services and its dependent services.

terraform destroy -target=module.task_deployment.aws_ecs_service.service

Push the local Dockerfile to ECR.
```
bash docker_push.sh
```
Re-apply and authorize the changes to create the service in ECS.
```
terraform apply
```

The process of running these steps can be done using make update_container.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infrastructure as Code (AWS)

Infrastructure as Code - AWS

Remote State

Prerequesites

Getting Started

Input Variables

Batch

Prerequisites

Getting Started

Modules

permissions

secrets

networking

storage

batch

lambda

Useful Commands

MLflow

Prerequisites

Getting Started

Modules

container_repository

permissions

networking

secrets

load_balancing

storage

task_deployment

autoscaling

Useful Commands

Clone this wiki locally