-
Notifications
You must be signed in to change notification settings - Fork 4
Infrastructure as Code (AWS)
Terraform is used to create the underlying infrastructure to support Hydra.
When working collaboratively on a Terraform project, it is recommended to use remote state and lock files. Remote storage of the state and locks should be done before provisioning the infrastructure. Accomplishing this can be done using a S3 bucket and DynamoDB table to store the state and locks of a Terraform project respectively. To set up this remote backend, complete the following steps.
- Read and write access to S3 in AWS account
-
Configure your AWS CLI.
-
Change directory into
~/hydra/iac/aws/remote_state
. -
Create variable definitions file (
.tfvars
) with suited values or manually enter variable values on command line. -
Initialize the Terraform project.
terraform init
-
Review and authorize changes to infrastructure.
terraform apply
Name | Description | Type |
---|---|---|
remote_state_bucket_name | The name of the bucket that stores the remote state of a Terraform project | string |
remote_locks_dynamodb_name | The name of the DynamoDB table that stores the remote locks of a Terraform project | string |
remote_locks_read_capacity | The read capacity of the DynamoDB table that stores the remote locks of a Terraform project | number |
remote_locks_write_capacity | The write capacity of the DynamoDB table that stores the remote locks of a Terraform project | number |
Batch is a cloud service that allows to run large-scale batch computing jobs using AWS. Batch relies on compute environments, which contain AWS ECS container instances to run containerized batch jobs, and is able to dynamically scale in allocated CPUs based on different demands. These jobs are submitted to job queues where they reside until they can be scheduled to run in a compute environment.
To support jobs being run using Batch in Hydra, this infrastructure setup supports tracking job metadata from Hydra using an RDS MySQL instance whose credentials are stored in Secrets Manager. This database is initialized using a table setup SQL script that defines the schemas to be used within tables of the database, which is stored as a S3 bucket object. Once the database has been created, a lambda function is invoked using Terraform that will sequentially execute the data definition commands within the SQL script into the newly created database.
- Read and write access to all above mentioned services on your AWS account
-
Configure your AWS CLI.
-
Change directory into
~/hydra/iac/aws/batch
. -
In
main.tf
, set the appropriatebucket
,region
, anddynamodb_table
values to the remote backend that was created earlier. Set thekey
value to the path that the state and locks will be stored. -
Create variable definitions file (
.tfvars
) with suited values or manually enter variable values on command line. -
Initialize the Terraform project
terraform init
-
NOTE: Step 7 and 8 can be completed together using
make build_lambda
. -
Change directory into
~/hydra/iac/aws/batch/modules/lambda/function
. Execute the following commandpip3 install pymysql sqlalchemy -t .
-
Next, select all folders and files in this directory and compress into a ZIP file named
batch_lambda.zip
.zip -r batch_lambda.zip .
-
Change directory into
~/hydra/iac/aws/batch
. Review and authorize changes to the infrastructure.terraform apply
The permissions
module is responsible for creating IAM roles with appropriate permissions to attach to the computer environment service, computer environment instance, and the lambda function.
Input Variables
Name | Description | Type |
---|---|---|
compute_envionment_service_role_name | IAM name of the compute environment service role | string |
compute_envionment_service_iam_policy_arn | IAM policies attached to compute environment service role | list(string) |
compute_envionment_instance_role_name | IAM name of the compute environment instance role | string |
compute_envionment_instance_iam_policy_arn | IAM policies attached to compute environment instance role | list(string) |
lambda_service_role_name | IAM name of the lambda function service role | string |
lambda_service_iam_policy_arn | IAM policies attached to the lambda function service role | list(string) |
Output Variables
Name | Description |
---|---|
compute_environment_service_role_arn | ARN of IAM role of the compute environment service role |
compute_environmnet_instance_profile_arn | ARN of the IAM compute environment instance profile |
lambda_service_role_arn | ARN of the IAM lambda function service role |
The secrets
module is responsible for randomly generating a username and password and then storing them in Secrets Manager.
Input Variables
Name | Description | Type |
---|---|---|
username_length | Number of characters in randomly generated username | number |
password_length | Number of characters in randomly generated password | number |
username_recovery_window | Number of days before username secret can be deleted | number |
password_recovery_window | Number of days before password secret can be deleted | number |
username_secret_name | Name of the username secret | string |
password_secret_name | Name of the password secret | string |
Output Values
Name | Description |
---|---|
username | Randomly generated username |
username_secret | Secret name of the username of the RDS instance |
password | Randomly generated password |
password_secret | Secret name of the password of the RDS instance |
The networking
module is responsible for creating a database subnet group to be associated with an RDS instance.
Input Variables
Name | Description | Type |
---|---|---|
rds_subnet_group_name | Name of the database subnet group to be created | string |
rds_subnets | The IDs of the subnets attached to be attached to the RDS subnet group | list(string) |
Output Values
Name | Description |
---|---|
db_subnet_group | Name of the created database subnet group |
The storage
module is responsible for creating an RDS MySQL instance and storing the table setup SQL script as an S3 object.
Input Variables
Name | Description | Type |
---|---|---|
table_setup_script_bucket_name | The name of the S3 bucket that will store the table setup script | string |
table_setup_script_bucket_key | The key of the S3 bucket that will store the table setup script | string |
table_setup_script_local_path | The local path of the SQL script to be executed in RDS | string |
batch_backend_store_identifier | The identifier of the RDS database to be created | string |
allocated_storage | The allocated storage of the RDS database to be created | string |
storage_type | The storage type of the RDS database to be created (in GiB) | string |
db_engine_version | The engine version of the RDS MySQL database | string |
db_instance_class | The instance class of the RDS database to be created | string |
db_default_name | The name of the default database that is created in RDS | string |
skip_final_snapshot | Whether a final snapshot is created immediately before the database is deleted | bool |
db_username | The admin username of the database | string |
db_password | The admin password of the database | string |
db_subnet_group_name | The name of the database subnet group | string |
vpc_security_groups | The security groups associated with the database instance | list(string) |
publicly_accessible | Whether the RDS instance is made is made publicly accessible | bool |
Output Values
Name | Description |
---|---|
db_host | The hostname of the created RDS MySQL instance |
table_setup_script_bucket_name | S3 Bucket that stores the table setup SQL script |
table_setup_script_bucket_key | S3 Key of bucket that stores the table setup SQL script |
The batch
module is responsible for dynamically creating job queues and compute environments in AWS Batch.
Input Variables
Name | Description | Type |
---|---|---|
aws_region | AWS Region | string |
compute_environments | List of maps of compute environments to be created; map key name is 'name', map value name is 'instance_type' | list |
compute_environment_instance_profile_arn | ARN of the instance profile to be used in compute environment | string |
compute_environment_resource_type | Resource type to be used in compute environment: Valid options are 'EC2' or 'SPOT' | string |
compute_environment_max_vcpus | Maximum vCPUs that the compute environment should maintain | number |
compute_environment_min_vcpus | Minimum vCPUs that the compute environment should maintain | number |
compute_environment_security_group_ids | EC2 security groups associated with instances within compute environment | list(string) |
compute_environment_service_role_arn | ARN of the IAM role allowing Batch to call other services | string |
compute_environment_subnets | Subnets that compute resources are launched in | list(string) |
compute_environment_type | The type of the compute environment: Valid options are 'MANAGED' or 'UNMANAGED' | string |
job_queues | List of maps of compute environments to be created; map key name is 'name', map value name is 'compute_environment' | list |
job_queue_priority | Priority of the job queue | number |
job_queue_state | The state of the job queue: Valid options are 'ENABLED' or 'DISABLED' | string |
The lambda
module is responsible for building a lambda function with handler batch_lambda.initialize_db
, and invoking this function.
Input Variables
Name | Description | Type |
---|---|---|
aws_region | AWS region | string |
lambda_service_role_arn | ARN of the IAM lambda function service role | string |
lambda_function_file_path | File path of the lambda ZIP file that will be executed | string |
lambda_function_timeout | Timeout of the executed lambda function | number |
lambda_function_name | Name of the lambda function that will be created | string |
lambda_security_group_ids | List of security groups to attach lambda function to | list(string) |
lambda_subnets | List of subnets to attach lambda function to | list(string) |
database_hostname | Hostname of the Batch RDS instance | string |
database_username_secret | Secret name of the Batch RDS Username | string |
database_password_secret | Secret name of the Batch RDS Password | string |
database_default_name | Default database name created in RDS instance | string |
table_setup_script_bucket_name | The name of the S3 bucket that will store the table setup script | string |
table_setup_script_bucket_key | The key of the S3 bucket that will store the table setup script | string |
Output Variables
Name | Description |
---|---|
lambda_invocation_response | Output of the invoked lambda function initializing the batch database |
make destroy
To destroy all of the Batch associated infrastructure in Hydra created with Terraform, there is a 3-step process involved:
-
Destroy all the job queues managed by this Terraform project.
terraform destroy -target=module.batch.aws_batch_job_queue.batch_job_queue
-
Destroy all of the compute environments managed by this Terraform project.
terraform destroy -target=module.batch.aws_batch_compute_environment.batch_compute_environment
-
Destroy all remaining infrastructure managed by this Terraform project.
terraform destroy
The process of running these steps can be done using make destroy
.
make build_lambda
-
Install the PyPI libraries
pymysql
andsqlalchemy
locally in the path~/hydra/iac/aws/batch/modules/lambda/function
.pip3 install pymysql sqlalchemy -t ./modules/lambda/function
-
Change directory into the path
~/hydra/iac/aws/batch/modules/lambda/function
, create a ZIP file calledbatch_lambda.zip
compressing all existing files in this path, and change directory back into the path~/hydra/iac/aws/batch
.cd ./modules/lambda/function; zip -r batch_lambda.zip .; cd ../../..
The process of running these steps can be done using make build_lambda
.
MLflow is an open source platform that manages the machine learning lifecycle. In this infrastructure setup, a Docker container is used to run the MLflow tracking server, and is deployed using ECS Fargate, where the Docker images will be stored using ECR. MLflow logs are stored in an RDS MySQL instance whose credentials are stored in Secrets Manager, and MLflow models are stored in an S3 bucket. Autoscaling is set up so that containers will be automatically deployed and destroyed based on demand. To setup this infrastructure, complete the following steps.
NOTE: In this infrastructure build, please ensure that when tracking jobs using a new experiment name, be sure to create the experiment using the MLflow UI before tracking jobs under this experiment.
- Read and write access to all above mentioned services on your AWS account
-
Configure your AWS CLI.
-
Change directory into
~/hydra/iac/aws/mlflow
. -
In
main.tf
, set the appropriatebucket
,region
, anddynamodb_table
values to the remote backend that was created earlier. Set thekey
value to the path that the state and locks will be stored. -
Create variable definitions file (
.tfvars
) with suited values or manually enter variable values on command line. -
Initialize the Terraform project
terraform init
-
In
docker_push.sh
, set the region and repository variables to use the desired values of your Docker image repository on ECR -
NOTE: Step 8, 9, and 10 can be completed together by running
make start
. -
Create the Docker image registry on ECR.
terraform apply -target=module.container_repository.aws_ecr_repository.mlflow_container_repository
-
Push the local Dockerfile to ECR.
bash docker_push.sh
-
Review and authorize changes to the remaining infrastructure.
terraform apply
The container_repository
is responsible for creating a Docker image registry using ECR.
Input Variables
Name | Description | Type |
---|---|---|
mlflow_container_repository | Name of the Docker container registry to be created | string |
scan_on_push | Scan docker image on push for vulnerabilities | bool |
Output Values
Name | Description |
---|---|
container_repository_url | URL of the created Docker container registry |
The permissions
module is responsible for creating an IAM role to complete the necessary tasks of ECS and a Security Group to control inbound and outbound traffic.
Input Variables
Name | Description | Type |
---|---|---|
mlflow_ecs_tasks_role | Name of the IAM role to be created | string |
ecs_task_iam_policy_arn | IAM policies to attached to the created IAM role | list(string) |
mlflow_sg | Name of the security group to be created | string |
vpc_id | The ID of the VPC to be used for the security group | string |
cidr_blocks | List of CIDR blocks to allow ingress access | list(string) |
Output Values
Name | Description |
---|---|
mlflow_sg_id | ID of the created security group |
mlflow_ecs_tasks_role_arn | ARN of the created IAM role that will execute ECS tasks |
The networking
module is responsible for creating a database subnet group to be associated with an RDS instance.
Input Variables
Name | Description | Type |
---|---|---|
rds_subnet_group_name | Name of the database subnet group to be created | string |
rds_subnets | The IDs of the subnets attached to be attached to the RDS subnet group | list(string) |
Output Values
Name | Description |
---|---|
db_subnet_group | Name of the created database subnet group |
The secrets
module is responsible for randomly generating a username and password and then storing them in Secrets Manager.
Input Variables
Name | Description | Type |
---|---|---|
username_length | Number of characters in randomly generated username | number |
password_length | Number of characters in randomly generated password | number |
username_recovery_window | Number of days before username secret can be deleted | number |
password_recovery_window | Number of days before password secret can be deleted | number |
username_secret_name | Name of the username secret | string |
password_secret_name | Name of the password secret | string |
Output Values
Name | Description |
---|---|
username | Randomly generated username |
username_arn | ARN of username secret |
password | Randomly generated password |
password_arn | ARN of password secret |
The load_balancing
module is responsible for creating an application load balancer to be used in AWS.
Input Variables
Name | Description | Type |
---|---|---|
vpc_id | The ID of the VPC to be used for the load balancer | string |
lb_name | Name of the application to be created | string |
lb_security_groups | List of the security group IDs to be attached to the load balancer | list(string) |
lb_subnets | List of the subnets to be attached to the load balancer (must be at least two) | list(string) |
lb_target_group | Name of the load balancer target group to be created | string |
Output Values
Name | Description |
---|---|
lb_target_group_arn | ARN of the created Load Balancer target group |
The storage
module is responsible for creating an RDS MySQL instance and S3 bucket to be the MLflow backend store and artifact store respectively.
Input Variables
Name | Description | Type |
---|---|---|
mlflow_artifact_store | The name of the S3 bucket to be created | string |
mlflow_backend_store_identifier | The identifier of the RDS database to be created | string |
allocated_storage | The allocated storage of the RDS database to be created | string |
storage_type | The storage type of the RDS database to be created (in GiB) | string |
db_engine_version | The engine version of the RDS MySQL database | string |
db_instance_class | The instance class of the RDS database to be created | string |
db_default_name | The name of the default database that is created in RDS | string |
skip_final_snapshot | Whether a final snapshot is created immediately before the database is deleted | bool |
db_username | The admin username of the database | string |
db_password | The admin password of the database | string |
db_subnet_group_name | The name of the database subnet group | string |
vpc_security_groups | The security groups associated with the database instance | list(string) |
Output Values
Name | Description |
---|---|
db_host | The hostname of the created RDS MySQL instance |
db_name | The name of the default database name of the RDS MySQL instance |
s3_bucket | The name of the created S3 bucket |
The task_deployment
module is responsible for creating an ECS cluster, a Fargate task definition that runs an MLflow tracking service, and run an ECS service that uses deployed instances of this task.
Input Variables
Name | Description | Type |
---|---|---|
aws_region | AWS region | string |
mlflow_server_cluster | Name of MLflow server cluster to be created | string |
ecs_service_name | Name of ECS Fargate service name to be created | string |
cloudwatch_log_group | Name of cloudwatch log group associated with service to be created | string |
mlflow_ecs_task_family | Name of the ECS task family to be created | string |
container_name | Name of the container to be run using a Fargate task | string |
s3_bucket_name | Name of the S3 bucket that will be the artifact store | string |
s3_bucket_folder | Name of the folder in the S3 bucket that will be used to store models | string |
db_name | Name of the database that will be used as the backend store | string |
db_host | Hostname of the database that will be used as the backend store | string |
db_port | Port of the database connection | string |
docker_image | URL of the docker image that will be used to run the task in Fargate | string |
task_memory | Total memory to be used by a single Fargate task (in MiB) | number |
task_cpu | Number of CPU units to be used by a single Fargate task | number |
admin_username_arn | ARN of the RDS admin username secret | string |
admin_password_arn | ARN of the RDS admin password secret | string |
task_role_arn | ARN of the IAM task role | string |
execution_role_arn | ARN of the IAM execution role | string |
aws_lb_target_group_arn | ARN of the load balancer target group | string |
ecs_service_subnets | Subnets to be used in the network configuration of the created ECS service | list(string) |
ecs_service_security_groups | Security groups to be attached to the created ECS service | list(string) |
Output Values
Name | Description |
---|---|
ecs_service_name | Name of the created ECS service |
ecs_cluster_name | Name of the created ECS cluster |
The autoscaling
module is responsible for creating and attaching autoscaling policies based on CPU and memory utilization percentage.
Name | Description | Type |
---|---|---|
server_cluster_name | Name of ECS cluster to create autoscaling policy in | string |
ecs_service_name | Name of ECS service to create autoscaling policy in | string |
min_tasks | Minimum number of running tasks in ECS service | number |
max_tasks | Maximum number of running tasks in ECS service | number |
memory_autoscaling_policy_name | Name of memory autoscaling policy to be created | string |
cpu_autoscaling_policy_name | Name of CPU autoscaling policy to be created | string |
memory_autoscale_in_cooldown | Cooldown time for scale in based on memory metric (in seconds) | number |
memory_autoscale_out_cooldown | Cooldown time for scale out based on memory metric (in seconds) | number |
memory_autoscale_target | Target value of memory utilization percentage in each task | number |
cpu_autoscale_in_cooldown | Cooldown time for scale in based on CPU metric (in seconds) | number |
cpu_autoscale_out_cooldown | Cooldown time for scale out based on CPU metric (in seconds) | number |
cpu_autoscale_target | Target value of CPU utilization percentage in each task | number |
make start
To create an MLflow server running on ECS Fargate from scratch, there is a 3-step process involved:
-
Create the Docker image registry on ECR.
terraform apply -target=module.container_repository.aws_ecr_repository.mlflow_container_repository
-
Push the local Dockerfile to ECR.
bash docker_push.sh
-
Review and authorize changes to the remaining infrastructure.
terraform apply
The process of running these steps can be done using make start
.
make update_container
To update the base image that is used to run the MLflow server in ECS, there is a 3-step process involved:
-
Destroy the existing services and its dependent services.
terraform destroy -target=module.task_deployment.aws_ecs_service.service
-
Push the local Dockerfile to ECR.
bash docker_push.sh
-
Re-apply and authorize the changes to create the service in ECS.
terraform apply
The process of running these steps can be done using make update_container
.