Requires:
Install gcloud and login:
gcloud auth application-default login
Projects you will need access to:
- bacalhau-cicd
- this is where the GCS bucket with terraform state lives
- bacalhau-development
- commits to main are deployed here
- scale tests are run here
- short lived clusters
- bacalhau-staging
- long lived cluster
- bacalhau-production
- long lived cluster
The ops/terraform
directory contains the terraform configuration and all of the logic lives in main.tf
.
cd ops/terraform
gcloud auth application-default login
terraform init
terraform workspace list
Terraform state is managed using workspaces - there is a GCS bucket called bacalhau-global-storage
the lives in the bacalhau-cicd
project that keeps the tfstate for each workspace.
Combined with a <workspace-name>.tfvars
variables file that controls which google project we deploy to - we can manage multiple bacalhau clusters into the same gcloud project.
A bacalhau workspace is a combination of:
<workspace-name>.tfvars
- the variables controlling the versions, cluster size, gcloud project, gcloud region<workspace-name>-secrets.tfvars
- the sensitive API keys that are required (not checked in to source control)- a terraform workspace named
<workspace-name>
which points at the state file managed in the GCS bucket
The bash scripts/connect_workspace.sh <workspace-name>
will connect to the correct gcloud project and zone named in <workspace-name>.tfvars
and run terraform workspace select <workspace-name>
so you can begin to work with that cluster.
IMPORTANT: always run bash scripts/connect_workspace.sh
before running terraform
commands for a given project.
bash scripts/connect_workspace.sh production
terraform plan -var-file production.tfvars -var-file production-secrets.tfvars
The normal operation is to edit production.tfvars
, make sure the bacalhau_version
variable points to the version you'd like to deploy.
Optionally, ensure you have appropriate values in production-secrets.tfvars
(see secrets.tfvars.example
for a guide) and then:
# make sure gcloud is connected to the correct project and compute zone for our workspace
bash scripts/connect_workspace.sh production
# apply the latest variables
terraform plan -var-file production.tfvars -var-file production-secrets.tfvars
terraform apply -var-file production.tfvars -var-file production-secrets.tfvars
⚠️ Due to some limitations in how GCP provision gpus (inquiry @simonwo for more details 😄) the disk of one of the gpu machines has to be restored from a hand-picked snapshot. This is a temporary solution.
To start a new long lived cluster - we need to first standup the first node and get it's libp2p id and then re-apply the cluster
# make sure you are logged into the google user that has access to our gcloud projects
gcloud auth application-default login
# the name of the cluster (and workspace)
export WORKSPACE=apples
cp staging.tfvars $WORKSPACE.tfvars
# edit variables
# * gcp_project = bacalhau-development
# * region = XXX
# * zone = XXX
vi $WORKSPACE.tfvars
# create a new workspace state file for this cluster
terraform workspace new $WORKSPACE
# make sure gcloud is connected to the correct project and compute zone for our workspace
bash scripts/connect_workspace.sh $WORKSPACE
# get the first node up and running
terraform apply \
-var-file $WORKSPACE.tfvars \
-var-file $WORKSPACE-secrets.tfvars \
-var="bacalhau_connect_node0=" \
-var="instance_count=1"
# wait a bit of time so the bacalhau server is up and running
sleep 10
gcloud compute ssh bacalhau-vm-$WORKSPACE-0 -- sudo systemctl status bacalhau
# now we need to get the libp2p id of the first node
gcloud compute ssh bacalhau-vm-$WORKSPACE-0 -- journalctl -u bacalhau | grep "peer id is" | awk -F': ' '{print $2}'
# copy this id and paste it into the variables file
# edit variables
# * bacalhau_connect_node0 = <id copied from SSH command above>
vi $WORKSPACE.tfvars
# now we re-apply the terraform command
terraform apply \
-var-file $WORKSPACE.tfvars \
-var-file $WORKSPACE-secrets.tfvars
There is prevent_destroy = true
on long lived clusters.
This is controlled by the protect_resources = true
variable.
The only way to delete a long lived cluster (because you've thought hard about it and have decided it is actually what you want to do) is to edit the main.tf
file and set prevent_destroy = false
on the ip address and the disk before doing:
terraform destroy -var-file $WORKSPACE.tfvars
IMPORTANT: remember to reset prevent_destroy = true
in main.tf
(please don't commit it with prevent_destroy = false
)
Once you have deleted a cluster - don't forget to:
terraform workspace delete $WORKSPACE
rm -f $WORKSPACE.tfvars
This is for scale tests or short lived tests on a live network.
We set bacalhau_unsafe_cluster=true
so nodes automatically connect to each other (it uploads a fixed, unsafe private key from this repo so we know the libp2p id of node0)
We set protect_resources=false
so we can easily delete the cluster when we are done.
export WORKSPACE=oranges
cp shortlived_example.tfvars $WORKSPACE.tfvars
# edit variables
# * gcp_project = bacalhau-development
# * region = XXX
# * zone = XXX
vi $WORKSPACE.tfvars
# create a new workspace state file for this cluster
terraform workspace new $WORKSPACE
# make sure gcloud is connected to the correct project and compute zone for our workspace
bash scripts/connect_workspace.sh $WORKSPACE
# get the first node up and running
terraform apply \
-var-file $WORKSPACE.tfvars
sleep 10
gcloud compute ssh bacalhau-vm-$WORKSPACE-0 -- sudo systemctl status bacalhau
export WORKSPACE=oranges
bash scripts/connect_workspace.sh $WORKSPACE
terraform destroy \
-var-file $WORKSPACE.tfvars
terraform workspace select development
terraform workspace delete $WORKSPACE
rm $WORKSPACE.tfvars
To see the logs from a nodes startup script:
export WORKSPACE=apples
bash scripts/connect_workspace.sh $WORKSPACE
gcloud compute ssh bacalhau-vm-$WORKSPACE-0 -- sudo journalctl -u google-startup-scripts.service
In some resources, the name
property of a resource is calculated like this:
name = terraform.workspace == "production" ?
"bacalhau-ipv4-address-${count.index}" :
"bacalhau-ipv4-address-${terraform.workspace}-${count.index}"
This is a backwards compatible mode to preserve the production disks and ip addresses by avoiding renaming them.
The disks and ip addresses are in one of two modes:
protected
unprotected
To control which type is used - you set the protect_resources
variable to true when creating a cluster.
With long lived clusters - we use the auto_subnets = true
setting which means there will be a bunch of subnetworks auto created for the deployment network.
For short lived clusters - we set this to false and create a single manual sub network.
This is so we don't use up all of our network quota making subnets that we don't actually use.
Sometimes it's useful to upload content directly to nodes in a terraform managed cluster.
There is a script to help do that:
bash scripts/upload_cid.sh production ~/path/to/local/content
To inspect the aggregated logs in Grafana Cloud access this dashboard (requires credentials!).
Alternatively, you need to ssh into the hosts in the bacalhau-production project. Inspect the logs with journalctl -u bacalhau -f
.