Skip to content

Commit

Permalink
ci: temp machines for scheduled killing experiment (digital-asset#4386)
Browse files Browse the repository at this point in the history
* ci: temp machines for scheduled killing experiment

Based on our discussions last week, I am exploring ways to move us to
permanent machines instead of preemptible ones. This should drastically
reduce the number of "cancelled" jobs.

The end goal is to have:

1. An instance group (per OS) that defines the actual CI nodes; this
would be pretty much the same as the existing ones, but with
`preemptible` set to false.
2. A separate machine that, on a cron (say at 4AM UTC), destroys all the
CI nodes.

The hope is that the group managers, which are set to maintain 10 nodes,
will then recreate the "missing" nodes using their normal starting
procedure.

However, there are a lot of unknowns I would like to explore, and I need
a playground for that. This is where this PR comes in. As it stands, it
creates one "killer" machine and a temporary group manager. I will use
these to experiment with the GCP API in various ways without interfering
with the real CI nodes.

This experimentation will likely require multiple `terraform apply` with
multiple different versions of the associated files, as well as
connecting to the machines and running various commands directly from
them. I will ensure all of that only affects the new machines created as
part of this PR, and therefore believe we do not need to go through a
separate round of approval for each change.

Once I have finished experimenting, I will create a new PR to clean up
the temporary resources created with this one and hopefully set up a
more permanent solution.

CHANGELOG_BEGIN
CHANGELOG_END

* add missing zone for killer instance

* add compute scope to killer

* authorize Terraform to shutdown killer to update it

* change in plans: use a service account instead

* .

* add compute.instances.list permission

* add compute.instances.delete permission

* add cron script

* obligatory round of extra escaping

* fix PATH issue & crontab format

* smaller machine & less frequent reboots
  • Loading branch information
garyverhaegen-da authored Feb 7, 2020
1 parent 98ab189 commit 1681922
Show file tree
Hide file tree
Showing 2 changed files with 133 additions and 0 deletions.
56 changes: 56 additions & 0 deletions infra/TEMP_KILLABLE_vsts_agent_linux.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Copyright (c) 2020 The DAML Authors. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

resource "google_compute_region_instance_group_manager" "temp-killable" {
provider = "google-beta"
name = "temp-killable"
base_instance_name = "temp-killable"
region = "${local.region}"
target_size = 3

version {
name = "temp-killable"
instance_template = "${google_compute_instance_template.temp-killable.self_link}"
}

update_policy {
type = "PROACTIVE"
minimal_action = "REPLACE"
max_surge_fixed = 3
min_ready_sec = 60
}
}

resource "google_compute_instance_template" "temp-killable" {
name_prefix = "killable-"
machine_type = "n1-standard-1"
labels = "${local.labels}"

disk {
disk_size_gb = 20
disk_type = "pd-ssd"
source_image = "ubuntu-os-cloud/ubuntu-1604-lts"
}

lifecycle {
create_before_destroy = true
}

network_interface {
network = "default"

// Ephemeral IP to get access to the Internet
access_config {}
}

service_account {
email = "log-writer@da-dev-gcp-daml-language.iam.gserviceaccount.com"
scopes = ["cloud-platform"]
}

scheduling {
automatic_restart = false
on_host_maintenance = "TERMINATE"
preemptible = true
}
}
77 changes: 77 additions & 0 deletions infra/periodic_killer.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Copyright (c) 2020 The DAML Authors. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# This file defines a machine meant to destroy/recreate all our CI nodes every
# night.

resource "google_service_account" "periodic-killer" {
account_id = "periodic-killer"
}

resource "google_project_iam_member" "periodic-killer" {
# should reference google_project_iam_custom_role.periodic-killer.id or
# something, but for whatever reason that's not exposed.
role = "projects/da-dev-gcp-daml-language/roles/killCiNodesEveryNight"
member = "serviceAccount:${google_service_account.periodic-killer.email}"
}

resource "google_project_iam_custom_role" "periodic-killer" {
role_id = "killCiNodesEveryNight"
title = "Permissions to list & kill CI nodes every night"
permissions = [
"compute.instances.delete",
"compute.instances.list",
"compute.zones.list",
]
}

resource "google_compute_instance" "periodic-killer" {
name = "periodic-killer"
machine_type = "f1-micro"
zone = "us-east4-a"

boot_disk {
initialize_params {
image = "ubuntu-1804-lts"
}
}

network_interface {
network = "default"

// Ephemeral IP to get access to the Internet
access_config {}
}

service_account {
email = "${google_service_account.periodic-killer.email}"
scopes = ["cloud-platform"]
}
allow_stopping_for_update = true

metadata_startup_script = <<STARTUP
set -euxo pipefail
apt-get update
apt-get install -y curl jq
cat <<CRON > /root/periodic-kill.sh
#!/usr/bin/env bash
set -euo pipefail
PREFIX=temp-killable
MACHINES=\$(/snap/bin/gcloud compute instances list --format=json | jq -c '.[] | select(.name | startswith("'\$PREFIX'")) | [.name, .zone]')
for m in \$MACHINES; do
/snap/bin/gcloud -q compute instances delete \$(echo \$m | jq -r '.[0]') --zone=\$(echo \$m | jq -r '.[1]')
done
CRON
chmod +x /root/periodic-kill.sh
cat <<CRONTAB >> /etc/crontab
*/30 * * * * root /root/periodic-kill.sh
CRONTAB
STARTUP
}

0 comments on commit 1681922

Please sign in to comment.