Skip to content

Commit

Permalink
es cluster tweaks (#10853)
Browse files Browse the repository at this point in the history
On Sept 8 our ES cluster became unresponsive. I tried connecting to the
machines.

One machine had an ES Docker container that claimed to have started 7
weeks ago and stopped 5 weeks ago, while the machine's own uptime was 5
weeks. I assume GCP had decided to restart it for some reason. The init
script had failed on missing a TTY, hence the addition of the
`DEBIAN_FRONTEND` env var.

Two machines had a Docker container that had stopped on that day, resp.
6h and 2h before I started investigating. It wasn't immediately clear
what had caused the containers to stop.

On all three of these machines, I was abble to manually restart the
containers and they were abble to reform a cluster, though the state of
the cluster was red (missing shards).

The last two machines simply did not respond to SSH connection attempts.
Assuming it might help, I decided to try to restart the machines. As GCP
does not allow restarting individual machines when they're part of a
managed instance roup, I tried clicking the "rolling restart" button
on the GCP console, which seemed like it would restart the machines. I
carefully selected "restart" (and not "replace"), started the process,
and watched GCP proceed to immediately replace all five machines, losing
all data in the process.

I then started a new cluster and used bigger (and more) machines to
reingest all of the data, and then fell back to the existing
configuration for the "steady" state. I'll try to keep a better eye on
the state of the cluster from now on. In particular, we should not have
a node down for 5 weeks without noticing.

I'll also try to find some time to look into backing up the Kibana
configuration, as that's the one thing we can't just reingest at the
moment.

CHANGELOG_BEGIN
CHANGELOG_END
  • Loading branch information
garyverhaegen-da authored Sep 13, 2021
1 parent 842c5b1 commit 8c9edd8
Showing 1 changed file with 7 additions and 2 deletions.
9 changes: 7 additions & 2 deletions infra/es_cluster.tf
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@ locals {
ubuntu_version = "2004",
size = 0,
init = "[]",
type = "e2-standard-2",
xmx = "6g",
type = "n2-highcpu-16",
xmx = "12g",
},
{
suffix = "-init",
Expand Down Expand Up @@ -136,6 +136,9 @@ resource "google_compute_instance_template" "es" {
metadata_startup_script = <<STARTUP
#! /bin/bash
set -euo pipefail
export DEBIAN_FRONTEND=noninteractive
apt-get update
apt-get -y upgrade
### stackdriver
Expand Down Expand Up @@ -184,13 +187,15 @@ EOF
docker build -t es .
docker run -d \
--restart on-failure \
--name es \
-p 9200:9200 \
-p 9300:9300 \
-e ES_JAVA_OPTS="-Xmx${local.es_clusters[count.index].xmx} -Xms${local.es_clusters[count.index].xmx}" \
es
docker run -d \
--restart on-failure \
--name kibana \
-p 5601:5601 \
--link es:elasticsearch \
Expand Down

0 comments on commit 8c9edd8

Please sign in to comment.