es cluster tweaks #10853

garyverhaegen-da · 2021-09-11T23:33:21Z

On Sept 8 our ES cluster became unresponsive. I tried connecting to the
machines.

One machine had an ES Docker container that claimed to have started 7
weeks ago and stopped 5 weeks ago, while the machine's own uptime was 5
weeks. I assume GCP had decided to restart it for some reason. The init
script had failed on missing a TTY, hence the addition of the
DEBIAN_FRONTEND env var.

Two machines had a Docker container that had stopped on that day, resp.
6h and 2h before I started investigating. It wasn't immediately clear
what had caused the containers to stop.

On all three of these machines, I was abble to manually restart the
containers and they were abble to reform a cluster, though the state of
the cluster was red (missing shards).

The last two machines simply did not respond to SSH connection attempts.
Assuming it might help, I decided to try to restart the machines. As GCP
does not allow restarting individual machines when they're part of a
managed instance roup, I tried clicking the "rolling restart" button
on the GCP console, which seemed like it would restart the machines. I
carefully selected "restart" (and not "replace"), started the process,
and watched GCP proceed to immediately replace all five machines, losing
all data in the process.

I then started a new cluster and used bigger (and more) machines to
reingest all of the data, and then fell back to the existing
configuration for the "steady" state. I'll try to keep a better eye on
the state of the cluster from now on. In particular, we should not have
a node down for 5 weeks without noticing.

I'll also try to find some time to look into backing up the Kibana
configuration, as that's the one thing we can't just reingest at the
moment.

CHANGELOG_BEGIN
CHANGELOG_END

On Sept 8 our ES cluster became unresponsive. I tried connecting to the machines. One machine had an ES Docker container that claimed to have started 7 weeks ago and stopped 5 weeks ago, while the machine's own uptime was 5 weeks. I assume GCP had decided to restart it for some reason. The init script had failed on missing a TTY, hence the addition of the `DEBIAN_FRONTEND` env var. Two machines had a Docker container that had stopped on that day, resp. 6h and 2h before I started investigating. It wasn't immediately clear what had caused the containers to stop. On all three of these machines, I was abble to manually restart the containers and they were abble to reform a cluster, though the state of the cluster was red (missing shards). The last two machines simply did not respond to SSH connection attempts. Assuming it might help, I decided to try to restart the machines. As GCP does not allow restarting individual machines when they're part of a managed instance roup, I tried clicking the "rolling restart" button on the GCP console, which seemed like it would restart the machines. I carefully selected "restart" (and not "replace"), started the process, and watched GCP proceed to immediately replace all five machines, losing all data in the process. I then started a new cluster and used bigger (and more) machines to reingest all of the data, and then fell back to the existing configuration for the "steady" state. I'll try to keep a better eye on the state of the cluster from now on. In particular, we should not have a node down for 5 weeks without noticing. I'll also try to find some time to look into backing up the Kibana configuration, as that's the one thing we can't just reingest at the moment. CHANGELOG_BEGIN CHANGELOG_END

garyverhaegen-da · 2021-09-11T23:33:47Z

Current "production" state of the cluster matches this PR (as of the time of opening).

cocreature

Thanks!

stefanobaghino-da

Thanks for the detailed description and the fix! 🙇🏻

As explained in #10853, we recently lost our ES cluster. While I'm not planning on trusting Google's "rolling restart" feature ever again, we can't exclude the possibility of future similar outages (without a significant investment in the cluster, which I don't think we want to do). Losing the cluster is not a huge issue as we can always reingest the data. Worst case we lose visibility for a few days. At least, as far as the bazel logs are concerned. Losing the Kiaban data is a lot more annoying, as that is not derived data and thus cannot be reingested. This PR aims to add a backup mechanism for our Kibana configuration. CHANGELOG_BEGIN CHANGELOG_END

As explained in #10853, we recently lost our ES cluster. While I'm not planning on trusting Google's "rolling restart" feature ever again, we can't exclude the possibility of future similar outages (without a significant investment in the cluster, which I don't think we want to do). Losing the cluster is not a huge issue as we can always reingest the data. Worst case we lose visibility for a few days. At least, as far as the bazel logs are concerned. Losing the Kibana data is a lot more annoying, as that is not derived data and thus cannot be reingested. This PR aims to add a backup mechanism for our Kibana configuration. CHANGELOG_BEGIN CHANGELOG_END

@cocreature

This PR has been created by a script, which is not very smart and does not have all the context. Please do double-check that the version prefix is correct before merging. @cocreature is in charge of this release. Commit log: ``` 38227a8 [Ledger API error codes] ErrorCode enrichments [DPP-591] (#10874) e7c443a enable json index for all fields that are queried with JSON_EXISTS (#10658) 6c1c02a document complete authorized auth0 setup (#10881) e4230dc Do not drop generated `submissionId`s in `GrpcCommandService` [KVL-1104] (#10882) b4750a4 trigger reach auth on internal network (#10844) 7908083 add Auth0 support to create-daml-app (#10673) b86490c Add @adriaanm-da to the release rotation (#10872) 9e918c3 Update trigger-service docs to use --dar option in the corresponding example (#10877) 49a9556 [docs] Fix minor typo in doc of exerciseByKey in TS. (#10863) f7c07ea interfaces: scala protobuf encoder (#10878) be4e064 Ledger API Test Tool: support `--additional` tests [KVL-1100] (#10829) 97e14de [Ledger API error codes] ErrorCode interfaces and generator [DPP-591] (#10836) 6dcdaa4 [DPP-589] Add CLI flag to select minimum enabled TLS version (#10854) 1fc58d9 Navigator customviews highlight and choices button, apply custom theme on the login screen (#10859) 6faddc9 Update Daml Documentation to reflect command deduplication related changes [KVL-1094] (#10852) 7c29eee Cleanup normalize from svalue (#10873) 053f22a Convert SValue to Value, and normalize, in a single code pass. (#10828) 37a1cb2 compatibility-tests - Exclude CommandDeduplicationIT from running for existing 1.17 snapshots (#10866) dfae9f6 Command deduplication - better support for different deduplication modes in conformance tests [KVL-1099] (#10864) 6f151e2 save kibana exports (#10861) 99f0362 [JSON-API] drop package token doc changes (#10865) b50bb8e Populate `definite_answer` in `ApiException` [KVL-1004] (#10832) a471225 LF: Add missing collision check for type synonyms (#10841) 1e1c452 LF: drop ad-hoc FrontStack builders (#10839) 8f5b4fa interfaces: protobuf encoder haskell side (#10850) 63f6678 ParticipantPruningIT divulgence test fixes to avoid flakiness on canton (#10860) 8a9d19a Command deduplication - KV conformance test for usage of max deduplication duration [KVL-1098] (#10846) 24fff88 LF: Simplify TransactionBuilder (#10753) 9a4c9df Implement LF desugaring of interface definitions (#10834) 2aaf601 interfaces: protobuf decoder haskell side (#10849) 6dc769b interfaces: lf typechecker implementation (#10843) d9178d2 Clarify version usage in test tool exclusion docs (#10858) c113954 Clarify docs for test tool exclusions (#10855) 8c9edd8 es cluster tweaks (#10853) 842c5b1 Drop early access notice from profiler docs (#10856) 7c47aca Improvements to wording in ledger-api protobuf docs (#10851) cff0358 ledger-api: Remove unimplemented fields [KVL-1094] (#10822) dcec6ea kvutils: Populate `definite_answer` in rejections [KVL-1004] (#10801) 1c4f173 Command deduplication - kvutils - Always use max deduplication duration as deduplication period [KVL-1098] (#10824) 567fe43 tweak trigger-service docs (#10845) fb5ab5d setvar doesn't like new lines in assignment, refactor (#10842) 7225c04 [docs] Replace AdoptOpenJDK suggestion by Adoptium (#10837) 6a9c8a6 release 1.17.0-snapshot.20210910.7786.0.976ca400 (#10838) 6ed2124 LF: clean up useless version tests. (#10833) 85f6f36 Modify the name of the secrets-url CLI flag to tls-secrets-url [DPP-604] (#10840) d809fd9 [JSON-API] surrogate template id cache (#10806) ``` Changelog: ``` - The Trigger Service can now accept separate `--auth-internal` and `--auth-external` CLI arguments, where `--auth-internal` is the address used by the Trigger Service to reach the Auth Middleware directly, and `--auth-external` is the address the Trigger Service uses in generated URLs sent back to the client. The `--auth` option remains and keeps working as before, setting both internal and external addresses to the same given value. - The `create-daml-app` template now includes support for a third authentication scheme (in addition to the existing "dev mode" and Daml Hub support): Auth0. Sandbox: Add CLI flag to select minimum enabled TLS version for ledger API server. - [Navigator] The currently selected custom view is now highlighted on the sidebar kvutils - committer side deduplication always uses max_deduplication_duration + min_skew as a deduplication period for all the requests. Modify the name of the secrets-url CLI flag to tls-secrets-url. ``` CHANGELOG_BEGIN CHANGELOG_END

@akrmn

Manual release process. @akrmn is in charge of this release. Commit log: ``` b5648c0 Make `CommandTracker` distinguish submissions of the same command using `submissionId` [KVL-1104] (#10868) b4328b3 ledger-api-test-tool - Add conformance test for parallel command deduplication using CommandSubmissionService [KVL-1099] (#10869) 0c32e3b Fix Parallel Indexer initialization issue [DPP-542] (#10889) b3e4975 Chore slow migration error removal (#10886) e4cce53 Create a new grpc exception for each duplicate result [KVL-1099] (#10887) a939594 Sandbox on H2 - performance improvements for the append-only schema [DPP-600] (#10888) 9a1a101 Increase timeout for heavy tests in ParticipantPruningIT (#10894) 9093c6c Improve wording for the active contracts service description (#10880) c12f546 Document #10780 (#10781) 5814f6a update NOTICES file (#10893) 38227a8 [Ledger API error codes] ErrorCode enrichments [DPP-591] (#10874) e7c443a enable json index for all fields that are queried with JSON_EXISTS (#10658) 6c1c02a document complete authorized auth0 setup (#10881) e4230dc Do not drop generated `submissionId`s in `GrpcCommandService` [KVL-1104] (#10882) b4750a4 trigger reach auth on internal network (#10844) 7908083 add Auth0 support to create-daml-app (#10673) b86490c Add @adriaanm-da to the release rotation (#10872) 9e918c3 Update trigger-service docs to use --dar option in the corresponding example (#10877) 49a9556 [docs] Fix minor typo in doc of exerciseByKey in TS. (#10863) f7c07ea interfaces: scala protobuf encoder (#10878) be4e064 Ledger API Test Tool: support `--additional` tests [KVL-1100] (#10829) 97e14de [Ledger API error codes] ErrorCode interfaces and generator [DPP-591] (#10836) 6dcdaa4 [DPP-589] Add CLI flag to select minimum enabled TLS version (#10854) 1fc58d9 Navigator customviews highlight and choices button, apply custom theme on the login screen (#10859) 6faddc9 Update Daml Documentation to reflect command deduplication related changes [KVL-1094] (#10852) 7c29eee Cleanup normalize from svalue (#10873) 053f22a Convert SValue to Value, and normalize, in a single code pass. (#10828) 37a1cb2 compatibility-tests - Exclude CommandDeduplicationIT from running for existing 1.17 snapshots (#10866) dfae9f6 Command deduplication - better support for different deduplication modes in conformance tests [KVL-1099] (#10864) 6f151e2 save kibana exports (#10861) 99f0362 [JSON-API] drop package token doc changes (#10865) b50bb8e Populate `definite_answer` in `ApiException` [KVL-1004] (#10832) a471225 LF: Add missing collision check for type synonyms (#10841) 1e1c452 LF: drop ad-hoc FrontStack builders (#10839) 8f5b4fa interfaces: protobuf encoder haskell side (#10850) 63f6678 ParticipantPruningIT divulgence test fixes to avoid flakiness on canton (#10860) 8a9d19a Command deduplication - KV conformance test for usage of max deduplication duration [KVL-1098] (#10846) 24fff88 LF: Simplify TransactionBuilder (#10753) 9a4c9df Implement LF desugaring of interface definitions (#10834) 2aaf601 interfaces: protobuf decoder haskell side (#10849) 6dc769b interfaces: lf typechecker implementation (#10843) d9178d2 Clarify version usage in test tool exclusion docs (#10858) c113954 Clarify docs for test tool exclusions (#10855) 8c9edd8 es cluster tweaks (#10853) 842c5b1 Drop early access notice from profiler docs (#10856) 7c47aca Improvements to wording in ledger-api protobuf docs (#10851) cff0358 ledger-api: Remove unimplemented fields [KVL-1094] (#10822) dcec6ea kvutils: Populate `definite_answer` in rejections [KVL-1004] (#10801) 1c4f173 Command deduplication - kvutils - Always use max deduplication duration as deduplication period [KVL-1098] (#10824) 567fe43 tweak trigger-service docs (#10845) fb5ab5d setvar doesn't like new lines in assignment, refactor (#10842) 7225c04 [docs] Replace AdoptOpenJDK suggestion by Adoptium (#10837) 6a9c8a6 release 1.17.0-snapshot.20210910.7786.0.976ca400 (#10838) 6ed2124 LF: clean up useless version tests. (#10833) 85f6f36 Modify the name of the secrets-url CLI flag to tls-secrets-url [DPP-604] (#10840) d809fd9 [JSON-API] surrogate template id cache (#10806) ``` Changelog: ``` - [Sandbox] - Added a CLI parameter for configuring the number of connections in the database connection pool used for serving ledger API requests [Docs] Improved description of the purpose and usage of the active contracts service [Docs/JSON API] documented 256B limitation of Oracle query store - The Trigger Service can now accept separate `--auth-internal` and `--auth-external` CLI arguments, where `--auth-internal` is the address used by the Trigger Service to reach the Auth Middleware directly, and `--auth-external` is the address the Trigger Service uses in generated URLs sent back to the client. The `--auth` option remains and keeps working as before, setting both internal and external addresses to the same given value. - The `create-daml-app` template now includes support for a third authentication scheme (in addition to the existing "dev mode" and Daml Hub support): Auth0. Sandbox: Add CLI flag to select minimum enabled TLS version for ledger API server. - [Navigator] The currently selected custom view is now highlighted on the sidebar kvutils - committer side deduplication always uses max_deduplication_duration + min_skew as a deduplication period for all the requests. Modify the name of the secrets-url CLI flag to tls-secrets-url. ``` changelog_begin changelog_end

garyverhaegen-da added the Standard-Change label Sep 11, 2021

garyverhaegen-da requested review from cocreature, aherrmann-da and stefanobaghino-da September 11, 2021 23:33

cocreature approved these changes Sep 13, 2021

View reviewed changes

stefanobaghino-da approved these changes Sep 13, 2021

View reviewed changes

garyverhaegen-da merged commit 8c9edd8 into main Sep 13, 2021

garyverhaegen-da deleted the es-reset branch September 13, 2021 09:12

garyverhaegen-da mentioned this pull request Sep 13, 2021

save kibana exports #10861

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

es cluster tweaks #10853

es cluster tweaks #10853

garyverhaegen-da commented Sep 11, 2021

garyverhaegen-da commented Sep 11, 2021

cocreature left a comment

stefanobaghino-da left a comment

es cluster tweaks #10853

es cluster tweaks #10853

Conversation

garyverhaegen-da commented Sep 11, 2021

garyverhaegen-da commented Sep 11, 2021

cocreature left a comment

Choose a reason for hiding this comment

stefanobaghino-da left a comment

Choose a reason for hiding this comment