-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
es cluster tweaks #10853
Merged
Merged
es cluster tweaks #10853
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
On Sept 8 our ES cluster became unresponsive. I tried connecting to the machines. One machine had an ES Docker container that claimed to have started 7 weeks ago and stopped 5 weeks ago, while the machine's own uptime was 5 weeks. I assume GCP had decided to restart it for some reason. The init script had failed on missing a TTY, hence the addition of the `DEBIAN_FRONTEND` env var. Two machines had a Docker container that had stopped on that day, resp. 6h and 2h before I started investigating. It wasn't immediately clear what had caused the containers to stop. On all three of these machines, I was abble to manually restart the containers and they were abble to reform a cluster, though the state of the cluster was red (missing shards). The last two machines simply did not respond to SSH connection attempts. Assuming it might help, I decided to try to restart the machines. As GCP does not allow restarting individual machines when they're part of a managed instance roup, I tried clicking the "rolling restart" button on the GCP console, which seemed like it would restart the machines. I carefully selected "restart" (and not "replace"), started the process, and watched GCP proceed to immediately replace all five machines, losing all data in the process. I then started a new cluster and used bigger (and more) machines to reingest all of the data, and then fell back to the existing configuration for the "steady" state. I'll try to keep a better eye on the state of the cluster from now on. In particular, we should not have a node down for 5 weeks without noticing. I'll also try to find some time to look into backing up the Kibana configuration, as that's the one thing we can't just reingest at the moment. CHANGELOG_BEGIN CHANGELOG_END
garyverhaegen-da
requested review from
cocreature,
aherrmann-da and
stefanobaghino-da
September 11, 2021 23:33
Current "production" state of the cluster matches this PR (as of the time of opening). |
cocreature
approved these changes
Sep 13, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
stefanobaghino-da
approved these changes
Sep 13, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed description and the fix! 🙇🏻
garyverhaegen-da
added a commit
that referenced
this pull request
Sep 13, 2021
As explained in #10853, we recently lost our ES cluster. While I'm not planning on trusting Google's "rolling restart" feature ever again, we can't exclude the possibility of future similar outages (without a significant investment in the cluster, which I don't think we want to do). Losing the cluster is not a huge issue as we can always reingest the data. Worst case we lose visibility for a few days. At least, as far as the bazel logs are concerned. Losing the Kiaban data is a lot more annoying, as that is not derived data and thus cannot be reingested. This PR aims to add a backup mechanism for our Kibana configuration. CHANGELOG_BEGIN CHANGELOG_END
Merged
garyverhaegen-da
added a commit
that referenced
this pull request
Sep 13, 2021
As explained in #10853, we recently lost our ES cluster. While I'm not planning on trusting Google's "rolling restart" feature ever again, we can't exclude the possibility of future similar outages (without a significant investment in the cluster, which I don't think we want to do). Losing the cluster is not a huge issue as we can always reingest the data. Worst case we lose visibility for a few days. At least, as far as the bazel logs are concerned. Losing the Kibana data is a lot more annoying, as that is not derived data and thus cannot be reingested. This PR aims to add a backup mechanism for our Kibana configuration. CHANGELOG_BEGIN CHANGELOG_END
azure-pipelines bot
pushed a commit
that referenced
this pull request
Sep 15, 2021
This PR has been created by a script, which is not very smart and does not have all the context. Please do double-check that the version prefix is correct before merging. @cocreature is in charge of this release. Commit log: ``` 38227a8 [Ledger API error codes] ErrorCode enrichments [DPP-591] (#10874) e7c443a enable json index for all fields that are queried with JSON_EXISTS (#10658) 6c1c02a document complete authorized auth0 setup (#10881) e4230dc Do not drop generated `submissionId`s in `GrpcCommandService` [KVL-1104] (#10882) b4750a4 trigger reach auth on internal network (#10844) 7908083 add Auth0 support to create-daml-app (#10673) b86490c Add @adriaanm-da to the release rotation (#10872) 9e918c3 Update trigger-service docs to use --dar option in the corresponding example (#10877) 49a9556 [docs] Fix minor typo in doc of exerciseByKey in TS. (#10863) f7c07ea interfaces: scala protobuf encoder (#10878) be4e064 Ledger API Test Tool: support `--additional` tests [KVL-1100] (#10829) 97e14de [Ledger API error codes] ErrorCode interfaces and generator [DPP-591] (#10836) 6dcdaa4 [DPP-589] Add CLI flag to select minimum enabled TLS version (#10854) 1fc58d9 Navigator customviews highlight and choices button, apply custom theme on the login screen (#10859) 6faddc9 Update Daml Documentation to reflect command deduplication related changes [KVL-1094] (#10852) 7c29eee Cleanup normalize from svalue (#10873) 053f22a Convert SValue to Value, and normalize, in a single code pass. (#10828) 37a1cb2 compatibility-tests - Exclude CommandDeduplicationIT from running for existing 1.17 snapshots (#10866) dfae9f6 Command deduplication - better support for different deduplication modes in conformance tests [KVL-1099] (#10864) 6f151e2 save kibana exports (#10861) 99f0362 [JSON-API] drop package token doc changes (#10865) b50bb8e Populate `definite_answer` in `ApiException` [KVL-1004] (#10832) a471225 LF: Add missing collision check for type synonyms (#10841) 1e1c452 LF: drop ad-hoc FrontStack builders (#10839) 8f5b4fa interfaces: protobuf encoder haskell side (#10850) 63f6678 ParticipantPruningIT divulgence test fixes to avoid flakiness on canton (#10860) 8a9d19a Command deduplication - KV conformance test for usage of max deduplication duration [KVL-1098] (#10846) 24fff88 LF: Simplify TransactionBuilder (#10753) 9a4c9df Implement LF desugaring of interface definitions (#10834) 2aaf601 interfaces: protobuf decoder haskell side (#10849) 6dc769b interfaces: lf typechecker implementation (#10843) d9178d2 Clarify version usage in test tool exclusion docs (#10858) c113954 Clarify docs for test tool exclusions (#10855) 8c9edd8 es cluster tweaks (#10853) 842c5b1 Drop early access notice from profiler docs (#10856) 7c47aca Improvements to wording in ledger-api protobuf docs (#10851) cff0358 ledger-api: Remove unimplemented fields [KVL-1094] (#10822) dcec6ea kvutils: Populate `definite_answer` in rejections [KVL-1004] (#10801) 1c4f173 Command deduplication - kvutils - Always use max deduplication duration as deduplication period [KVL-1098] (#10824) 567fe43 tweak trigger-service docs (#10845) fb5ab5d setvar doesn't like new lines in assignment, refactor (#10842) 7225c04 [docs] Replace AdoptOpenJDK suggestion by Adoptium (#10837) 6a9c8a6 release 1.17.0-snapshot.20210910.7786.0.976ca400 (#10838) 6ed2124 LF: clean up useless version tests. (#10833) 85f6f36 Modify the name of the secrets-url CLI flag to tls-secrets-url [DPP-604] (#10840) d809fd9 [JSON-API] surrogate template id cache (#10806) ``` Changelog: ``` - The Trigger Service can now accept separate `--auth-internal` and `--auth-external` CLI arguments, where `--auth-internal` is the address used by the Trigger Service to reach the Auth Middleware directly, and `--auth-external` is the address the Trigger Service uses in generated URLs sent back to the client. The `--auth` option remains and keeps working as before, setting both internal and external addresses to the same given value. - The `create-daml-app` template now includes support for a third authentication scheme (in addition to the existing "dev mode" and Daml Hub support): Auth0. Sandbox: Add CLI flag to select minimum enabled TLS version for ledger API server. - [Navigator] The currently selected custom view is now highlighted on the sidebar kvutils - committer side deduplication always uses max_deduplication_duration + min_skew as a deduplication period for all the requests. Modify the name of the secrets-url CLI flag to tls-secrets-url. ``` CHANGELOG_BEGIN CHANGELOG_END
akrmn
added a commit
that referenced
this pull request
Sep 15, 2021
Manual release process. @akrmn is in charge of this release. Commit log: ``` b5648c0 Make `CommandTracker` distinguish submissions of the same command using `submissionId` [KVL-1104] (#10868) b4328b3 ledger-api-test-tool - Add conformance test for parallel command deduplication using CommandSubmissionService [KVL-1099] (#10869) 0c32e3b Fix Parallel Indexer initialization issue [DPP-542] (#10889) b3e4975 Chore slow migration error removal (#10886) e4cce53 Create a new grpc exception for each duplicate result [KVL-1099] (#10887) a939594 Sandbox on H2 - performance improvements for the append-only schema [DPP-600] (#10888) 9a1a101 Increase timeout for heavy tests in ParticipantPruningIT (#10894) 9093c6c Improve wording for the active contracts service description (#10880) c12f546 Document #10780 (#10781) 5814f6a update NOTICES file (#10893) 38227a8 [Ledger API error codes] ErrorCode enrichments [DPP-591] (#10874) e7c443a enable json index for all fields that are queried with JSON_EXISTS (#10658) 6c1c02a document complete authorized auth0 setup (#10881) e4230dc Do not drop generated `submissionId`s in `GrpcCommandService` [KVL-1104] (#10882) b4750a4 trigger reach auth on internal network (#10844) 7908083 add Auth0 support to create-daml-app (#10673) b86490c Add @adriaanm-da to the release rotation (#10872) 9e918c3 Update trigger-service docs to use --dar option in the corresponding example (#10877) 49a9556 [docs] Fix minor typo in doc of exerciseByKey in TS. (#10863) f7c07ea interfaces: scala protobuf encoder (#10878) be4e064 Ledger API Test Tool: support `--additional` tests [KVL-1100] (#10829) 97e14de [Ledger API error codes] ErrorCode interfaces and generator [DPP-591] (#10836) 6dcdaa4 [DPP-589] Add CLI flag to select minimum enabled TLS version (#10854) 1fc58d9 Navigator customviews highlight and choices button, apply custom theme on the login screen (#10859) 6faddc9 Update Daml Documentation to reflect command deduplication related changes [KVL-1094] (#10852) 7c29eee Cleanup normalize from svalue (#10873) 053f22a Convert SValue to Value, and normalize, in a single code pass. (#10828) 37a1cb2 compatibility-tests - Exclude CommandDeduplicationIT from running for existing 1.17 snapshots (#10866) dfae9f6 Command deduplication - better support for different deduplication modes in conformance tests [KVL-1099] (#10864) 6f151e2 save kibana exports (#10861) 99f0362 [JSON-API] drop package token doc changes (#10865) b50bb8e Populate `definite_answer` in `ApiException` [KVL-1004] (#10832) a471225 LF: Add missing collision check for type synonyms (#10841) 1e1c452 LF: drop ad-hoc FrontStack builders (#10839) 8f5b4fa interfaces: protobuf encoder haskell side (#10850) 63f6678 ParticipantPruningIT divulgence test fixes to avoid flakiness on canton (#10860) 8a9d19a Command deduplication - KV conformance test for usage of max deduplication duration [KVL-1098] (#10846) 24fff88 LF: Simplify TransactionBuilder (#10753) 9a4c9df Implement LF desugaring of interface definitions (#10834) 2aaf601 interfaces: protobuf decoder haskell side (#10849) 6dc769b interfaces: lf typechecker implementation (#10843) d9178d2 Clarify version usage in test tool exclusion docs (#10858) c113954 Clarify docs for test tool exclusions (#10855) 8c9edd8 es cluster tweaks (#10853) 842c5b1 Drop early access notice from profiler docs (#10856) 7c47aca Improvements to wording in ledger-api protobuf docs (#10851) cff0358 ledger-api: Remove unimplemented fields [KVL-1094] (#10822) dcec6ea kvutils: Populate `definite_answer` in rejections [KVL-1004] (#10801) 1c4f173 Command deduplication - kvutils - Always use max deduplication duration as deduplication period [KVL-1098] (#10824) 567fe43 tweak trigger-service docs (#10845) fb5ab5d setvar doesn't like new lines in assignment, refactor (#10842) 7225c04 [docs] Replace AdoptOpenJDK suggestion by Adoptium (#10837) 6a9c8a6 release 1.17.0-snapshot.20210910.7786.0.976ca400 (#10838) 6ed2124 LF: clean up useless version tests. (#10833) 85f6f36 Modify the name of the secrets-url CLI flag to tls-secrets-url [DPP-604] (#10840) d809fd9 [JSON-API] surrogate template id cache (#10806) ``` Changelog: ``` - [Sandbox] - Added a CLI parameter for configuring the number of connections in the database connection pool used for serving ledger API requests [Docs] Improved description of the purpose and usage of the active contracts service [Docs/JSON API] documented 256B limitation of Oracle query store - The Trigger Service can now accept separate `--auth-internal` and `--auth-external` CLI arguments, where `--auth-internal` is the address used by the Trigger Service to reach the Auth Middleware directly, and `--auth-external` is the address the Trigger Service uses in generated URLs sent back to the client. The `--auth` option remains and keeps working as before, setting both internal and external addresses to the same given value. - The `create-daml-app` template now includes support for a third authentication scheme (in addition to the existing "dev mode" and Daml Hub support): Auth0. Sandbox: Add CLI flag to select minimum enabled TLS version for ledger API server. - [Navigator] The currently selected custom view is now highlighted on the sidebar kvutils - committer side deduplication always uses max_deduplication_duration + min_skew as a deduplication period for all the requests. Modify the name of the secrets-url CLI flag to tls-secrets-url. ``` changelog_begin changelog_end
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
On Sept 8 our ES cluster became unresponsive. I tried connecting to the
machines.
One machine had an ES Docker container that claimed to have started 7
weeks ago and stopped 5 weeks ago, while the machine's own uptime was 5
weeks. I assume GCP had decided to restart it for some reason. The init
script had failed on missing a TTY, hence the addition of the
DEBIAN_FRONTEND
env var.Two machines had a Docker container that had stopped on that day, resp.
6h and 2h before I started investigating. It wasn't immediately clear
what had caused the containers to stop.
On all three of these machines, I was abble to manually restart the
containers and they were abble to reform a cluster, though the state of
the cluster was red (missing shards).
The last two machines simply did not respond to SSH connection attempts.
Assuming it might help, I decided to try to restart the machines. As GCP
does not allow restarting individual machines when they're part of a
managed instance roup, I tried clicking the "rolling restart" button
on the GCP console, which seemed like it would restart the machines. I
carefully selected "restart" (and not "replace"), started the process,
and watched GCP proceed to immediately replace all five machines, losing
all data in the process.
I then started a new cluster and used bigger (and more) machines to
reingest all of the data, and then fell back to the existing
configuration for the "steady" state. I'll try to keep a better eye on
the state of the cluster from now on. In particular, we should not have
a node down for 5 weeks without noticing.
I'll also try to find some time to look into backing up the Kibana
configuration, as that's the one thing we can't just reingest at the
moment.
CHANGELOG_BEGIN
CHANGELOG_END