Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

es cluster tweaks #10853

Merged
merged 1 commit into from
Sep 13, 2021
Merged

es cluster tweaks #10853

merged 1 commit into from
Sep 13, 2021

Conversation

garyverhaegen-da
Copy link
Contributor

On Sept 8 our ES cluster became unresponsive. I tried connecting to the
machines.

One machine had an ES Docker container that claimed to have started 7
weeks ago and stopped 5 weeks ago, while the machine's own uptime was 5
weeks. I assume GCP had decided to restart it for some reason. The init
script had failed on missing a TTY, hence the addition of the
DEBIAN_FRONTEND env var.

Two machines had a Docker container that had stopped on that day, resp.
6h and 2h before I started investigating. It wasn't immediately clear
what had caused the containers to stop.

On all three of these machines, I was abble to manually restart the
containers and they were abble to reform a cluster, though the state of
the cluster was red (missing shards).

The last two machines simply did not respond to SSH connection attempts.
Assuming it might help, I decided to try to restart the machines. As GCP
does not allow restarting individual machines when they're part of a
managed instance roup, I tried clicking the "rolling restart" button
on the GCP console, which seemed like it would restart the machines. I
carefully selected "restart" (and not "replace"), started the process,
and watched GCP proceed to immediately replace all five machines, losing
all data in the process.

I then started a new cluster and used bigger (and more) machines to
reingest all of the data, and then fell back to the existing
configuration for the "steady" state. I'll try to keep a better eye on
the state of the cluster from now on. In particular, we should not have
a node down for 5 weeks without noticing.

I'll also try to find some time to look into backing up the Kibana
configuration, as that's the one thing we can't just reingest at the
moment.

CHANGELOG_BEGIN
CHANGELOG_END

On Sept 8 our ES cluster became unresponsive. I tried connecting to the
machines.

One machine had an ES Docker container that claimed to have started 7
weeks ago and stopped 5 weeks ago, while the machine's own uptime was 5
weeks. I assume GCP had decided to restart it for some reason. The init
script had failed on missing a TTY, hence the addition of the
`DEBIAN_FRONTEND` env var.

Two machines had a Docker container that had stopped on that day, resp.
6h and 2h before I started investigating. It wasn't immediately clear
what had caused the containers to stop.

On all three of these machines, I was abble to manually restart the
containers and they were abble to reform a cluster, though the state of
the cluster was red (missing shards).

The last two machines simply did not respond to SSH connection attempts.
Assuming it might help, I decided to try to restart the machines. As GCP
does not allow restarting individual machines when they're part of a
managed instance roup, I tried clicking the "rolling restart" button
on the GCP console, which seemed like it would restart the machines. I
carefully selected "restart" (and not "replace"), started the process,
and watched GCP proceed to immediately replace all five machines, losing
all data in the process.

I then started a new cluster and used bigger (and more) machines to
reingest all of the data, and then fell back to the existing
configuration for the "steady" state. I'll try to keep a better eye on
the state of the cluster from now on. In particular, we should not have
a node down for 5 weeks without noticing.

I'll also try to find some time to look into backing up the Kibana
configuration, as that's the one thing we can't just reingest at the
moment.

CHANGELOG_BEGIN
CHANGELOG_END
@garyverhaegen-da
Copy link
Contributor Author

Current "production" state of the cluster matches this PR (as of the time of opening).

Copy link
Contributor

@cocreature cocreature left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

@stefanobaghino-da stefanobaghino-da left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed description and the fix! 🙇🏻

@garyverhaegen-da garyverhaegen-da merged commit 8c9edd8 into main Sep 13, 2021
@garyverhaegen-da garyverhaegen-da deleted the es-reset branch September 13, 2021 09:12
garyverhaegen-da added a commit that referenced this pull request Sep 13, 2021
As explained in #10853, we recently lost our ES cluster. While I'm not
planning on trusting Google's "rolling restart" feature ever again, we
can't exclude the possibility of future similar outages (without a
significant investment in the cluster, which I don't think we want to
do).

Losing the cluster is not a huge issue as we can always reingest the
data. Worst case we lose visibility for a few days. At least, as far as
the bazel logs are concerned.

Losing the Kiaban data is a lot more annoying, as that is not derived
data and thus cannot be reingested. This PR aims to add a backup
mechanism for our Kibana configuration.

CHANGELOG_BEGIN
CHANGELOG_END
garyverhaegen-da added a commit that referenced this pull request Sep 13, 2021
As explained in #10853, we recently lost our ES cluster. While I'm not
planning on trusting Google's "rolling restart" feature ever again, we
can't exclude the possibility of future similar outages (without a
significant investment in the cluster, which I don't think we want to
do).

Losing the cluster is not a huge issue as we can always reingest the
data. Worst case we lose visibility for a few days. At least, as far as
the bazel logs are concerned.

Losing the Kibana data is a lot more annoying, as that is not derived
data and thus cannot be reingested. This PR aims to add a backup
mechanism for our Kibana configuration.

CHANGELOG_BEGIN
CHANGELOG_END
azure-pipelines bot pushed a commit that referenced this pull request Sep 15, 2021
This PR has been created by a script, which is not very smart
and does not have all the context. Please do double-check that
the version prefix is correct before merging.

@cocreature is in charge of this release.

Commit log:
```
38227a8 [Ledger API error codes] ErrorCode enrichments [DPP-591] (#10874)
e7c443a enable json index for all fields that are queried with JSON_EXISTS (#10658)
6c1c02a document complete authorized auth0 setup (#10881)
e4230dc Do not drop generated `submissionId`s in `GrpcCommandService` [KVL-1104] (#10882)
b4750a4 trigger reach auth on internal network (#10844)
7908083 add Auth0 support to create-daml-app (#10673)
b86490c Add @adriaanm-da to the release rotation (#10872)
9e918c3 Update trigger-service docs to use --dar option in the corresponding example (#10877)
49a9556 [docs] Fix minor typo in doc of exerciseByKey in TS. (#10863)
f7c07ea interfaces: scala protobuf encoder (#10878)
be4e064 Ledger API Test Tool: support `--additional` tests [KVL-1100] (#10829)
97e14de [Ledger API error codes] ErrorCode interfaces and generator [DPP-591] (#10836)
6dcdaa4 [DPP-589] Add CLI flag to select minimum enabled TLS version (#10854)
1fc58d9 Navigator customviews highlight and choices button, apply custom theme on the login screen (#10859)
6faddc9 Update Daml Documentation to reflect command deduplication related changes [KVL-1094] (#10852)
7c29eee Cleanup normalize from svalue (#10873)
053f22a Convert SValue to Value, and normalize, in a single code pass. (#10828)
37a1cb2 compatibility-tests - Exclude CommandDeduplicationIT from running for existing 1.17 snapshots (#10866)
dfae9f6 Command deduplication - better support for different deduplication modes in conformance tests [KVL-1099] (#10864)
6f151e2 save kibana exports (#10861)
99f0362 [JSON-API] drop package token doc changes (#10865)
b50bb8e Populate `definite_answer` in `ApiException` [KVL-1004] (#10832)
a471225 LF: Add missing collision check for type synonyms (#10841)
1e1c452 LF: drop ad-hoc FrontStack builders (#10839)
8f5b4fa interfaces: protobuf encoder haskell side (#10850)
63f6678 ParticipantPruningIT divulgence test fixes to avoid flakiness on canton (#10860)
8a9d19a Command deduplication - KV conformance test for usage of max deduplication duration [KVL-1098] (#10846)
24fff88 LF: Simplify TransactionBuilder (#10753)
9a4c9df Implement LF desugaring of interface definitions (#10834)
2aaf601 interfaces: protobuf decoder haskell side (#10849)
6dc769b interfaces: lf typechecker implementation (#10843)
d9178d2 Clarify version usage in test tool exclusion docs (#10858)
c113954 Clarify docs for test tool exclusions (#10855)
8c9edd8 es cluster tweaks (#10853)
842c5b1 Drop early access notice from profiler docs (#10856)
7c47aca Improvements to wording in ledger-api protobuf docs (#10851)
cff0358 ledger-api: Remove unimplemented fields [KVL-1094] (#10822)
dcec6ea kvutils: Populate `definite_answer` in rejections [KVL-1004] (#10801)
1c4f173 Command deduplication - kvutils - Always use max deduplication duration as deduplication period [KVL-1098] (#10824)
567fe43 tweak trigger-service docs (#10845)
fb5ab5d setvar doesn't like new lines in assignment, refactor (#10842)
7225c04 [docs] Replace AdoptOpenJDK suggestion by Adoptium (#10837)
6a9c8a6 release 1.17.0-snapshot.20210910.7786.0.976ca400 (#10838)
6ed2124 LF: clean up useless version tests. (#10833)
85f6f36 Modify the name of the secrets-url CLI flag to tls-secrets-url [DPP-604] (#10840)
d809fd9 [JSON-API] surrogate template id cache (#10806)
```
Changelog:
```
- The Trigger Service can now accept separate `--auth-internal` and
  `--auth-external` CLI arguments, where `--auth-internal` is the
  address used by the Trigger Service to reach the Auth Middleware
  directly, and `--auth-external` is the address the Trigger Service uses
  in generated URLs sent back to the client. The `--auth` option remains
  and keeps working as before, setting both internal and external
  addresses to the same given value.
- The `create-daml-app` template now includes support for a third
  authentication scheme (in addition to the existing "dev mode" and Daml
  Hub support): Auth0.
Sandbox: Add CLI flag to select minimum enabled TLS version for ledger API server.
- [Navigator] The currently selected custom view is now highlighted on the sidebar

kvutils - committer side deduplication always uses max_deduplication_duration + min_skew as a deduplication period for all the requests.
Modify the name of the secrets-url CLI flag to tls-secrets-url.
```

CHANGELOG_BEGIN
CHANGELOG_END
akrmn added a commit that referenced this pull request Sep 15, 2021
Manual release process. @akrmn is in charge of this release.

Commit log:
```
b5648c0 Make `CommandTracker` distinguish submissions of the same command using `submissionId` [KVL-1104] (#10868)
b4328b3 ledger-api-test-tool - Add conformance test for parallel command deduplication using CommandSubmissionService [KVL-1099] (#10869)
0c32e3b Fix Parallel Indexer initialization issue [DPP-542] (#10889)
b3e4975 Chore slow migration error removal (#10886)
e4cce53 Create a new grpc exception for each duplicate result [KVL-1099] (#10887)
a939594 Sandbox on H2 - performance improvements for the append-only schema [DPP-600] (#10888)
9a1a101 Increase timeout for heavy tests in ParticipantPruningIT (#10894)
9093c6c Improve wording for the active contracts service description (#10880)
c12f546 Document #10780 (#10781)
5814f6a update NOTICES file (#10893)
38227a8 [Ledger API error codes] ErrorCode enrichments [DPP-591] (#10874)
e7c443a enable json index for all fields that are queried with JSON_EXISTS (#10658)
6c1c02a document complete authorized auth0 setup (#10881)
e4230dc Do not drop generated `submissionId`s in `GrpcCommandService` [KVL-1104] (#10882)
b4750a4 trigger reach auth on internal network (#10844)
7908083 add Auth0 support to create-daml-app (#10673)
b86490c Add @adriaanm-da to the release rotation (#10872)
9e918c3 Update trigger-service docs to use --dar option in the corresponding example (#10877)
49a9556 [docs] Fix minor typo in doc of exerciseByKey in TS. (#10863)
f7c07ea interfaces: scala protobuf encoder (#10878)
be4e064 Ledger API Test Tool: support `--additional` tests [KVL-1100] (#10829)
97e14de [Ledger API error codes] ErrorCode interfaces and generator [DPP-591] (#10836)
6dcdaa4 [DPP-589] Add CLI flag to select minimum enabled TLS version (#10854)
1fc58d9 Navigator customviews highlight and choices button, apply custom theme on the login screen (#10859)
6faddc9 Update Daml Documentation to reflect command deduplication related changes [KVL-1094] (#10852)
7c29eee Cleanup normalize from svalue (#10873)
053f22a Convert SValue to Value, and normalize, in a single code pass. (#10828)
37a1cb2 compatibility-tests - Exclude CommandDeduplicationIT from running for existing 1.17 snapshots (#10866)
dfae9f6 Command deduplication - better support for different deduplication modes in conformance tests [KVL-1099] (#10864)
6f151e2 save kibana exports (#10861)
99f0362 [JSON-API] drop package token doc changes (#10865)
b50bb8e Populate `definite_answer` in `ApiException` [KVL-1004] (#10832)
a471225 LF: Add missing collision check for type synonyms (#10841)
1e1c452 LF: drop ad-hoc FrontStack builders (#10839)
8f5b4fa interfaces: protobuf encoder haskell side (#10850)
63f6678 ParticipantPruningIT divulgence test fixes to avoid flakiness on canton (#10860)
8a9d19a Command deduplication - KV conformance test for usage of max deduplication duration [KVL-1098] (#10846)
24fff88 LF: Simplify TransactionBuilder (#10753)
9a4c9df Implement LF desugaring of interface definitions (#10834)
2aaf601 interfaces: protobuf decoder haskell side (#10849)
6dc769b interfaces: lf typechecker implementation (#10843)
d9178d2 Clarify version usage in test tool exclusion docs (#10858)
c113954 Clarify docs for test tool exclusions (#10855)
8c9edd8 es cluster tweaks (#10853)
842c5b1 Drop early access notice from profiler docs (#10856)
7c47aca Improvements to wording in ledger-api protobuf docs (#10851)
cff0358 ledger-api: Remove unimplemented fields [KVL-1094] (#10822)
dcec6ea kvutils: Populate `definite_answer` in rejections [KVL-1004] (#10801)
1c4f173 Command deduplication - kvutils - Always use max deduplication duration as deduplication period [KVL-1098] (#10824)
567fe43 tweak trigger-service docs (#10845)
fb5ab5d setvar doesn't like new lines in assignment, refactor (#10842)
7225c04 [docs] Replace AdoptOpenJDK suggestion by Adoptium (#10837)
6a9c8a6 release 1.17.0-snapshot.20210910.7786.0.976ca400 (#10838)
6ed2124 LF: clean up useless version tests. (#10833)
85f6f36 Modify the name of the secrets-url CLI flag to tls-secrets-url [DPP-604] (#10840)
d809fd9 [JSON-API] surrogate template id cache (#10806)
```
Changelog:
```

- [Sandbox] - Added a CLI parameter for configuring the number of connections in the database connection pool used for serving ledger API requests
[Docs] Improved description of the purpose and usage of the active contracts service
[Docs/JSON API] documented 256B limitation of Oracle query store
- The Trigger Service can now accept separate `--auth-internal` and
  `--auth-external` CLI arguments, where `--auth-internal` is the
  address used by the Trigger Service to reach the Auth Middleware
  directly, and `--auth-external` is the address the Trigger Service uses
  in generated URLs sent back to the client. The `--auth` option remains
  and keeps working as before, setting both internal and external
  addresses to the same given value.
- The `create-daml-app` template now includes support for a third
  authentication scheme (in addition to the existing "dev mode" and Daml
  Hub support): Auth0.
Sandbox: Add CLI flag to select minimum enabled TLS version for ledger API server.
- [Navigator] The currently selected custom view is now highlighted on the sidebar

kvutils - committer side deduplication always uses max_deduplication_duration + min_skew as a deduplication period for all the requests.
Modify the name of the secrets-url CLI flag to tls-secrets-url.
```

changelog_begin
changelog_end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants