Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes participant to do retries on startup as waiting for DB #10759

Merged
merged 1 commit into from
Sep 2, 2021

Conversation

nmarton-da
Copy link
Contributor

CHANGELOG_BEGIN
CHANGELOG_END

Pull Request Checklist

  • Read and understand the contribution guidelines
  • Include appropriate tests
  • Set a descriptive title and thorough description
  • Add a reference to the issue this PR will solve, if appropriate
  • Include changelog additions in one or more commit message bodies between the CHANGELOG_BEGIN and CHANGELOG_END tags
  • Normal production system change, include purpose of change in description

NOTE: CI is not automatically run on non-members pull-requests for security
reasons. The reviewer will have to comment with /AzurePipelines run to
trigger the build.

@nmarton-da nmarton-da requested a review from a team as a code owner September 2, 2021 18:05
@kamil-da
Copy link
Contributor

kamil-da commented Sep 2, 2021

Is this related? #10758

@nmarton-da
Copy link
Contributor Author

Is this related? #10758

indeed. see #10758 (comment)

@nmarton-da nmarton-da force-pushed the no-task-fix-participant-retry-on-init branch from 5a568be to fc6579d Compare September 2, 2021 18:17
@@ -31,6 +31,7 @@ private[platform] class FlywayMigrations(
.configure()
.locations((locations(enableAppendOnlySchema, dbType) ++ additionalMigrationPaths): _*)
.dataSource(dataSource)
.connectRetries(600)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deferring waiting for connectivity to Flyway.
this would do exponential backoff waiting 1 2 4 8 16 ... seconds between retries
Flyway is called almost every case at startup: exception sandboxnext reset triggered restart

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flyway says it always waits 1sec between retries. Waiting for 10min before declaring there is no connectivity to the database seems a bit long, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed offline:

@@ -257,7 +257,7 @@ private[backend] object PostgresStorageBackend
val pgSimpleDataSource = new PGSimpleDataSource()
pgSimpleDataSource.setUrl(jdbcUrl)

Using.resource(pgSimpleDataSource.getConnection())(checkCompatibility)
Try(Using.resource(pgSimpleDataSource.getConnection())(checkCompatibility))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a consequence this won't do much on the first datasource initialisation if noconnectivity, but it will later as ds-s are created for dbdispatchers (at that time we already waited once for connectivity)

@@ -257,7 +257,7 @@ private[backend] object PostgresStorageBackend
val pgSimpleDataSource = new PGSimpleDataSource()
pgSimpleDataSource.setUrl(jdbcUrl)

Using.resource(pgSimpleDataSource.getConnection())(checkCompatibility)
Try(Using.resource(pgSimpleDataSource.getConnection())(checkCompatibility))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This silently drops any errors, wouldn't it be better to use com.daml.timer.RetryStrategy.exponentialBackoff or similar to retry at least a few times?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is a good idea to retry here, see explanation #10758 (comment)
And yes this will silently drop errors. But still it will run multiple times as we instantiate other datasources later, when connectivity should be already there. Not ideal I know.
The way I see it either we do something quickly if this is pushing (like this) and then we do the proper thing, or we just do the proper thing.

@rautenrieth-da rautenrieth-da self-requested a review September 2, 2021 21:10
Copy link
Contributor

@rautenrieth-da rautenrieth-da left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think silently dropping errors during data source creation is a good idea, but this is a quick fix to unblock others.

A proper fix should be added soon after #10705 lands.

@nmarton-da nmarton-da force-pushed the no-task-fix-participant-retry-on-init branch from fc6579d to a65772c Compare September 2, 2021 21:42
@mergify mergify bot merged commit e45d852 into main Sep 2, 2021
@mergify mergify bot deleted the no-task-fix-participant-retry-on-init branch September 2, 2021 23:12
azure-pipelines bot pushed a commit that referenced this pull request Sep 8, 2021
This PR has been created by a script, which is not very smart
and does not have all the context. Please do double-check that
the version prefix is correct before merging.

@akrmn is in charge of this release.

Commit log:
```
35a853f Async logging for http-json-perf tests and  publish logs as a pipeline artifact (#10803)
a349217 get oracle perf test output (#10800)
5f448f2 Cleanup DataSource initialization (#10794)
0e250d0 Fix parallel ingestion initialization (#10797)
b8bd5e6 interface PoC: protobuf definitions (#10796)
a1da025 FreePort draw from outside ephemeral port range (#10774)
945a86c Fix a race in a test of the RecoveringIndexer (#10792)
4093bbd fix macOS Bazel cache (#10795)
ed99fe5 ParticipantPruningIT added events for canton safe-pruning and multi-node synchronization (#10775)
5b28aef Deprecate kvutils rejections generated only by participant state v1 [KVL-1090] (#10791)
116d6a5 DPP-577 Use BIGINT instead of TIMESTAMP (#10761)
cdf4bf1 Check compatibility of timestamps (#10793)
098a08c Add Moisés to release rotation (#10789)
7ae199f Follow-up to h2 storage backend user/password in url: Review feedback switch to regex and move to pure function (#10541)
a5999ab remove dead code (#10765)
4b7391b LedgerApiTestTool prints skipped (excluded) tests (#10726)
a0f0fb3 [DPP-418] Enable outstanding test case for private key decryption (#10778)
d750666 Do not drop details when converting between gRPC `Status` classes [KVL-1084] (#10745)
5ec9a23 Fix link to the gRPC health checks in the driver for PostgreSQL docs (#10779)
8538fe9 HA protected ledger initialization for append only schema [DPP-551] (#10705)
23b0fe7 Fix exclusions for command service tests (#10777)
c4e0a75 LF: drop ad-hoc ImmArray builders (#10763)
303ba90 participant-state: Re-enable integration test for command deduplication [KVL-1089] (#10751)
793253c bump cleanup threshold (#10771)
4d67684 Fix flaky DBLock tests (#10770)
7515871 Fix exclusions for command dedup in compat tests (#10767)
50fecfb Wrap missing label names in quotes (#10749)
b1c6e87 fix claims check in auth middleware (#10768)
5595d55 [JSON-API] Use the token from incoming requests to update the package list (#10602)
dbafea0 Add unit tests for DBLockStorageBackend [DPP-495] (#10746)
e5a6d70 Added buffer size metrics for getTransactions/getTransactionTrees (#10744)
f76c868 update NOTICES file (#10762)
bb908d0 Create a link to party management service in main Ledger API page. (#10138)
e45d852 Fixes participant to do retries on startup as waiting for DB connectivity (#10759)
f68a129 Report some oracle_json_perf numbers on slack  (#10754)
7270ee3 Handle dynamic port in auth middleware client (trigger service) (#10755)
ff1308e [Docs] Add info on logs on Kubernetes & metrics in the ops section (#10525)
d267390 Update buf image [kvl-1049] (#10752)
d2180cf Update exclusion for command deduplication to include full version range (#10750)
41881ba Command dedup: migrate kvutils to use v2 services [KVL-1049] (#10679)
4525b8c [JSON-API] vanilla oracle_perf ci job (#10688)
16df8a5 [Short] Remove unused code (#10719)
b28afcf [DPP-438] Update docs on metrics that no longer use <party_name> in their name (#10728)
0662025 Clarify what the `buf` image is and how it should be used (#10741)
9071a05 sandbox-classic: Reintroduce SqlLedger tests for the mutable index. (#10722)
f576cdf fix flaky test in RecoveringIndexer (#9619)
f6a75b4 ledger-api-common: Do not mock final classes. (#10733)
e9c8af5 Bump ghc-lib to include dropped parsing code for generic templates (#10735)
862a290 rotate release duty after 1.17.0-snapshot.20210831.7702.0.f058c2f1 (#10731)
12cb92b update compat versions for 1.17.0-snapshot.20210831.7702.0.f058c2f1 (#10737)
73c0de8 [DPP-578] Temporarily disable flaky test - Mockito plugin issue (#10734)
963bcb1 release 1.17.0-snapshot.20210831.7702.0.f058c2f1 (#10730)
c11323d LF: Refactor engine test reinterpret (#10724)
a995fa3 [DPP-574] Update docs about TLS: encrypted private key (#10727)
66970b7 Not sharing the absolute Deadline object. (#10713)
c6c304b Improve script error on invalid script identifier format (#10702)
b748fd6 Support TLS in Daml Helper --json-api requests (#10709)
```
Changelog:
```
kvutils - protobuf rejections generated by the participant state v1 API are deprecated (Inconsistent, Disputed, ResourcesExhausted, PartyNotKnownOnLedger)
[Docs] Fixed link to the gRPC health checks in the driver for PostgreSQL docs

participant-state: Emit completions (CommandRejected) for duplicate commands when using pre-execution
- Fix a bug in the auth middleware where insufficient credentials could
  still give access to list of running triggers.
- [JSON API] The cli option `--access-token-file` is now deprecated. It
    is not needed anymore and you can safely remove it. Reason is that
    the operations which prior required a token at startup are now done
    on demand using the token of the incoming request.
- [Docs] Information was added in the `Operating Daml` section on how to aggregate logs on Kubernetes in conjuction with Daml services & what options exists for exporting metrics from daml services (not Kubernetes specific)

participant-state:
- Migrate to use only v2 Read/Write Services. This includes the use of new models for rejections/updates/submission results.
- Calculate deduplicate until for committer side deduplication based on the submitter given deduplication period
   -  if using the deprecated deduplicateUntil then just set the given timestamp
   -  if using duration then calculate the new deduplicateUntil by using the formula (submissionTime + deduplicationDuration + minSkew)
  -   if using offset deduplication then calculate deduplicateUntil by using the formula (submissionTime + maxDeduplicationDuration + minSkew)
  -  if the deduplication period is not set then we don't set deduplicateUntil
- Emit completions with status `ALREADY_EXISTS` for duplicate commands

- [Java Bindings] DamlLedgerClient.Builder allows to set a timeout for command using `withTimeout`.

- [Daml Assistant] The `daml ledger` commands now accepts `--tls` in
  combination with `--json-api` to access a JSON API behind a TLS
  reverse proxy.

```

CHANGELOG_BEGIN
CHANGELOG_END
moisesackerman-da pushed a commit that referenced this pull request Sep 8, 2021
This PR has been created by a script, which is not very smart
and does not have all the context. Please do double-check that
the version prefix is correct before merging.

@akrmn is in charge of this release.

Commit log:
```
35a853f Async logging for http-json-perf tests and  publish logs as a pipeline artifact (#10803)
a349217 get oracle perf test output (#10800)
5f448f2 Cleanup DataSource initialization (#10794)
0e250d0 Fix parallel ingestion initialization (#10797)
b8bd5e6 interface PoC: protobuf definitions (#10796)
a1da025 FreePort draw from outside ephemeral port range (#10774)
945a86c Fix a race in a test of the RecoveringIndexer (#10792)
4093bbd fix macOS Bazel cache (#10795)
ed99fe5 ParticipantPruningIT added events for canton safe-pruning and multi-node synchronization (#10775)
5b28aef Deprecate kvutils rejections generated only by participant state v1 [KVL-1090] (#10791)
116d6a5 DPP-577 Use BIGINT instead of TIMESTAMP (#10761)
cdf4bf1 Check compatibility of timestamps (#10793)
098a08c Add Moisés to release rotation (#10789)
7ae199f Follow-up to h2 storage backend user/password in url: Review feedback switch to regex and move to pure function (#10541)
a5999ab remove dead code (#10765)
4b7391b LedgerApiTestTool prints skipped (excluded) tests (#10726)
a0f0fb3 [DPP-418] Enable outstanding test case for private key decryption (#10778)
d750666 Do not drop details when converting between gRPC `Status` classes [KVL-1084] (#10745)
5ec9a23 Fix link to the gRPC health checks in the driver for PostgreSQL docs (#10779)
8538fe9 HA protected ledger initialization for append only schema [DPP-551] (#10705)
23b0fe7 Fix exclusions for command service tests (#10777)
c4e0a75 LF: drop ad-hoc ImmArray builders (#10763)
303ba90 participant-state: Re-enable integration test for command deduplication [KVL-1089] (#10751)
793253c bump cleanup threshold (#10771)
4d67684 Fix flaky DBLock tests (#10770)
7515871 Fix exclusions for command dedup in compat tests (#10767)
50fecfb Wrap missing label names in quotes (#10749)
b1c6e87 fix claims check in auth middleware (#10768)
5595d55 [JSON-API] Use the token from incoming requests to update the package list (#10602)
dbafea0 Add unit tests for DBLockStorageBackend [DPP-495] (#10746)
e5a6d70 Added buffer size metrics for getTransactions/getTransactionTrees (#10744)
f76c868 update NOTICES file (#10762)
bb908d0 Create a link to party management service in main Ledger API page. (#10138)
e45d852 Fixes participant to do retries on startup as waiting for DB connectivity (#10759)
f68a129 Report some oracle_json_perf numbers on slack  (#10754)
7270ee3 Handle dynamic port in auth middleware client (trigger service) (#10755)
ff1308e [Docs] Add info on logs on Kubernetes & metrics in the ops section (#10525)
d267390 Update buf image [kvl-1049] (#10752)
d2180cf Update exclusion for command deduplication to include full version range (#10750)
41881ba Command dedup: migrate kvutils to use v2 services [KVL-1049] (#10679)
4525b8c [JSON-API] vanilla oracle_perf ci job (#10688)
16df8a5 [Short] Remove unused code (#10719)
b28afcf [DPP-438] Update docs on metrics that no longer use <party_name> in their name (#10728)
0662025 Clarify what the `buf` image is and how it should be used (#10741)
9071a05 sandbox-classic: Reintroduce SqlLedger tests for the mutable index. (#10722)
f576cdf fix flaky test in RecoveringIndexer (#9619)
f6a75b4 ledger-api-common: Do not mock final classes. (#10733)
e9c8af5 Bump ghc-lib to include dropped parsing code for generic templates (#10735)
862a290 rotate release duty after 1.17.0-snapshot.20210831.7702.0.f058c2f1 (#10731)
12cb92b update compat versions for 1.17.0-snapshot.20210831.7702.0.f058c2f1 (#10737)
73c0de8 [DPP-578] Temporarily disable flaky test - Mockito plugin issue (#10734)
963bcb1 release 1.17.0-snapshot.20210831.7702.0.f058c2f1 (#10730)
c11323d LF: Refactor engine test reinterpret (#10724)
a995fa3 [DPP-574] Update docs about TLS: encrypted private key (#10727)
66970b7 Not sharing the absolute Deadline object. (#10713)
c6c304b Improve script error on invalid script identifier format (#10702)
b748fd6 Support TLS in Daml Helper --json-api requests (#10709)
```
Changelog:
```
kvutils - protobuf rejections generated by the participant state v1 API are deprecated (Inconsistent, Disputed, ResourcesExhausted, PartyNotKnownOnLedger)
[Docs] Fixed link to the gRPC health checks in the driver for PostgreSQL docs

participant-state: Emit completions (CommandRejected) for duplicate commands when using pre-execution
- Fix a bug in the auth middleware where insufficient credentials could
  still give access to list of running triggers.
- [JSON API] The cli option `--access-token-file` is now deprecated. It
    is not needed anymore and you can safely remove it. Reason is that
    the operations which prior required a token at startup are now done
    on demand using the token of the incoming request.
- [Docs] Information was added in the `Operating Daml` section on how to aggregate logs on Kubernetes in conjuction with Daml services & what options exists for exporting metrics from daml services (not Kubernetes specific)

participant-state:
- Migrate to use only v2 Read/Write Services. This includes the use of new models for rejections/updates/submission results.
- Calculate deduplicate until for committer side deduplication based on the submitter given deduplication period
   -  if using the deprecated deduplicateUntil then just set the given timestamp
   -  if using duration then calculate the new deduplicateUntil by using the formula (submissionTime + deduplicationDuration + minSkew)
  -   if using offset deduplication then calculate deduplicateUntil by using the formula (submissionTime + maxDeduplicationDuration + minSkew)
  -  if the deduplication period is not set then we don't set deduplicateUntil
- Emit completions with status `ALREADY_EXISTS` for duplicate commands

- [Java Bindings] DamlLedgerClient.Builder allows to set a timeout for command using `withTimeout`.

- [Daml Assistant] The `daml ledger` commands now accepts `--tls` in
  combination with `--json-api` to access a JSON API behind a TLS
  reverse proxy.

```

CHANGELOG_BEGIN
CHANGELOG_END

Co-authored-by: Azure Pipelines DAML Build <support@digitalasset.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants