Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[indexer] table info service backup support #14373

Merged
merged 7 commits into from
Sep 5, 2024
Merged

Conversation

larry-aptos
Copy link
Contributor

@larry-aptos larry-aptos commented Aug 21, 2024

Description

In this PR, we introduce the backup of db index_indexer_async_v2_db, i.e., async table info db.

Design

There are two major loops:

  1. async indexing loop: fetch transactions, index table info. If the backup is enabled, snapshot at the end of epoch.
  2. backup loop: if backup is enabled, backup the snapshot to GCS.

Example Config

indexer_table_info:
    parser_task_count: 10
    parser_batch_size: 100
    table_info_service_mode:
        Backup:
            example-gcs-bucket-name

Breaking Change

This change is a breaking change around the configuration. It's not released yet and the only impact is our internal deployment. We need to update the configuration for the next deployment.

Type of Change

  • New feature
  • Bug fix
  • Breaking change
  • Performance improvement
  • Refactoring
  • Dependency update
  • Documentation update
  • Tests

Which Components or Systems Does This Change Impact?

  • Validator Node
  • Full Node (API, Indexer, etc.)
  • Move/Aptos Virtual Machine
  • Aptos Framework
  • Aptos CLI/SDK
  • Developer Infrastructure
  • [] Other (specify)

How Has This Been Tested?

tested locally, looks good

GCS:
image
image

Key Areas to Review

Checklist

  • I have read and followed the CONTRIBUTING doc
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I identified and added all stakeholders and component owners affected by this change as reviewers
  • I tested both happy and unhappy path of the functionality
  • I have made corresponding changes to the documentation

Copy link

trunk-io bot commented Aug 21, 2024

⏱️ 2h 23m total CI duration on this PR
Job Cumulative Duration Recent Runs
general-lints 19m 🟩🟩🟩🟩🟩 (+6 more)
rust-cargo-deny 19m 🟩🟩🟩🟩🟩 (+6 more)
check-dynamic-deps 11m 🟩🟩🟩🟩🟩 (+6 more)
rust-move-tests 9m 🟩
rust-move-tests 9m 🟩
rust-move-tests 9m 🟩
rust-move-tests 9m 🟩
rust-move-tests 8m 🟩
rust-move-tests 8m 🟩
rust-move-tests 8m 🟩
rust-move-tests 8m 🟩
rust-move-tests 6m
rust-move-tests 5m
semgrep/ci 5m 🟩🟩🟩🟩🟩 (+6 more)
rust-move-tests 3m
file_change_determinator 2m 🟩🟩🟩🟩🟩 (+6 more)
file_change_determinator 2m 🟩🟩🟩🟩🟩 (+6 more)
permission-check 32s 🟩🟩🟩🟩🟩 (+6 more)
permission-check 31s 🟩🟩🟩🟩🟩 (+6 more)
permission-check 29s 🟩🟩🟩🟩🟩 (+6 more)
permission-check 28s 🟩🟩🟩🟩🟩 (+6 more)

settingsfeedbackdocs ⋅ learn more about trunk.io

@larry-aptos larry-aptos force-pushed the larry-table-info-service branch from 8daf2f6 to d663816 Compare August 21, 2024 22:11
@larry-aptos larry-aptos changed the title Larry table info service [indexer] table info service backup support Aug 23, 2024
@larry-aptos larry-aptos marked this pull request as ready for review August 23, 2024 23:24
@larry-aptos larry-aptos requested review from jillxuu, grao1991 and a team and removed request for gregnazario and JoshLind August 23, 2024 23:24
@larry-aptos larry-aptos force-pushed the larry-table-info-service branch from ab39429 to 84f816c Compare August 24, 2024 00:04
@grao1991 grao1991 requested a review from msmouse August 24, 2024 00:10
@larry-aptos larry-aptos requested a review from grao1991 August 26, 2024 08:20
@larry-aptos
Copy link
Contributor Author

larry-aptos commented Aug 28, 2024

i am working on:

  1. split the snapshot and the backup logic apart.
  2. backup-related stuff should be optional for readability.

@larry-aptos larry-aptos requested a review from grao1991 August 28, 2024 22:53
@larry-aptos larry-aptos force-pushed the larry-table-info-service branch 2 times, most recently from f2aa2fc to 5823faa Compare August 28, 2024 23:02
@larry-aptos larry-aptos force-pushed the larry-table-info-service branch from 5823faa to ecf8017 Compare August 28, 2024 23:51
let backup_restore_operator = backup_restore_operator.clone();
let context = self.context.clone();
let _task = tokio::spawn(async move {
let backup_restore_operator = backup_restore_operator.clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to clone so many times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

/// Backup service mode with GCS bucket name.
Backup(String),
/// Restore service mode with GCS bucket name.
Restore(String),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you plan to use Restore mode?

Copy link
Contributor Author

@larry-aptos larry-aptos Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question.

Restore mode is for people to use when bootstrap a new node. It can help when the node lags/has issues and the rest of the world has already pruned the data.

IndexingOnly - no backup or restore, async db only.

@larry-aptos larry-aptos requested a review from grao1991 August 30, 2024 04:07
let file_name = path.file_name().unwrap().to_string_lossy();
if path.is_dir()
&& file_name.starts_with(&target_snapshot_directory_prefix)
&& !file_name.ends_with(".tmp")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it safe to assume that snapshot is done if tmp doesn't show up in the filename? cc @grao1991

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know? @msmouse any idea?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What? You mean rocksdb adds a ".tmp" suffix while taking a checkpoint? That's great design (moving / renaming folder should be assumed atomic) :trollface:

if rocksdb didn't do it, we should do it manually in your use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea then i'll use the tmp for now to determine the snapshot is done.

let file_name = path.file_name().unwrap().to_string_lossy();
if path.is_dir()
&& file_name.starts_with(&target_snapshot_directory_prefix)
&& !file_name.ends_with(".tmp")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know? @msmouse any idea?

}
}

backup_restore_operator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if you need to do something here to handle the case when multiple jobs uploading the backup at the same time. Can you double check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean multiple backup services are running? No; my thought is to check it on the deployment side. If we do want to check within the program(runtime?), we can either use some distributed locking or some coarse method like, checking metadata versioning. wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we want to have multiple jobs to be able to backup things to ensure a fair reliability. I don't have preference on how they corporate. Either doing it on client side or on server side are ok.
We can leave it as a TODO for future PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'd just add a random suffix to the file name so we never write to the same remote file and we can observe the conflicting behavior better)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I added a to-do as follow-up to check if concurrent jobs are running. One of the concurrent scenarios for now is multiple full nodes running with backup enabled.

@larry-aptos larry-aptos enabled auto-merge (squash) September 5, 2024 18:59

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

github-actions bot commented Sep 5, 2024

✅ Forge suite realistic_env_max_load success on 0f8bfabef29261ba93dcd265aa39426297c771ae

two traffics test: inner traffic : committed: 12250.94 txn/s, latency: 3242.28 ms, (p50: 3000 ms, p70: 3300, p90: 3600 ms, p99: 7500 ms), latency samples: 4658120
two traffics test : committed: 99.98 txn/s, latency: 2658.72 ms, (p50: 2500 ms, p70: 2600, p90: 3000 ms, p99: 10600 ms), latency samples: 1700
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.245, avg: 0.221", "QsPosToProposal: max: 0.623, avg: 0.471", "ConsensusProposalToOrdered: max: 0.329, avg: 0.320", "ConsensusOrderedToCommit: max: 0.533, avg: 0.490", "ConsensusProposalToCommit: max: 0.856, avg: 0.810"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 1.30s no progress at version 1622193 (avg 0.23s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 7.55s no progress at version 1622191 (avg 7.54s) [limit 15].
Test Ok

Copy link
Contributor

github-actions bot commented Sep 5, 2024

✅ Forge suite framework_upgrade success on d1bf834728a0cf166d993f4728dfca54f3086fb0 ==> 0f8bfabef29261ba93dcd265aa39426297c771ae

Compatibility test results for d1bf834728a0cf166d993f4728dfca54f3086fb0 ==> 0f8bfabef29261ba93dcd265aa39426297c771ae (PR)
Upgrade the nodes to version: 0f8bfabef29261ba93dcd265aa39426297c771ae
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1241.04 txn/s, submitted: 1243.59 txn/s, failed submission: 2.55 txn/s, expired: 2.55 txn/s, latency: 2790.75 ms, (p50: 2300 ms, p70: 2800, p90: 5200 ms, p99: 6800 ms), latency samples: 107080
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1071.79 txn/s, submitted: 1074.32 txn/s, failed submission: 2.53 txn/s, expired: 2.53 txn/s, latency: 2971.99 ms, (p50: 2400 ms, p70: 3000, p90: 5600 ms, p99: 7400 ms), latency samples: 93300
5. check swarm health
Compatibility test for d1bf834728a0cf166d993f4728dfca54f3086fb0 ==> 0f8bfabef29261ba93dcd265aa39426297c771ae passed
Upgrade the remaining nodes to version: 0f8bfabef29261ba93dcd265aa39426297c771ae
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1216.82 txn/s, submitted: 1220.18 txn/s, failed submission: 3.36 txn/s, expired: 3.36 txn/s, latency: 2657.92 ms, (p50: 2400 ms, p70: 2700, p90: 4200 ms, p99: 7200 ms), latency samples: 101380
Test Ok

Copy link
Contributor

github-actions bot commented Sep 5, 2024

✅ Forge suite compat success on d1bf834728a0cf166d993f4728dfca54f3086fb0 ==> 0f8bfabef29261ba93dcd265aa39426297c771ae

Compatibility test results for d1bf834728a0cf166d993f4728dfca54f3086fb0 ==> 0f8bfabef29261ba93dcd265aa39426297c771ae (PR)
1. Check liveness of validators at old version: d1bf834728a0cf166d993f4728dfca54f3086fb0
compatibility::simple-validator-upgrade::liveness-check : committed: 9892.99 txn/s, latency: 3232.83 ms, (p50: 2300 ms, p70: 2500, p90: 3300 ms, p99: 29300 ms), latency samples: 399720
2. Upgrading first Validator to new version: 0f8bfabef29261ba93dcd265aa39426297c771ae
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 6850.21 txn/s, latency: 4137.06 ms, (p50: 4700 ms, p70: 5000, p90: 5100 ms, p99: 5300 ms), latency samples: 127440
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 6581.09 txn/s, latency: 4664.68 ms, (p50: 4700 ms, p70: 4900, p90: 5200 ms, p99: 7200 ms), latency samples: 246800
3. Upgrading rest of first batch to new version: 0f8bfabef29261ba93dcd265aa39426297c771ae
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 7164.07 txn/s, latency: 3960.53 ms, (p50: 4400 ms, p70: 4600, p90: 4700 ms, p99: 4900 ms), latency samples: 138820
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6216.83 txn/s, latency: 4778.85 ms, (p50: 4800 ms, p70: 5100, p90: 5400 ms, p99: 6900 ms), latency samples: 236720
4. upgrading second batch to new version: 0f8bfabef29261ba93dcd265aa39426297c771ae
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 9123.52 txn/s, latency: 2972.32 ms, (p50: 2900 ms, p70: 3200, p90: 4600 ms, p99: 5800 ms), latency samples: 166120
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 9522.18 txn/s, latency: 3359.11 ms, (p50: 2800 ms, p70: 3100, p90: 7500 ms, p99: 9100 ms), latency samples: 317480
5. check swarm health
Compatibility test for d1bf834728a0cf166d993f4728dfca54f3086fb0 ==> 0f8bfabef29261ba93dcd265aa39426297c771ae passed
Test Ok

@larry-aptos larry-aptos merged commit 78c815b into main Sep 5, 2024
46 checks passed
@larry-aptos larry-aptos deleted the larry-table-info-service branch September 5, 2024 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants