-
Notifications
You must be signed in to change notification settings - Fork 36.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: add stress tests for initialization #23289
Conversation
d4d2bed
to
7e70350
Compare
7e70350
to
1888c8d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strong concept ACK. I tried doing this manually for a few releases in the past, but a functional test simplifies that a lot.
Strong Concept ACK Functional testing, unit testing, fuzz testing and now stress testing: great to see! Fault injection testing and mutation testing next in line? :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested ACK 1888c8d on Ubuntu 20.04
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tACK 1888c8d
dd1f1b5
to
71ba24a
Compare
Code Review ACK 71ba24a This is great, and is very helpful for testing initialization! Thank you James! Naive ideaSince GDB has a Python API, perhaps someone can use that for more granular stress testing (shouldn't be in this PR since it seems like quite a large undertaking) |
Concept ACK, very cool idea. I also encountered the intermittent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK
1768b27
to
c870ee0
Compare
On the one hand, holding off on merging test coverage because there may be a bug in master seems wrong. On the other hand, spurious test failures in CI for a hard-to-track-down bug will be annoying. For context and neither here nor there: this bug seems pretty minor - under very particular circumstances (SIGTERM during just the right init point) it may cause a user to have to reindex. But that said, we should certainly have a better understanding of why it's happening. |
If txindex is causing issues here, maybe submit it in a follow-up pull request? |
c870ee0
to
ff2fede
Compare
To allow easier access to the node's datadir and debug logs.
ff2fede
to
b32f5a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK. Nice idea!
b32f5a5
to
e1220b7
Compare
I see different things locally: For me, the failure happens in roughly 1-2% of the runs in the |
e1220b7
to
9c055d2
Compare
Concept ACK |
9c055d2
to
d5f0dd1
Compare
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
d5f0dd1
to
d9803f7
Compare
Okay, I've addressed a bunch of feedback and disabled txindex for the time being. I think this is a good first pass on this test provided all the non-deterministic failures have been removed. @mzumsande/@dongcarl if you'd like to PR the indexing change and fix you're welcome to, otherwise I can take care of it shortly. |
This allows filters to be reconstructed when the best known block is the Genesis block without needing to reindex. It fixes Init errors seen in bitcoin#23289.
I opened #23365 to fix the intermittent issue. |
Concept ACK |
Anything left here? I think as things stand this is a stable test that adds some good coverage. AFAIK there's no outstanding feedback aside from a few nits and marginal suggestions. |
This allows filters to be reconstructed when the best known block is the Genesis block without needing to reindex. It fixes Init errors seen in bitcoin#23289.
Code review ACK d9803f7 |
d9803f7 test: add stress tests for initialization (James O'Beirne) 23f8561 test: add node.chain_path and node.debug_log_path (James O'Beirne) Pull request description: In the course of coming up with a test plan for bitcoin#23280, I thought it would be neat to include a Python snippet showing how I tested the initialization process. I quickly realized I was reinventing the functional test framework... so here's a new test. This change bangs init around like the Fonz hitting a jukebox. It adds some interesting (read: lazy and random) coverage for the initialization process by - interrupting init with SIGTERM after certain log statements, - interrupting init at random points, and - starting init with some essential data missing (block files, block indices, etc.) to test init error paths. As far as I can tell, some of these code paths are uncovered otherwise (namely the startup errors). --- Incidentally, I think I may have uncovered some kind of a bug or race condition with indexing initialization based on an intermittent failure in this testcase. This test sometimes fails after shutting down immediately after `loadblk` thread start: ``` 2021-10-15T21:14:51.295000Z TestFramework (INFO): Starting node and will exit after line 'loadblk thread start' 36 │ 2021-10-15T21:14:51.296000Z TestFramework.node0 (DEBUG): bitcoind started, waiting for RPC to come up 37 │ 2021-10-15T21:14:51.493000Z TestFramework (INFO): terminating node after 110 log lines seen 38 │ 2021-10-15T21:14:51.625000Z TestFramework (INFO): Starting node and will exit after line 'txindex thread start' 39 │ 2021-10-15T21:14:51.625000Z TestFramework.node0 (DEBUG): bitcoind started, waiting for RPC to come up ------> [[ FAILURE HERE ]] 2021-10-15T21:15:21.626000Z TestFramework (WARNING): missed line {bail_line}; bailing now after {num_lines} lines ``` and then fails to start up afterwards. Combined logs showing `Error: txindex best block of the index goes beyond pruned data`, when the node under test is not pruned: ``` node0 2021-10-15T21:16:51.848439Z [shutoff] [validationinterface.cpp:244] [ChainStateFlushed] Enqueuing ChainStateFlushed: block hash=1014bc4ff4917602ae53d10e9dfe230af4b7d52a6cdaa8a47798b9c288180907 node0 2021-10-15T21:16:51.848954Z [shutoff] [init.cpp:302] [Shutdown] Shutdown: done test 2021-10-15T21:16:51.882000Z TestFramework (ERROR): Unexpected exception caught during testing Traceback (most recent call last): File "/home/james/src/bitcoin/test/functional/test_framework/test_framework.py", line 132, in main self.run_test() File "/home/james/src/bitcoin/./test/functional/stress_init.py", line 87, in run_test check_clean_start() File "/home/james/src/bitcoin/./test/functional/stress_init.py", line 60, in check_clean_start node.wait_for_rpc_connection() File "/home/james/src/bitcoin/test/functional/test_framework/test_node.py", line 224, in wait_for_rpc_connection raise FailedToStartError(self._node_msg( test_framework.test_node.FailedToStartError: [node 0] bitcoind exited with status 1 during initialization test 2021-10-15T21:16:51.882000Z TestFramework (DEBUG): Closing down network thread test 2021-10-15T21:16:51.933000Z TestFramework (INFO): Stopping nodes test 2021-10-15T21:16:51.933000Z TestFramework.node0 (DEBUG): Stopping node node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again) node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again) node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again) node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again) node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again) ``` ACKs for top commit: laanwj: Code review ACK d9803f7 Tree-SHA512: 4d80dc399daf199a6222e81e47d12d830dc7af07355eddbb7f52479a676a645b8d3d45093ff54a9295f01a163b2f4fe0e038e83fc269969e03d4cfda69eaf111
I'm surprised this was merged, given that there were no ACKs on the latest branch, and there were still unresolved approach/implementation issues and comments, eg: |
This is also the cause of the recent sporadic failures in our CI. See #23646. |
9600ea0 test: Add edge case of pruning up to index height (Martin Zumsande) 698c524 index: Fix backwards search for bestblock (Martin Zumsande) Pull request description: This PR attempts to fix an intermittent Init issue encountered during the stress testing of #23289, which relates to the pruning-compatible filter reconstruction logic introduced in #15946. The problem would occur when the node starts with `-txindex=1` but `ThreadSync` is interrupted after it sets `m_best_block_index` to Genesis, and before it gets do any further work. In that case, during the next restart of the node, an Init error would be thrown because `BaseIndex::Init()` tries to backtrack from the tip to the last block which has been successfully indexed (here: Genesis), but the backtracking logic didn't work properly in this case: The loop `while (block_to_test && block->pprev && (block->pprev->nStatus & BLOCK_HAVE_DATA))` checks if a predecessor exists **before** performing the check `block_to_test == block` and then possbily setting `prune_violation = false` If `block_to_test` and `block` are the Genesis block this check will not be reached because `block->pprev` does not exist. To reproduce this bug on regtest: 1) start a node with a fresh datadir using `-txindex=1` (or any other index) 2) stop and restart without any index 3) mine a block 3) stop and restart again with the index enabled ->InitError `Error: txindex best block of the index goes beyond pruned data. (...)` Fix this by requiring that we have the data for the block of the current iteration `block` (instead of requiring it for the predecessor `block->pprev`) That way, the check for `block_to_test == block` is also reached when `block_to_test` is the Genesis block. No longer requiring the data of `block->pprev` also means that we can now prune up to `m_best_block_index` height without requiring a reindex (one block more than before). I added this edge case to `feature_blockfilterindex_prune.py`, the new version should fail on master. ACKs for top commit: ryanofsky: Partial code review ACK 9600ea0 for the code change, not the test changes. (Test changes are indirect and little over my head.) It seems obvious that previous code `prune_violation = true, while (block->pprev)` would incorrectly detect a prune violation at the genesis block, and the fix here make sense and looks correct. Tree-SHA512: c717f372cee8fd49718b1b18bfe237aa6ba3ff4468588c10e1272d7a2ef3981d10af4e57de51dec295e2ca72d441bc6c2812f7990011a94d7f818775e3ff1a38
This allows filters to be reconstructed when the best known block is the Genesis block without needing to reindex. It fixes Init errors seen in bitcoin#23289.
9600ea0 test: Add edge case of pruning up to index height (Martin Zumsande) 698c524 index: Fix backwards search for bestblock (Martin Zumsande) Pull request description: This PR attempts to fix an intermittent Init issue encountered during the stress testing of bitcoin#23289, which relates to the pruning-compatible filter reconstruction logic introduced in bitcoin#15946. The problem would occur when the node starts with `-txindex=1` but `ThreadSync` is interrupted after it sets `m_best_block_index` to Genesis, and before it gets do any further work. In that case, during the next restart of the node, an Init error would be thrown because `BaseIndex::Init()` tries to backtrack from the tip to the last block which has been successfully indexed (here: Genesis), but the backtracking logic didn't work properly in this case: The loop `while (block_to_test && block->pprev && (block->pprev->nStatus & BLOCK_HAVE_DATA))` checks if a predecessor exists **before** performing the check `block_to_test == block` and then possbily setting `prune_violation = false` If `block_to_test` and `block` are the Genesis block this check will not be reached because `block->pprev` does not exist. To reproduce this bug on regtest: 1) start a node with a fresh datadir using `-txindex=1` (or any other index) 2) stop and restart without any index 3) mine a block 3) stop and restart again with the index enabled ->InitError `Error: txindex best block of the index goes beyond pruned data. (...)` Fix this by requiring that we have the data for the block of the current iteration `block` (instead of requiring it for the predecessor `block->pprev`) That way, the check for `block_to_test == block` is also reached when `block_to_test` is the Genesis block. No longer requiring the data of `block->pprev` also means that we can now prune up to `m_best_block_index` height without requiring a reindex (one block more than before). I added this edge case to `feature_blockfilterindex_prune.py`, the new version should fail on master. ACKs for top commit: ryanofsky: Partial code review ACK 9600ea0 for the code change, not the test changes. (Test changes are indirect and little over my head.) It seems obvious that previous code `prune_violation = true, while (block->pprev)` would incorrectly detect a prune violation at the genesis block, and the fix here make sense and looks correct. Tree-SHA512: c717f372cee8fd49718b1b18bfe237aa6ba3ff4468588c10e1272d7a2ef3981d10af4e57de51dec295e2ca72d441bc6c2812f7990011a94d7f818775e3ff1a38
a14d537 test: add stress tests for initialization (James O'Beirne) 3f80ebc test: add node.chain_path and node.debug_log_path (James O'Beirne) Pull request description: In the course of coming up with a test plan for #23280, I thought it would be neat to include a Python snippet showing how I tested the initialization process. I quickly realized I was reinventing the functional test framework... so here's a new test. This change bangs init around like the Fonz hitting a jukebox. It adds some interesting (read: lazy and random) coverage for the initialization process by - interrupting init with SIGTERM after certain log statements, - interrupting init at random points, and - starting init with some essential data missing (block files, block indices, etc.) to test init error paths. As far as I can tell, some of these code paths are uncovered otherwise (namely the startup errors). --- Incidentally, I think I may have uncovered some kind of a bug or race condition with indexing initialization based on an intermittent failure in this testcase. This test sometimes fails after shutting down immediately after `loadblk` thread start: ``` 2021-10-15T21:14:51.295000Z TestFramework (INFO): Starting node and will exit after line 'loadblk thread start' 36 │ 2021-10-15T21:14:51.296000Z TestFramework.node0 (DEBUG): bitcoind started, waiting for RPC to come up 37 │ 2021-10-15T21:14:51.493000Z TestFramework (INFO): terminating node after 110 log lines seen 38 │ 2021-10-15T21:14:51.625000Z TestFramework (INFO): Starting node and will exit after line 'txindex thread start' 39 │ 2021-10-15T21:14:51.625000Z TestFramework.node0 (DEBUG): bitcoind started, waiting for RPC to come up ------> [[ FAILURE HERE ]] 2021-10-15T21:15:21.626000Z TestFramework (WARNING): missed line {bail_line}; bailing now after {num_lines} lines ``` and then fails to start up afterwards. Combined logs showing `Error: txindex best block of the index goes beyond pruned data`, when the node under test is not pruned: ``` node0 2021-10-15T21:16:51.848439Z [shutoff] [validationinterface.cpp:244] [ChainStateFlushed] Enqueuing ChainStateFlushed: block hash=1014bc4ff4917602ae53d10e9dfe230af4b7d52a6cdaa8a47798b9c288180907 node0 2021-10-15T21:16:51.848954Z [shutoff] [init.cpp:302] [Shutdown] Shutdown: done test 2021-10-15T21:16:51.882000Z TestFramework (ERROR): Unexpected exception caught during testing Traceback (most recent call last): File "/home/james/src/bitcoin/test/functional/test_framework/test_framework.py", line 132, in main self.run_test() File "/home/james/src/bitcoin/./test/functional/stress_init.py", line 87, in run_test check_clean_start() File "/home/james/src/bitcoin/./test/functional/stress_init.py", line 60, in check_clean_start node.wait_for_rpc_connection() File "/home/james/src/bitcoin/test/functional/test_framework/test_node.py", line 224, in wait_for_rpc_connection raise FailedToStartError(self._node_msg( test_framework.test_node.FailedToStartError: [node 0] bitcoind exited with status 1 during initialization test 2021-10-15T21:16:51.882000Z TestFramework (DEBUG): Closing down network thread test 2021-10-15T21:16:51.933000Z TestFramework (INFO): Stopping nodes test 2021-10-15T21:16:51.933000Z TestFramework.node0 (DEBUG): Stopping node node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again) node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again) node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again) node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again) node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again) ``` ACKs for top commit: laanwj: Code review ACK a14d537 Tree-SHA512: 4d80dc399daf199a6222e81e47d12d830dc7af07355eddbb7f52479a676a645b8d3d45093ff54a9295f01a163b2f4fe0e038e83fc269969e03d4cfda69eaf111
8904f17 test: enable txindex in feature_init (James O'Beirne) 93db6d8 test: feature_init: retain debug.log and improve detection (James O'Beirne) 24fcf6e test: feature_init: tweak all .ldb files of a certain type (James O'Beirne) a8ffbc0 test: add TestNode.wait_for_debug_log (James O'Beirne) a2fb62b test: introduce TestNode.debug_log_bytes (James O'Beirne) Pull request description: Hopefully fixes #23646. This makes a few changes to feature_init (along the lines of the discussion here: bitcoin/bitcoin#23289 (comment)). The changes are detailed in the specific commits, but a summary might be: - more robust log-line detection (at the expense of maybe being slower) - retain all debug.log content - perturb .ldb files in a more complete way Top commit has no ACKs. Tree-SHA512: 9c8900998fdb96d3cc92bd30c4eeda2a47e9787e6432546d458c17f24eb9e2b6b3eb573db225172888ca71d7a62260f497f83827597b982b9e35c5452cb6e514
This allows filters to be reconstructed when the best known block is the Genesis block without needing to reindex. It fixes Init errors seen in bitcoin#23289.
Summary: ``` This change bangs init around like the Fonz hitting a jukebox. ``` Backport of [[bitcoin/bitcoin#23289 | core#23289]]. Test Plan: ninja check-functional Reviewers: #bitcoin_abc, PiRK, sdulfari Reviewed By: #bitcoin_abc, PiRK, sdulfari Subscribers: PiRK Differential Revision: https://reviews.bitcoinabc.org/D12598
In the course of coming up with a test plan for #23280, I thought it would be neat to include a Python snippet showing how I tested the initialization process. I quickly realized I was reinventing the functional test framework... so here's a new test.
This change bangs init around like the Fonz hitting a jukebox. It adds some interesting (read: lazy and random) coverage for the initialization process by
As far as I can tell, some of these code paths are uncovered otherwise (namely the startup errors).
Incidentally, I think I may have uncovered some kind of a bug or race condition with indexing initialization based on an intermittent failure in this testcase. This test sometimes fails after shutting down immediately after
loadblk
thread start:and then fails to start up afterwards. Combined logs showing
Error: txindex best block of the index goes beyond pruned data
, when the node under test is not pruned: