Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index: Verify the block filter hash when reading the filter from disk. #24832

Merged

Conversation

kcalvinalvin
Copy link
Contributor

This PR picks up the abandoned #19280

BlockFilterIndex was depending on GolombRiceDecode() during the filter decode to sanity check that the filter wasn't corrupt. However, we can check for corruption by ensuring that the encoded blockfilter's hash matches up with the one stored in the index database.

Benchmarks that were added in #19280 showed that checking the hash is much faster.

The benchmarks were changed to nanobench and the relevant benchmarks were like below, showing a clear win for the hash check method.

|             ns/elem |              elem/s |    err% |        ins/elem |       bra/elem |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|---------------:|--------:|----------:|:----------
|              531.40 |        1,881,819.43 |    0.3% |        3,527.01 |         411.00 |    0.2% |      0.01 | `DecodeCheckedGCSFilter`
|          258,220.50 |            3,872.66 |    0.1% |    2,990,092.00 |     586,706.00 |    1.7% |      0.01 | `DecodeGCSFilter`
|           13,036.77 |           76,706.09 |    0.3% |       64,238.24 |         513.04 |    0.2% |      0.01 | `BlockFilterGetHash`

Copy link
Member

@jonatack jonatack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK

A few suggestions regarding filter_checked

diff --git a/src/blockfilter.cpp b/src/blockfilter.cpp
index 59f34bc54e..0ab89fdbdb 100644
--- a/src/blockfilter.cpp
+++ b/src/blockfilter.cpp
@@ -47,7 +47,7 @@ GCSFilter::GCSFilter(const Params& params)
     : m_params(params), m_N(0), m_F(0), m_encoded{0}
 {}
 
-GCSFilter::GCSFilter(const Params& params, std::vector<unsigned char> encoded_filter, const bool filter_checked)
+GCSFilter::GCSFilter(const Params& params, std::vector<unsigned char> encoded_filter, bool filter_checked)
     : m_params(params), m_encoded(std::move(encoded_filter))
 {
     SpanReader stream{GCS_SER_TYPE, GCS_SER_VERSION, m_encoded};
@@ -221,7 +221,7 @@ static GCSFilter::ElementSet BasicFilterElements(const CBlock& block,
 }
 
 BlockFilter::BlockFilter(BlockFilterType filter_type, const uint256& block_hash,
-                         std::vector<unsigned char> filter, const bool filter_checked)
+                         std::vector<unsigned char> filter, bool filter_checked)
     : m_filter_type(filter_type), m_block_hash(block_hash)
 {
     GCSFilter::Params params;
diff --git a/src/blockfilter.h b/src/blockfilter.h
index af054dbf34..5de5679834 100644
--- a/src/blockfilter.h
+++ b/src/blockfilter.h
@@ -59,7 +59,7 @@ public:
     explicit GCSFilter(const Params& params = Params());
 
     /** Reconstructs an already-created filter from an encoding. */
-    GCSFilter(const Params& params, std::vector<unsigned char> encoded_filter, const bool filter_checked=false);
+    GCSFilter(const Params& params, std::vector<unsigned char> encoded_filter, bool filter_checked = false);
 
     /** Builds a new filter from the params and set of elements. */
     GCSFilter(const Params& params, const ElementSet& elements);
@@ -122,7 +122,7 @@ public:
 
     //! Reconstruct a BlockFilter from parts.
     BlockFilter(BlockFilterType filter_type, const uint256& block_hash,
-                std::vector<unsigned char> filter, const bool filter_checked=false);
+                std::vector<unsigned char> filter, bool filter_checked = false);
 
     //! Construct a new BlockFilter of the specified type from a block.
     BlockFilter(BlockFilterType filter_type, const CBlock& block, const CBlockUndo& block_undo);
diff --git a/src/index/blockfilterindex.cpp b/src/index/blockfilterindex.cpp
index bad766ebe1..0a4728f571 100644
--- a/src/index/blockfilterindex.cpp
+++ b/src/index/blockfilterindex.cpp
@@ -159,7 +159,7 @@ bool BlockFilterIndex::ReadFilterFromDisk(const FlatFilePos& pos, BlockFilter& f
         uint256 result;
         CHash256().Write(encoded_filter).Finalize(result);
         if (result != hash) return error("Checksum mismatch in filter decode.");
-        filter = BlockFilter(GetFilterType(), block_hash, std::move(encoded_filter), true);
+        filter = BlockFilter(GetFilterType(), block_hash, std::move(encoded_filter), /*filter_checked=*/true);
     }

// Verify that the encoded filter contains exactly N elements. If it has too much or too little
// data, a std::ios_base::failure exception will be raised.
BitStreamReader<SpanReader> bitreader{stream};
BitStreamReader<SpanReader> bitreader(stream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted

@kcalvinalvin kcalvinalvin force-pushed the 2020-06-14-blockfilterindex-checksums branch from 294a845 to 2e878de Compare April 12, 2022 12:24
@kcalvinalvin
Copy link
Contributor Author

A few suggestions regarding filter_checked

I've applied all the suggested changes :)

Copy link
Member

@jonatack jonatack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach ACK, a few iterative further comments on second look.

@@ -31,7 +31,7 @@ class BlockFilterIndex final : public BaseIndex
FlatFilePos m_next_filter_pos;
std::unique_ptr<FlatFileSeq> m_filter_fileseq;

bool ReadFilterFromDisk(const FlatFilePos& pos, BlockFilter& filter) const;
bool ReadFilterFromDisk(const FlatFilePos& pos, BlockFilter& filter, const uint256& hash) const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to order the ReadFilterFromDisk() arguments with in-params first, then the out-param. There are only four lines to change of the ones touched in this pull.

Suggested change
bool ReadFilterFromDisk(const FlatFilePos& pos, BlockFilter& filter, const uint256& hash) const;
bool ReadFilterFromDisk(const FlatFilePos& pos, const uint256& hash, BlockFilter& filter) const;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

});
}

static void BlockFilterGetHash(benchmark::Bench& bench)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be handy to name all of these benchmarks with a common prefix to group them together in the output and that is easy to filter for, i.e. ./src/bench/bench_bitcoin -filter=EvictionProtection*.*

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the GCSFIlter benchmarks are now changed to have the prefix GCSFilter.

GCSFilter::Element element(32);
element[0] = static_cast<unsigned char>(i);
element[1] = static_cast<unsigned char>(i >> 8);
elements.insert(std::move(element));
Copy link
Member

@jonatack jonatack Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: it looks like there is a fair amount of duplicate boilerplate in each of these benchmarks that you could extract out to a common setup helper in a separate commit, if you like. See EvictionProtectionCommon() in src/bench/peer_eviction.cpp for an example.

Copy link
Contributor Author

@kcalvinalvin kcalvinalvin Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since most of the duplicate code is just generating the elements, how's separating that out to a function like so:

static const GCSFilter::ElementSet GenerateGCSTestElements()
{
    GCSFilter::ElementSet elements;
    for (int i = 0; i < 10000; ++i) {
        GCSFilter::Element element(32);
        element[0] = static_cast<unsigned char>(i);
        element[1] = static_cast<unsigned char>(i >> 8);
        elements.insert(std::move(element));
    }

    return elements;
}

And having each of the benchmarks call that function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've applied the above change I suggested above.

@@ -59,7 +59,7 @@ class GCSFilter
explicit GCSFilter(const Params& params = Params());

/** Reconstructs an already-created filter from an encoding. */
GCSFilter(const Params& params, std::vector<unsigned char> encoded_filter);
GCSFilter(const Params& params, std::vector<unsigned char> encoded_filter, bool filter_checked=false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clang format for each of these defaults

Suggested change
GCSFilter(const Params& params, std::vector<unsigned char> encoded_filter, bool filter_checked=false);
GCSFilter(const Params& params, std::vector<unsigned char> encoded_filter, bool filter_checked = false);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed as the default value was removed

auto encoded = filter.GetEncoded();

bench.unit("elem").run([&] {
GCSFilter filter({0, 0, BASIC_FILTER_P, BASIC_FILTER_M}, encoded, true);
Copy link
Member

@jonatack jonatack Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
GCSFilter filter({0, 0, BASIC_FILTER_P, BASIC_FILTER_M}, encoded, true);
GCSFilter filter({0, 0, BASIC_FILTER_P, BASIC_FILTER_M}, encoded, /*filter_checked=*/true);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

Copy link
Contributor

@mzumsande mzumsande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK

As far as I can see, the sanity check that this PR makes optional is now only invoked from BlockFilter::Unserialize(), which is used only in test code because as a full node, bitcoin core doesn't have the use case of accepting unchecked serialized blockfilters over p2p, only send self-generated ones to peers.
So I wonder whether it might make sense to just remove the old sanity check instead of making it optional?

@jonatack
Copy link
Member

jonatack commented Apr 13, 2022

@mzumsande good point, and I think it also highlights that it may alternatively be better to drop the default false value for filter_checked here as follows to make that more explicit. Then (here or in a follow-up), if BlockFilter::Unserialize() is removed or moved to the test code, the sanity check here could be removed.

drop default false filter_checked value

diff --git a/src/bench/gcs_filter.cpp b/src/bench/gcs_filter.cpp
index 89fa07f602..56f3cff0c1 100644
--- a/src/bench/gcs_filter.cpp
+++ b/src/bench/gcs_filter.cpp
@@ -52,7 +52,7 @@ static void DecodeGCSFilter(benchmark::Bench& bench)
     auto encoded = filter.GetEncoded();
 
     bench.unit("elem").run([&] {
-        GCSFilter filter({0, 0, BASIC_FILTER_P, BASIC_FILTER_M}, encoded);
+        GCSFilter filter({0, 0, BASIC_FILTER_P, BASIC_FILTER_M}, encoded, /*filter_checked=*/false);
     });
 }
 
@@ -66,7 +66,7 @@ static void BlockFilterGetHash(benchmark::Bench& bench)
         elements.insert(std::move(element));
     }
     GCSFilter filter({0, 0, BASIC_FILTER_P, BASIC_FILTER_M}, elements);
-    BlockFilter block_filter(BlockFilterType::BASIC, {}, filter.GetEncoded());
+    BlockFilter block_filter(BlockFilterType::BASIC, {}, filter.GetEncoded(), /*filter_checked=*/false);
 
     bench.unit("elem").run([&] {
         block_filter.GetHash();
@@ -86,7 +86,7 @@ static void DecodeCheckedGCSFilter(benchmark::Bench& bench)
     auto encoded = filter.GetEncoded();
 
     bench.unit("elem").run([&] {
-        GCSFilter filter({0, 0, BASIC_FILTER_P, BASIC_FILTER_M}, encoded, true);
+        GCSFilter filter({0, 0, BASIC_FILTER_P, BASIC_FILTER_M}, encoded, /*filter_checked=*/true);
     });
 }
 BENCHMARK(BlockFilterGetHash);
diff --git a/src/blockfilter.h b/src/blockfilter.h
index 21d6a295a2..f823354989 100644
--- a/src/blockfilter.h
+++ b/src/blockfilter.h
@@ -59,7 +59,7 @@ public:
     explicit GCSFilter(const Params& params = Params());
 
     /** Reconstructs an already-created filter from an encoding. */
-    GCSFilter(const Params& params, std::vector<unsigned char> encoded_filter, bool filter_checked=false);
+    GCSFilter(const Params& params, std::vector<unsigned char> encoded_filter, bool filter_checked);
 
     /** Builds a new filter from the params and set of elements. */
     GCSFilter(const Params& params, const ElementSet& elements);
@@ -122,7 +122,7 @@ public:
 
     //! Reconstruct a BlockFilter from parts.
     BlockFilter(BlockFilterType filter_type, const uint256& block_hash,
-                std::vector<unsigned char> filter, bool filter_checked=false);
+                std::vector<unsigned char> filter, bool filter_checked);
 
     //! Construct a new BlockFilter of the specified type from a block.
     BlockFilter(BlockFilterType filter_type, const CBlock& block, const CBlockUndo& block_undo);
@@ -164,7 +164,7 @@ public:
         if (!BuildParams(params)) {
             throw std::ios_base::failure("unknown filter_type");
         }
-        m_filter = GCSFilter(params, std::move(encoded_filter));
+        m_filter = GCSFilter(params, std::move(encoded_filter), /*filter_checked=*/false);
     }
 };

@kcalvinalvin
Copy link
Contributor Author

Sure I can just get rid of the option. I'll move BlockFilter::Unserialize() to the test files in a follow-up.

Quick side-question: my initial thought was that it'd be helpful to leave an option in there in case anyone is using the code as a library. In general, does Bitcoin Core development think about potential use cases of the code being used as a library somewhere else?

@jonatack
Copy link
Member

@kcalvinalvin maybe decided case-by-case; I don't recall a recent specific example of leave-it-in-for-reuse-as-a-library being invoked as a rationale for not moving code used only for tests out to the test code.

Verified that removing BlockFilter::Unserialize() compiles with the following test-only deletions:

code that depends on BlockFilter::Unserialize()

diff --git a/src/blockfilter.h b/src/blockfilter.h
index 21d6a295a2..764395c370 100644
--- a/src/blockfilter.h
+++ b/src/blockfilter.h
@@ -148,24 +148,6 @@ public:
           << m_block_hash
           << m_filter.GetEncoded();
     }
-
-    template <typename Stream>
-    void Unserialize(Stream& s) {
-        std::vector<unsigned char> encoded_filter;
-        uint8_t filter_type;
-
-        s >> filter_type
-          >> m_block_hash
-          >> encoded_filter;
-
-        m_filter_type = static_cast<BlockFilterType>(filter_type);
-
-        GCSFilter::Params params;
-        if (!BuildParams(params)) {
-            throw std::ios_base::failure("unknown filter_type");
-        }
-        m_filter = GCSFilter(params, std::move(encoded_filter));
-    }
 };
 
 #endif // BITCOIN_BLOCKFILTER_H
diff --git a/src/test/blockfilter_tests.cpp b/src/test/blockfilter_tests.cpp
index 8eb4dbc592..b99d9a32c3 100644
--- a/src/test/blockfilter_tests.cpp
+++ b/src/test/blockfilter_tests.cpp
@@ -106,23 +106,6 @@ BOOST_AUTO_TEST_CASE(blockfilter_basic_test)
     for (const CScript& script : excluded_scripts) {
         BOOST_CHECK(!filter.Match(GCSFilter::Element(script.begin(), script.end())));
     }
-
-    // Test serialization/unserialization.
-    BlockFilter block_filter2;
-
-    CDataStream stream(SER_NETWORK, PROTOCOL_VERSION);
-    stream << block_filter;
-    stream >> block_filter2;
-
-    BOOST_CHECK_EQUAL(block_filter.GetFilterType(), block_filter2.GetFilterType());
-    BOOST_CHECK_EQUAL(block_filter.GetBlockHash(), block_filter2.GetBlockHash());
-    BOOST_CHECK(block_filter.GetEncodedFilter() == block_filter2.GetEncodedFilter());
-
-    BlockFilter default_ctor_block_filter_1;
-    BlockFilter default_ctor_block_filter_2;
-    BOOST_CHECK_EQUAL(default_ctor_block_filter_1.GetFilterType(), default_ctor_block_filter_2.GetFilterType());
-    BOOST_CHECK_EQUAL(default_ctor_block_filter_1.GetBlockHash(), default_ctor_block_filter_2.GetBlockHash());
-    BOOST_CHECK(default_ctor_block_filter_1.GetEncodedFilter() == default_ctor_block_filter_2.GetEncodedFilter());
 }
 
 BOOST_AUTO_TEST_CASE(blockfilters_json_test)
diff --git a/src/test/fuzz/deserialize.cpp b/src/test/fuzz/deserialize.cpp
index ed6f172a2a..3793921e86 100644
--- a/src/test/fuzz/deserialize.cpp
+++ b/src/test/fuzz/deserialize.cpp
@@ -95,12 +95,6 @@ void DeserializeFromFuzzingInput(FuzzBufferType buffer, T& obj, const std::optio
             throw invalid_fuzzing_input_exception();
         }
     }
-    try {
-        ds >> obj;
-    } catch (const std::ios_base::failure&) {
-        throw invalid_fuzzing_input_exception();
-    }
-    assert(buffer.empty() || !Serialize(obj).empty());
 }
 
 template <typename T>
diff --git a/src/test/fuzz/util.h b/src/test/fuzz/util.h
index 6c91844633..e8e8e57394 100644
--- a/src/test/fuzz/util.h
+++ b/src/test/fuzz/util.h
@@ -145,11 +145,6 @@ template <typename T>
     const std::vector<uint8_t> buffer = ConsumeRandomLengthByteVector(fuzzed_data_provider, max_length);
     CDataStream ds{buffer, SER_NETWORK, INIT_PROTO_VERSION};
     T obj;
-    try {
-        ds >> obj;
-    } catch (const std::ios_base::failure&) {
-        return std::nullopt;
-    }
     return obj;
 }

@kcalvinalvin kcalvinalvin force-pushed the 2020-06-14-blockfilterindex-checksums branch 3 times, most recently from fe72ea6 to 3e6d98e Compare April 14, 2022 08:28
@kcalvinalvin
Copy link
Contributor Author

Addressed all of the changes requested except for removing BlockFilter::Unserialize() and getting rid of the option to sanity check. I'll do those in a follow up.

@jonatack
Copy link
Member

jonatack commented Apr 14, 2022

ACK 3e6d98e

modulo:

  • ordering the gcs_filter benchmark methods in the same order as the benchmarks in that file
  • maybe naming the BlockFilterGetHash bench as GCSBlockFilterGetHash so that ./src/bench/bench_bitcoin -filter=GCS*.* can invoke it with the others in that file

Here is a patch on top of your last push, if helpful, that removes BlockFilter::Unserialize: https://github.com/jonatack/bitcoin/commits/remove-BlockFilter-Unserialize. Though it may be more prudent to move it to the test code and leave the tests in place.

@kcalvinalvin kcalvinalvin force-pushed the 2020-06-14-blockfilterindex-checksums branch from 3e6d98e to 0b976a6 Compare April 19, 2022 03:21
@jonatack
Copy link
Member

jonatack commented Apr 19, 2022

ACK 0b976a6

Sanity check:

Running ./src/bench/bench_bitcoin -filter=GCS*.* on this branch (debug build and not tuned for benchmarking)

|             ns/elem |              elem/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          179,394.67 |            5,574.30 |    0.5% |      0.01 | `GCSBlockFilterGetHash`
|            3,376.66 |          296,150.75 |    2.1% |      0.38 | `GCSFilterConstruct`
|        5,163,985.00 |              193.65 |    2.5% |      0.06 | `GCSFilterDecode`
|            1,926.40 |          519,102.50 |    1.9% |      0.01 | `GCSFilterDecodeChecked`
|          577,446.00 |            1,731.76 |    1.0% |      0.01 | `GCSFilterMatch`
and with the operative codebase changes reverted (diff)

diff --git a/src/blockfilter.cpp b/src/blockfilter.cpp
index 08e2e6f72e..9d76d2b60e 100644
--- a/src/blockfilter.cpp
+++ b/src/blockfilter.cpp
@@ -59,7 +59,7 @@ GCSFilter::GCSFilter(const Params& params, std::vector<unsigned char> encoded_fi
     }
     m_F = static_cast<uint64_t>(m_N) * static_cast<uint64_t>(m_params.m_M);
 
-    if (filter_checked) return;
+    // if (filter_checked) return;
 
     // Verify that the encoded filter contains exactly N elements. If it has too much or too little
     // data, a std::ios_base::failure exception will be raised.
diff --git a/src/index/blockfilterindex.cpp b/src/index/blockfilterindex.cpp
index 1471c29a85..f83b883c8f 100644
--- a/src/index/blockfilterindex.cpp
+++ b/src/index/blockfilterindex.cpp
@@ -156,9 +156,6 @@ bool BlockFilterIndex::ReadFilterFromDisk(const FlatFilePos& pos, const uint256&
     std::vector<uint8_t> encoded_filter;
     try {
         filein >> block_hash >> encoded_filter;
-        uint256 result;
-        CHash256().Write(encoded_filter).Finalize(result);
-        if (result != hash) return error("Checksum mismatch in filter decode.");
         filter = BlockFilter(GetFilterType(), block_hash, std::move(encoded_filter), /*filter_checked=*/true);

...GCSFilterDecodeChecked is far slower and equivalent to GCSFilterDecode as expected

|             ns/elem |              elem/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          165,774.17 |            6,032.30 |    0.5% |      0.01 | `GCSBlockFilterGetHash`
|            3,775.71 |          264,850.84 |    0.5% |      0.42 | `GCSFilterConstruct`
|        4,960,766.00 |              201.58 |    0.9% |      0.05 | `GCSFilterDecode`
|        4,968,437.00 |              201.27 |    1.4% |      0.06 | `GCSFilterDecodeChecked`
|          584,923.00 |            1,709.63 |    3.7% |      0.01 | `GCSFilterMatch`

@kcalvinalvin kcalvinalvin force-pushed the 2020-06-14-blockfilterindex-checksums branch from 58382e2 to d492499 Compare April 20, 2022 06:46
@kcalvinalvin
Copy link
Contributor Author

kcalvinalvin commented Apr 20, 2022

ACK 0b976a6

Thank you for the speedy review!

I pushed 04008e2 on top of 0b976a6 which does mostly what you did in https://github.com/jonatack/bitcoin/commits/remove-BlockFilter-Unserialize but it keeps the serialize/unserialize test in src/test/blockfilter_tests.cpp with the function

template <typename Stream>
static BlockFilter UnserializeBlockFilter(Stream& s) {
    std::vector<unsigned char> encoded_filter;
    uint8_t filter_type_uint8;
    uint256 block_hash;

    s >> filter_type_uint8
      >> block_hash
      >> encoded_filter;

    BlockFilterType filter_type = static_cast<BlockFilterType>(filter_type_uint8);
    BlockFilter block_filter(filter_type, block_hash, std::move(encoded_filter), /*filter_checked=*/false);
    return block_filter;
}

I think this is a good tradeoff between keeping the serialization in the unit tests but still keeping the maintenance overhead low.

@kcalvinalvin kcalvinalvin force-pushed the 2020-06-14-blockfilterindex-checksums branch from d492499 to 04008e2 Compare April 20, 2022 07:23
@kcalvinalvin
Copy link
Contributor Author

Looks like you are removing the fuzz test as well?

@MarcoFalke Ah my first instinct was to get rid of the entire test as ConsumeDeserializable would no longer be usable. I'll push a change with the fuzz tests kept in.

@kcalvinalvin kcalvinalvin force-pushed the 2020-06-14-blockfilterindex-checksums branch 2 times, most recently from 88cdc2c to 447174d Compare April 21, 2022 09:00
@jonatack
Copy link
Member

jonatack commented Apr 21, 2022

@kcalvinalvin if you move BlockFilter::Unserialize() to the test code in this pull, it will be a smaller change as the first commit rather than last one, WDYT? Edit: maybe a first commit that adds the check in ReadFilterFromDisk() and rename the commit, then the one that moves Unserialize() to the test code.

@kcalvinalvin
Copy link
Contributor Author

@kcalvinalvin if you move BlockFilter::Unserialize() to the test code in this pull, it will be a smaller change as the first commit rather than last one, WDYT? Edit: maybe a first commit that adds the check in ReadFilterFromDisk() and rename the commit, then the one that moves Unserialize() to the test code.

I'm ok with that if it'd lower the cost of reviewing. However, I'm thinking if the cost of reviewing would be even lower if I have commit 447174d be in a separate PR. I'm sorta thinking maybe getting rid of Unserialize() and replacing GolombRiceDecode() check are two separate things. Let me know what you'd think would lower the cost of reviewing :)

@jonatack
Copy link
Member

@kcalvinalvin SGTM.

@kcalvinalvin kcalvinalvin force-pushed the 2020-06-14-blockfilterindex-checksums branch from 447174d to 0b976a6 Compare April 22, 2022 16:03
@kcalvinalvin
Copy link
Contributor Author

Reverted remove the last commit. Latest commit is now back to 0b976a6. Will make a follow up PR after this one.

pstratem and others added 2 commits May 22, 2022 14:00
This benchmark allows us to compare the differences between doing the
sanity check for corruption via GolombRiceDecode() vs checking the hash
of the encoded block filter.
Element count used in the GCSFilter benchmarks are increased to 100,000
from 10,000. Testing the benchmarks with different element counts showed
that a filter with 100,000 elements resulted in the same ns/op. This
this a desirable thing to have as it allows us to reason about how long
a single filter element takes to process, letting us easily calculate
how long a filter with N elements (where N > 100,000) would take to
process.

GCSFilterConstruct benchmark is now called without batch. This makes
intra-bench results more intuitive as all benchmarks are in ns/op
instead of a custom unit. There are no downsides to this change as
testing showed that there is no observable difference in error rates
in the benchmarks when calling without batch.
@kcalvinalvin kcalvinalvin force-pushed the 2020-06-14-blockfilterindex-checksums branch from b32e965 to e734228 Compare May 22, 2022 05:28
@kcalvinalvin
Copy link
Contributor Author

kcalvinalvin commented May 22, 2022

4 things changed in the latest push:

  1. Renamed the GCSFilterGetHash benchmark back to GCSBlockFilterGetHash.
  2. Renamed GCSFilterDecodeChecked to GCSFilterDecodeSkipCheck as filter_checked argument was renamed to skip_decode_check. Changing the name of the benchmark as well seemed fitting.
  3. Added a commit where GenerateGCSTestElements generates 100,000 elements instead of 10,000. The reasoning was based off of my findings in testing the benchmarks. I've included an explanation in the code as a comment and in the commit message.
  4. In the same added commit, I've changed the GCSFilterConstruct to be called without batch and the custom unit. This was done as it makes comparing intra-bench results more intuitive. Calling with batch didn't result in lower error rates either so there were no tradeoffs made.

Copy link
Contributor

@mzumsande mzumsande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review ACK e734228

I'm not sure if it really matters a lot how big we choose the number of elements used in the test, I just think that if we pick custom measure other than "ops", the benchmark should be linear in that - so I like that everything uses ns/op now.
In any case, the important feature of this PR is the improvement of the filter verification.

@theStack
Copy link
Contributor

theStack commented Jul 4, 2022

Concept ACK

Copy link
Contributor

@theStack theStack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code-review ACK e734228

FWIW, this are the benchmark results on my machine:

$ ./src/bench/bench_bitcoin -filter=GCS.*
ns/op op/s err% total benchmark
719,504.00 1,389.85 2.4% 0.01 GCSBlockFilterGetHash
28,431,064.00 35.17 4.0% 0.31 GCSFilterConstruct
3,641,177.00 274.64 3.4% 0.04 GCSFilterDecode
13,994.04 71,458.99 1.5% 0.01 GCSFilterDecodeSkipCheck
424,500.50 2,355.71 2.2% 0.01 GCSFilterMatch

@fanquake fanquake requested a review from ryanofsky July 6, 2022 12:09
Copy link
Contributor

@stickies-v stickies-v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK e734228

Non-controversial and significant performance improvement when loading block filters from disk, which can be a frequent process depending on peers or RPC users. I've left some style/readability suggestions but none of them are blocking for me.

If BlockFilter::Unserialize() is removed in a follow up, further code cleanup is possible by removing the skip_decode_check parameter and the check branch.

$ ./src/bench/bench_bitcoin -filter=GCS.*
yields

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          766,333.00 |            1,304.92 |    0.8% |      0.01 | `GCSBlockFilterGetHash`
|       60,891,167.00 |               16.42 |    0.3% |      0.67 | `GCSFilterConstruct`
|       15,802,125.00 |               63.28 |    0.3% |      0.17 | `GCSFilterDecode`
|        1,697,875.00 |              588.97 |    0.2% |      0.02 | `GCSFilterDecodeSkipCheck`
|        1,687,000.00 |              592.77 |    0.1% |      0.02 | `GCSFilterMatch`

Comment on lines +154 to 163
// Check that the hash of the encoded_filter matches the one stored in the db.
uint256 block_hash;
std::vector<uint8_t> encoded_filter;
try {
filein >> block_hash >> encoded_filter;
filter = BlockFilter(GetFilterType(), block_hash, std::move(encoded_filter));
uint256 result;
CHash256().Write(encoded_filter).Finalize(result);
if (result != hash) return error("Checksum mismatch in filter decode.");
filter = BlockFilter(GetFilterType(), block_hash, std::move(encoded_filter), /*skip_decode_check=*/true);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you consider moving the hash checking into a separate function for readability and maintainability?

@@ -143,18 +144,22 @@ bool BlockFilterIndex::CommitInternal(CDBBatch& batch)
return BaseIndex::CommitInternal(batch);
}

bool BlockFilterIndex::ReadFilterFromDisk(const FlatFilePos& pos, BlockFilter& filter) const
bool BlockFilterIndex::ReadFilterFromDisk(const FlatFilePos& pos, const uint256& hash, BlockFilter& filter) const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since this function also has a block_hash variable, maybe filter_hash would be a clearer disambiguation?

Suggested change
bool BlockFilterIndex::ReadFilterFromDisk(const FlatFilePos& pos, const uint256& hash, BlockFilter& filter) const
bool BlockFilterIndex::ReadFilterFromDisk(const FlatFilePos& pos, const uint256& filter_hash, BlockFilter& filter) const

uint256 block_hash;
std::vector<uint8_t> encoded_filter;
try {
filein >> block_hash >> encoded_filter;
filter = BlockFilter(GetFilterType(), block_hash, std::move(encoded_filter));
uint256 result;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calculated_hash might be more intuitive?

Suggested change
uint256 result;
uint256 calculated_hash;

uint256 block_hash;
std::vector<uint8_t> encoded_filter;
try {
filein >> block_hash >> encoded_filter;
filter = BlockFilter(GetFilterType(), block_hash, std::move(encoded_filter));
uint256 result;
CHash256().Write(encoded_filter).Finalize(result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think {} list initialization is generally preferred? Here, and in some other places in the PR.

Suggested change
CHash256().Write(encoded_filter).Finalize(result);
CHash256{}.Write(encoded_filter).Finalize(result);

filter = BlockFilter(GetFilterType(), block_hash, std::move(encoded_filter));
uint256 result;
CHash256().Write(encoded_filter).Finalize(result);
if (result != hash) return error("Checksum mismatch in filter decode.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be helpful to include pos in the error message? Also, not sure if "filter decode" is the most accurate description? What about e.g. "Checksum mismatch for filter loaded from <pos>"?

Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK e734228, with caveat that I mostly paid attention to the main code, not the changes to the benchmark. Only changes since last review were changes to the benchmark code.

@fanquake
Copy link
Member

fanquake commented Jul 7, 2022

@stickies-v maybe you'd like to follow up here?

@fanquake fanquake merged commit 5abbc9a into bitcoin:master Jul 7, 2022
@jonatack
Copy link
Member

jonatack commented Jul 7, 2022

If someone follows up, this comment and patch might be useful: #24832 (comment)

@kcalvinalvin
Copy link
Contributor Author

I have working code here: https://github.com/kcalvinalvin/bitcoin/commits/2020-06-14-blockfilterindex-checksums-2. I'll make a follow up PR

sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Jul 8, 2022
… the filter from disk.

e734228 Update GCSFilter benchmarks (Calvin Kim)
aee9a81 Add GCSFilterDecodeSkipCheck benchmark (Patrick Strateman)
299023c Add GCSFilterDecode and GCSBlockFilterGetHash benchmarks. (Patrick Strateman)
b0a53d5 Make sanity check in GCSFilter constructor optional (Patrick Strateman)

Pull request description:

  This PR picks up the abandoned bitcoin#19280

  BlockFilterIndex was depending on `GolombRiceDecode()` during the filter decode to sanity check that the filter wasn't corrupt. However, we can check for corruption by ensuring that the encoded blockfilter's hash matches up with the one stored in the index database.

  Benchmarks that were added in bitcoin#19280 showed that checking the hash is much faster.

  The benchmarks were changed to nanobench and the relevant benchmarks were like below, showing a clear win for the hash check method.

  ```
  |             ns/elem |              elem/s |    err% |        ins/elem |       bra/elem |   miss% |     total | benchmark
  |--------------------:|--------------------:|--------:|----------------:|---------------:|--------:|----------:|:----------
  |              531.40 |        1,881,819.43 |    0.3% |        3,527.01 |         411.00 |    0.2% |      0.01 | `DecodeCheckedGCSFilter`
  |          258,220.50 |            3,872.66 |    0.1% |    2,990,092.00 |     586,706.00 |    1.7% |      0.01 | `DecodeGCSFilter`
  |           13,036.77 |           76,706.09 |    0.3% |       64,238.24 |         513.04 |    0.2% |      0.01 | `BlockFilterGetHash`
  ```

ACKs for top commit:
  mzumsande:
    Code Review ACK e734228
  theStack:
    Code-review ACK e734228
  stickies-v:
    ACK e734228
  ryanofsky:
    Code review ACK e734228, with caveat that I mostly paid attention to the main code, not the changes to the benchmark. Only changes since last review were changes to the benchmark code.

Tree-SHA512: 02b86eab7b554e1a57a15b17a4d6d71faa91b556c637b0da29f0c9ee76597a110be8e3b4d0c158d4cab04af0623de18b764837be0ec2a72afcfe1ad9c78a83c6
kcalvinalvin added a commit to kcalvinalvin/bitcoin that referenced this pull request Jul 19, 2022
BlockFilter::Unserialize is only used in the test code. Moving it to
the tests allows for the removal of the optional sanity check in the
GCSFilter and BlockFilter constructor, resulting in simpler code.

This is a follow up to bitcoin#24832.
kcalvinalvin added a commit to kcalvinalvin/bitcoin that referenced this pull request Jul 27, 2022
BlockFilter::Unserialize is only used in the test code. Moving it to
the tests allows for the removal of the optional sanity check in the
GCSFilter and BlockFilter constructor, resulting in simpler code.

This is a follow up to bitcoin#24832.
@jonatack
Copy link
Member

Post-merge re-ACK e734228, thanks for updating.

kcalvinalvin added a commit to kcalvinalvin/bitcoin that referenced this pull request Aug 5, 2022
BlockFilter::Unserialize is only used in the test code. Moving it to
the tests allows for the removal of the optional sanity check in the
GCSFilter and BlockFilter constructor, resulting in simpler code.

This is a follow up to bitcoin#24832.
kcalvinalvin added a commit to kcalvinalvin/bitcoin that referenced this pull request Aug 5, 2022
BlockFilter::Unserialize is only used in the test code. Moving it to
the tests allows for the removal of the optional sanity check in the
GCSFilter and BlockFilter constructor, resulting in simpler code.

This is a follow up to bitcoin#24832.
kcalvinalvin added a commit to kcalvinalvin/bitcoin that referenced this pull request Aug 5, 2022
BlockFilter::Unserialize is only used in the test code. Moving it to
the tests allows for the removal of the optional sanity check in the
GCSFilter and BlockFilter constructor, resulting in simpler code.

This is a follow up to bitcoin#24832.
kcalvinalvin added a commit to kcalvinalvin/bitcoin that referenced this pull request Aug 30, 2022
BlockFilter::Unserialize is only used in the test code. Moving it to
the tests allows for the removal of the optional sanity check in the
GCSFilter and BlockFilter constructor, resulting in simpler code.

This is a follow up to bitcoin#24832.
kcalvinalvin added a commit to kcalvinalvin/bitcoin that referenced this pull request Aug 31, 2022
BlockFilter::Unserialize is only used in the test code. Moving it to
the tests allows for the removal of the optional sanity check in the
GCSFilter and BlockFilter constructor, resulting in simpler code.

This is a follow up to bitcoin#24832.
kcalvinalvin added a commit to kcalvinalvin/bitcoin that referenced this pull request Aug 31, 2022
BlockFilter::Unserialize is only used in the test code. Moving it to
the tests allows for the removal of the optional sanity check in the
GCSFilter and BlockFilter constructor, resulting in simpler code.

This is a follow up to bitcoin#24832.
@bitcoin bitcoin locked and limited conversation to collaborators Jul 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants