speed up IBD by not storing unspent coins in UTXO memory cache #7
Description
Background
The UTXO, aka chainstate, database is stored on disk and partially in a memory cache whose size is determined by -dbcache
`). The size of this cache significantly affects IBD time. As we validate each block, we loop through its transactions, and verify that each input spends an existing UTXO. Even if we're not checking signatures (AssumeValid), we verify that the input refers to an unspent output (to prevent double-spends). It's much faster if the UTXO is in the memory cache. We also add each transaction output to our memory UTXO cache so a subsequent transaction can refer to it. This memory cache is implemented as an unordered map whose key is an outpoint (txid and output index) and whose data is the scriptPubKey (what's needed to validate the input that wants to spend this output). When this cache becomes full, we flush it to disk; you can see those flushing cliffs in graphs.
We don't flush a "dirty" in-memory UTXO entry to disk right away, because another near-term tx may spend this UTXO -- even within the same block. In that case, we never write this UTXO to disk! That's a nice savings. I think this explains the downward arcs of those graphs before each flush. UTXOs are continuously added to the cache at about the same rate, but as the cache becomes populated, it's more likely that inputs spend UTXOs that are in the memory cache, which allows those UTXOs to be dropped (and never written to disk). So the net rate of addition to the cache slows down. If the cache is large enough to hold the entire UTXO set, then the cache size would gradually level off. But it's unusual for node runners to configure the cache to be this large.
As an aside, it seems like an improvement would be to retain these items in the memory cache after we flush them to disk, instead of clearing the memory cache. There may be a reason I'm not aware of that this can't be done (or done easily).
The problem
There are many coins, such as Satoshi's early coins, that to this day have not been spent. They will eventually get flushed to disk -- but they hold onto memory until that happens. This means there is less memory available for other UTXOs that will be spent before being flushed. The time Satoshi's coins are sitting in memory isn't a very efficient use of that memory.
One interesting property of IBD is that we know the (likely) future. Every node will go through the same sequence of coin creation and spending (in the database) until reaching the tip.
A solution
So the idea is, we have a static, hard-coded bloom filter (in the source code) that's probably updated for each major release (similar to the AssumeValid block hash). This bloom filter will contain outpoints referring to UTXOs that remain unspent (at least up to the time the filter is constructed), such as Satoshi's coins. When we add an entry to the memory cache, if the outpoint is (probably) in the bloom filter, then we want to flush it to disk -- and free its cache entry in memory -- soon. The flushing seems to be done in batches, so we may need to bundle up a bunch of these and flush them at the same time. (I haven't figured out how to do this yet.) This will free up cache space, making it more likely that an input we're validating refers to a UTXO that is in the memory cache (and thus doesn't need to be read from disk). It makes the cache more effective -- storing items that are more likely to be needed. It would have the same effect as having a larger cache.
A bloom filter can have false-positives, so we may flush some cache entries that do get spent, and that may make it worse when those get spent than currently. But if we size the filter correctly, this should be rare. We could instead invert the sense of the filter so it contains entries that will be spent (and flush entries that are not in the filter). This way, the false positives would cause us to keep entries in the cache that would be better to flush. But that's no different from today. I'm not sure which way around the filter should be implemented, but for now, let's assume it contains unspent UTXOs.
A refinement to this idea would be to add UTXOs to the bloom filter that are long-lived, not spent "soon" for some definition of soon. Unfortunately, this definition would ideally depend on the node's -dbcache
setting, which we can't predict when we're creating the static bloom filter. We could have a few filters and use the appropriate one based on cache size, but that may be getting too complicated. If we choose "soon" based on the default cache size, that may be close enough. Also, we could not activate this optimization (do these "early" flushes) at all until the cache becomes something like 90% full. That way if a node is configured with a very large cache, this change wouldn't make performance worse.
Another refinement (thanks to @theStack) would be to inspect the scriptPubKey to see if the output is likely or actually unspendable (for example, dust or public keys not on the curve, or time locks).
Note that this idea doesn't increase trust in the devs; it's just a performance improvement in the most likely case.