Description
This is a tracking issue for the recurring finalization stalling incidents.
So far, here are the pointers we have.
Commits that are affecting the finalization directly:
- Fix longest chain finalization target lookup paritytech/substrate#13289
Should not the the cause though, as the issue is with the inability of grandpa to follow a reorg, which is cause by pruned blocks. Or, rather, it could be the cause, but this one is not really where the bug we'd like to fix is. - Fix block pruning after long re-org paritytech/substrate#13323 -
could this be the reason?no, this one works as intended - grandpa: don't error if best block and finality target are inconsistent paritytech/substrate#13364 - this looks like it directly messes with what's causing the bug, but more like it fixes the aftermath of the one above
Notable historical commits that might've affected this incident:
- Epoch-Changes tree pruning was lagging by one epoch paritytech/substrate#12567
- babe: allow skipping over empty epochs paritytech/substrate#11727 (unlikely though)
- Remove
uncles
related code paritytech/substrate#13216 (actually looks somewhat interesting) - Notification-based block pinning paritytech/substrate#13157 (maybe pinning the block that is used as a finality based is a good idea)
Probably irrelevant:
- rpc: Use the blocks pinning API for chainHead methods paritytech/substrate#13233
- rpc/chainHead: Fix pruned blocks events from forks paritytech/substrate#13379
- Reduce consensus spam paritytech/substrate#1658
- BABE: Fix aux data cleaning paritytech/substrate#11263
- grandpa: cleanup stale entries in set id session mapping paritytech/substrate#13237
BlockId
removal:runtime-api
refactor paritytech/substrate#13255
Related things from the future:
- Change forks pruning algorithm. paritytech/polkadot-sdk#3962 - new fork pruning algorithm - might work better, but not sure
- Finalization hangs in 1.13 paritytech/polkadot-sdk#4903 - irrelevant for our issue, as this issue is with the new fork pruning algorithm mentioned above being too slow
The thoughts we have so far is that the issue is caused by the broken block pruning (which is meaningful when an unfortunate best block selection happens). (UPD: the blocks are not really pruned, we can still request them from the API; this means this is a selection issue) Grandpa tries to finalize the block that is pruned and not the part of the best chain anymore; upon receiving the precommits it fails to recognize them because the block can't be resolved (it's pruned). This is quite odd, since the nodes operate with --blocks-pruning archive
, so it must mean they haven't seen the block in question as is it no never to be pruned if seen - but in reality we see a the reorg happening from the block in question to a new one, so the old block is definitely a known one.
We are currently on substrate 0.9.40.
Mainnet encountered this at least three times lately: