Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: split out /Transaction/AssembleBlock metrics #4795

Merged
merged 6 commits into from
Nov 17, 2022

Conversation

iansuvak
Copy link
Contributor

@iansuvak iansuvak commented Nov 15, 2022

Summary

This splits out reasons for removing transactions from the pool during AssembleBlock with more granularity. The main reason is to provide visibility into the number of expired transactions across the network.

Questions for discussion

  • This removes info log from drops caused by taken lease since they are fairly common but leaves them in for MinTxnFeeError. We could remove them, leave them or potentially downsample them if we think that having some logs would be useful
  • There might be value in separately tracking TxnDeadErrors that have expired and had the LastValid set to FirstValid + MaxTxnLife. This can be implemented by adding a bool field indicating this to the TxnDeadError struct but accessing the MaxTxnLife would require passing down the ConsensusParams all the way down from recomputeBlockEvaluator and it might be messy if we don't think it's worth it

Test Plan

Existing tests should pass

algorandskiy
algorandskiy previously approved these changes Nov 15, 2022
Copy link
Contributor

@algorandskiy algorandskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me

@iansuvak
Copy link
Contributor Author

Additionally after releasing this we should probably remove /Transaction/ProcessBlock since it's mostly unused now and all info is available in /Transaction/AssembleBlock

@iansuvak iansuvak marked this pull request as ready for review November 15, 2022 20:59
@codecov
Copy link

codecov bot commented Nov 15, 2022

Codecov Report

Merging #4795 (3560a26) into master (23890a8) will increase coverage by 0.00%.
The diff coverage is 37.50%.

@@           Coverage Diff           @@
##           master    #4795   +/-   ##
=======================================
  Coverage   54.68%   54.69%           
=======================================
  Files         414      414           
  Lines       53550    53560   +10     
=======================================
+ Hits        29286    29292    +6     
- Misses      21836    21843    +7     
+ Partials     2428     2425    -3     
Impacted Files Coverage Δ
ledger/ledgercore/error.go 0.00% <0.00%> (ø)
data/pools/transactionPool.go 50.00% <20.00%> (-0.70%) ⬇️
logging/telemetryspec/metric.go 81.03% <100.00%> (+1.40%) ⬆️
ledger/blockqueue.go 85.63% <0.00%> (-2.88%) ⬇️
catchup/peerSelector.go 98.95% <0.00%> (-1.05%) ⬇️
cmd/tealdbg/debugger.go 72.69% <0.00%> (-0.81%) ⬇️
catchup/service.go 68.88% <0.00%> (-0.50%) ⬇️
network/wsPeer.go 68.93% <0.00%> (ø)
network/wsNetwork.go 65.52% <0.00%> (+0.18%) ⬆️
ledger/tracker.go 74.89% <0.00%> (+0.85%) ⬆️
... and 3 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@iansuvak
Copy link
Contributor Author

I added a separate check for max life transaction but happy to remove it. One thing that would definitely make this counter unnecessary is if majority of transactions coming from all sources had LastValid shorter than MaxTxnLife

algonautshant
algonautshant previously approved these changes Nov 16, 2022
Copy link
Contributor

@algonautshant algonautshant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine. Is it possible to consolidate some of the metrics, since two sets of metrics are collected for the same conditions.

The code is getting crowded with metrics. At least it is important to add comments explaining the distinctions, so that if cleanup is needed, it can be done safely and easily.

data/pools/transactionPool.go Outdated Show resolved Hide resolved
data/pools/transactionPool.go Show resolved Hide resolved
case *ledgercore.TransactionInLedgerError:
asmStats.CommittedCount++
stats.RemovedInvalidCount++
case transactions.TxnDeadError:
asmStats.InvalidCount++
if proto.MaxTxnLife == uint64(err.LastValid-err.FirstValid) {
asmStats.ExpiredMaxLifeCount++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be misleading, since we don't know if err.LastValid-err.FirstValid is a small enough to make an impact, say less than 10, or if it is proto.MaxTxnLife-1.
I think it will be better to check if it is less than 50, than compare to proto.MaxTxnLife.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is odd, I just didn't want to introduce an arbitrary magic number since any number is really arbitrary. Perhaps more useful would be an average transaction lifespan? Could implement it as running average by keeping running sum of lifetimes of expired txns?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running average is too complex, and I don't suggest that. Also, it is not very helpful.

The "magic" number is the one that makes sense for the collected metrics.
I think, we are interested in learning if the txn life is 5 or 900. But we don't care if it is 900 or 990. I think anything more that 20 will fall into a long-term bucket. So, 20 is not a magic number, but helps characterize a certain behavior of transactions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the change to 20. We can see what stats this comes out to and further tune when removing old stats.

algorandskiy
algorandskiy previously approved these changes Nov 16, 2022
algonautshant
algonautshant previously approved these changes Nov 16, 2022
@iansuvak iansuvak dismissed stale reviews from algonautshant and algorandskiy via 57e08ac November 17, 2022 17:13
@@ -55,7 +55,7 @@ func MakeLeaseInLedgerError(txid transactions.Txid, lease Txlease) *LeaseInLedge
func (lile *LeaseInLedgerError) Error() string {
// format the lease as address.
addr := basics.Address(lile.lease.Lease)
return fmt.Sprintf("transaction %v using an overlapping lease %s", lile.txid, addr.String())
return fmt.Sprintf("transaction %v using an overlapping lease (%s, %s)", lile.txid, lile.lease.Sender.String(), addr.String())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very confusing :) maybe rename addr to leaseData

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? the "lease" in actual string to "(sender, lease)"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but addr := basics.Address(lile.lease.Lease) this is not address but the lease value. I'm referencing to addr.String() that prints lease data, not addr as one can think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, I think

algorandskiy
algorandskiy previously approved these changes Nov 17, 2022
@algorandskiy algorandskiy merged commit 06e61a9 into algorand:master Nov 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants