catchup: Add HEAD request to catchpoint service start #5393

Eric-Warehime · 2023-05-17T00:42:52Z

Summary

Adds a HEAD request to the ledger endpoint on catchpoint service start. It will cycle through 2 peers to see if the requested ledger is valid before returning an error if the ledger is not valid.

Addresses #3637

Test Plan

Unit tests added. Also added an e2e test which exercises this logic.

…ilure

codecov · 2023-05-17T01:03:08Z

Codecov Report

Merging #5393 (bdcc654) into master (6b712a1) will increase coverage by 0.13%.
The diff coverage is 43.58%.

@@            Coverage Diff             @@
##           master    #5393      +/-   ##
==========================================
+ Coverage   55.46%   55.60%   +0.13%     
==========================================
  Files         447      447              
  Lines       63290    63349      +59     
==========================================
+ Hits        35103    35224     +121     
+ Misses      25807    25751      -56     
+ Partials     2380     2374       -6

Impacted Files	Coverage Δ
catchup/catchpointService.go	`8.03% <0.00%> (-0.36%)`	⬇️
daemon/algod/api/server/v2/handlers.go	`0.82% <0.00%> (-0.01%)`	⬇️
node/error.go	`0.00% <0.00%> (ø)`
node/follower_node.go	`34.41% <0.00%> (-0.49%)`	⬇️
node/node.go	`4.02% <0.00%> (-0.02%)`	⬇️
catchup/ledgerFetcher.go	`53.09% <90.62%> (+12.67%)`	⬆️
rpcs/ledgerService.go	`54.54% <100.00%> (+54.54%)`	⬆️

... and 14 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

…ilure

winder

Looks great. Just had some minor suggestions.

winder · 2023-05-26T12:15:14Z

catchup/catchpointService.go

+	err := cs.checkLedgerDownload()
+	if err != nil {
+		return err
+	}


Add some extra information about what we're doing as a result of the error. An error in checkLedgerDownload could also be helpful.

Suggested change

err := cs.checkLedgerDownload()

if err != nil {

return err

}

err := cs.checkLedgerDownload()

if err != nil {

return fmt.Errorf("Start(): aborting catchup: %s", err)

}

winder · 2023-05-26T12:17:53Z

catchup/catchpointService.go

+			return nil
+		}
+	}
+	return err


Suggested change

return err

return fmt.Errorf("checkLedgerDownload(): catchpoint '%s' unavailable from peers: %s", cs.stats.CatchpointLabel, err)

winder · 2023-05-26T12:20:27Z

catchup/ledgerFetcher.go

+	switch response.StatusCode {
+	case http.StatusOK:
+		return nil
+	case http.StatusNotFound: // server could not find a block with that round numbers.


Suggested change

case http.StatusNotFound: // server could not find a block with that round numbers.

case http.StatusNotFound: // server could not find a block with that round number.

winder · 2023-05-26T12:28:32Z

node/node.go

+	if err != nil {
+		node.log.Warnf(err.Error())
+		return MakeCatchpointNoPeersFoundError(catchpoint)


nit: this error seems too specific for this level of the code, there could be other errors now (i.e. a bad catchpoint label) or new conditions added in the future. I'd suggest creating the error type inside catchupservice or making the error type more generic, such as UnableToStartCatchup.

winder · 2023-05-26T12:31:44Z

rpcs/ledgerService.go

+	response.Header().Set("Content-Type", LedgerResponseContentType)
+	if request.Method == http.MethodHead {
+		response.WriteHeader(http.StatusOK)
+		return
+	}


This ServeHTTP function is wild... I'm starting to see what you were talking about.

Also, nice. This is a great solution, I didn't realize you were implementing across the client and the server with this PR.

It just occurred to me, there is a period of time where new clients will be making HEAD requests against relays that may not have the updated to this code which handles them.

What happens in that case?

If there is a known error for a missing HEAD verb, that could spot the difference between old and new relays. Alternatively, a feature flag, or a 2 phased release is possible. A feature flag is more controversial, I imagine.

I've been running catchup locally using my branch (as the client) against betanet which obviously has no peers running the new server code.

What happens is that it just returns 200 along with the entire payload which just gets ignored and then fetched again during the actual catchup.

So this will cause double the data transfer until the ledgerService is running the new code--which seems ok to me.

rpcs/ledgerService_test.go

catchup/catchpointService.go

tzaffi · 2023-05-26T22:46:28Z

rpcs/ledgerService.go

@@ -189,6 +195,11 @@ func (ls *LedgerService) ServeHTTP(response http.ResponseWriter, request *http.R
 		}
 	}
 	defer cs.Close()
+	response.Header().Set("Content-Type", LedgerResponseContentType)
+	if request.Method == http.MethodHead {


tzaffi

LGTM

bbroder-algo · 2023-06-06T17:16:46Z

catchup/ledgerFetcher_test.go

+	require.Equal(t, fmt.Errorf("could not parse a host from url"), err)
+
+	// headLedger 404 response
+	httpServerResponse = http.StatusNotFound


I don't think you need this anymore?

or the other two

This is the response that the mux handler func writes on behalf of the peer. These tests ensure the behavior of the headLedger function is correct based on various possible peer responses.

yes but in the error construction you are not using this httpServerResponse variable, you are using the http package directly.

bbroder-algo · 2023-06-06T17:26:12Z

great tests.

winder · 2023-06-06T18:07:54Z

node/follower_node.go

+		node.log.Warnf(err.Error())
+		return MakeStartCatchpointError(catchpoint)


Include the error too?

Suggested change

node.log.Warnf(err.Error())

return MakeStartCatchpointError(catchpoint)

node.log.Warnf(err.Error())

return MakeStartCatchpointError(catchpoint, err)

Added in most recent commit.

…ilure

Eric-Warehime · 2023-06-07T16:56:50Z

Going to close and re-open since updates to the branch aren't being reflected on the PR for some reason?

cce · 2023-06-14T13:56:49Z

catchup/catchpointService.go

+
+	// checkLedgerDownloadRetries is the number of times the catchpoint service will attempt to HEAD request the
+	// ledger from peers when `Start`ing catchpoint catchup
+	checkLedgerDownloadRetries = 10


There is a config argument CatchupLedgerDownloadRetryAttempts which is default to 50. Shouldn't this be the same value as that, rather than a new const?

I still think we should use the configuration value CatchupLedgerDownloadRetryAttempts instead adding a new hard-coded arbitrary constant

algorandskiy

I do not understand how adding HEAD to fetch catchpoint is any better than the existing loop in processStageLedgerDownload that tries to download catchpoint files. How does it ever help to the initial issue? Say there is no catchpoint, then both processStageLedgerDownload and HEAD advance to a next peer.
Please clarify.
In my opinion if processStageLedgerDownload does not work it needs to be fixed rather writing a new preprocessing workaround.

catchup/catchpointService.go

algorandskiy · 2023-06-14T15:32:46Z

catchup/catchpointService.go

@@ -156,11 +160,16 @@ func MakeNewCatchpointCatchupService(catchpoint string, node CatchpointCatchupNo
 }

 // Start starts the catchpoint catchup service ( continue in the process )
-func (cs *CatchpointCatchupService) Start(ctx context.Context) {
+func (cs *CatchpointCatchupService) Start(ctx context.Context) error {
+	err := cs.checkLedgerDownload()


what if the stage is "catchpoint downloaded" and node was restarted? In this case no need to go into network at all.

That's true. I think in the case of a resumed catchpoint service we should just succeed since we assume it was previously successfully started.

So let's say someone runs Start() via the REST API. They get a 200 but nothing visible happens yet (it's busy downloading, validating, etc). They run start again, perhaps putting in a different label. They get a 200 again. They do it over and over.. they will keep getting 200s because of this check, even though the cs stage has advanced and their call to Start() and the resulting HEAD request has no impact on what algod is doing..

The handlers invoke StartCatchup via the node which checks that the service isn't already running. So you'd only get a 200 the first time.

Oh the node calls Start() not the handler! OK I see. But the node doesn't know about the check, and neither does the handler.. OK.

winder · 2023-06-14T16:20:55Z

catchup/catchpointService.go

@@ -156,11 +160,16 @@ func MakeNewCatchpointCatchupService(catchpoint string, node CatchpointCatchupNo
 }

 // Start starts the catchpoint catchup service ( continue in the process )
-func (cs *CatchpointCatchupService) Start(ctx context.Context) {
+func (cs *CatchpointCatchupService) Start(ctx context.Context) error {


@algorandskiy this function returning an error is the primary functional change. Today, if this function fails the catchup fails silently. By returning an error the API can return the error and show it to the user.

it only solves a case where an invalid catchpoint supplied or node not connected to a network. Probably fine. But again, it should only be checked if there is no pending catchpoint in progress, i.e. the service is started fresh, i.e. the check should go into MakeNewCatchpointCatchupService.

Agreed, it is a false positive to return 200 for these cases just because you did a HEAD check. However the REST API handler code has access to status so it may be possible to prevent this at the handler level? Not sure for all cases or not

Eric-Warehime · 2023-06-14T16:29:38Z

@algorandskiy

I do not understand how adding HEAD to fetch catchpoint is any better than the existing loop in processStageLedgerDownload that tries to download catchpoint files. How does it ever help to the initial issue?

The issue this PR is trying to solve is returning an error when catchpoint catchup starts instead of optimistically succeeding. Currently you'll get a 200 response even if you give it an invalid, but properly formed, catchpoint label.

The head request isn't meant to be better than the ledger download, it's supposed to be nearly identical except it doesn't fetch the ledger contents. It's meant to be a quick check to see if the provided catchpoint would advance past the first step in the catchpoint process.

Do you think the better decision would be to just block the start catchpoint catchup request on the processLedgerDownload stage? That way we could ensure that the ledger download happens before launching the rest of the process in a goroutine and returning success?

algorandskiy · 2023-06-14T16:37:09Z

It looks like MakeNewCatchpointCatchupService is a place where the caller decided we want a new catchpoint download (state=ledger.CatchpointCatchupStateInactive, no db access to get prev state) so it appears to be a right place to precheck availability

algorandskiy · 2023-06-14T21:49:19Z

catchup/catchpointService.go

-	if err != nil {
-		return fmt.Errorf("aborting catchup Start(): %s", err)
+	// Only check catchpoint ledger validity if we're starting new
+	if cs.stage == ledger.CatchpointCatchupStateInactive {


still... why not in MakeService?

It feels like bad practice to start making network requests during construction. We really care about whether or not we'll be able to fetch the ledger when the service is about to start.

It's not a big deal to me--I'm ok moving it to MakeService if needed.

This is valid point so yes, Start fits better.
The last concert I have is about the following scenarios:

at the moment there are 100+ relays and we only try 10

what if for some reason the catchpoint requested is found in Start but not found when download + new peer selector invoked?
I think it would be nice to save a peer that 100% has this catchpoint and use this peer as a first one the download loop.

winder

This is improving the catchup API. Changes look good, I'd like to merge it in.

algorandskiy

I still think this could be done better but probably OK for now

Eric-Warehime added 4 commits May 9, 2023 13:24

Initial commit of returning error on failed catchpoint download

eda0a1e

Merge remote-tracking branch 'upstream/master' into err-on-catchup-fa…

e5438bd

…ilure

Merge remote-tracking branch 'upstream/master' into err-on-catchup-fa…

c91a2b5

…ilure

Add ledgerFetcher tests

f14e162

Eric-Warehime added the Enhancement label May 17, 2023

Eric-Warehime added 3 commits May 17, 2023 16:47

Add ledgerService_test

a43b3a8

Merge remote-tracking branch 'upstream/master' into err-on-catchup-fa…

1b81f32

…ilure

Add license

e8e880a

algorandskiy assigned Eric-Warehime May 19, 2023

Eric-Warehime added 5 commits May 25, 2023 10:43

Add e2e test

afe9a5a

Merge remote-tracking branch 'upstream/master' into err-on-catchup-fa…

5ac193b

…ilure

Merge remote-tracking branch 'upstream/master' into err-on-catchup-fa…

7e7f854

…ilure

Fix lint

89ee945

Fix errors

e22dc87

Eric-Warehime marked this pull request as ready for review May 26, 2023 00:08

Eric-Warehime requested review from cce, winder, tzaffi, algorandskiy, shiqizng and algochoi May 26, 2023 00:08

winder previously approved these changes May 26, 2023

View reviewed changes

tzaffi reviewed May 26, 2023

View reviewed changes

catchup/catchpointService.go Show resolved Hide resolved

tzaffi reviewed May 26, 2023

View reviewed changes

Fix context canceled

2c4410c

tzaffi previously approved these changes May 31, 2023

View reviewed changes

PR comments

efe59dd

Eric-Warehime dismissed stale reviews from tzaffi and winder via efe59dd May 31, 2023 18:46

bbroder-algo reviewed Jun 6, 2023

View reviewed changes

winder reviewed Jun 6, 2023

View reviewed changes

Eric-Warehime added 2 commits June 7, 2023 09:48

Add error string to output

a38b278

Merge remote-tracking branch 'upstream/master' into err-on-catchup-fa…

a05612f

…ilure

Eric-Warehime closed this Jun 7, 2023

Eric-Warehime reopened this Jun 7, 2023

Fix block-gen go.sum

8f5cf4c

Eric-Warehime requested review from winder and bbroder-algo June 7, 2023 21:17

winder previously approved these changes Jun 9, 2023

View reviewed changes

cce reviewed Jun 14, 2023

View reviewed changes

algorandskiy requested changes Jun 14, 2023

View reviewed changes

catchup/catchpointService.go Show resolved Hide resolved

algorandskiy reviewed Jun 14, 2023

View reviewed changes

winder reviewed Jun 14, 2023

View reviewed changes

Only head request on initial catchpoint stage

bdcc654

Eric-Warehime dismissed winder’s stale review via bdcc654 June 14, 2023 21:25

Eric-Warehime requested a review from algorandskiy June 14, 2023 21:47

algorandskiy reviewed Jun 14, 2023

View reviewed changes

winder requested a review from algorandskiy June 23, 2023 18:36

winder approved these changes Jun 23, 2023

View reviewed changes

algorandskiy approved these changes Jun 24, 2023

View reviewed changes

algorandskiy merged commit e96d104 into algorand:master Jun 24, 2023

Algo-devops-service mentioned this pull request Jul 11, 2023

go-algorand 3.17.0-beta Release PR #5541

Merged

onetechnical mentioned this pull request Jul 24, 2023

go-algorand 3.17.0-beta Release PR #5601

Merged

Algo-devops-service mentioned this pull request Aug 3, 2023

go-algorand 3.17.0-stable Release PR #5633

Merged

winder mentioned this pull request Oct 13, 2023

catchpoints: check EnableCatchupFromArchiveServers for ledgerFetcher #5783

Merged

	return err
	return fmt.Errorf("checkLedgerDownload(): catchpoint '%s' unavailable from peers: %s", cs.stats.CatchpointLabel, err)

	case http.StatusNotFound: // server could not find a block with that round numbers.
	case http.StatusNotFound: // server could not find a block with that round number.

		node.log.Warnf(err.Error())
		return MakeStartCatchpointError(catchpoint)

catchup: Add HEAD request to catchpoint service start #5393

catchup: Add HEAD request to catchpoint service start #5393

Conversation

Eric-Warehime commented May 17, 2023 • edited Loading

Summary

Test Plan

codecov bot commented May 17, 2023 • edited Loading

Codecov Report

winder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tzaffi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bbroder-algo commented Jun 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Eric-Warehime commented Jun 7, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

algorandskiy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cce Jun 14, 2023 • edited Loading

Choose a reason for hiding this comment

Eric-Warehime commented Jun 14, 2023

algorandskiy commented Jun 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

winder left a comment

Choose a reason for hiding this comment

algorandskiy left a comment

Choose a reason for hiding this comment

Eric-Warehime commented May 17, 2023 •

edited

Loading

codecov bot commented May 17, 2023 •

edited

Loading

cce Jun 14, 2023 •

edited

Loading