Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catchup: Add HEAD request to catchpoint service start #5393

Merged
merged 19 commits into from
Jun 24, 2023

Conversation

Eric-Warehime
Copy link
Contributor

@Eric-Warehime Eric-Warehime commented May 17, 2023

Summary

Adds a HEAD request to the ledger endpoint on catchpoint service start. It will cycle through 2 peers to see if the requested ledger is valid before returning an error if the ledger is not valid.

Addresses #3637

Test Plan

Unit tests added. Also added an e2e test which exercises this logic.

@codecov
Copy link

codecov bot commented May 17, 2023

Codecov Report

Merging #5393 (bdcc654) into master (6b712a1) will increase coverage by 0.13%.
The diff coverage is 43.58%.

@@            Coverage Diff             @@
##           master    #5393      +/-   ##
==========================================
+ Coverage   55.46%   55.60%   +0.13%     
==========================================
  Files         447      447              
  Lines       63290    63349      +59     
==========================================
+ Hits        35103    35224     +121     
+ Misses      25807    25751      -56     
+ Partials     2380     2374       -6     
Impacted Files Coverage Δ
catchup/catchpointService.go 8.03% <0.00%> (-0.36%) ⬇️
daemon/algod/api/server/v2/handlers.go 0.82% <0.00%> (-0.01%) ⬇️
node/error.go 0.00% <0.00%> (ø)
node/follower_node.go 34.41% <0.00%> (-0.49%) ⬇️
node/node.go 4.02% <0.00%> (-0.02%) ⬇️
catchup/ledgerFetcher.go 53.09% <90.62%> (+12.67%) ⬆️
rpcs/ledgerService.go 54.54% <100.00%> (+54.54%) ⬆️

... and 14 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Eric-Warehime Eric-Warehime marked this pull request as ready for review May 26, 2023 00:08
winder
winder previously approved these changes May 26, 2023
Copy link
Contributor

@winder winder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Just had some minor suggestions.

Comment on lines 164 to 167
err := cs.checkLedgerDownload()
if err != nil {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some extra information about what we're doing as a result of the error. An error in checkLedgerDownload could also be helpful.

Suggested change
err := cs.checkLedgerDownload()
if err != nil {
return err
}
err := cs.checkLedgerDownload()
if err != nil {
return fmt.Errorf("Start(): aborting catchup: %s", err)
}

return nil
}
}
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return err
return fmt.Errorf("checkLedgerDownload(): catchpoint '%s' unavailable from peers: %s", cs.stats.CatchpointLabel, err)

switch response.StatusCode {
case http.StatusOK:
return nil
case http.StatusNotFound: // server could not find a block with that round numbers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case http.StatusNotFound: // server could not find a block with that round numbers.
case http.StatusNotFound: // server could not find a block with that round number.

node/node.go Outdated
Comment on lines 1186 to 1188
if err != nil {
node.log.Warnf(err.Error())
return MakeCatchpointNoPeersFoundError(catchpoint)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this error seems too specific for this level of the code, there could be other errors now (i.e. a bad catchpoint label) or new conditions added in the future. I'd suggest creating the error type inside catchupservice or making the error type more generic, such as UnableToStartCatchup.

Comment on lines +198 to +202
response.Header().Set("Content-Type", LedgerResponseContentType)
if request.Method == http.MethodHead {
response.WriteHeader(http.StatusOK)
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ServeHTTP function is wild... I'm starting to see what you were talking about.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, nice. This is a great solution, I didn't realize you were implementing across the client and the server with this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just occurred to me, there is a period of time where new clients will be making HEAD requests against relays that may not have the updated to this code which handles them.

What happens in that case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a known error for a missing HEAD verb, that could spot the difference between old and new relays. Alternatively, a feature flag, or a 2 phased release is possible. A feature flag is more controversial, I imagine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been running catchup locally using my branch (as the client) against betanet which obviously has no peers running the new server code.

What happens is that it just returns 200 along with the entire payload which just gets ignored and then fetched again during the actual catchup.

So this will cause double the data transfer until the ledgerService is running the new code--which seems ok to me.

rpcs/ledgerService_test.go Show resolved Hide resolved
rpcs/ledgerService_test.go Show resolved Hide resolved
@@ -189,6 +195,11 @@ func (ls *LedgerService) ServeHTTP(response http.ResponseWriter, request *http.R
}
}
defer cs.Close()
response.Header().Set("Content-Type", LedgerResponseContentType)
if request.Method == http.MethodHead {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clever

tzaffi
tzaffi previously approved these changes May 31, 2023
Copy link
Contributor

@tzaffi tzaffi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Eric-Warehime Eric-Warehime dismissed stale reviews from tzaffi and winder via efe59dd May 31, 2023 18:46
require.Equal(t, fmt.Errorf("could not parse a host from url"), err)

// headLedger 404 response
httpServerResponse = http.StatusNotFound
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this anymore?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or the other two

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the response that the mux handler func writes on behalf of the peer. These tests ensure the behavior of the headLedger function is correct based on various possible peer responses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but in the error construction you are not using this httpServerResponse variable, you are using the http package directly.

@bbroder-algo
Copy link
Contributor

great tests.

Comment on lines 354 to 355
node.log.Warnf(err.Error())
return MakeStartCatchpointError(catchpoint)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include the error too?

Suggested change
node.log.Warnf(err.Error())
return MakeStartCatchpointError(catchpoint)
node.log.Warnf(err.Error())
return MakeStartCatchpointError(catchpoint, err)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in most recent commit.

@Eric-Warehime
Copy link
Contributor Author

Going to close and re-open since updates to the branch aren't being reflected on the PR for some reason?

winder
winder previously approved these changes Jun 9, 2023

// checkLedgerDownloadRetries is the number of times the catchpoint service will attempt to HEAD request the
// ledger from peers when `Start`ing catchpoint catchup
checkLedgerDownloadRetries = 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a config argument CatchupLedgerDownloadRetryAttempts which is default to 50. Shouldn't this be the same value as that, rather than a new const?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should use the configuration value CatchupLedgerDownloadRetryAttempts instead adding a new hard-coded arbitrary constant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@algorandskiy algorandskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand how adding HEAD to fetch catchpoint is any better than the existing loop in processStageLedgerDownload that tries to download catchpoint files. How does it ever help to the initial issue? Say there is no catchpoint, then both processStageLedgerDownload and HEAD advance to a next peer.
Please clarify.
In my opinion if processStageLedgerDownload does not work it needs to be fixed rather writing a new preprocessing workaround.

catchup/catchpointService.go Show resolved Hide resolved
@@ -156,11 +160,16 @@ func MakeNewCatchpointCatchupService(catchpoint string, node CatchpointCatchupNo
}

// Start starts the catchpoint catchup service ( continue in the process )
func (cs *CatchpointCatchupService) Start(ctx context.Context) {
func (cs *CatchpointCatchupService) Start(ctx context.Context) error {
err := cs.checkLedgerDownload()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the stage is "catchpoint downloaded" and node was restarted? In this case no need to go into network at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. I think in the case of a resumed catchpoint service we should just succeed since we assume it was previously successfully started.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So let's say someone runs Start() via the REST API. They get a 200 but nothing visible happens yet (it's busy downloading, validating, etc). They run start again, perhaps putting in a different label. They get a 200 again. They do it over and over.. they will keep getting 200s because of this check, even though the cs stage has advanced and their call to Start() and the resulting HEAD request has no impact on what algod is doing..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The handlers invoke StartCatchup via the node which checks that the service isn't already running. So you'd only get a 200 the first time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh the node calls Start() not the handler! OK I see. But the node doesn't know about the check, and neither does the handler.. OK.

@@ -156,11 +160,16 @@ func MakeNewCatchpointCatchupService(catchpoint string, node CatchpointCatchupNo
}

// Start starts the catchpoint catchup service ( continue in the process )
func (cs *CatchpointCatchupService) Start(ctx context.Context) {
func (cs *CatchpointCatchupService) Start(ctx context.Context) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@algorandskiy this function returning an error is the primary functional change. Today, if this function fails the catchup fails silently. By returning an error the API can return the error and show it to the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it only solves a case where an invalid catchpoint supplied or node not connected to a network. Probably fine. But again, it should only be checked if there is no pending catchpoint in progress, i.e. the service is started fresh, i.e. the check should go into MakeNewCatchpointCatchupService.

Copy link
Contributor

@cce cce Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it is a false positive to return 200 for these cases just because you did a HEAD check. However the REST API handler code has access to status so it may be possible to prevent this at the handler level? Not sure for all cases or not

@Eric-Warehime
Copy link
Contributor Author

@algorandskiy

I do not understand how adding HEAD to fetch catchpoint is any better than the existing loop in processStageLedgerDownload that tries to download catchpoint files. How does it ever help to the initial issue?

The issue this PR is trying to solve is returning an error when catchpoint catchup starts instead of optimistically succeeding. Currently you'll get a 200 response even if you give it an invalid, but properly formed, catchpoint label.

The head request isn't meant to be better than the ledger download, it's supposed to be nearly identical except it doesn't fetch the ledger contents. It's meant to be a quick check to see if the provided catchpoint would advance past the first step in the catchpoint process.

Do you think the better decision would be to just block the start catchpoint catchup request on the processLedgerDownload stage? That way we could ensure that the ledger download happens before launching the rest of the process in a goroutine and returning success?

@algorandskiy
Copy link
Contributor

It looks like MakeNewCatchpointCatchupService is a place where the caller decided we want a new catchpoint download (state=ledger.CatchpointCatchupStateInactive, no db access to get prev state) so it appears to be a right place to precheck availability

if err != nil {
return fmt.Errorf("aborting catchup Start(): %s", err)
// Only check catchpoint ledger validity if we're starting new
if cs.stage == ledger.CatchpointCatchupStateInactive {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still... why not in MakeService?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like bad practice to start making network requests during construction. We really care about whether or not we'll be able to fetch the ledger when the service is about to start.

It's not a big deal to me--I'm ok moving it to MakeService if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is valid point so yes, Start fits better.
The last concert I have is about the following scenarios:

  1. at the moment there are 100+ relays and we only try 10
  2. what if for some reason the catchpoint requested is found in Start but not found when download + new peer selector invoked?
    I think it would be nice to save a peer that 100% has this catchpoint and use this peer as a first one the download loop.

@winder winder requested a review from algorandskiy June 23, 2023 18:36
Copy link
Contributor

@winder winder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is improving the catchup API. Changes look good, I'd like to merge it in.

Copy link
Contributor

@algorandskiy algorandskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this could be done better but probably OK for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants