-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
catchup: Add HEAD request to catchpoint service start #5393
catchup: Add HEAD request to catchpoint service start #5393
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5393 +/- ##
==========================================
+ Coverage 55.46% 55.60% +0.13%
==========================================
Files 447 447
Lines 63290 63349 +59
==========================================
+ Hits 35103 35224 +121
+ Misses 25807 25751 -56
+ Partials 2380 2374 -6
... and 14 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Just had some minor suggestions.
catchup/catchpointService.go
Outdated
err := cs.checkLedgerDownload() | ||
if err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some extra information about what we're doing as a result of the error. An error in checkLedgerDownload
could also be helpful.
err := cs.checkLedgerDownload() | |
if err != nil { | |
return err | |
} | |
err := cs.checkLedgerDownload() | |
if err != nil { | |
return fmt.Errorf("Start(): aborting catchup: %s", err) | |
} |
catchup/catchpointService.go
Outdated
return nil | ||
} | ||
} | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return err | |
return fmt.Errorf("checkLedgerDownload(): catchpoint '%s' unavailable from peers: %s", cs.stats.CatchpointLabel, err) |
catchup/ledgerFetcher.go
Outdated
switch response.StatusCode { | ||
case http.StatusOK: | ||
return nil | ||
case http.StatusNotFound: // server could not find a block with that round numbers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case http.StatusNotFound: // server could not find a block with that round numbers. | |
case http.StatusNotFound: // server could not find a block with that round number. |
node/node.go
Outdated
if err != nil { | ||
node.log.Warnf(err.Error()) | ||
return MakeCatchpointNoPeersFoundError(catchpoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this error seems too specific for this level of the code, there could be other errors now (i.e. a bad catchpoint label) or new conditions added in the future. I'd suggest creating the error type inside catchupservice or making the error type more generic, such as UnableToStartCatchup
.
response.Header().Set("Content-Type", LedgerResponseContentType) | ||
if request.Method == http.MethodHead { | ||
response.WriteHeader(http.StatusOK) | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ServeHTTP function is wild... I'm starting to see what you were talking about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, nice. This is a great solution, I didn't realize you were implementing across the client and the server with this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just occurred to me, there is a period of time where new clients will be making HEAD requests against relays that may not have the updated to this code which handles them.
What happens in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a known error for a missing HEAD
verb, that could spot the difference between old and new relays. Alternatively, a feature flag, or a 2 phased release is possible. A feature flag is more controversial, I imagine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been running catchup locally using my branch (as the client) against betanet which obviously has no peers running the new server code.
What happens is that it just returns 200 along with the entire payload which just gets ignored and then fetched again during the actual catchup.
So this will cause double the data transfer until the ledgerService is running the new code--which seems ok to me.
@@ -189,6 +195,11 @@ func (ls *LedgerService) ServeHTTP(response http.ResponseWriter, request *http.R | |||
} | |||
} | |||
defer cs.Close() | |||
response.Header().Set("Content-Type", LedgerResponseContentType) | |||
if request.Method == http.MethodHead { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clever
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
require.Equal(t, fmt.Errorf("could not parse a host from url"), err) | ||
|
||
// headLedger 404 response | ||
httpServerResponse = http.StatusNotFound |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need this anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or the other two
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the response that the mux handler func writes on behalf of the peer. These tests ensure the behavior of the headLedger
function is correct based on various possible peer responses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but in the error construction you are not using this httpServerResponse variable, you are using the http package directly.
great tests. |
node/follower_node.go
Outdated
node.log.Warnf(err.Error()) | ||
return MakeStartCatchpointError(catchpoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include the error too?
node.log.Warnf(err.Error()) | |
return MakeStartCatchpointError(catchpoint) | |
node.log.Warnf(err.Error()) | |
return MakeStartCatchpointError(catchpoint, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in most recent commit.
Going to close and re-open since updates to the branch aren't being reflected on the PR for some reason? |
|
||
// checkLedgerDownloadRetries is the number of times the catchpoint service will attempt to HEAD request the | ||
// ledger from peers when `Start`ing catchpoint catchup | ||
checkLedgerDownloadRetries = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a config argument CatchupLedgerDownloadRetryAttempts
which is default to 50. Shouldn't this be the same value as that, rather than a new const?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we should use the configuration value CatchupLedgerDownloadRetryAttempts instead adding a new hard-coded arbitrary constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand how adding HEAD to fetch catchpoint is any better than the existing loop in processStageLedgerDownload
that tries to download catchpoint files. How does it ever help to the initial issue? Say there is no catchpoint, then both processStageLedgerDownload and HEAD advance to a next peer.
Please clarify.
In my opinion if processStageLedgerDownload
does not work it needs to be fixed rather writing a new preprocessing workaround.
catchup/catchpointService.go
Outdated
@@ -156,11 +160,16 @@ func MakeNewCatchpointCatchupService(catchpoint string, node CatchpointCatchupNo | |||
} | |||
|
|||
// Start starts the catchpoint catchup service ( continue in the process ) | |||
func (cs *CatchpointCatchupService) Start(ctx context.Context) { | |||
func (cs *CatchpointCatchupService) Start(ctx context.Context) error { | |||
err := cs.checkLedgerDownload() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the stage is "catchpoint downloaded" and node was restarted? In this case no need to go into network at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. I think in the case of a resumed catchpoint service we should just succeed since we assume it was previously successfully started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So let's say someone runs Start() via the REST API. They get a 200 but nothing visible happens yet (it's busy downloading, validating, etc). They run start again, perhaps putting in a different label. They get a 200 again. They do it over and over.. they will keep getting 200s because of this check, even though the cs stage has advanced and their call to Start() and the resulting HEAD request has no impact on what algod is doing..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The handlers invoke StartCatchup via the node which checks that the service isn't already running. So you'd only get a 200 the first time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh the node calls Start() not the handler! OK I see. But the node doesn't know about the check, and neither does the handler.. OK.
@@ -156,11 +160,16 @@ func MakeNewCatchpointCatchupService(catchpoint string, node CatchpointCatchupNo | |||
} | |||
|
|||
// Start starts the catchpoint catchup service ( continue in the process ) | |||
func (cs *CatchpointCatchupService) Start(ctx context.Context) { | |||
func (cs *CatchpointCatchupService) Start(ctx context.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@algorandskiy this function returning an error is the primary functional change. Today, if this function fails the catchup fails silently. By returning an error the API can return the error and show it to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it only solves a case where an invalid catchpoint supplied or node not connected to a network. Probably fine. But again, it should only be checked if there is no pending catchpoint in progress, i.e. the service is started fresh, i.e. the check should go into MakeNewCatchpointCatchupService
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it is a false positive to return 200 for these cases just because you did a HEAD check. However the REST API handler code has access to status so it may be possible to prevent this at the handler level? Not sure for all cases or not
The issue this PR is trying to solve is returning an error when catchpoint catchup starts instead of optimistically succeeding. Currently you'll get a 200 response even if you give it an invalid, but properly formed, catchpoint label. The head request isn't meant to be better than the ledger download, it's supposed to be nearly identical except it doesn't fetch the ledger contents. It's meant to be a quick check to see if the provided catchpoint would advance past the first step in the catchpoint process. Do you think the better decision would be to just block the start catchpoint catchup request on the processLedgerDownload stage? That way we could ensure that the ledger download happens before launching the rest of the process in a goroutine and returning success? |
It looks like |
if err != nil { | ||
return fmt.Errorf("aborting catchup Start(): %s", err) | ||
// Only check catchpoint ledger validity if we're starting new | ||
if cs.stage == ledger.CatchpointCatchupStateInactive { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still... why not in MakeService?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like bad practice to start making network requests during construction. We really care about whether or not we'll be able to fetch the ledger when the service is about to start.
It's not a big deal to me--I'm ok moving it to MakeService if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is valid point so yes, Start fits better.
The last concert I have is about the following scenarios:
- at the moment there are 100+ relays and we only try 10
- what if for some reason the catchpoint requested is found in Start but not found when download + new peer selector invoked?
I think it would be nice to save a peer that 100% has this catchpoint and use this peer as a first one the download loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is improving the catchup API. Changes look good, I'd like to merge it in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this could be done better but probably OK for now
Summary
Adds a HEAD request to the ledger endpoint on catchpoint service start. It will cycle through 2 peers to see if the requested ledger is valid before returning an error if the ledger is not valid.
Addresses #3637
Test Plan
Unit tests added. Also added an e2e test which exercises this logic.