network:ws block byte limiter #5472

iansuvak · 2023-06-15T13:46:12Z

Summary

This is the websocket counterpart of the HTTP block server PR implemented here #5428

The goal of it is to limit the number of concurrent bytes worth of block requests that we can serve at a given time. It counts the bytes as used from the moment that they land in the send queue channel and subtracts once they are sent or the connection is terminated.

The tricky implementation choice here is where to track the number of bytes. Ideally we would be able to do it in the blockserver.go like the HTTP PR does but since the bytes aren't freed until the network package is done with them and network doesn't have access to the blockserver we would have to do this by changing the interface to allow messages to communicate back to the blockserver via a channel perhaps? All thoughts and opinions are much appreciated

Test Plan

Don't have the tests written yet since I wanted to get feedback on the approach beforehand.

Will write new tests focusing on ensuring that send queue is drained properly in different cases.

…iter

codecov · 2023-06-16T16:03:50Z

Codecov Report

Merging #5472 (254ec5c) into master (68a4aba) will increase coverage by 0.02%.
The diff coverage is 85.00%.

@@            Coverage Diff             @@
##           master    #5472      +/-   ##
==========================================
+ Coverage   55.78%   55.81%   +0.02%     
==========================================
  Files         446      446              
  Lines       63253    63282      +29     
==========================================
+ Hits        35288    35322      +34     
+ Misses      25593    25584       -9     
- Partials     2372     2376       +4

Impacted Files	Coverage Δ
config/localTemplate.go	`70.76% <ø> (ø)`
rpcs/blockService.go	`74.33% <83.33%> (+5.38%)`	⬆️
network/wsPeer.go	`70.53% <86.66%> (+0.29%)`	⬆️
network/wsNetwork.go	`72.87% <100.00%> (ø)`

... and 7 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

…ounter from the network layer

algonautshant · 2023-06-16T16:13:14Z

config/localTemplate.go

@@ -524,6 +524,9 @@ type Local struct {
 	// BlockServiceHTTPMemCap is the memory capacity in bytes which is allowed for the block service to use for HTTP block requests.
 	// When it exceeds this capacity, it redirects the block requests to a different node
 	BlockServiceHTTPMemCap uint64 `version[28]:"500000000"`
+
+	// BlockServiceWSMemCap is the memory capacity in bytes which is allowed for the block service to use for websocket block requests.
+	BlockServiceWSMemCap int64 `version[28]:"500000000"`


I think we can make a single parameter and use it for both HTTP and WS. We are having too many config params...

algonautshant · 2023-06-16T16:25:47Z

network/wsNetwork.go

@@ -142,6 +142,8 @@ var networkPrioBatchesPPWithoutCompression = metrics.MakeCounter(metrics.MetricN
 var networkPrioPPCompressedSize = metrics.MakeCounter(metrics.MetricName{Name: "algod_network_prio_pp_compressed_size_total", Description: "cumulative size of all compressed PP"})
 var networkPrioPPNonCompressedSize = metrics.MakeCounter(metrics.MetricName{Name: "algod_network_prio_pp_non_compressed_size_total", Description: "cumulative size of all non-compressed PP"})

+var networkCatchupMessagesDropped = metrics.MakeCounter(metrics.MetricName{Name: "algod_network_ue_messages_dropped", Description: "number of (UE) block catchup request messages dropped due to being at byte limit"})


If you want to add metrics, maybe we need one/same for the HTTP?

Right now it's unused since my change removed it? Should I remove it or add both?

I think we want these both metrics

algonautshant · 2023-06-16T18:31:35Z

network/wsNetwork.go

@@ -470,6 +472,13 @@ type WebsocketNetwork struct {

 	// resolveSRVRecords is a function that resolves SRV records for a given service, protocol and name
 	resolveSRVRecords func(service string, protocol string, name string, fallbackDNSResolverAddress string, secure bool) (addrs []string, err error)
+


I think this is too much complication in the network layer.
I have a prototype suggestion here:
iansuvak#4

Keep the accounting similar to #5428 and use a callback as in iansuvak#4

…k-byte-limiter Suggestion

…nd#5451)

…iter

iansuvak · 2023-06-23T20:12:07Z

rpcs/blockService.go

+	// If we are over-capacity, we will not process the request
+	// respond to sender with error message
+	memUsed := atomic.LoadUint64(&bs.memoryUsed)
+	if memUsed > bs.memoryCap {


Do we want to just not return a response here? Or at least not give the peer memory information

Great question.
We need to do something similar to the HTTP request. There we got it for free, using the protocol retry message.
Here, we need to implement on the requester side, how to react to this error.

In this case, the requester should request it from another peer. It cannot wait and request again, since these requests are time sensitive.

This is what it already does after a 4 second timeout even if you return nothing so not returning a response should be the same as returning an error response

for WS 4 seconds is a very long time if we want to address this properly.
The request will come from the agreement service, so it needs the block in milliseconds, not seconds :-)

If there is a system cause for not receiving the proposed block (i.e. network disruption), for the agreement to move forward will need the blocks serviced as fast as possible. 4 seconds might be okay, we get a 20 second round, but this may get longer very quickly, and we can easily do better.

Again, this is a product question: should be add a little bit more protocol implementation to address this rare situation or we leave it as is.

Returning an error in that case should still be good enough, but 4 second timeouts do happen somewhat regularly right now

network/wsPeer.go

rpcs/blockService.go

algonautshant · 2023-06-23T20:07:08Z

network/wsPeer.go

@@ -410,7 +411,7 @@ func (wp *wsPeer) Respond(ctx context.Context, reqMsg IncomingMessage, responseT
 	}

 	select {
-	case wp.sendBufferBulk <- sendMessages{msgs: msg}:
+	case wp.sendBufferBulk <- sendMessages{msgs: msg, callback: reqMsg.Callback}:


If this select case is not picked, you may leak the counter decrement.
For now, this likely won't matter, since ctx is canceled only when the service is shutting down.
If this is the case, we are still vulnerable to future changes.
Better to handle the case here and make the behavior robust irrespective of why or who closes or cancels.

Agreed, thanks!

…iter

AlgoAxel · 2023-06-28T20:24:33Z

network/wsNetwork.go

@@ -142,6 +142,8 @@ var networkPrioBatchesPPWithoutCompression = metrics.MakeCounter(metrics.MetricN
 var networkPrioPPCompressedSize = metrics.MakeCounter(metrics.MetricName{Name: "algod_network_prio_pp_compressed_size_total", Description: "cumulative size of all compressed PP"})
 var networkPrioPPNonCompressedSize = metrics.MakeCounter(metrics.MetricName{Name: "algod_network_prio_pp_non_compressed_size_total", Description: "cumulative size of all non-compressed PP"})

+var networkCatchupMessagesDropped = metrics.MakeCounter(metrics.MetricName{Name: "algod_network_ue_messages_dropped", Description: "number of (UE) block catchup request messages dropped due to being at byte limit"})


Note the Lint warning for this var being unused

Indeed. I was originally using it but had it removed. There's a conversation above asking if we should record this for just this, both this and http rejections or neither

test/testdata/configs/config-v28.json

algorandskiy

I guess having a callback on when a response leaves this node is fine, but it only works for topics since Respond has IncomingMessage as an argument, so it is not a generic implementation so should not be part of IncomingMessage.

The Respond handler is used only in blockService (the usage in wsNetwork appears to be unreachable since the Respond action is not used.

I think Respond handler should be refactored to accept OutgoingMessage instead of responseTopics Topics especially OutgoingMessage has Topics field so no functionality will be lost. Adding OnRelease (or OnSent) to OutgoingMessage would make it more generic/usable.

Maybe call it "OnResponseSent" instead of "OnRelease"?

network/wsPeer.go

algorandskiy · 2023-06-28T20:32:42Z

network/wsNetwork.go

@@ -142,6 +142,8 @@ var networkPrioBatchesPPWithoutCompression = metrics.MakeCounter(metrics.MetricN
 var networkPrioPPCompressedSize = metrics.MakeCounter(metrics.MetricName{Name: "algod_network_prio_pp_compressed_size_total", Description: "cumulative size of all compressed PP"})
 var networkPrioPPNonCompressedSize = metrics.MakeCounter(metrics.MetricName{Name: "algod_network_prio_pp_non_compressed_size_total", Description: "cumulative size of all non-compressed PP"})

+var networkCatchupMessagesDropped = metrics.MakeCounter(metrics.MetricName{Name: "algod_network_ue_messages_dropped", Description: "number of (UE) block catchup request messages dropped due to being at byte limit"})


I think we want these both metrics

…with peers

algorandskiy

Looks good, please fix the reviewdog and a failing test

algorandskiy · 2023-06-29T19:06:29Z

rpcs/blockService.go

+			}
+			atomic.AddUint64(&bs.wsMemoryUsed, (n))
+		}
+		target.Respond(ctx, reqMsg, outMsg)


maybe log an error here?

cce · 2023-06-30T01:21:55Z

network/wsPeer.go

 	case <-wp.closing:
+		outMsg.OnRelease()


don't you need to check if OnRelease is nil or not?

cce · 2023-06-30T01:23:33Z

network/wsPeer.go

 		wp.net.log.Debugf("peer closing %s", wp.conn.RemoteAddr().String())
 		return
 	case <-ctx.Done():
+		outMsg.OnRelease()


similar here

iansuvak added 3 commits June 8, 2023 15:21

Base implementation, no testing

17c532f

Merge remote-tracking branch 'upstream/master' into ws-block-byte-lim…

80d71d4

…iter

Make existing tests pass

413a2bd

iansuvak requested review from algonautshant and algorandskiy June 15, 2023 13:46

iansuvak self-assigned this Jun 15, 2023

iansuvak added Enhancement Team Carbon-11 labels Jun 15, 2023

Merge remote-tracking branch 'upstream/master' into ws-block-byte-lim…

c0bf235

…iter

proposl: count memory in blockServer, use callback to decrement mem c…

503c439

…ounter from the network layer

algonautshant reviewed Jun 16, 2023

View reviewed changes

iansuvak and others added 13 commits June 17, 2023 12:51

Merge pull request #4 from algonautshant/shant/suggestion/ian/ws-bloc…

2f43c59

…k-byte-limiter Suggestion

rename CallWhenDone to Callback

e6c7908

encoding: Update go-codec version. (algorand#5471)

a6716ea

Chore: Use strings.Cut for clarity (algorand#5474)

3529573

network: improve MsgOfInterest message handling (algorand#5476)

ff1cdfe

ledger: refactor store module interfaces before kv impl merge (algora…

68664af

…nd#5451)

tools: replace upload_metrics (algorand#5470)

d2d8f9d

Unify ws and http limits to block_service.go

5dcf685

call after succeeding or failing instead of before

bd0ee61

Merge remote-tracking branch 'upstream/master' into ws-block-byte-lim…

b1b8277

…iter

noop test skeleton

42c34fb

Almost working network test

77771ed

network side test

83c8bea

iansuvak changed the title ~~WIP: network:ws block byte limiter~~ network:ws block byte limiter Jun 23, 2023

iansuvak marked this pull request as ready for review June 23, 2023 18:15

iansuvak added 2 commits June 23, 2023 14:54

use locking methods to create responsechannels

977ce11

fix blockservice test

5aa8e33

iansuvak commented Jun 23, 2023

View reviewed changes

algonautshant reviewed Jun 23, 2023

View reviewed changes

iansuvak added 5 commits June 26, 2023 09:40

Merge remote-tracking branch 'upstream/master' into ws-block-byte-lim…

f623eec

…iter

add callbacks to shutdown cases

46d9d53

add another test to confirm draining explicitly

72ec677

Change callback to OnMessageRelease()

84bb544

Use a separate counter for memory

62414ed

iansuvak requested review from cce and AlgoAxel June 28, 2023 20:10

AlgoAxel reviewed Jun 28, 2023

View reviewed changes

algorandskiy reviewed Jun 28, 2023

View reviewed changes

iansuvak added 3 commits June 29, 2023 12:37

move onRelease to OutgoingMessage

fc1306b

add counters for both http and ws and don't share memory information …

6ba239e

…with peers

remove the now unused network side metric

18f3869

algorandskiy previously approved these changes Jun 29, 2023

View reviewed changes

fix reviewdog issue

e1b6411

iansuvak dismissed algorandskiy’s stale review via e1b6411 June 29, 2023 19:26

fix failing test

254ec5c

AlgoAxel approved these changes Jun 29, 2023

View reviewed changes

algorandskiy approved these changes Jun 29, 2023

View reviewed changes

algorandskiy merged commit 0bc522d into algorand:master Jun 29, 2023

cce reviewed Jun 30, 2023

View reviewed changes

network/wsPeer.go

case <-wp.closing:

outMsg.OnRelease()

Copy link

Contributor

cce Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you need to check if OnRelease is nil or not?

cce reviewed Jun 30, 2023

View reviewed changes

iansuvak mentioned this pull request Jun 30, 2023

bugfix: ensure blockservice callbacks are not nil #5518

Merged

Algo-devops-service mentioned this pull request Jul 11, 2023

go-algorand 3.17.0-beta Release PR #5541

Merged

onetechnical mentioned this pull request Jul 24, 2023

go-algorand 3.17.0-beta Release PR #5601

Merged

Algo-devops-service mentioned this pull request Aug 3, 2023

go-algorand 3.17.0-stable Release PR #5633

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

network:ws block byte limiter #5472

network:ws block byte limiter #5472

iansuvak commented Jun 15, 2023

codecov bot commented Jun 16, 2023 •

edited

Loading

algonautshant Jun 16, 2023

algonautshant Jun 16, 2023

iansuvak Jun 27, 2023

algorandskiy Jun 28, 2023

algonautshant Jun 16, 2023

iansuvak Jun 23, 2023

algonautshant Jun 26, 2023

iansuvak Jun 27, 2023

algonautshant Jun 27, 2023 •

edited

Loading

iansuvak Jun 27, 2023

algonautshant Jun 23, 2023

iansuvak Jun 24, 2023

AlgoAxel Jun 28, 2023

iansuvak Jun 28, 2023

algorandskiy left a comment

algorandskiy Jun 28, 2023

algorandskiy left a comment

algorandskiy Jun 29, 2023

cce Jun 30, 2023

cce Jun 30, 2023

		@@ -470,6 +472,13 @@ type WebsocketNetwork struct {

		// resolveSRVRecords is a function that resolves SRV records for a given service, protocol and name
		resolveSRVRecords func(service string, protocol string, name string, fallbackDNSResolverAddress string, secure bool) (addrs []string, err error)

network:ws block byte limiter #5472

network:ws block byte limiter #5472

Conversation

iansuvak commented Jun 15, 2023

Summary

Test Plan

codecov bot commented Jun 16, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

algonautshant Jun 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

algorandskiy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

algorandskiy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jun 16, 2023 •

edited

Loading

algonautshant Jun 27, 2023 •

edited

Loading