Skip to content

Commit

Permalink
Update network disconnection logic (algorand#668)
Browse files Browse the repository at this point in the history
Gossip networks connections in previous releases were randomly selected, stable connections.
That approach has its advantages, but is also susceptible to two major problems : inefficient connections and clique formation.

This PR attempt to address both of the above issues. In this PR, we add an "attribute" to each outgoing connection called "throttled connection". A throttled connection is a connection who's performance is being tested continuously. The performance comparison is always done over the set of outgoing connections.

Once we complete the performance test, we disconnect the throttled connection who performed the worse. That would allow us to continuously attempt to "improve" our outgoing connections pool for the fastest relays.

Unfortunately, the above would also expedite the formation of cliques. Prior to this PR, our response for cliques formation was to restart the node.
In this PR, we have two separate mechanisms to handle cliques:

Avoidance; When creating a relay node, we configure only half of it's outgoing connections to be throttled connections. That reduce the likelihood of clique formation. For a non-relay, that's not an issue.
Fix existing clique; This PR add a dedicated feedback between the agreement service and the network layer. The network layer is exposing a watchdog-style handler that is being invoked by the agreement service. That allows the network library to determine if it's unable to make progress. When that happens, a random outgoing connection is being disconnected.
  • Loading branch information
tsachiherman authored Mar 17, 2020
1 parent 25d0d6e commit 7b17012
Show file tree
Hide file tree
Showing 11 changed files with 616 additions and 40 deletions.
3 changes: 3 additions & 0 deletions components/mocks/mockNetwork.go
Original file line number Diff line number Diff line change
Expand Up @@ -96,3 +96,6 @@ func (network *MockNetwork) ClearHandlers() {
// RegisterHTTPHandler - empty implementation
func (network *MockNetwork) RegisterHTTPHandler(path string, handler http.Handler) {
}

// OnNetworkAdvance - empty implementation
func (network *MockNetwork) OnNetworkAdvance() {}
4 changes: 4 additions & 0 deletions config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -265,6 +265,10 @@ type Local struct {

// EnablePingHandler controls whether the gossip node would respond to ping messages with a pong message.
EnablePingHandler bool

// DisableOutgoingConnectionThrottling disables the connection throttling of the network library, which
// allow the network library to continuesly disconnect relays based on their relative ( and absolute ) performance.
DisableOutgoingConnectionThrottling bool
}

// Filenames of config files within the configdir (e.g. ~/.algorand)
Expand Down
3 changes: 2 additions & 1 deletion config/local_defaults.go
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,7 @@ var defaultLocalV5 = Local{
ConnectionsRateLimitingCount: 60,
ConnectionsRateLimitingWindowSeconds: 1,
DeadlockDetection: 0,
DisableOutgoingConnectionThrottling: false,
DNSBootstrapID: "<network>.algorand.network",
EnableAgreementReporting: false,
EnableAgreementTimeMetrics: false,
Expand All @@ -130,6 +131,7 @@ var defaultLocalV5 = Local{
NodeExporterPath: "./node_exporter",
OutgoingMessageFilterBucketCount: 3,
OutgoingMessageFilterBucketSize: 128,
PeerConnectionsUpdateInterval: 3600,
ReconnectTime: 1 * time.Minute, // Was 60ns
ReservedFDs: 256,
RestReadTimeoutSeconds: 15,
Expand All @@ -143,7 +145,6 @@ var defaultLocalV5 = Local{
TxSyncIntervalSeconds: 60,
TxSyncTimeoutSeconds: 30,
TxSyncServeResponseSize: 1000000,
PeerConnectionsUpdateInterval: 3600,
// DO NOT MODIFY VALUES - New values may be added carefully - See WARNING at top of file
}

Expand Down
2 changes: 2 additions & 0 deletions installer/config.json.example
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
"CatchupParallelBlocks": 16,
"ConnectionsRateLimitingCount": 60,
"ConnectionsRateLimitingWindowSeconds": 1,
"DisableOutgoingConnectionThrottling": false,
"DeadlockDetection": 0,
"DNSBootstrapID": "<network>.algorand.network",
"EnableIncomingMessageFilter": false,
Expand All @@ -33,6 +34,7 @@
"NodeExporterPath": "./node_exporter",
"OutgoingMessageFilterBucketCount": 3,
"OutgoingMessageFilterBucketSize": 128,
"PeerConnectionsUpdateInterval": 3600,
"PriorityPeers": {},
"ReconnectTime": 60000000000,
"ReservedFDs": 256,
Expand Down
6 changes: 6 additions & 0 deletions logging/telemetryspec/event.go
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,10 @@ type PeerEventDetails struct {
HostName string
Incoming bool
InstanceName string
// Endpoint is the dialed-to address, for an outgoing connection. Not being used for incoming connection.
Endpoint string `json:",omitempty"`
// MessageDelay is the avarage relative message delay. Not being used for incoming connection.
MessageDelay int64 `json:",omitempty"`
}

// ConnectPeerFailEvent event
Expand Down Expand Up @@ -275,4 +279,6 @@ type PeerConnectionDetails struct {
ConnectionDuration uint
// Endpoint is the dialed-to address, for an outgoing connection. Not being used for incoming connection.
Endpoint string `json:",omitempty"`
// MessageDelay is the avarage relative message delay. Not being used for incoming connection.
MessageDelay int64 `json:",omitempty"`
}
Loading

0 comments on commit 7b17012

Please sign in to comment.