Skip to content

Relayer slows down exponentially in some circumstances #2008

Closed
@romac

Description

Summary of Bug

The relayer sometimes slows down exponentially.

The cause of slowness is that the height of the events we are pulling from the subscriptions drift exponentially from the real latest height, because we are not pulling them from the event monitor stream fast enough.

We are trying to get an event from the stream of events (with try_recv_multiple) every 500ms.

So with two chains we have:

  • call try_recv_multiple and get a NewBlock from chain A
  • wait 500ms
  • call try_recv_multiple and get a NewBlock from chain B
  • wait 500ms
  • call try_recv_multiple and get a NewBlock from chain A
    etc.

So we get an event per chain roughly once every second.

With three chains:

  • call try_recv_multiple and get a NewBlock from chain A
  • wait 500ms
  • call try_recv_multiple and get a NewBlock from chain B
  • wait 500ms
  • call try_recv_multiple and get a NewBlock from chain C
  • wait 500ms
  • call try_recv_multiple and get a NewBlock from chain A
  • etc.

So we were getting a NewBlock event per chain every 1.5s.

But since the block time for testing is 1s, we end up drifting behind more and more.
I guess that's why we only see this in testing and not in prod, because in prod we query often enough that we are always up to date.

The problem gets worse the lower the block time and the higher the number of chains the relayer is connected to.

To fix this, we should use a blocking recv_multiple on the subscriptions stream so that we get the events as fast as they are emitted, which solves the drift.

Version

v0.13.0-rc0

Steps to Reproduce

  1. Spawn 3 chains with a block time of 1s
  2. Create a channel between 2 chains
  3. Start Hermes
  4. Wait a few minutes
  5. Do a ft-transfer
  6. See that the relayer only processes the transfer after a long time
  7. Wait more
  8. Do another ft-transfer
  9. It takes even longer until the relayer processes the transfer

Acceptance Criteria

The relayer does not exhibit this issue anymore.


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned

Metadata

Assignees

Labels

A: bugAdmin: something isn't workingI: logicInternal: related to the relaying logicO: performanceObjective: cause to improve performance

Type

No type

Projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions