fix(connlib): discard timer once it fired #7288
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Within
connlib
, we have many nested state machines. Many of them have internal timers by means of timestamps with which they indicate, when they'd like to be "woken" to perform time-related processing. For example, theAllocation
state machine would indicate with a timestamp 5 minutes from the time an allocation is created that it needs to be woken again in order to send the refresh message to the relay.When we reset our network connections, we pretty much discard all state within connlib and together with that, all of these timers. Thus the
poll_timeout
function would returnNone
, indicating that our state machines are not waiting for anything.Within the eventloop, the most outer state machine, i.e.
ClientState
is paired with anIo
component that actually implements the timer by scheduling a wake-up aggregated as the earliest point of all state machines.In order to not fire the same timer multiple times in a row, we already intended to reset the timer once it fired. It turns out that this never worked and the timer still lingered around.
When we call
reset
,poll_timeout
- which feeds this timer - returnsNone
and the timer doesn't get updated until it will finally returnSome
with anInstant
. Because the previous timer didn't get cleared when it fired, this causedconnlib
to busy loop and prevent some(?) other parts of it from progressing, resulting in us never being able to reconnect to the portal. Yet, because the event loop itself was still operating, we could still resolve DNS queries and such.Resolves: #7254.