Issue: Redis Cluster connect timeout not applied on first connection to node from cluster nodes that is down #2515

inakisoriamrf · 2024-06-27T08:17:17Z

We are working with a Redis cluster consisting of multiple shards, with 3 nodes per shard. When a node unexpectedly restarts (due to hardware reset, for example), the first get call to the affected node throws an exception as expected. Upon encountering this exception, we reconnect to the cluster, however, initially, the cluster nodes do not report the affected node as failed.

The new cluster connection remains functional until the point where it tries to retrieve an element from the affected shard and establishes a connection to the impaired node. At this point, the script freezes and remains unresponsive until the affected server starts responding to icmp.

We suspect that the timeout settings are not being applied correctly during the initial connection attempt to a node retrieved from cluster nodes.

Expected Behaviour

If a server retrieved from cluster nodes is unreachable, the default timeout settings should apply, preventing the script from hanging.

Actual Behaviour

If a server retrieved from cluster nodes is unreachable, the script hangs indefinitely when attempting to retrieve data from this node.

If we look at netstat

netstat -na | grep 6379
tcp        0      1 a.a.a.a:43860      b.b.b.b:6379      SYN_SENT
tcp        0    152 a.a.a.a:51812      b.b.b.b:6379      FIN_WAIT1

Environment

OS: Ubuntu
Redis: Any version
PHP: 8.1 and 7.4
phpredis: Latest (6.0.2)

Steps to Reproduce

Set up a Redis cluster with multiple shards, each shard having 3 nodes.
Trigger an unexpected hard restart of one of the nodes.
Attempt a get call to the affected node and observe the thrown exception.
Reconnect to the cluster before the cluster marks the node as failed; initially, node failure is not reported.
Attempt to retrieve an element from the affected shard.
Observe the script freezing until the affected node starts responding to icmp (there is no need that redis service is running).

Checklist

There is no similar issue reported by other users.
Issue is not resolved in the develop branch.

Update / additional notes

We are using

$obj_cluster->setOption(
    RedisCluster::OPT_SLAVE_FAILOVER, RedisCluster::FAILOVER_DISTRIBUTE_SLAVES
);

and the failing node is a slave

The text was updated successfully, but these errors were encountered:

michael-grunder self-assigned this Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue: Redis Cluster connect timeout not applied on first connection to node from cluster nodes that is down #2515

Issue: Redis Cluster connect timeout not applied on first connection to node from cluster nodes that is down #2515

inakisoriamrf commented Jun 27, 2024 •

edited

Loading

Issue: Redis Cluster connect timeout not applied on first connection to node from cluster nodes that is down #2515

Issue: Redis Cluster connect timeout not applied on first connection to node from cluster nodes that is down #2515

Comments

inakisoriamrf commented Jun 27, 2024 • edited Loading

Expected Behaviour

Actual Behaviour

Environment

Steps to Reproduce

Checklist

Update / additional notes

inakisoriamrf commented Jun 27, 2024 •

edited

Loading