Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue: Redis Cluster connect timeout not applied on first connection to node from cluster nodes that is down #2515

Open
2 tasks done
inakisoriamrf opened this issue Jun 27, 2024 · 0 comments
Assignees

Comments

@inakisoriamrf
Copy link

inakisoriamrf commented Jun 27, 2024

We are working with a Redis cluster consisting of multiple shards, with 3 nodes per shard. When a node unexpectedly restarts (due to hardware reset, for example), the first get call to the affected node throws an exception as expected. Upon encountering this exception, we reconnect to the cluster, however, initially, the cluster nodes do not report the affected node as failed.

The new cluster connection remains functional until the point where it tries to retrieve an element from the affected shard and establishes a connection to the impaired node. At this point, the script freezes and remains unresponsive until the affected server starts responding to icmp.

We suspect that the timeout settings are not being applied correctly during the initial connection attempt to a node retrieved from cluster nodes.

Expected Behaviour

If a server retrieved from cluster nodes is unreachable, the default timeout settings should apply, preventing the script from hanging.

Actual Behaviour

If a server retrieved from cluster nodes is unreachable, the script hangs indefinitely when attempting to retrieve data from this node.

If we look at netstat

netstat -na | grep 6379
tcp        0      1 a.a.a.a:43860      b.b.b.b:6379      SYN_SENT
tcp        0    152 a.a.a.a:51812      b.b.b.b:6379      FIN_WAIT1

Environment

  • OS: Ubuntu
  • Redis: Any version
  • PHP: 8.1 and 7.4
  • phpredis: Latest (6.0.2)

Steps to Reproduce

  1. Set up a Redis cluster with multiple shards, each shard having 3 nodes.
  2. Trigger an unexpected hard restart of one of the nodes.
  3. Attempt a get call to the affected node and observe the thrown exception.
  4. Reconnect to the cluster before the cluster marks the node as failed; initially, node failure is not reported.
  5. Attempt to retrieve an element from the affected shard.
  6. Observe the script freezing until the affected node starts responding to icmp (there is no need that redis service is running).

Checklist

  • There is no similar issue reported by other users.
  • Issue is not resolved in the develop branch.

Update / additional notes

We are using

$obj_cluster->setOption(
    RedisCluster::OPT_SLAVE_FAILOVER, RedisCluster::FAILOVER_DISTRIBUTE_SLAVES
);

and the failing node is a slave

@michael-grunder michael-grunder self-assigned this Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants