geo-replication: fix for secondary node fail-over #3959

sanjurakonde · 2023-01-18T06:51:10Z

Problem: When geo-replication session is setup, all the gsyncd slave
processes are coming up on the host which is used in creating the
geo-rep session. When this primary slave node goes down, all the
bricks are going into faulty state.

Cause: When monitor process tries to connect to the remote secondary
node, we are always using the remote_addr as a hostname. This variable
holds the hostname of the node which is used in creating the geo-rep
session. Thus, the gsyncd slave processes are always coming up on the
primary slave node. When this node goes down, monitor process is not able
to bring up gsyncd slave process and bricks are going into faulty state.

Fix: Instead of remote_addr, we should use resource_remote which holds
the hostname of randomly picked remote node. This way, when geo-rep
session is created and started, we will have the gsyncd slave processes
distributed across the secondary cluster. If the node which is used in
creating the session goes down, monitor process will bring the gsyncd
slave process on a randomly picked remote node (from the nodes which are
up at the moment). Bricks will not go into faulty state.

fixes:#3956

Signed-off-by: Sanju Rakonde sanju.rakonde@phonepe.com

Problem: When geo-replication session is setup, all the gsyncd slave processes are coming up on the host which is used in creating the geo-rep session. When this primary slave node goes down, all the bricks are going into faulty state. Cause: When monitor process tries to connect to the remote secondary node, we are always using the remote_addr as a hostname. This variable holds the hostname of the node which is used in creating the geo-rep session. Thus, the gsyncd slave processes are always coming up on the primary slave node. When this node goes down, monitor process is not able to bring up gsyncd slave process and bricks are going into faulty state. Fix: Instead of remote_addr, we should use resource_remote which holds the hostname of randomly picked remote node. This way, when geo-rep session is created and started, we will have the gsyncd slave processes distributed across the secondary cluster. If the node which is used in creating the session goes down, monitor process will bring the gsyncd slave process on a randomly picked remote node (from the nodes which are up at the moment). Bricks will not go into faulty state. fixes:gluster#3956 Signed-off-by: Sanju Rakonde <sanju.rakonde@phonepe.com>

…sterfs into geo-rep-slave-node-fail-over

sanjurakonde · 2023-01-18T06:58:38Z

/recheck smoke

black-dragon74 · 2023-01-18T08:27:06Z

/recheck smoke

sanjurakonde · 2023-01-18T18:05:20Z

/run regression

Shwetha-Acharya

LGTM

* geo-replication: fiz for secondary node fail-over Problem: When geo-replication session is setup, all the gsyncd slave processes are coming up on the host which is used in creating the geo-rep session. When this primary slave node goes down, all the bricks are going into faulty state. Cause: When monitor process tries to connect to the remote secondary node, we are always using the remote_addr as a hostname. This variable holds the hostname of the node which is used in creating the geo-rep session. Thus, the gsyncd slave processes are always coming up on the primary slave node. When this node goes down, monitor process is not able to bring up gsyncd slave process and bricks are going into faulty state. Fix: Instead of remote_addr, we should use resource_remote which holds the hostname of randomly picked remote node. This way, when geo-rep session is created and started, we will have the gsyncd slave processes distributed across the secondary cluster. If the node which is used in creating the session goes down, monitor process will bring the gsyncd slave process on a randomly picked remote node (from the nodes which are up at the moment). Bricks will not go into faulty state. fixes:gluster#3956 Signed-off-by: Sanju Rakonde <sanju.rakonde@phonepe.com> * geo-replication: fiz for secondary node fail-over Problem: When geo-replication session is setup, all the gsyncd slave processes are coming up on the host which is used in creating the geo-rep session. When this primary slave node goes down, all the bricks are going into faulty state. Cause: When monitor process tries to connect to the remote secondary node, we are always using the remote_addr as a hostname. This variable holds the hostname of the node which is used in creating the geo-rep session. Thus, the gsyncd slave processes are always coming up on the primary slave node. When this node goes down, monitor process is not able to bring up gsyncd slave process and bricks are going into faulty state. Fix: Instead of remote_addr, we should use resource_remote which holds the hostname of randomly picked remote node. This way, when geo-rep session is created and started, we will have the gsyncd slave processes distributed across the secondary cluster. If the node which is used in creating the session goes down, monitor process will bring the gsyncd slave process on a randomly picked remote node (from the nodes which are up at the moment). Bricks will not go into faulty state. fixes:gluster#3956 Signed-off-by: Sanju Rakonde <sanju.rakonde@phonepe.com> --------- Signed-off-by: Sanju Rakonde <sanju.rakonde@phonepe.com>

…)" This reverts commit af95d11.

sanjurakonde added 3 commits January 17, 2023 15:51

Merge branch 'geo-rep-slave-node-fail-over' of github.com:gluster/glu…

5ac8568

…sterfs into geo-rep-slave-node-fail-over

sanjurakonde requested review from aravindavk, Shwetha-Acharya and kotreshhr January 19, 2023 04:57

sanjurakonde mentioned this pull request Jan 23, 2023

geo-replication: fix for secondary node fail-over #3957

Closed

Shwetha-Acharya approved these changes Jan 24, 2023

View reviewed changes

aravindavk approved these changes Jan 27, 2023

View reviewed changes

Shwetha-Acharya merged commit af95d11 into gluster:devel Jan 30, 2023

sanjurakonde added a commit to sanjurakonde/glusterfs that referenced this pull request Sep 25, 2024

Revert "geo-replication: fix for secondary node fail-over (gluster#3959…

be03410

…)" This reverts commit af95d11.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

geo-replication: fix for secondary node fail-over #3959

geo-replication: fix for secondary node fail-over #3959

sanjurakonde commented Jan 18, 2023 •

edited

Loading

sanjurakonde commented Jan 18, 2023

black-dragon74 commented Jan 18, 2023

sanjurakonde commented Jan 18, 2023

Shwetha-Acharya left a comment

geo-replication: fix for secondary node fail-over #3959

geo-replication: fix for secondary node fail-over #3959

Conversation

sanjurakonde commented Jan 18, 2023 • edited Loading

sanjurakonde commented Jan 18, 2023

black-dragon74 commented Jan 18, 2023

sanjurakonde commented Jan 18, 2023

Shwetha-Acharya left a comment

Choose a reason for hiding this comment

sanjurakonde commented Jan 18, 2023 •

edited

Loading