Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

geo-replication: fix for secondary node fail-over #3959

Merged

Conversation

sanjurakonde
Copy link
Member

@sanjurakonde sanjurakonde commented Jan 18, 2023

Problem: When geo-replication session is setup, all the gsyncd slave
processes are coming up on the host which is used in creating the
geo-rep session. When this primary slave node goes down, all the
bricks are going into faulty state.

Cause: When monitor process tries to connect to the remote secondary
node, we are always using the remote_addr as a hostname. This variable
holds the hostname of the node which is used in creating the geo-rep
session. Thus, the gsyncd slave processes are always coming up on the
primary slave node. When this node goes down, monitor process is not able
to bring up gsyncd slave process and bricks are going into faulty state.

Fix: Instead of remote_addr, we should use resource_remote which holds
the hostname of randomly picked remote node. This way, when geo-rep
session is created and started, we will have the gsyncd slave processes
distributed across the secondary cluster. If the node which is used in
creating the session goes down, monitor process will bring the gsyncd
slave process on a randomly picked remote node (from the nodes which are
up at the moment). Bricks will not go into faulty state.

fixes:#3956

Signed-off-by: Sanju Rakonde sanju.rakonde@phonepe.com

Problem: When geo-replication session is setup, all the gsyncd slave
processes are coming up on the host which is used in creating the
geo-rep session. When this primary slave node goes down, all the
bricks are going into faulty state.

Cause: When monitor process tries to connect to the remote secondary
node, we are always using the remote_addr as a hostname. This variable
holds the hostname of the node which is used in creating the geo-rep
session. Thus, the gsyncd slave processes are always coming up on the
primary slave node. When this node goes down, monitor process is not able
to bring up gsyncd slave process and bricks are going into faulty state.

Fix: Instead of remote_addr, we should use resource_remote which holds
the hostname of randomly picked remote node. This way, when geo-rep
session is created and started, we will have the gsyncd slave processes
distributed across the secondary cluster. If the node which is used in
creating the session goes down, monitor process will bring the gsyncd
slave process on a randomly picked remote node (from the nodes which are
up at the moment). Bricks will not go into faulty state.

fixes:gluster#3956

Signed-off-by: Sanju Rakonde <sanju.rakonde@phonepe.com>
Problem: When geo-replication session is setup, all the gsyncd slave
processes are coming up on the host which is used in creating the
geo-rep session. When this primary slave node goes down, all the
bricks are going into faulty state.

Cause: When monitor process tries to connect to the remote secondary
node, we are always using the remote_addr as a hostname. This variable
holds the hostname of the node which is used in creating the geo-rep
session. Thus, the gsyncd slave processes are always coming up on the
primary slave node. When this node goes down, monitor process is not able
to bring up gsyncd slave process and bricks are going into faulty state.

Fix: Instead of remote_addr, we should use resource_remote which holds
the hostname of randomly picked remote node. This way, when geo-rep
session is created and started, we will have the gsyncd slave processes
distributed across the secondary cluster. If the node which is used in
creating the session goes down, monitor process will bring the gsyncd
slave process on a randomly picked remote node (from the nodes which are
up at the moment). Bricks will not go into faulty state.

fixes:gluster#3956

Signed-off-by: Sanju Rakonde <sanju.rakonde@phonepe.com>
@sanjurakonde
Copy link
Member Author

/recheck smoke

1 similar comment
@black-dragon74
Copy link
Member

/recheck smoke

@sanjurakonde
Copy link
Member Author

/run regression

Copy link
Contributor

@Shwetha-Acharya Shwetha-Acharya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Shwetha-Acharya Shwetha-Acharya merged commit af95d11 into gluster:devel Jan 30, 2023
amarts pushed a commit to kadalu/glusterfs that referenced this pull request Mar 20, 2023
* geo-replication: fiz for secondary node fail-over

Problem: When geo-replication session is setup, all the gsyncd slave
processes are coming up on the host which is used in creating the
geo-rep session. When this primary slave node goes down, all the
bricks are going into faulty state.

Cause: When monitor process tries to connect to the remote secondary
node, we are always using the remote_addr as a hostname. This variable
holds the hostname of the node which is used in creating the geo-rep
session. Thus, the gsyncd slave processes are always coming up on the
primary slave node. When this node goes down, monitor process is not able
to bring up gsyncd slave process and bricks are going into faulty state.

Fix: Instead of remote_addr, we should use resource_remote which holds
the hostname of randomly picked remote node. This way, when geo-rep
session is created and started, we will have the gsyncd slave processes
distributed across the secondary cluster. If the node which is used in
creating the session goes down, monitor process will bring the gsyncd
slave process on a randomly picked remote node (from the nodes which are
up at the moment). Bricks will not go into faulty state.

fixes:gluster#3956

Signed-off-by: Sanju Rakonde <sanju.rakonde@phonepe.com>

* geo-replication: fiz for secondary node fail-over

Problem: When geo-replication session is setup, all the gsyncd slave
processes are coming up on the host which is used in creating the
geo-rep session. When this primary slave node goes down, all the
bricks are going into faulty state.

Cause: When monitor process tries to connect to the remote secondary
node, we are always using the remote_addr as a hostname. This variable
holds the hostname of the node which is used in creating the geo-rep
session. Thus, the gsyncd slave processes are always coming up on the
primary slave node. When this node goes down, monitor process is not able
to bring up gsyncd slave process and bricks are going into faulty state.

Fix: Instead of remote_addr, we should use resource_remote which holds
the hostname of randomly picked remote node. This way, when geo-rep
session is created and started, we will have the gsyncd slave processes
distributed across the secondary cluster. If the node which is used in
creating the session goes down, monitor process will bring the gsyncd
slave process on a randomly picked remote node (from the nodes which are
up at the moment). Bricks will not go into faulty state.

fixes:gluster#3956

Signed-off-by: Sanju Rakonde <sanju.rakonde@phonepe.com>

---------

Signed-off-by: Sanju Rakonde <sanju.rakonde@phonepe.com>
sanjurakonde added a commit to sanjurakonde/glusterfs that referenced this pull request Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants