-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
geo-replication: fix for secondary node fail-over #3959
Merged
Shwetha-Acharya
merged 3 commits into
gluster:devel
from
sanjurakonde:geo-rep-slave-node-fail-over
Jan 30, 2023
Merged
geo-replication: fix for secondary node fail-over #3959
Shwetha-Acharya
merged 3 commits into
gluster:devel
from
sanjurakonde:geo-rep-slave-node-fail-over
Jan 30, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Problem: When geo-replication session is setup, all the gsyncd slave processes are coming up on the host which is used in creating the geo-rep session. When this primary slave node goes down, all the bricks are going into faulty state. Cause: When monitor process tries to connect to the remote secondary node, we are always using the remote_addr as a hostname. This variable holds the hostname of the node which is used in creating the geo-rep session. Thus, the gsyncd slave processes are always coming up on the primary slave node. When this node goes down, monitor process is not able to bring up gsyncd slave process and bricks are going into faulty state. Fix: Instead of remote_addr, we should use resource_remote which holds the hostname of randomly picked remote node. This way, when geo-rep session is created and started, we will have the gsyncd slave processes distributed across the secondary cluster. If the node which is used in creating the session goes down, monitor process will bring the gsyncd slave process on a randomly picked remote node (from the nodes which are up at the moment). Bricks will not go into faulty state. fixes:gluster#3956 Signed-off-by: Sanju Rakonde <sanju.rakonde@phonepe.com>
Problem: When geo-replication session is setup, all the gsyncd slave processes are coming up on the host which is used in creating the geo-rep session. When this primary slave node goes down, all the bricks are going into faulty state. Cause: When monitor process tries to connect to the remote secondary node, we are always using the remote_addr as a hostname. This variable holds the hostname of the node which is used in creating the geo-rep session. Thus, the gsyncd slave processes are always coming up on the primary slave node. When this node goes down, monitor process is not able to bring up gsyncd slave process and bricks are going into faulty state. Fix: Instead of remote_addr, we should use resource_remote which holds the hostname of randomly picked remote node. This way, when geo-rep session is created and started, we will have the gsyncd slave processes distributed across the secondary cluster. If the node which is used in creating the session goes down, monitor process will bring the gsyncd slave process on a randomly picked remote node (from the nodes which are up at the moment). Bricks will not go into faulty state. fixes:gluster#3956 Signed-off-by: Sanju Rakonde <sanju.rakonde@phonepe.com>
…sterfs into geo-rep-slave-node-fail-over
/recheck smoke |
1 similar comment
/recheck smoke |
/run regression |
Shwetha-Acharya
approved these changes
Jan 24, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
aravindavk
approved these changes
Jan 27, 2023
amarts
pushed a commit
to kadalu/glusterfs
that referenced
this pull request
Mar 20, 2023
* geo-replication: fiz for secondary node fail-over Problem: When geo-replication session is setup, all the gsyncd slave processes are coming up on the host which is used in creating the geo-rep session. When this primary slave node goes down, all the bricks are going into faulty state. Cause: When monitor process tries to connect to the remote secondary node, we are always using the remote_addr as a hostname. This variable holds the hostname of the node which is used in creating the geo-rep session. Thus, the gsyncd slave processes are always coming up on the primary slave node. When this node goes down, monitor process is not able to bring up gsyncd slave process and bricks are going into faulty state. Fix: Instead of remote_addr, we should use resource_remote which holds the hostname of randomly picked remote node. This way, when geo-rep session is created and started, we will have the gsyncd slave processes distributed across the secondary cluster. If the node which is used in creating the session goes down, monitor process will bring the gsyncd slave process on a randomly picked remote node (from the nodes which are up at the moment). Bricks will not go into faulty state. fixes:gluster#3956 Signed-off-by: Sanju Rakonde <sanju.rakonde@phonepe.com> * geo-replication: fiz for secondary node fail-over Problem: When geo-replication session is setup, all the gsyncd slave processes are coming up on the host which is used in creating the geo-rep session. When this primary slave node goes down, all the bricks are going into faulty state. Cause: When monitor process tries to connect to the remote secondary node, we are always using the remote_addr as a hostname. This variable holds the hostname of the node which is used in creating the geo-rep session. Thus, the gsyncd slave processes are always coming up on the primary slave node. When this node goes down, monitor process is not able to bring up gsyncd slave process and bricks are going into faulty state. Fix: Instead of remote_addr, we should use resource_remote which holds the hostname of randomly picked remote node. This way, when geo-rep session is created and started, we will have the gsyncd slave processes distributed across the secondary cluster. If the node which is used in creating the session goes down, monitor process will bring the gsyncd slave process on a randomly picked remote node (from the nodes which are up at the moment). Bricks will not go into faulty state. fixes:gluster#3956 Signed-off-by: Sanju Rakonde <sanju.rakonde@phonepe.com> --------- Signed-off-by: Sanju Rakonde <sanju.rakonde@phonepe.com>
sanjurakonde
added a commit
to sanjurakonde/glusterfs
that referenced
this pull request
Sep 25, 2024
…)" This reverts commit af95d11.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem: When geo-replication session is setup, all the gsyncd slave
processes are coming up on the host which is used in creating the
geo-rep session. When this primary slave node goes down, all the
bricks are going into faulty state.
Cause: When monitor process tries to connect to the remote secondary
node, we are always using the remote_addr as a hostname. This variable
holds the hostname of the node which is used in creating the geo-rep
session. Thus, the gsyncd slave processes are always coming up on the
primary slave node. When this node goes down, monitor process is not able
to bring up gsyncd slave process and bricks are going into faulty state.
Fix: Instead of remote_addr, we should use resource_remote which holds
the hostname of randomly picked remote node. This way, when geo-rep
session is created and started, we will have the gsyncd slave processes
distributed across the secondary cluster. If the node which is used in
creating the session goes down, monitor process will bring the gsyncd
slave process on a randomly picked remote node (from the nodes which are
up at the moment). Bricks will not go into faulty state.
fixes:#3956
Signed-off-by: Sanju Rakonde sanju.rakonde@phonepe.com