You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In RocketMQ DLedger, when a disk which stores data or index fails, a status which No-Master may occur for brokerGroup.
The following is the process of the problem:
# Process
Initial state: term=2, n2 is the Master node
2023-04-13 02:04:53, disk failure, term=3, n0 becomes Master
2023-04-13 02:04:53 INFO QuorumAckChecker-n0 - [n0][LEADER] term=3 ledgerBegin=133702137569 ledgerEnd=137547196422 committed=137547196422 watermarks={3:{"n0":137547196422,"n1":137547196422,"n2":-1}}
2023-04-1302:04:56WARNDLedgerServer-ScheduledExecutor-preferredLeaderId=n2 is notonline
2023-04-13 02:05:08, term=3, n0 detected that n2 is online, handing over the Master role.
2023-04-13 02:05:08 INFO DLedgerServer-ScheduledExecutor - preferredLeaderId = n2, which has the smallest fall behind index = 12 and is decided to be transferee.
2023-04-13 02:05:08 INFO DLedgerServer-ScheduledExecutor - handleLeadershipTransfer: LeadershipTransferRequest{transferId='null', transfereeId='n2', takeLeadershipLedgerIndex=0, group='null', remoteId='null', lo calId='null', code=200, leaderId='null', term=3}
2023-04-13 02:05:08,term=4, n2 becomes master
2023-04-13 02:05:08 INFO StateMaintainer - [n2] [PARSE_VOTE_RESULT] cost=1 term=4 memberNum=3 allNum=2 acceptedNum=2 notReadyTermNum=0 biggerLedgerNum=0 alreadyHasLeader=false maxTerm=4 result=PASSED
2023-04-13 02:05:08 INFO StateMaintainer - [n2] [VOTE_RESULT] has been elected to be the leader in term 4
2023-04-13 02:05:08 INFO StateMaintainer - TakeLeadershipTask finished. request=LeadershipTransferRequest{transferId='n0', transfereeId='n2', takeLeadershipLedgerIndex=137547318699, group='c4cloudsrv-miot-rocketmq-raft8', remoteId='n2', localId='n0', code=200, leaderId='n0', term=3}, response=LeadershipTransferResponse{group='null', remoteId='null', localId='null', code=200, leaderId='null', term=4}, term=4, role=LEADER
2023-04-13 02:05:08 INFO StateMaintainer - [n2] [ChangeRoleToLeader] from term: 4 and currTerm: 4
2023-04-13 02:05:27, term=4, Master role transparently passed to Broker failed
2023-04-13 02:05:08 INFO DLegerRoleChangeHandler_1 - Begin handling broker role change term=4 role=LEADER currStoreRole=SLAVE
2023-04-13 02:05:27 INFO DLegerRoleChangeHandler_1 - [MONITOR]Failed handling broker role change term=4 role=LEADER currStoreRole=SLAVE cost=19334
java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code
In the final state, DLedger chose n2 as the Master, but there was no Master for the Broker group, and the election has not been re-initiated since then.
# Question
Why did n0 and n1 not initiate an election after n2 became Master in term 4.
If the Master has not sent heartbeats to the follower, the follower will trigger the election; but if the heartbeat has been sent normally, the slave node will not initiate the election.
The memberState object lock is used to detect disk failures. When writing a message, the lock will be held. If the disk fails, the lock will not be released in time, and the heartbeat thread will not acquire the lock, thus detecting the disk failure. It can be seen that writing messages is a trigger to detect disk failures, but if the client no longer writes messages, the heartbeat thread can always acquire the lock, and it keeps sending heartbeats.
# TODO
If no data is written, the node where the faulty disk is located will also become the Master. Therefore, I think it is necessary to add a task to regularly detect whether the disks are available. to avoid this situation.
The text was updated successfully, but these errors were encountered:
In RocketMQ DLedger, when a disk which stores data or index fails, a status which No-Master may occur for brokerGroup.
The following is the process of the problem:
# Process
Initial state: term=2, n2 is the Master node
In the final state, DLedger chose n2 as the Master, but there was no Master for the Broker group, and the election has not been re-initiated since then.
# Question
If the Master has not sent heartbeats to the follower, the follower will trigger the election; but if the heartbeat has been sent normally, the slave node will not initiate the election.
The
memberState
object lock is used to detect disk failures. When writing a message, the lock will be held. If the disk fails, the lock will not be released in time, and the heartbeat thread will not acquire the lock, thus detecting the disk failure. It can be seen that writing messages is a trigger to detect disk failures, but if the client no longer writes messages, the heartbeat thread can always acquire the lock, and it keeps sending heartbeats.# TODO
If no data is written, the node where the faulty disk is located will also become the Master. Therefore, I think it is necessary to add a task to regularly detect whether the disks are available. to avoid this situation.
The text was updated successfully, but these errors were encountered: