-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default settings lead to infinite loop when a server is removed #362
Comments
Sorry for my late reply, I started investigating this issue right away, but forgot to leave an acknowledgement here. |
Yes total timeout is a fix for avoiding the infinite loop (but it does default to 0) but does not fix the underlying issue with a broken node being continually used to service requests. As far as getting a new node assigned each iteration of the loop, I looked deeper into the code to see what could be making it get the same node. I found that even though the code to get a new node is running, there are a few factors that are keeping it from making progress. Firstly, for the first (and subsequent failures) the code is ending up executing the continue in the block i posted above as well: cmd.conn, err = ifc.getConnection(policy)
if err != nil {
isClientTimeout = true
if err == types.ErrConnectionPoolEmpty {
// if the connection pool is empty, we still haven't tried
// the transaction to increase the iteration count.
iterations--
}
logger.Logger.Debug("Node " + cmd.node.String() + ": " + err.Error())
continue
} Note that client timeout will be set to true After the first failure, notFirstIteration will be set, and ifc.prepareRetry will execute. if notFirstIteration {
aerr, ok := err.(types.AerospikeError)
if !ifc.prepareRetry(ifc, isClientTimeout || (ok && aerr.ResultCode() != types.SERVER_NOT_AVAILABLE)) {
. . . Looking at the implementation of the command type I am running (operate) i can see it will end up here: opperate_command::prepareRetry::cmd.partition.PrepareRetryWrite(isClientTimeout bool) {
if !isClientTimeout {
ptn.sequence++
}
} However, since isClientTimeout == true, the partition sequence number will NOT be incremented. Now we get to the code where node will be set (command.go 2098): cmd.node, err = ifc.getNode(ifc) Which for the operate command ends up here: func (cmd *operateCommand) getNode(ifc command) (*Node, error) {
if cmd.hasWrite {
return cmd.partition.GetNodeWrite(cmd.cluster)
}
// this may be affected by Rackaware
return cmd.partition.GetNodeRead(cmd.cluster)
} Both GetNodeWrite/GetNodeRead have similar execute paths for the default REPLICA_POLCY of SEQUENCE ( the default value returned by policy.go::NewPolicy() and what i have been using: func (ptn *Partition) getSequenceNode(cluster *Cluster) (*Node, error) {
replicas := ptn.partitions.Replicas
for range replicas {
index := ptn.sequence % len(replicas)
node := replicas[index][ptn.PartitionId]
if node != nil && node.IsActive() {
return node, nil
}
ptn.sequence++
}
nodeArray := cluster.GetNodes()
return nil, newInvalidNodeError(len(nodeArray), ptn)
} For some reason, node.IsActive() always passes here and other places in command.go. As you can see, if that occurs, and since sequence number was not changed due to it being a "client timeout" it will return the exact same node associated with the previous failed execution. It then eventually gets back to this block, where everything starts all over. cmd.conn, err = ifc.getConnection(policy)
if err != nil {
isClientTimeout = true
if err == types.ErrConnectionPoolEmpty {
// if the connection pool is empty, we still haven't tried
// the transaction to increase the iteration count.
iterations--
}
logger.Logger.Debug("Node " + cmd.node.String() + ": " + err.Error())
continue
} While the solution with setting the timeout works to avoid an infinite loop, it does not allow the driver to recover. The bad nodes stick around, getting stuck in a similar error path, until i create a new client and swap my wrapper to use that. If there is a bug here it does seem more likely to be related to the code that tends the nodes. I will try to get you some more details or perhaps figure out how to get into this state with the debugger attached. |
Sorry for the very late reply, I didn't get any notifications for a while from the repo. Thank you for your analysis of the issue. From what I see, we have an issue regarding identification of client side vs server side errors. It is not as trivial as it seems, I'll have to give it a bit of thought to how to handle it best. I've pushed this issue to the top of my list and will release a fix before the week is out. |
Any updates? |
I have observed this issue in driver versions (v4.5.2 and v5.5.0) the code I am listing and investigated are for v4.5.2.
To produce this behavior, I deployed aerospike 5.5 for community edition via the helm charts.
I deployed a cluster of 6 replicas into GKE. I then run a set of clients that continually create and get records (using the Operate method for creates). If i then remove any of my 6 replicas by restarting the pod, the CPU spikes on most of the service pods using the golang client from say 5% up to 200%. It does not recover even after waiting > 15 minutes.
I ran the profiler and discovered that every pod with runaway CPU was spending most of its time executing the command.executeAt method and tracked it to this block (command.go line 2114):
The node does get replaces by kubernetes after a few seconds, but the specific connection the client command has (the Node from the Cluster struct) never gets updated.
This block of code will execute forever, always failing
cmd.conn, err = ifc.getConnection(policy)
because the actual node connection is bound to thecommand
when it was created. This can sit here forever just incrementing iterations, using the bad connection and decrementing it.Setting a value for TotalTimeout (which defaults to 0) resolves the infinite loop, but there are still cases where after one of the nodes becomes unhealthy, the tend routine never forgets the old node and its pool is exhausted (even though a new statefull set pod has been created and joined the cluster to replace it). To work around this, I added a wrapper layer around this entire client library that recreates a new client and closes the old one, in case of any network errors. If there is some way to avoid this i would much prefer that!
I see people have reported similar behavior with the defaults in past issues, if this is truly the expected behavior with the default settings, perhaps add a note in the best practices/performance section of the docs mentioning this, as it could help people avoid surprises!
I am also curious why it keeps retrying on the same exact connection, wouldn't it be a lot more resilient if it randomly chose a new connection each time? As far as I can tell, the selection of the node happens much earlier as soon as the command is created.
The text was updated successfully, but these errors were encountered: