-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3.4.18 etcd data inconsistency #14211
Comments
Please provide the info below,
|
Thanks for the comment! Unfortunately the point-in-time db state on each member was lost. Corruption check on the new clusters will alarm and print CI in disk if this issue happened again. |
Is it easy to reproduce this issue in your environment (dev or production?) ?
Did you just create a new cluster or replace one problematic member with a new one? |
It happened in a production cluster created on demand. I am not even sure where to start the reproduce the error
The oncall operator replaced the problematic member with a new one and everything recovered.. |
Thanks for the feedback. If just one member has issue, in other words majorities are still working well, then most likely the db file of the problematic member was corrupted. Usually you will see panic when you try to execute boltDB CMD in this case, something like 13406. Had the oncall operator in your company tried any BoltDB CMD and confirmed this? |
Yeah, I downloaded the snapshot taken from the impacted node before and after the panic. I ran some bbolt cmds like I'd like to keep this issue open until second occurrence with WAL log presence. |
@chaochn47 do you still have data in your metric system? Can you post other metrics? Seams like something happened on 06/09
It's interesting why revision divergence happened on 06/09 but size divergence on 06/19. Was cluster under heavy load/different load pattern? |
@lavacat thanks for taking a look! We do have data in the metrics and logging system.
There is a leadership change around 5 mins before this event. However, we started to see 3 more similar issues happening on other production clusters. They does not have correlation to leader changes.
No heartbeat failures and no snapshot apply in progress.
Compaction is driven by
I think the cluster is under light load. The mvcc range, txn, put rate is low and db size is under 25MiB. |
I suspected the write Txn failed to write back the buffer to read Txn buffer when a new concurrentReadTxn is created at the same time copied a stale read Txn buffer. But from the code perspective, this seems like not a issue. The only possibility is the disk write failed and mmap'd data is not reflected somehow.. But disk write failure should have been poped up to the etcd at the same time.. But we are enabling corrupt check, alarming faster, retaining more WAL files, better tooling to investigate, etc. So hopefully next time there will be more information.. |
It seems the names "revision divergency" and "db size total divergency" confused. Please
|
Thanks, updated. FYI,
We’re not allowed to share customer data. |
|
Provide a case we observed on Nov 1, 2023. What happened?A 3 etcd cluster returned inconsistent data. etcd version: 3.4.18 Run A panic occurred during operation, it seems that the data is inconsistent after etcd restarting, no errors were detected at startup. ( I don't think it's related with #266 ) Anything else we need to know?
|
Do you have any suggestions on the direction of investigation? I haven't been able to reproduce it yet. @ahrtr @serathius |
I am working on reproducing for bbolt data inconsistency. Will update it if I find out something. @Tachone |
@Tachone Would you mind sharing what |
I think there are some important information to help reproduce
|
@Tachone Thanks!
The machine wasn't restarted, is it correct? Is there any useful kernel message in
After bbolt panic, etcd still can restart and serve requests?
For the bad etcd node, after leader replicated the log, it can catch up with other two nodes? |
|
I have seen couple of corruptions, but nothing following a single pattern that would suggest a application issue. I would attribute them mostly to memory stamping (page is written under wrong page id). Still investigating them would be useful to propose additional debug information and safeguards that would give us more insight into the problem. My suggestion would be to separate the investigation, handle each report as independent issue as long as there is no strong suggestion that they have exactly the same root cause. |
Discussed during sig-etcd triage meeting, @ahrtr is this issue still valid? There is a suggestion above we handle each report of this data corruption as it's own issue as the root cause may not be the same. |
What happened?
A 3 etcd cluster returned inconsistent data.
'bad' Node
kube-controller-manager continuously cannot update its lock due to partial data in etcd.
What did you expect to happen?
The error
range failed to find revision pair
should never ever happen because read transaction will either read from write buffer or boltdb.We reported this case as a reference.
How can we reproduce it (as minimally and precisely as possible)?
Unfortunately, I don't have any clue to reproduce it.
Anything else we need to know?
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
paste your configuration here
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
No response
The text was updated successfully, but these errors were encountered: