Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more gofailpoints to validate critical parts of the code #14735

Open
4 of 8 tasks
Tracked by #14045
serathius opened this issue Nov 11, 2022 · 12 comments
Open
4 of 8 tasks
Tracked by #14045

Add more gofailpoints to validate critical parts of the code #14735

serathius opened this issue Nov 11, 2022 · 12 comments
Labels
help wanted priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. stage/tracked type/feature

Comments

@serathius
Copy link
Member

serathius commented Nov 11, 2022

What would you like to be added?

Followup from #14733. Thanks to gofailpoint in defrag code we were able to find an issue with data inconsistency.

Based on this we should increase coverage of other processes similar to defrag. TODO:

  • Create a list of processes to add gofailpoints
  • Add the gofailpoint into the code. Example:
    // gofail: var defragBeforeCopy struct{}
  • Add code to linearizability tests to trigger the failpoints. Example
    DefragBeforeCopyPanic Failpoint = goFailpoint{"backend/defragBeforeCopy", "panic", triggerDefrag}
    DefragBeforeRenamePanic Failpoint = goFailpoint{"backend/defragBeforeRename", "panic", triggerDefrag}
    BeforeCommitPanic Failpoint = goFailpoint{"backend/beforeCommit", "panic", nil}
    AfterCommitPanic Failpoint = goFailpoint{"backend/afterCommit", "panic", nil}
    RaftBeforeSavePanic Failpoint = goFailpoint{"etcdserver/raftBeforeSave", "panic", nil}
    RaftAfterSavePanic Failpoint = goFailpoint{"etcdserver/raftAfterSave", "panic", nil}
    RandomFailpoint Failpoint = randomFailpoint{[]Failpoint{
    KillFailpoint, BeforeCommitPanic, AfterCommitPanic, RaftBeforeSavePanic,
    RaftAfterSavePanic, DefragBeforeCopyPanic, DefragBeforeRenamePanic,
    }}

Let's start from creating the list:

Feel free to suggest other processes that require a gofailpoint. cc @ahrtr @ptabor

Why is this needed?

This should allow us to better detect issues with data inconsistency like #14733

@serathius serathius added type/feature help wanted priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Nov 11, 2022
@ahrtr
Copy link
Member

ahrtr commented Nov 11, 2022

I will provide a list next week.

@ahrtr
Copy link
Member

ahrtr commented Nov 14, 2022

Supplement:

  • downgrade
  • Moveleader
  • backend (I will add some failpoints)

@serathius
Copy link
Member Author

Would love to also get some suggestions from @tbg @pavelkalinnikov. Raft library already has couple of failpoints, still it might be worth to expand them, especially with #14627.

@lavacat
Copy link

lavacat commented Jan 3, 2023

@serathius I've checked ConsistenIndex and I think the most interesting use case for failpoint is UnsafeSave, but it's indirectly covered by PreCommitHook failpoints.
Do you have something else in mind for ConsistenIndex? If not, I think it can be crossed out.

@lavacat
Copy link

lavacat commented Jan 3, 2023

@ahrtr Checked Moveleader and it seams most of the work is happening inside raft lib. I think it will be useful to add MoveLeader call to DefaultTraffic. But I don't see a good place inside etcd repo for a new failpoint. Do you have something else in mind for MoveLeader failpoint?

@serathius
Copy link
Member Author

@serathius I've checked ConsistenIndex and I think the most interesting use case for failpoint is UnsafeSave, but it's indirectly covered by PreCommitHook failpoints. Do you have something else in mind for ConsistenIndex? If not, I think it can be crossed out.

No, list is to ensure we give some thought to it. Thanks for confirming.

@serathius
Copy link
Member Author

I think we should also cover watch with failpoints. First step would be to reproduce #15248 by adding a failpoint after sending a progress notify. cc @ahrtr

@ahrtr
Copy link
Member

ahrtr commented Feb 9, 2023

@ahrtr Checked Moveleader and it seams most of the work is happening inside raft lib. I think it will be useful to add MoveLeader call to DefaultTraffic. But I don't see a good place inside etcd repo for a new failpoint. Do you have something else in mind for MoveLeader failpoint?

Sorry for the late response. We can use newLeader to check whether the leader changed, and add failpoint accordingly.

I think we should also cover watch with failpoints. First step would be to reproduce #15248 by adding a failpoint after sending a progress notify. cc @ahrtr

This seems to be a good suggestion. But I still do not think it can reliably reproduce the duplicated event issue. Submitted a PR #15272.

@stale
Copy link

stale bot commented May 21, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label May 21, 2023
@mounilKshah
Copy link

If no one else is working on this issue, is it fine if I work on it @serathius ?

@serathius
Copy link
Member Author

This issue requires some research into best places to put failpoints. This might require knowledge from maintainers, that might not be available and not able to provide timely responses.

@mounilKshah Could you look into #17817 ? It's very similar issue, but about backporting existing failpoints, so there is no need for maintainers help and should allow you to dig into the problem.

@mounilKshah
Copy link

Yeah sure, thanks for the quick response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. stage/tracked type/feature
Development

No branches or pull requests

4 participants