Skip to content

Specifying a revision for a range request in a transaction may cause data inconsistency #18667

Open
@ahrtr

Description

What happened?

Specifying a revision for a range request in a transaction may cause data inconsistency. The client may get different values against the same key from different endpoints.

How can we reproduce it?

  • Step 1: start a brand new 3 member cluster
  • Step 2: Execute for i in {1..20}; do ./etcdctl put k$i v$i; done
  • Step 3: Execute ./etcdctl compact 21
  • Step 4: Execute ./etcdctl txn --interactive
compares:
value("k1") = "v1"

success requests (get, put, del):
put k2 foo
get k1 --rev=10

failure requests (get, put, del):

The client will get a **etcdserver: mvcc: required revision has been compacted** error.

  • Step 5: Execute ./etcdctl get k2 against different endpoints, you will get different values,
$ ./etcdctl --endpoints=127.0.0.1:2379 get k2
k2
v2
$ ./etcdctl --endpoints=127.0.0.1:22379 get k2
k2
foo
$ ./etcdctl --endpoints=127.0.0.1:32379 get k2
k2
foo

Root cause

The root cause is that etcd server removes range requests from the TXN for endpoints the client isn’t connected to.

needResult := s.w.IsRegistered(id)
if needResult || !noSideEffect(&raftReq) {
if !needResult && raftReq.Txn != nil {
removeNeedlessRangeReqs(raftReq.Txn)
}

For example, if the client connects to member 1, then etcdserver removes the range request (get k1 --rev=10 in above example) from the TXN in member 2 and 3. Accordingly, member 1 applies failed due to checkRange's failures, but member 2 & 3 apply the TXN successfully because the range request was removed. Eventually it leads to the situation that different members have different data.

func checkRange(rv mvcc.ReadView, req *pb.RangeRequest) error {
switch {
case req.Revision == 0:
return nil
case req.Revision > rv.Rev():
return mvcc.ErrFutureRev
case req.Revision < rv.FirstRev():
return mvcc.ErrCompacted
}
return nil
}

Solution

The simplest solution is we don't remove range requests from TXN for any member. The side effect is that other endpoints (the client isn't connected to) will execute the unnecessary range operations.

To resolve the side effect above, we don't execute the range requests on other endpoints that the client isn't connected to; instead etcdservers only verify them to ensure all endpoints always execute consistent validation.

Workaround

If only one member is inconsistent, just replace it. See this guide. After removing the member, delete its data.

If all members are inconsistent, things get trickier. You'll need to pick one member as the source of truth, force creating a single-member cluster, and then re-add other members (don’t forget to clear their data first).

Impact

All versions (including 3.4.x, 3.5.x and main) are affected.

  • Just searched the Kubernetes repo, and confirmed that kubernetes doesn't specify revision for range request in TXN. So Kubernetes isn't affected
  • For non-Kubernetes usage... It’s uncommon for this to be used this way in a real-world product, but I’m not entirely sure it won’t be.

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions