Skip to content

auth token invalid after watch reconnectsΒ #11954

Closed
@jschwinger233

Description

Reproduce Procedure

  1. prepare etcd cluster with 3 nodes and respective etcdctl
/go/src/github.com/coreos/etcd # etcd --version
etcd Version: 3.4.9
Git SHA: Not provided (use ./build instead of go build)
Go Version: go1.14.2
Go OS/Arch: linux/amd64

though I don't think configure matters under this issue, let me present one of them

name: etcd2
data-dir: /data/zz_6129484611666145821
listen-peer-urls: http://0.0.0.0:23800
listen-client-urls: http://0.0.0.0:23790
initial-advertise-peer-urls: http://10.213.20.39:23800
advertise-client-urls: http://10.213.20.39:23790
initial-cluster: etcd0=http://10.213.20.40:23800,etcd1=http://10.213.20.38:23800,etcd2=http://10.213.20.39:23800
initial-cluster-token: zz
initial-cluster-state: new
auto-compaction-retention: "1"
quota-backend-bytes: -1

and the etcd processes are running simply by

etcd --config-file /etc/etcd.conf

no envirnments, no command line arguments

  1. enable auth
etcdctl user add root
# then type root password, mine is 'root'
etcdctl --user root:root auth enable
  1. watch
etcdctl --user root:root watch / --prefix

we can put a key to ensure the watch is working right now

etcdctl put /a b
  1. wait for token deleting

wait for 5 min until the token is deleted by simpleTokenKeeper, then kill the etcd processes one by one and restart them immediately after kill

be noted do NOT kill the process until the cluster recovers healthy

  1. watch fails

then you'll realize the watch is down with the output permission deny

Analysis

the issue is cause by simpleTokenKeeper, here is the timeline

  1. etcdctl dials grpc and fetches an auth token, let's say TOKEN-A
  2. etcdctl dials grpc with TOKEN-A, and watch / --prefix as expected
  3. after 5 min, simpleTokenKeeper delete TOKEN-A
  4. watch continues working even if TOKEN-A has been deleted, because token is only checked upon grpc invocation
  5. killing etcd process terminates connection, and etcdv3 client will re-invoke grpc Watch with the same token TOKEN-A
  6. authStore.AuthInfoFromCtx will return ErrInvalidAuthToken due to TOKEN-A no longer exists

Impact

the experiment is conducted using v3.4.9, the good part in this version is client will raise error permission deny and terminate watching;

however in our live cluster, the etcd server is v3.4.3, etcdv3 client is v3.3.8, and there will be no error, no log, no output, no termination, everything looks good but the watch has failed in silence, this is bad.

sometimes we can barely control the client version, such as calico-felix v3.4 binds with clientv3 v3.3.8, and upgrade calico version is subtle in live.

Improvement

In my opinion there are 2 ways to improve:

  1. we improve the watch keepalive mechanism. Correct me if wrong, we already have watch control message from server to client which are sent periodically; if we can have watch client respond with a keepalive response or something like this, we can invoke simpleTokenKeeper.resetSimpleToken to renew TTL
  2. we improve client side. The etcdv3 Watcher is merely an interface that returns <-chan WatchResponse, if we could encapsulate the re-fetching token and re-dialing grpc connection after receiving transport is closing WatchResponse, the Watcher client, outside the interface, would not be influenced.

Related issues

I presume the following issues are talking the exact same thing as I talk
#11121
#11381

looking forward to your kind feedback

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions