Description
Reproduce Procedure
- prepare etcd cluster with 3 nodes and respective etcdctl
/go/src/github.com/coreos/etcd # etcd --version
etcd Version: 3.4.9
Git SHA: Not provided (use ./build instead of go build)
Go Version: go1.14.2
Go OS/Arch: linux/amd64
though I don't think configure matters under this issue, let me present one of them
name: etcd2
data-dir: /data/zz_6129484611666145821
listen-peer-urls: http://0.0.0.0:23800
listen-client-urls: http://0.0.0.0:23790
initial-advertise-peer-urls: http://10.213.20.39:23800
advertise-client-urls: http://10.213.20.39:23790
initial-cluster: etcd0=http://10.213.20.40:23800,etcd1=http://10.213.20.38:23800,etcd2=http://10.213.20.39:23800
initial-cluster-token: zz
initial-cluster-state: new
auto-compaction-retention: "1"
quota-backend-bytes: -1
and the etcd processes are running simply by
etcd --config-file /etc/etcd.conf
no envirnments, no command line arguments
- enable auth
etcdctl user add root
# then type root password, mine is 'root'
etcdctl --user root:root auth enable
- watch
etcdctl --user root:root watch / --prefix
we can put a key to ensure the watch is working right now
etcdctl put /a b
- wait for token deleting
wait for 5 min until the token is deleted by simpleTokenKeeper
, then kill the etcd processes one by one and restart them immediately after kill
be noted do NOT kill the process until the cluster recovers healthy
- watch fails
then you'll realize the watch is down with the output permission deny
Analysis
the issue is cause by simpleTokenKeeper
, here is the timeline
- etcdctl dials grpc and fetches an auth token, let's say
TOKEN-A
- etcdctl dials grpc with
TOKEN-A
, and watch / --prefix as expected - after 5 min,
simpleTokenKeeper
deleteTOKEN-A
- watch continues working even if
TOKEN-A
has been deleted, because token is only checked upon grpc invocation - killing etcd process terminates connection, and etcdv3 client will re-invoke grpc Watch with the same token
TOKEN-A
authStore.AuthInfoFromCtx
will returnErrInvalidAuthToken
due toTOKEN-A
no longer exists
Impact
the experiment is conducted using v3.4.9, the good part in this version is client will raise error permission deny
and terminate watching;
however in our live cluster, the etcd server is v3.4.3, etcdv3 client is v3.3.8, and there will be no error, no log, no output, no termination, everything looks good but the watch has failed in silence, this is bad.
sometimes we can barely control the client version, such as calico-felix v3.4 binds with clientv3 v3.3.8, and upgrade calico version is subtle in live.
Improvement
In my opinion there are 2 ways to improve:
- we improve the watch keepalive mechanism. Correct me if wrong, we already have watch control message from server to client which are sent periodically; if we can have watch client respond with a
keepalive response
or something like this, we can invokesimpleTokenKeeper.resetSimpleToken
to renew TTL - we improve client side. The etcdv3 Watcher is merely an interface that returns
<-chan WatchResponse
, if we could encapsulate the re-fetching token and re-dialing grpc connection after receivingtransport is closing
WatchResponse, the Watcher client, outside the interface, would not be influenced.
Related issues
I presume the following issues are talking the exact same thing as I talk
#11121
#11381
looking forward to your kind feedback