auth token invalid after watch reconnects

# Reproduce Procedure

1. prepare etcd cluster with 3 nodes and respective etcdctl

```
/go/src/github.com/coreos/etcd # etcd --version
etcd Version: 3.4.9
Git SHA: Not provided (use ./build instead of go build)
Go Version: go1.14.2
Go OS/Arch: linux/amd64
```

though I don't think configure matters under this issue, let me present one of them

```
name: etcd2
data-dir: /data/zz_6129484611666145821
listen-peer-urls: http://0.0.0.0:23800
listen-client-urls: http://0.0.0.0:23790
initial-advertise-peer-urls: http://10.213.20.39:23800
advertise-client-urls: http://10.213.20.39:23790
initial-cluster: etcd0=http://10.213.20.40:23800,etcd1=http://10.213.20.38:23800,etcd2=http://10.213.20.39:23800
initial-cluster-token: zz
initial-cluster-state: new
auto-compaction-retention: "1"
quota-backend-bytes: -1
```

and the etcd processes are running simply by 

```
etcd --config-file /etc/etcd.conf
```

no envirnments, no command line arguments

2. enable auth

```
etcdctl user add root
# then type root password, mine is 'root'
etcdctl --user root:root auth enable
```

3. watch

```
etcdctl --user root:root watch / --prefix
```

we can put a key to ensure the watch is working right now

```
etcdctl put /a b
```

4. wait for token deleting

wait for 5 min until the token is deleted by `simpleTokenKeeper`, then kill the etcd processes one by one and restart them immediately after kill

be noted do NOT kill the process until the cluster recovers healthy

5. watch fails

then you'll realize the watch is down with the output `permission deny`

# Analysis

the issue is cause by `simpleTokenKeeper`, here is the timeline

1. etcdctl dials grpc and fetches an auth token, let's say `TOKEN-A`
2. etcdctl dials grpc with `TOKEN-A`, and watch / --prefix as expected
3. after 5 min, `simpleTokenKeeper` delete `TOKEN-A`
4. watch continues working even if `TOKEN-A` has been deleted, because token is only checked upon grpc invocation
5. killing etcd process terminates connection, and etcdv3 client will re-invoke grpc Watch with the same token `TOKEN-A`
6. `authStore.AuthInfoFromCtx` will return `ErrInvalidAuthToken` due to `TOKEN-A` no longer exists

# Impact

the experiment is conducted using v3.4.9, the good part in this version is client will raise error `permission deny` and terminate watching;

however in our live cluster, the etcd server is v3.4.3, etcdv3 client is v3.3.8, and there will be no error, no log, no output, no termination, everything looks good but the watch has failed in silence, this is bad.

sometimes we can barely control the client version, such as calico-felix v3.4 binds with clientv3 v3.3.8, and upgrade calico version is subtle in live.

# Improvement

In my opinion there are 2 ways to improve:

1. we improve the watch keepalive mechanism. Correct me if wrong, we already have watch control message from server to client which are sent periodically; if we can have watch client respond with a `keepalive response` or something like this, we can invoke `simpleTokenKeeper.resetSimpleToken` to renew TTL
2. we improve client side. The etcdv3 Watcher is merely an interface that returns `<-chan WatchResponse`, if we could encapsulate the re-fetching token and re-dialing grpc connection after receiving `transport is closing` WatchResponse, the Watcher client, outside the interface, would not be influenced.

# Related issues

I presume the following issues are talking the exact same thing as I talk
https://github.com/etcd-io/etcd/issues/11121
https://github.com/etcd-io/etcd/issues/11381

looking forward to your kind feedback

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

auth token invalid after watch reconnects #11954

Reproduce Procedure

Analysis

Impact

Improvement

Related issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development