Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit flake in pkg/storage: TestFiltering #24194

Closed
ncdc opened this issue Apr 13, 2016 · 22 comments
Closed

Unit flake in pkg/storage: TestFiltering #24194

ncdc opened this issue Apr 13, 2016 · 22 comments
Assignees

Comments

@ncdc
Copy link
Member

ncdc commented Apr 13, 2016

--- FAIL: TestFiltering (0.97s)
    cacher_test.go:167: (called from line 342)
    cacher_test.go:168: Expected: DELETED, got: ADDED
    cacher_test.go:167: (called from line 343)
    cacher_test.go:168: Expected: ADDED, got: DELETED
    cacher_test.go:171: (called from line 343)
    cacher_test.go:172: Expected (ADDED): &api.Pod{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"foo", GenerateName:"", Namespace:"ns", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"filter":"foo"}, Annotations:map[string]string(nil)}, Spec:api.PodSpec{Volumes:[]api.Volume(nil), Containers:[]api.Container(nil), RestartPolicy:"Always", TerminationGracePeriodSeconds:(*int64)(0xc8203ae650), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:"ClusterFirst", NodeSelector:map[string]string(nil), ServiceAccountName:"", NodeName:"", SecurityContext:(*api.PodSecurityContext)(0xc82089ec80), ImagePullSecrets:[]api.LocalObjectReference(nil)}, Status:api.PodStatus{Phase:"", Conditions:[]api.PodCondition(nil), Message:"", Reason:"", HostIP:"", PodIP:"", StartTime:(*unversioned.Time)(nil), ContainerStatuses:[]api.ContainerStatus(nil)}}, got: &api.Pod{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"foo", GenerateName:"", Namespace:"ns", SelfLink:"", UID:"", ResourceVersion:"5", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, Spec:api.PodSpec{Volumes:[]api.Volume(nil), Containers:[]api.Container(nil), RestartPolicy:"Always", TerminationGracePeriodSeconds:(*int64)(0xc8206ffed0), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:"ClusterFirst", NodeSelector:map[string]string(nil), ServiceAccountName:"", NodeName:"", SecurityContext:(*api.PodSecurityContext)(0xc8206c4c40), ImagePullSecrets:[]api.LocalObjectReference(nil)}, Status:api.PodStatus{Phase:"", Conditions:[]api.PodCondition(nil), Message:"", Reason:"", HostIP:"", PodIP:"", StartTime:(*unversioned.Time)(nil), ContainerStatuses:[]api.ContainerStatus(nil)}}
    cacher_test.go:167: (called from line 344)
    cacher_test.go:168: Expected: MODIFIED, got: ADDED
    cacher_test.go:171: (called from line 344)
    cacher_test.go:172: Expected (MODIFIED): &api.Pod{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"foo", GenerateName:"", Namespace:"ns", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"filter":"foo"}, Annotations:map[string]string(nil)}, Spec:api.PodSpec{Volumes:[]api.Volume(nil), Containers:[]api.Container(nil), RestartPolicy:"Always", TerminationGracePeriodSeconds:(*int64)(0xc8203ae660), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:"ClusterFirst", NodeSelector:map[string]string(nil), ServiceAccountName:"", NodeName:"fakeNode", SecurityContext:(*api.PodSecurityContext)(0xc82089ed00), ImagePullSecrets:[]api.LocalObjectReference(nil)}, Status:api.PodStatus{Phase:"", Conditions:[]api.PodCondition(nil), Message:"", Reason:"", HostIP:"", PodIP:"", StartTime:(*unversioned.Time)(nil), ContainerStatuses:[]api.ContainerStatus(nil)}}, got: &api.Pod{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"foo", GenerateName:"", Namespace:"ns", SelfLink:"", UID:"", ResourceVersion:"6", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"filter":"foo"}, Annotations:map[string]string(nil)}, Spec:api.PodSpec{Volumes:[]api.Volume(nil), Containers:[]api.Container(nil), RestartPolicy:"Always", TerminationGracePeriodSeconds:(*int64)(0xc820386e28), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:"ClusterFirst", NodeSelector:map[string]string(nil), ServiceAccountName:"", NodeName:"", SecurityContext:(*api.PodSecurityContext)(0xc8201893c0), ImagePullSecrets:[]api.LocalObjectReference(nil)}, Status:api.PodStatus{Phase:"", Conditions:[]api.PodCondition(nil), Message:"", Reason:"", HostIP:"", PodIP:"", StartTime:(*unversioned.Time)(nil), ContainerStatuses:[]api.ContainerStatus(nil)}}
    cacher_test.go:167: (called from line 345)
    cacher_test.go:168: Expected: DELETED, got: MODIFIED
FAIL
FAIL    k8s.io/kubernetes/pkg/storage   12.878s

Running in an endless loop, I wasn't able to immediately reproduce, BUT I was able to reproduce as soon as I started running stress -c 4 in a separate terminal.

@ncdc ncdc added the kind/flake Categorizes issue or PR as related to a flaky test. label Apr 13, 2016
@ncdc
Copy link
Member Author

ncdc commented Apr 13, 2016

cc @wojtek-t

@wojtek-t wojtek-t self-assigned this Apr 13, 2016
@wojtek-t
Copy link
Member

Will take a look

@ncdc
Copy link
Member Author

ncdc commented Apr 13, 2016

cc @liggitt

@ncdc
Copy link
Member Author

ncdc commented Apr 13, 2016

I'm guessing that #24008 has something to do with this

@wojtek-t
Copy link
Member

@ncdc - I have a different hypothesis, let me verify

@wojtek-t
Copy link
Member

Yup - the reason is different.

The problem is that if the watch is already initiated (but it still has resourceVersion in the future comparing to the cacher current state) it doesn't check if the event that it has just received is fresh enough that should be send to the watcher.

For reproducing this problem, see:
https://github.com/kubernetes/kubernetes/compare/master...wojtek-t:fix_cacher?expand=1

The test that I added there should be failing, but it's passing.

I will fix this.

@liggitt
Copy link
Member

liggitt commented Apr 13, 2016

@wojtek-t your example shouldn't get the added event unless it occurred past resourceVersion 1000, right?

@liggitt
Copy link
Member

liggitt commented Apr 13, 2016

oh, misread... should be failing, but is passing :)

@wojtek-t
Copy link
Member

@liggitt - exactly;
I'm afraid I won't be able to finish the fix today. I can do it tomorrow morning (or feel free to pick it up - my test is already a reproduction to check if it is fixed :P)

@wojtek-t
Copy link
Member

To clarify - the fix is to pass resourceVersion together with events to the "cacheWatcher" (as part of watchCacheEvent) and in filtering function check if the RV is greater then the initial one.

@liggitt
Copy link
Member

liggitt commented Apr 13, 2016

right

@ncdc
Copy link
Member Author

ncdc commented Apr 13, 2016

Does something like this look like it's going in the right direction? https://gist.github.com/47d197657af47d43a5717a2de2b68743

@wojtek-t
Copy link
Member

@ncdc - I wanted to avoid parsing the number you are doing in process() method, and instead of that add a new field "ResourceVersion" to watchCacheEvent where we already have this parsed.
On the other hand, maybe this kind of optimization doesn't make much sense. WDYT?

@ncdc
Copy link
Member Author

ncdc commented Apr 13, 2016

@wojtek-t I've now read through more of the cache code and I think it's best to avoid parsing the resource version twice. I'll update accordingly.

@ncdc
Copy link
Member Author

ncdc commented Apr 13, 2016

When comparing the event's resourceVersion with the value passed to Watch, do we pass the event through if its resourceVersion is > starting value, or >= starting value?

@ncdc
Copy link
Member Author

ncdc commented Apr 13, 2016

Also, @wojtek-t, your unit test that you added above that passes now but should fail once the corrected code is in place... is there an easy way to add a "future" test that runs quickly and won't flake? I feel like checking for the absence of an event could be problematic.

@liggitt
Copy link
Member

liggitt commented Apr 13, 2016

do we pass the event through if its resourceVersion is > starting value, or >= starting value?

>

@wojtek-t
Copy link
Member

Also, @wojtek-t, your unit test that you added above that passes now but should fail once the corrected code is in place... is there an easy way to add a "future" test that runs quickly and won't flake? I feel like checking for the absence of an event could be problematic.

Yeah - that's problematic. What we can do is to wait e.g. 1s for it and pass then. It's far from being ideal, but probably better than nothing.

@ncdc
Copy link
Member Author

ncdc commented Apr 13, 2016

I'm really hesitant to add something that waits 1s. That sounds really dangerous (flakey). What about sending data to etcd with various resource versions and making sure you only get the ones newer than the initial resource version?

@liggitt
Copy link
Member

liggitt commented Apr 13, 2016

watch a few forward in the future, then add enough to reach your watch window, and ensure you didn't get the ones prior to that

@ncdc
Copy link
Member Author

ncdc commented Apr 13, 2016

Yeah that's what I was thinking

@liggitt
Copy link
Member

liggitt commented Apr 13, 2016

  1. start etcd
  2. add an object
  3. watch at object.resourceVersion + 10
  4. add 10 objects
  5. ensure you only get one event

@ncdc ncdc removed the kind/flake Categorizes issue or PR as related to a flaky test. label Apr 14, 2016
k8s-github-robot pushed a commit that referenced this issue Apr 14, 2016
Automatic merge from submit-queue

Honor starting resourceVersion in watch cache

Compare the requested resourceVersion to each event's resourceVersion to ensure events that occurred
in the past are not sent to the client.

Fixes #24194 

cc @liggitt @wojtek-t
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants