Unit flake in pkg/storage: TestFiltering #24194

ncdc · 2016-04-13T13:40:56Z

--- FAIL: TestFiltering (0.97s)
    cacher_test.go:167: (called from line 342)
    cacher_test.go:168: Expected: DELETED, got: ADDED
    cacher_test.go:167: (called from line 343)
    cacher_test.go:168: Expected: ADDED, got: DELETED
    cacher_test.go:171: (called from line 343)
    cacher_test.go:172: Expected (ADDED): &api.Pod{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"foo", GenerateName:"", Namespace:"ns", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"filter":"foo"}, Annotations:map[string]string(nil)}, Spec:api.PodSpec{Volumes:[]api.Volume(nil), Containers:[]api.Container(nil), RestartPolicy:"Always", TerminationGracePeriodSeconds:(*int64)(0xc8203ae650), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:"ClusterFirst", NodeSelector:map[string]string(nil), ServiceAccountName:"", NodeName:"", SecurityContext:(*api.PodSecurityContext)(0xc82089ec80), ImagePullSecrets:[]api.LocalObjectReference(nil)}, Status:api.PodStatus{Phase:"", Conditions:[]api.PodCondition(nil), Message:"", Reason:"", HostIP:"", PodIP:"", StartTime:(*unversioned.Time)(nil), ContainerStatuses:[]api.ContainerStatus(nil)}}, got: &api.Pod{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"foo", GenerateName:"", Namespace:"ns", SelfLink:"", UID:"", ResourceVersion:"5", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, Spec:api.PodSpec{Volumes:[]api.Volume(nil), Containers:[]api.Container(nil), RestartPolicy:"Always", TerminationGracePeriodSeconds:(*int64)(0xc8206ffed0), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:"ClusterFirst", NodeSelector:map[string]string(nil), ServiceAccountName:"", NodeName:"", SecurityContext:(*api.PodSecurityContext)(0xc8206c4c40), ImagePullSecrets:[]api.LocalObjectReference(nil)}, Status:api.PodStatus{Phase:"", Conditions:[]api.PodCondition(nil), Message:"", Reason:"", HostIP:"", PodIP:"", StartTime:(*unversioned.Time)(nil), ContainerStatuses:[]api.ContainerStatus(nil)}}
    cacher_test.go:167: (called from line 344)
    cacher_test.go:168: Expected: MODIFIED, got: ADDED
    cacher_test.go:171: (called from line 344)
    cacher_test.go:172: Expected (MODIFIED): &api.Pod{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"foo", GenerateName:"", Namespace:"ns", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"filter":"foo"}, Annotations:map[string]string(nil)}, Spec:api.PodSpec{Volumes:[]api.Volume(nil), Containers:[]api.Container(nil), RestartPolicy:"Always", TerminationGracePeriodSeconds:(*int64)(0xc8203ae660), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:"ClusterFirst", NodeSelector:map[string]string(nil), ServiceAccountName:"", NodeName:"fakeNode", SecurityContext:(*api.PodSecurityContext)(0xc82089ed00), ImagePullSecrets:[]api.LocalObjectReference(nil)}, Status:api.PodStatus{Phase:"", Conditions:[]api.PodCondition(nil), Message:"", Reason:"", HostIP:"", PodIP:"", StartTime:(*unversioned.Time)(nil), ContainerStatuses:[]api.ContainerStatus(nil)}}, got: &api.Pod{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"foo", GenerateName:"", Namespace:"ns", SelfLink:"", UID:"", ResourceVersion:"6", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"filter":"foo"}, Annotations:map[string]string(nil)}, Spec:api.PodSpec{Volumes:[]api.Volume(nil), Containers:[]api.Container(nil), RestartPolicy:"Always", TerminationGracePeriodSeconds:(*int64)(0xc820386e28), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:"ClusterFirst", NodeSelector:map[string]string(nil), ServiceAccountName:"", NodeName:"", SecurityContext:(*api.PodSecurityContext)(0xc8201893c0), ImagePullSecrets:[]api.LocalObjectReference(nil)}, Status:api.PodStatus{Phase:"", Conditions:[]api.PodCondition(nil), Message:"", Reason:"", HostIP:"", PodIP:"", StartTime:(*unversioned.Time)(nil), ContainerStatuses:[]api.ContainerStatus(nil)}}
    cacher_test.go:167: (called from line 345)
    cacher_test.go:168: Expected: DELETED, got: MODIFIED
FAIL
FAIL    k8s.io/kubernetes/pkg/storage   12.878s

Running in an endless loop, I wasn't able to immediately reproduce, BUT I was able to reproduce as soon as I started running stress -c 4 in a separate terminal.

The text was updated successfully, but these errors were encountered:

ncdc · 2016-04-13T13:53:37Z

cc @wojtek-t

wojtek-t · 2016-04-13T14:18:18Z

Will take a look

ncdc · 2016-04-13T14:19:20Z

cc @liggitt

ncdc · 2016-04-13T14:19:34Z

I'm guessing that #24008 has something to do with this

wojtek-t · 2016-04-13T14:20:33Z

@ncdc - I have a different hypothesis, let me verify

wojtek-t · 2016-04-13T14:27:22Z

Yup - the reason is different.

The problem is that if the watch is already initiated (but it still has resourceVersion in the future comparing to the cacher current state) it doesn't check if the event that it has just received is fresh enough that should be send to the watcher.

For reproducing this problem, see:
https://github.com/kubernetes/kubernetes/compare/master...wojtek-t:fix_cacher?expand=1

The test that I added there should be failing, but it's passing.

I will fix this.

liggitt · 2016-04-13T14:32:55Z

@wojtek-t your example shouldn't get the added event unless it occurred past resourceVersion 1000, right?

liggitt · 2016-04-13T14:33:44Z

oh, misread... should be failing, but is passing :)

wojtek-t · 2016-04-13T14:35:47Z

@liggitt - exactly;
I'm afraid I won't be able to finish the fix today. I can do it tomorrow morning (or feel free to pick it up - my test is already a reproduction to check if it is fixed :P)

wojtek-t · 2016-04-13T14:37:22Z

To clarify - the fix is to pass resourceVersion together with events to the "cacheWatcher" (as part of watchCacheEvent) and in filtering function check if the RV is greater then the initial one.

liggitt · 2016-04-13T14:39:23Z

right

ncdc · 2016-04-13T15:59:12Z

Does something like this look like it's going in the right direction? https://gist.github.com/47d197657af47d43a5717a2de2b68743

wojtek-t · 2016-04-13T16:12:21Z

@ncdc - I wanted to avoid parsing the number you are doing in process() method, and instead of that add a new field "ResourceVersion" to watchCacheEvent where we already have this parsed.
On the other hand, maybe this kind of optimization doesn't make much sense. WDYT?

ncdc · 2016-04-13T16:32:06Z

@wojtek-t I've now read through more of the cache code and I think it's best to avoid parsing the resource version twice. I'll update accordingly.

ncdc · 2016-04-13T16:57:49Z

When comparing the event's resourceVersion with the value passed to Watch, do we pass the event through if its resourceVersion is > starting value, or >= starting value?

ncdc · 2016-04-13T16:59:11Z

Also, @wojtek-t, your unit test that you added above that passes now but should fail once the corrected code is in place... is there an easy way to add a "future" test that runs quickly and won't flake? I feel like checking for the absence of an event could be problematic.

liggitt · 2016-04-13T17:09:35Z

do we pass the event through if its resourceVersion is > starting value, or >= starting value?

>

wojtek-t · 2016-04-13T19:09:43Z

Also, @wojtek-t, your unit test that you added above that passes now but should fail once the corrected code is in place... is there an easy way to add a "future" test that runs quickly and won't flake? I feel like checking for the absence of an event could be problematic.

Yeah - that's problematic. What we can do is to wait e.g. 1s for it and pass then. It's far from being ideal, but probably better than nothing.

ncdc · 2016-04-13T19:13:28Z

I'm really hesitant to add something that waits 1s. That sounds really dangerous (flakey). What about sending data to etcd with various resource versions and making sure you only get the ones newer than the initial resource version?

liggitt · 2016-04-13T19:43:57Z

watch a few forward in the future, then add enough to reach your watch window, and ensure you didn't get the ones prior to that

ncdc · 2016-04-13T19:44:26Z

Yeah that's what I was thinking

liggitt · 2016-04-13T19:44:37Z

start etcd
add an object
watch at object.resourceVersion + 10
add 10 objects
ensure you only get one event

@liggitt

Automatic merge from submit-queue Honor starting resourceVersion in watch cache Compare the requested resourceVersion to each event's resourceVersion to ensure events that occurred in the past are not sent to the client. Fixes #24194 cc @liggitt @wojtek-t

ncdc added the kind/flake Categorizes issue or PR as related to a flaky test. label Apr 13, 2016

wojtek-t self-assigned this Apr 13, 2016

ncdc mentioned this issue Apr 13, 2016

Honor starting resourceVersion in watch cache #24208

Merged

caesarxuchao mentioned this issue Apr 13, 2016

Handle empty perferredVersion #23985

Merged

ncdc removed the kind/flake Categorizes issue or PR as related to a flaky test. label Apr 14, 2016

k8s-github-robot closed this as completed in #24208 Apr 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unit flake in pkg/storage: TestFiltering #24194

Unit flake in pkg/storage: TestFiltering #24194

ncdc commented Apr 13, 2016

ncdc commented Apr 13, 2016

wojtek-t commented Apr 13, 2016

ncdc commented Apr 13, 2016

ncdc commented Apr 13, 2016

wojtek-t commented Apr 13, 2016

wojtek-t commented Apr 13, 2016

liggitt commented Apr 13, 2016

liggitt commented Apr 13, 2016

wojtek-t commented Apr 13, 2016

wojtek-t commented Apr 13, 2016

liggitt commented Apr 13, 2016

ncdc commented Apr 13, 2016

wojtek-t commented Apr 13, 2016

ncdc commented Apr 13, 2016

ncdc commented Apr 13, 2016

ncdc commented Apr 13, 2016

liggitt commented Apr 13, 2016

wojtek-t commented Apr 13, 2016

ncdc commented Apr 13, 2016

liggitt commented Apr 13, 2016

ncdc commented Apr 13, 2016

liggitt commented Apr 13, 2016

Unit flake in pkg/storage: TestFiltering #24194

Unit flake in pkg/storage: TestFiltering #24194

Comments

ncdc commented Apr 13, 2016

ncdc commented Apr 13, 2016

wojtek-t commented Apr 13, 2016

ncdc commented Apr 13, 2016

ncdc commented Apr 13, 2016

wojtek-t commented Apr 13, 2016

wojtek-t commented Apr 13, 2016

liggitt commented Apr 13, 2016

liggitt commented Apr 13, 2016

wojtek-t commented Apr 13, 2016

wojtek-t commented Apr 13, 2016

liggitt commented Apr 13, 2016

ncdc commented Apr 13, 2016

wojtek-t commented Apr 13, 2016

ncdc commented Apr 13, 2016

ncdc commented Apr 13, 2016

ncdc commented Apr 13, 2016

liggitt commented Apr 13, 2016

wojtek-t commented Apr 13, 2016

ncdc commented Apr 13, 2016

liggitt commented Apr 13, 2016

ncdc commented Apr 13, 2016

liggitt commented Apr 13, 2016