-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add checking of events after all pods started to verify no failures in #6638
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project, in which case you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed, please reply here (e.g.
|
}.AsSelector(), | ||
) | ||
expectNoError(err) | ||
last = current |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really inefficient way of getting all the events you want. I recommend waiting for #6546 to land and using the NewInformer() thing it adds to collect events you care about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How close is #6546 to landing? Can we accept as is and log another enh to convert to NewInformer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#6546 should hopefully merge today. I am concerned that this method of getting events is going to be hard enough on apiserver that it will actually cause a performance problem. I'm also not sure about this stopping condition-- is it a guarantee that no events are generated in steady state? That may currently be the case but I don't think it's guaranteed...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my testing the only problems came about when a pod was continually restarted as that causes a never ending spew of events. Even cases where there was an error the test went okay. I dislike the notion of the test hanging because of a pod continually restarting, but atm I don't see a way to avoid it. Either the test provides a false positive by shutting off the event collection too soon and potentially missing an issue, or it hangs. The later seemed better because it would force an investigation, but neither are desirable imo.
I could put in an insanely large timeout, like 20 minutes or something. Thoughts @lavalamp ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for delay-- #6546 has merged!
I recommend using that to collect events, run your assertion on every event you see, and stop the controller after the replication controller has shut down. I don't know that it's worth it to wait for every last event-- there's not really a firm bound on how long that could take.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lavalamp I'm not sure I see how using 6546 will help ensure that we inspect all the events that are logged. I originally only cared about the start events, but there's no reason not to care about the shutdown events as well. If I move the inspection of the events until after the rc is deleted I should be able to wait for all the events to be logged. As long as the system is properly ensuring pods are shut down then the event stream should stop within a reasonable time after all the pods are stopped. This should bound the test and prevent it from running forever.
I'm not sure how to do that with 6546 though. I don't see any way to use that change and know that events have all been generated and it's ok to stop the controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use exactly the same logic-- stop if you haven't gotten an event in 10 seconds. It's just that the current code does a lot of lists which are unfortunately quite heavyweight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the concern is the efficiency of the test, then I would rather get the change in now and enhance it later. Ensuring the condition is tested is more important than ensuring it is done in the most efficient way imo. I couldn't say that any other e2e test is performing its functions in the most efficient way possible.
My concern has been bounding the event collection loop, and I've done that in a change I can push.
if not all are logged 10 minutes after all pods are started kubernetes#6637
dfdd36e
to
020ba6a
Compare
cc - @wojtek-t @fgrzadkowski |
@lavalamp Any issues with the latest changes? |
timeout := 10 * time.Minute | ||
for start := time.Now(); last < current && time.Since(start) < timeout; time.Sleep(10 * time.Second) { | ||
last = current | ||
current = len(events) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a pretty harmless race, but you have the potential for concurrent reads and writes to events. Please send a follow-up PR that adds a lock.
LGTM, but please send a follow-up. |
Add checking of events after all pods started to verify no failures in
density test #6637