-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky: DisruptionController evictions: enough pods, replicaSet, percentage => should allow an eviction #45706
Comments
Note that DisruptionController should update PodDisruptionBudget status also started failing at the same time |
@fejta So I'm digging backwards from the original gubernator link and some data doesn't make any sense to me, when you have a moment could you poke me on irc.
This doesn't appear to be the test at all, am I missing something |
Note that failures from PRs will be noisy, since they also may contain people's possibly bad changes that aren't merged. |
So in looking at the non-PR and just the CI I'm seeing a lot of : "May 11 02:13:53.334: All nodes should be ready after test, Not ready nodes: ", gke-bootstrap-e2e-default-pool-2e2dbc5c-dtrk" which to me seems like an infra flake, but if you are seeing something that I'm not, feel free to point it out to me. |
Do you have a gubernator link? Are you focusing on the AWS jobs (put aws into the job textbox)? |
Ahh I was not, thx! |
PDB is now owned by @kubernetes/sig-apps-misc cc/ @mml |
(Which isn't to say that @timothysc shouldn't investigate -- I'm sure everyone will be very happy if he does :) ) |
Just saw this. I will take a look. |
Well this is weird. I downloaded the controller manager log from a failed run.
ah ha!
But... why? Have we changed how |
It would be awesome to have a little pre-req on e2e tests that says "this test is meaningless unless controller X is running. check if it is and fail immediately if not.". To do that we need to have the running controllers exported as prometheus metrics, and a hook that lets us query prometheus metrics from kube-controller-manager. |
DisruptionController refuses to start unless |
To help with debugging kubernetes#45706
Dupe of #46871? |
Not sure if a dup. #46871 is a flake though - i.e. sometimes the controllers are skipped, sometimes they are not. My knee-jerk reaction was some sort of race between API registration and controller startup. I'm not clear whether this is a flake or not? If it's a flake, I don't think it's about whether flags are enabled (unless we have deeper problems ;-) ) What is doubly weird is that when I look at the k-c-m logs for the most recent failures for that cluster of failures, the kops jobs have the "skipping" error, but the gce ones do not e.g.
Maybe there are 2 different things happening here. |
#46871 is closely related, and I'm working in this area. @justinsb, I
think it makes most sense for you to either take both of these issues, or
assign both to me. Your call. :-)
…On Fri, Jun 2, 2017 at 12:34 PM, Justin Santa Barbara < ***@***.***> wrote:
Not sure if a dup. #46871
<#46871> is a flake though
- i.e. sometimes the controllers are skipped, sometimes they are not. My
knee-jerk reaction was some sort of race between API registration and
controller startup.
I'm not clear whether this is a flake or not? If it's a flake, I don't
think it's about whether flags are enabled (unless we have deeper problems
;-) )
What is doubly weird is that when I look at the k-c-m logs for the most
recent failures for that cluster of failures, the kops jobs have the
"skipping" error, but the gce ones do not
e.g.
W0602 03:57:59.767202 5 controllermanager.go:449] Skipping "disruption"
Maybe there are 2 different things happening here.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#45706 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAMKwLlTgohTKtcXdlMm6cv4y-HRWs0ks5sAGOtgaJpZM4NY7dv>
.
|
To help with debugging kubernetes#45706
Should help debugging. Specifically for kubernetes#45706
Automatic merge from submit-queue Add version and flag info to apiserver and CM logs. Should help debugging. Specifically for #45706
Closing, please reopen if these recur. |
Failure clusters: https://storage.googleapis.com/k8s-gubernator/triage/index.html?ci=0&pr=1&test=should%20allow%20an%20eviction
Most likely cluster causing flakiness:
/kind flake
/priority important-soon
@kubernetes/sig-scheduling-test-failures
/assign @timothysc @davidopp
Sig leads, please
/assign
someone to investigate and/unassign
yourself.If this failure is more appropriate for another sig please
/sig better-sig
and/sig-remove scheduling
The text was updated successfully, but these errors were encountered: