Flaky: DisruptionController evictions: enough pods, replicaSet, percentage => should allow an eviction #45706

fejta · 2017-05-12T07:23:03Z

Failure clusters: https://storage.googleapis.com/k8s-gubernator/triage/index.html?ci=0&pr=1&test=should%20allow%20an%20eviction

Most likely cluster causing flakiness:

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/dns.go:363
Expected error:
    <*errors.errorString | 0xc420179330>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/dns.go:247

/kind flake
/priority important-soon

@kubernetes/sig-scheduling-test-failures
/assign @timothysc @davidopp

Sig leads, please /assign someone to investigate and /unassign yourself.
If this failure is more appropriate for another sig please /sig better-sig and /sig-remove scheduling

The text was updated successfully, but these errors were encountered:

fejta · 2017-05-12T07:25:20Z

Note that DisruptionController should update PodDisruptionBudget status also started failing at the same time

timothysc · 2017-05-12T20:12:21Z

@fejta So I'm digging backwards from the original gubernator link and some data doesn't make any sense to me, when you have a moment could you poke me on irc.

Expected error:
    <*json.UnsupportedTypeError | 0xc420794dd0>: {
        Type: {
            size: 0x8,
            ptrdata: 0x8,
            hash: 2634747826,
            tflag: 2,
            align: 8,
            fieldAlign: 8,
            kind: 51,
            alg: {hash: nil, equal: nil},
            gcdata: 1,
            str: 2454528,
            ptrToThis: 0,
        },
    }
    json: unsupported type: func(*tls.CertificateRequestInfo) (*tls.Certificate, error)
not to have occurred

This doesn't appear to be the test at all, am I missing something

fejta · 2017-05-12T20:47:47Z

Note that failures from PRs will be noisy, since they also may contain people's possibly bad changes that aren't merged.

timothysc · 2017-05-12T21:23:22Z

So in looking at the non-PR and just the CI I'm seeing a lot of :

"May 11 02:13:53.334: All nodes should be ready after test, Not ready nodes: ", gke-bootstrap-e2e-default-pool-2e2dbc5c-dtrk"

which to me seems like an infra flake, but if you are seeing something that I'm not, feel free to point it out to me.

fejta · 2017-05-12T21:33:34Z

Do you have a gubernator link? Are you focusing on the AWS jobs (put aws into the job textbox)?

https://storage.googleapis.com/k8s-gubernator/triage/index.html?job=aws&test=should%20allow%20an%20eviction

timothysc · 2017-05-12T21:42:37Z

Ahh I was not, thx!

davidopp · 2017-05-12T22:05:21Z

PDB is now owned by @kubernetes/sig-apps-misc

cc/ @mml

davidopp · 2017-05-12T22:05:45Z

(Which isn't to say that @timothysc shouldn't investigate -- I'm sure everyone will be very happy if he does :) )

mml · 2017-05-18T17:32:53Z

Just saw this. I will take a look.

mml · 2017-05-18T17:47:21Z

Well this is weird. I downloaded the controller manager log from a failed run.

% grep disruption.go ~/Downloads/kube-controller-manager.log
zsh: exit 1     grep disruption.go ~/Downloads/kube-controller-manager.log
% wc -l !$
wc -l ~/Downloads/kube-controller-manager.log
6931 /usr/local/google/home/mml/Downloads/kube-controller-manager.log

ah ha!

I0516 20:54:54.679359       5 controllermanager.go:437] Starting "disruption"                                                                                                                
W0516 20:54:54.679370       5 controllermanager.go:444] Skipping "disruption"

But... why? Have we changed how kube-controller-manager is invoked on AWS?

mml · 2017-05-18T17:58:50Z

It would be awesome to have a little pre-req on e2e tests that says "this test is meaningless unless controller X is running. check if it is and fail immediately if not.". To do that we need to have the running controllers exported as prometheus metrics, and a hook that lets us query prometheus metrics from kube-controller-manager.

mml · 2017-05-18T18:04:33Z

DisruptionController refuses to start unless poddisruptionbudgets in policy/v1beta1 (exactly) is available. This is exactly the pattern we discourage here, but the root cause here is probably that we aren't passing the right flags to kube-apiserver to turn on the policy group.

To help with debugging kubernetes#45706

0xmichalis · 2017-06-02T19:17:51Z

Dupe of #46871?

justinsb · 2017-06-02T19:33:30Z

Not sure if a dup. #46871 is a flake though - i.e. sometimes the controllers are skipped, sometimes they are not. My knee-jerk reaction was some sort of race between API registration and controller startup.

I'm not clear whether this is a flake or not? If it's a flake, I don't think it's about whether flags are enabled (unless we have deeper problems ;-) )

What is doubly weird is that when I look at the k-c-m logs for the most recent failures for that cluster of failures, the kops jobs have the "skipping" error, but the gce ones do not

e.g.

W0602 03:57:59.767202 5 controllermanager.go:449] Skipping "disruption"

Maybe there are 2 different things happening here.

mml · 2017-06-05T18:39:00Z

#46871 is closely related, and I'm working in this area. @justinsb, I think it makes most sense for you to either take both of these issues, or assign both to me. Your call. :-)

…

On Fri, Jun 2, 2017 at 12:34 PM, Justin Santa Barbara < ***@***.***> wrote: Not sure if a dup. #46871 <#46871> is a flake though - i.e. sometimes the controllers are skipped, sometimes they are not. My knee-jerk reaction was some sort of race between API registration and controller startup. I'm not clear whether this is a flake or not? If it's a flake, I don't think it's about whether flags are enabled (unless we have deeper problems ;-) ) What is doubly weird is that when I look at the k-c-m logs for the most recent failures for that cluster of failures, the kops jobs have the "skipping" error, but the gce ones do not e.g. W0602 03:57:59.767202 5 controllermanager.go:449] Skipping "disruption" Maybe there are 2 different things happening here. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#45706 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAMKwLlTgohTKtcXdlMm6cv4y-HRWs0ks5sAGOtgaJpZM4NY7dv> .

To help with debugging kubernetes#45706

Should help debugging. Specifically for kubernetes#45706

Automatic merge from submit-queue Add version and flag info to apiserver and CM logs. Should help debugging. Specifically for #45706

foxish · 2017-06-16T22:51:48Z

This has not failed in the last 3 days. The issue appears to have been at least partly #47135 and #46871. I think this can be closed now.

foxish · 2017-06-16T22:59:00Z

Closing, please reopen if these recur.

k8s-ci-robot assigned davidopp and timothysc May 12, 2017

k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels May 12, 2017

k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label May 12, 2017

deads2k mentioned this issue May 15, 2017

apiextesions-server integration tests: test multiple registration #45732

Merged

mml mentioned this issue May 18, 2017

Add version and flag info to apiserver and CM logs. #46047

Merged

mml assigned mml and unassigned timothysc and davidopp May 18, 2017

mml mentioned this issue May 18, 2017

Log more verbose when DisruptionController doesn't start. #46050

Merged

mml added a commit to mml/kubernetes that referenced this issue May 18, 2017

Log more verbose when DisruptionController doesn't start.

00e6de0

To help with debugging kubernetes#45706

mml mentioned this issue May 22, 2017

Log all flag values at startup. kubernetes/apiserver#11

Closed

Q-Lee mentioned this issue Jun 2, 2017

Enable PodSecurityPolicy in gce #46064

Closed

mrIncompetent pushed a commit to kubermatic/kubernetes that referenced this issue Jun 6, 2017

Log more verbose when DisruptionController doesn't start.

b5ff89f

To help with debugging kubernetes#45706

wongma7 mentioned this issue Jun 12, 2017

Don't provision for PVCs with AccessModes unsupported by plugin #47274

Merged

mml added a commit to mml/kubernetes that referenced this issue Jun 15, 2017

Add version and flag info to apiserver and CM logs.

f6bcac3

Should help debugging. Specifically for kubernetes#45706

k8s-github-robot pushed a commit that referenced this issue Jun 16, 2017

Merge pull request #46047 from mml/master-debug-logs

1ecb40c

Automatic merge from submit-queue Add version and flag info to apiserver and CM logs. Should help debugging. Specifically for #45706

foxish closed this as completed Jun 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky: DisruptionController evictions: enough pods, replicaSet, percentage => should allow an eviction #45706

Flaky: DisruptionController evictions: enough pods, replicaSet, percentage => should allow an eviction #45706

fejta commented May 12, 2017

fejta commented May 12, 2017

timothysc commented May 12, 2017 •

edited

Loading

fejta commented May 12, 2017

timothysc commented May 12, 2017

fejta commented May 12, 2017

timothysc commented May 12, 2017

davidopp commented May 12, 2017

davidopp commented May 12, 2017

mml commented May 18, 2017

mml commented May 18, 2017

mml commented May 18, 2017

mml commented May 18, 2017

0xmichalis commented Jun 2, 2017

justinsb commented Jun 2, 2017

mml commented Jun 5, 2017 via email

foxish commented Jun 16, 2017

foxish commented Jun 16, 2017

Flaky: DisruptionController evictions: enough pods, replicaSet, percentage => should allow an eviction #45706

Flaky: DisruptionController evictions: enough pods, replicaSet, percentage => should allow an eviction #45706

Comments

fejta commented May 12, 2017

fejta commented May 12, 2017

timothysc commented May 12, 2017 • edited Loading

fejta commented May 12, 2017

timothysc commented May 12, 2017

fejta commented May 12, 2017

timothysc commented May 12, 2017

davidopp commented May 12, 2017

davidopp commented May 12, 2017

mml commented May 18, 2017

mml commented May 18, 2017

mml commented May 18, 2017

mml commented May 18, 2017

0xmichalis commented Jun 2, 2017

justinsb commented Jun 2, 2017

mml commented Jun 5, 2017 via email

foxish commented Jun 16, 2017

foxish commented Jun 16, 2017

timothysc commented May 12, 2017 •

edited

Loading