Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky: DisruptionController evictions: enough pods, replicaSet, percentage => should allow an eviction #45706

Closed
fejta opened this issue May 12, 2017 · 17 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@fejta
Copy link
Contributor

fejta commented May 12, 2017

Failure clusters: https://storage.googleapis.com/k8s-gubernator/triage/index.html?ci=0&pr=1&test=should%20allow%20an%20eviction

Most likely cluster causing flakiness:

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/dns.go:363
Expected error:
    <*errors.errorString | 0xc420179330>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/dns.go:247

/kind flake
/priority important-soon

@kubernetes/sig-scheduling-test-failures
/assign @timothysc @davidopp

Sig leads, please /assign someone to investigate and /unassign yourself.
If this failure is more appropriate for another sig please /sig better-sig and /sig-remove scheduling

@k8s-ci-robot k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels May 12, 2017
@fejta
Copy link
Contributor Author

fejta commented May 12, 2017

Note that DisruptionController should update PodDisruptionBudget status also started failing at the same time

@timothysc
Copy link
Member

timothysc commented May 12, 2017

@fejta So I'm digging backwards from the original gubernator link and some data doesn't make any sense to me, when you have a moment could you poke me on irc.

Expected error:
    <*json.UnsupportedTypeError | 0xc420794dd0>: {
        Type: {
            size: 0x8,
            ptrdata: 0x8,
            hash: 2634747826,
            tflag: 2,
            align: 8,
            fieldAlign: 8,
            kind: 51,
            alg: {hash: nil, equal: nil},
            gcdata: 1,
            str: 2454528,
            ptrToThis: 0,
        },
    }
    json: unsupported type: func(*tls.CertificateRequestInfo) (*tls.Certificate, error)
not to have occurred

This doesn't appear to be the test at all, am I missing something

@fejta
Copy link
Contributor Author

fejta commented May 12, 2017

Note that failures from PRs will be noisy, since they also may contain people's possibly bad changes that aren't merged.

@timothysc
Copy link
Member

So in looking at the non-PR and just the CI I'm seeing a lot of :

"May 11 02:13:53.334: All nodes should be ready after test, Not ready nodes: ", gke-bootstrap-e2e-default-pool-2e2dbc5c-dtrk"

which to me seems like an infra flake, but if you are seeing something that I'm not, feel free to point it out to me.

@fejta
Copy link
Contributor Author

fejta commented May 12, 2017

Do you have a gubernator link? Are you focusing on the AWS jobs (put aws into the job textbox)?

https://storage.googleapis.com/k8s-gubernator/triage/index.html?job=aws&test=should%20allow%20an%20eviction

@timothysc
Copy link
Member

Ahh I was not, thx!

@davidopp
Copy link
Member

PDB is now owned by @kubernetes/sig-apps-misc

cc/ @mml

@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label May 12, 2017
@davidopp
Copy link
Member

(Which isn't to say that @timothysc shouldn't investigate -- I'm sure everyone will be very happy if he does :) )

@mml
Copy link
Contributor

mml commented May 18, 2017

Just saw this. I will take a look.

@mml
Copy link
Contributor

mml commented May 18, 2017

Well this is weird. I downloaded the controller manager log from a failed run.

% grep disruption.go ~/Downloads/kube-controller-manager.log
zsh: exit 1     grep disruption.go ~/Downloads/kube-controller-manager.log
% wc -l !$
wc -l ~/Downloads/kube-controller-manager.log
6931 /usr/local/google/home/mml/Downloads/kube-controller-manager.log

ah ha!

I0516 20:54:54.679359       5 controllermanager.go:437] Starting "disruption"                                                                                                                
W0516 20:54:54.679370       5 controllermanager.go:444] Skipping "disruption"

But... why? Have we changed how kube-controller-manager is invoked on AWS?

@mml
Copy link
Contributor

mml commented May 18, 2017

It would be awesome to have a little pre-req on e2e tests that says "this test is meaningless unless controller X is running. check if it is and fail immediately if not.". To do that we need to have the running controllers exported as prometheus metrics, and a hook that lets us query prometheus metrics from kube-controller-manager.

@mml
Copy link
Contributor

mml commented May 18, 2017

DisruptionController refuses to start unless poddisruptionbudgets in policy/v1beta1 (exactly) is available. This is exactly the pattern we discourage here, but the root cause here is probably that we aren't passing the right flags to kube-apiserver to turn on the policy group.

@0xmichalis
Copy link
Contributor

Dupe of #46871?

@justinsb
Copy link
Member

justinsb commented Jun 2, 2017

Not sure if a dup. #46871 is a flake though - i.e. sometimes the controllers are skipped, sometimes they are not. My knee-jerk reaction was some sort of race between API registration and controller startup.

I'm not clear whether this is a flake or not? If it's a flake, I don't think it's about whether flags are enabled (unless we have deeper problems ;-) )

What is doubly weird is that when I look at the k-c-m logs for the most recent failures for that cluster of failures, the kops jobs have the "skipping" error, but the gce ones do not

e.g.

W0602 03:57:59.767202 5 controllermanager.go:449] Skipping "disruption"

Maybe there are 2 different things happening here.

@mml
Copy link
Contributor

mml commented Jun 5, 2017 via email

mrIncompetent pushed a commit to kubermatic/kubernetes that referenced this issue Jun 6, 2017
mml added a commit to mml/kubernetes that referenced this issue Jun 15, 2017
k8s-github-robot pushed a commit that referenced this issue Jun 16, 2017
Automatic merge from submit-queue

Add version and flag info to apiserver and CM logs.

Should help debugging.
Specifically for #45706
@foxish
Copy link
Contributor

foxish commented Jun 16, 2017

This has not failed in the last 3 days. The issue appears to have been at least partly #47135 and #46871. I think this can be closed now.

@foxish
Copy link
Contributor

foxish commented Jun 16, 2017

Closing, please reopen if these recur.

@foxish foxish closed this as completed Jun 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests

8 participants