-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ubernetes e2e flakes #26762
Comments
@madhusudancs I saw similar symptoms today, but it turns out the I'd reached the quota on my project. Do you have sufficient quota (specifically I had the default quota of 24 cores, and the federation requires more than that). I trivially increased my quota with an increase request, which got automatically approved within a few seconds. |
@quinton-hoole I have a higher quota limit in the region where I am spinning up a federation. Btw, this is a single cluster federation. |
For the record, my three clusters came up fine, but then I got the following. Will debug some more in the morning:
|
PS: It looks like the images are not where they should be:
|
I'm pretty sure that this is operator error on my part. I ran this:
and after all three clusters came up cleanly, I got this...
So (stupidly), rather than tearing the whole lot down and starting over, I ran this:
and got this:
I'm busy tearing it all down and starting from scratch, with the correct env vars. |
Please don't close this issue until we are reasonably confident that this isn't a flake. If it works once that doesn't mean it is not a flake. Also, this is an umbrella issue. |
@quinton-hoole Did you also run |
My apologies. I had no intention of closing this issue. I must have pressed the wrong button :-( |
And thanks - my federation is now up and healthy: Waiting for federation-apiserver to be running...(phase= Pending) kubectl get pods --namespace federation-e2e |
And the e2e test for cluster registration passed! [It] should allow creation of cluster api objects ... Ran 1 of 312 Specs in 98.316 seconds Ginkgo ran 1 suite in 1m38.613311042s |
But my cluster controller is spewing errors, unable to list clusters. @nikhiljindal @mfanjie @madhusudancs @jianhuiz E0603 16:11:08.140689 1 reflector.go:216] k8s.io/kubernetes/federation/pkg/federation-controller/cluster/clustercontroller.go:111: Failed to list *federation.Cluster: Get federation-apiserver:443?resourceVersion=0: unsupported protocol scheme "federation-apiserver" |
Looks like we need to replace
I havent tried it. @quinton-hoole can you try and see if that fixes the problem? |
Random sidenote given that this is a e2e umbrella. A while back I noticed ubernetes lite e2es weren't collecting logs from all nodes: https://pantheon.corp.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gce-ubernetes-lite/4407/artifacts/ |
Thanks @bprashanth. Good catch. That should be pretty easy to fix. I On Fri, Jun 3, 2016 at 10:38 AM, Prashanth B notifications@github.com
|
Thanks @nikhiljindal, that improved things, but cluster controller is still unable to connect to API server (see below). I'm going to head in to the office now, will continue debugging shortly:
The service exists, although those ports look weird:
|
Oh never mind, the ports are fine. It was just my kubectl output that got mushed in my terminal window. |
And occasionally, this error: E0603 19:09:26.005217 1 reflector.go:216] k8s.io/kubernetes/federation/pkg/federation-controller/service/servicecontroller.go:232: Failed to list *federation.Cluster: Get https://federation-apiserver:443/apis/federation/v1alpha1/clusters?resourceVersion=0: x509: failed to load system roots and no roots provided |
OK, the problem seems to be as simple as this (from API server logs): I0603 19:10:34.136146 1 logs.go:41] http: TLS handshake error from 10.240.0.13:61288: tls: client offered an unsupported, maximum protocol version of 301 |
It seems that we need the correct certificates installed on the controller manager. In the mean time I tried setting up unsecured, non-https access to API server, but that has other problems. Apparently the generic apiserver only listens on localhost for unsecured access: I0603 20:35:02.441500 1 genericapiserver.go:690] Serving securely on 0.0.0.0:443 So, not surprisingly, controller manager can't connect: E0603 20:43:05.903765 1 reflector.go:216] k8s.io/kubernetes/federation/pkg/federation-controller/service/servicecontroller.go:231: Failed to list *api.Service: Get http://federation-apiserver:8080/api/v1/services?resourceVersion=0: dial tcp 10.0.114.72:8080: getsockopt: connection refused |
edited, i have a fix for unsecured connection in #26694 but still under discussion |
Automatic merge from submit-queue federation: Updating federation-controller-manager to use secret to get federation-apiserver's kubeconfig Fixing the credentials problem: #26762 (comment). Admin will create a secret with the name "federation-apiserver-secret" in the k8s cluster hosting the federation control plane. This secret will contain the kubeconfig to access federation-apiserver. federation-controller-manager will use this secret to contact the federation-apiserver. This flow is same as the one used by all federation-controllers to contact k8s apiservers that are part of the federation. cc @kubernetes/sig-cluster-federation @lavalamp @erictune @colhom
I'm making fairly good progress doing manual e2e testing, but seeing some curious log messages and behaviour. I'll note them down here in case anyone else is seeing similar errors: Of note below is:
|
@nikhiljindal @madhusudancs @mml @mfanjie FYI ^^ |
Actually the failure to create the service seems unrelated to the failure to create the DNS record, as it recurs independently every time around the sync loop:
|
I'm having success with some DNS names but not others. For example
and
and
but Still digging in to that, but I note that having all these be a single test makes it really hard to debug what failed, or even to note that three of the DNS entries passed. |
Sent #27504 to dump federation apiserver and controller manager logs on failure which should help in debugging federation failures on jenkins. I am right now looking at federation-apiserver test. Its failing on jenkins. Works fine locally. |
@quinton-hoole FYI, this log indicates service controller are trying to create new service on k8s cluster federation-e2e-gce-us-central1-f, and repeated logs might be led by creation failure.
|
Automatic merge from submit-queue fix nslookup invocation The old way with 'sh -c' was not correct. For #26762
Automatic merge from submit-queue Clear ClusterIP in the local service before comparison. For #26762
- Dropped the regex test and just test for nslookup exiting 0. - Moved more setup into BeforeEach and used nested Context for non-local case. - Poll inside the container using a bash loop. - Aim for less console noise unless something goes wrong. - Commented out the tests trying to verify that a DNS name is absent.
Automatic merge from submit-queue Re-work the DNS part of the test for #26762
Automatic merge from submit-queue federation: reverse the order of creating controller manager and secret since controller requires secret Ref #26762 federation-controller-manager fails if the secret is not there when it comes up: https://github.com/kubernetes/kubernetes/blob/970104df3199eeb30710d1067da28f952ae36403/federation/cmd/federation-controller-manager/app/controllermanager.go#L82. Updating the bring up scripts to first create the secrets and then create the deployments. @kubernetes/sig-cluster-federation @mml
This loos like a bug. Namespace being dropped? I'll send in a fix shortly.
|
We should probably wrap the underlying |
Automatic merge from submit-queue federation: Updating KubeDNS to try finding a local service first for federation query Ref #26762 Updating KubeDNS to try to find a local service first for federation query. Without this change, KubeDNS always returns the DNS hostname, even if a local service exists. Have updated the code to first remove federation name from path if it exists, so that the default search for local service happens. If we dont find a local service, then we try to find the DNS hostname. Will appreciate a strong review since this is my first change to KubeDNS. #25727 was the original PR that added federation support to KubeDNS. cc @kubernetes/sig-cluster-federation @quinton-hoole @madhusudancs @bprashanth @mml
I think these problems are all resolved now, and this issue has served it's purpose. Closing. |
Automatic merge from submit-queue federation: Updating federation-controller-manager to use secret to get federation-apiserver's kubeconfig Fixing the credentials problem: kubernetes/kubernetes#26762 (comment). Admin will create a secret with the name "federation-apiserver-secret" in the k8s cluster hosting the federation control plane. This secret will contain the kubeconfig to access federation-apiserver. federation-controller-manager will use this secret to contact the federation-apiserver. This flow is same as the one used by all federation-controllers to contact k8s apiservers that are part of the federation. cc @kubernetes/sig-cluster-federation @lavalamp @erictune @colhom
I am opening this master/umbrella issue to track Ubernetes e2e test flakes. I think this issue is sufficient right now. But as we make progress we might have to fork this into sub-issues.
This issue is not just about the e2e tests themselves but also includes infrastructure related flakes such as bringing up the clusters, bringing up the federation control plane, etc.
Here is one flake that I am seeing from yesterday:
cc @colhom @nikhiljindal @colhom @kubernetes/sig-cluster-federation @jianhuiz @mfanjie
The text was updated successfully, but these errors were encountered: