Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promote Services shell out of flaky. Demote Elasticsearch #10989

Merged
merged 1 commit into from
Jul 10, 2015

Conversation

wojtek-t
Copy link
Member

@wojtek-t wojtek-t commented Jul 9, 2015

Elasticsearch is unfortunately still flaky

Shell services test was fix by #10978

cc @davidopp @satnam6502 @quinton-hoole

@wojtek-t wojtek-t added this to the v1.0 milestone Jul 9, 2015
@k8s-bot
Copy link

k8s-bot commented Jul 9, 2015

GCE e2e build/test failed for commit 71bcfac64ce9329d1af1a4a0b264066840311840.

@wojtek-t
Copy link
Member Author

wojtek-t commented Jul 9, 2015

This is an unrelated e2e test failures:

Kubernetes e2e suite.Kubectl client Update Demo should scale a replication controller
Kubernetes e2e suite.Probing container with readiness probe that fails should never be ready and never restart
Kubernetes e2e suite.Pods should contain environment variables for services
Kubernetes e2e suite.Pods should *not* be restarted with a docker exec "cat /tmp/health" liveness probe

I will look into it independently, now kicking the bot to rerun tests.

@wojtek-t
Copy link
Member Author

wojtek-t commented Jul 9, 2015

@k8s-bot please test

@wojtek-t
Copy link
Member Author

wojtek-t commented Jul 9, 2015

@k8s-bot test this please

@k8s-bot
Copy link

k8s-bot commented Jul 9, 2015

GCE e2e build/test failed for commit 71bcfac64ce9329d1af1a4a0b264066840311840.

@satnam6502
Copy link
Contributor

I don't think the Elasticsearch test should be demoted. It is doing its job. Demote the test that makes nodes unhealthy -- a situation which is then detected and flagged by Elasticsearch.

@satnam6502
Copy link
Contributor

One one of failures I saw yesterday was due to a node that I think was made unhealthy by a previous test. The Elasticsearch test makes sure there are at least two healthy nodes before proceeding and declares failure if there are not. I reported this to @quinton-hoole and to me is seems wrong to demote this test when its environment was probably sabotaged by another test.

The last failure I saw was due to something that I think is an on-going problem with our system. Every so often when a request is made via service it is then unable to be mapped to a pod to service it -- even when pods were detected and ready earlier on in the test (as confirmed by the log below). I think this is exposing a bug -- a bug our customers will face in a similar situation. Rather than demoting the test, let us understand if this is really a genuine issue and file a bug for it and fix it -- and if this is behaviour as expected (which in my opinion would be very undesireable for our customers) -- then let us document this clearly and also state how such failures should be handled and then implement that process here.

Let's not demote this test because -- as always -- it's pointing out things we need to think about.

Cluster level logging using Elasticsearch 
  should check that logs from pods on all nodes are ingested into Elasticsearch
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/es_cluster_logging.go:46
[BeforeEach] Cluster level logging using Elasticsearch
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/framework.go:47
STEP: Creating a kubernetes client
>>> testContext.KubeConfig: /var/lib/jenkins/jobs/kubernetes-e2e-gce/workspace/.kube/config
STEP: Building a namespace api object
INFO: Waiting up to 2m0s for service account default to be provisioned in ns e2e-tests-es-logging-ktu1s
INFO: Get service account default in ns e2e-tests-es-logging-ktu1s failed, ignoring for 5s: serviceaccounts "default" not found
INFO: Service account default in ns e2e-tests-es-logging-ktu1s with secrets found. (5.007973585s)
STEP: Waiting for a default service account to be provisioned in namespace
INFO: Waiting up to 2m0s for service account default to be provisioned in ns e2e-tests-es-logging-ktu1s
INFO: Service account default in ns e2e-tests-es-logging-ktu1s with secrets found. (2.750763ms)
[BeforeEach] Cluster level logging using Elasticsearch
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/es_cluster_logging.go:42
[It] should check that logs from pods on all nodes are ingested into Elasticsearch
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/es_cluster_logging.go:46
STEP: Checking the Elasticsearch service exists.
STEP: Checking to make sure the Elasticsearch pods are running
INFO: Waiting up to 5m0s for pod elasticsearch-logging-v1-95un8                          status to be running
INFO: Found pod 'elasticsearch-logging-v1-95un8' on node 'e2e-gce-minion-eu5s'
INFO: Waiting up to 5m0s for pod elasticsearch-logging-v1-mxzyv                          status to be running
INFO: Found pod 'elasticsearch-logging-v1-mxzyv' on node 'e2e-gce-minion-armt'
STEP: Checking to make sure we are talking to an Elasticsearch service.
INFO: Bad JSON: Error: 'dial tcp 10.245.0.6:9200: connection refused'
Trying to reach: 'http://10.245.0.6:9200/'
INFO: After 6.417267ms failed to convert Elasticsearch JSON response Error: 'dial tcp 10.245.0.6:9200: connection refused'
Trying to reach: 'http://10.245.0.6:9200/' to map[string]interface{}: failed to unmarshal Elasticsearch response: invalid character 'E' looking for beginning of value
STEP: Checking health of Elasticsearch service.
INFO: Bad JSON: Error: 'dial tcp 10.245.0.6:9200: connection refused'
Trying to reach: 'http://10.245.0.6:9200/_cluster/health?health=pretty'
[AfterEach] Cluster level logging using Elasticsearch
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/framework.go:48
STEP: Collecting events from namespace "e2e-tests-es-logging-ktu1s".
INFO: Node e2e-gce-minion-armt condition 1/1: type: Ready, status: True
INFO: Node e2e-gce-minion-eu5s condition 1/1: type: Ready, status: True
STEP: Destroying namespace "e2e-tests-es-logging-ktu1s" for this suite.

• Failure [15.319 seconds]
Cluster level logging using Elasticsearch
/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/es_cluster_logging.go:47
  should check that logs from pods on all nodes are ingested into Elasticsearch [It]
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/es_cluster_logging.go:46

  Expected error:
      <*errors.errorString | 0xc208136e20>: {
          s: "failed to unmarshal Elasticsearch response: invalid character 'E' looking for beginning of value",
      }
      failed to unmarshal Elasticsearch response: invalid character 'E' looking for beginning of value
  not to have occurred

  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/es_cluster_logging.go:163

@wojtek-t
Copy link
Member Author

wojtek-t commented Jul 9, 2015

The last failure I saw was due to something that I think is an on-going problem with our system. Every so often when a request is made via service it is then unable to be mapped to a pod to service it -- even when pods were detected and ready earlier on in the test (as confirmed by the log below). I think this is exposing a bug -- a bug our customers will face in a similar situation. Rather than demoting the test, let us understand if this is really a genuine issue and file a bug for it and fix it -- and if this is behaviour as expected (which in my opinion would be very undesireable for our customers) -- then let us document this clearly and also state how such failures should be handled and then implement that process here.

Not sure if I understood correctly - so please correct me if I misunderstood something.
Assume that we have few pods running. Then you create a service. If you send a request to a serivce immediately after creating a service, endpoints of the service may not be propagated to all kube-proxy instances yet. Thus, I think it is by design, that immediately after creating a service (even if pods are already running), they are not accessible. If that's the case - we should fix the test.

@satnam6502
Copy link
Contributor

As I said before, this connection error occurs after there have already been successfully GET requests earlier in the test. Checking the cluster health check via Elasticsearch is done after other successful requests.

    // Now assume we really are talking to an Elasticsearch instance.
    // Check the cluster health.
    By("Checking health of Elasticsearch service.")
    var body []byte
    for start := time.Now(); time.Since(start) < graceTime; time.Sleep(5 * time.Second) {
        body, err = f.Client.Get().
            Namespace(api.NamespaceSystem).
            Prefix("proxy").
            Resource("services").
            Name("elasticsearch-logging").
            Suffix("_cluster/health").
            Param("health", "pretty").
            DoRaw()
        if err == nil {
            break
        }
    }
    Expect(err).NotTo(HaveOccurred())

@satnam6502
Copy link
Contributor

I did find a bad use of err which somehow got introduced which masks a failure. #11001

@satnam6502
Copy link
Contributor

I will dig into this further to make sure I am not going mad.

@satnam6502
Copy link
Contributor

@wojtek-t this is the first attempt to speak to the application endpoint. It happens after the pods are determined to be ready and is retried to give enough time for the application inside the pod to come to life. Surely this is enough? Or am I wrong?

    By("Checking to make sure the Elasticsearch pods are running")
    label := labels.SelectorFromSet(labels.Set(map[string]string{esKey: esValue}))
    pods, err := f.Client.Pods(api.NamespaceSystem).List(label, fields.Everything())
    Expect(err).NotTo(HaveOccurred())
    for _, pod := range pods.Items {
        err = waitForPodRunningInNamespace(f.Client, pod.Name, api.NamespaceSystem)
        Expect(err).NotTo(HaveOccurred())
    }

    By("Checking to make sure we are talking to an Elasticsearch service.")
    // Perform a few checks to make sure this looks like an Elasticsearch cluster.
    var statusCode float64
    var esResponse map[string]interface{}
    err = nil
    var body []byte
    for start := time.Now(); time.Since(start) < graceTime; time.Sleep(5 * time.Second) {
        // Query against the root URL for Elasticsearch.
        body, err = f.Client.Get().
            Namespace(api.NamespaceSystem).
            Prefix("proxy").
            Resource("services").
            Name("elasticsearch-logging").
            DoRaw()
        if err != nil {
            Logf("After %v proxy call to elasticsearch-loigging failed: %v", time.Since(start), err)
            continue
        }
        esResponse, err = bodyToJSON(body)
        if err != nil {
            Logf("After %v failed to convert Elasticsearch JSON response %v to map[string]interface{}: %v", time.Since(start), string(body), err)
            continue
        }
        statusIntf, ok := esResponse["status"]
        if !ok {
            Logf("After %v Elasticsearch response has no status field: %v", time.Since(start), esResponse)
            continue
        }
        statusCode, ok = statusIntf.(float64)
        if !ok {
            // Assume this is a string returning Failure. Retry.
            Logf("After %v expected status to be a float64 but got %v of type %T", time.Since(start), statusIntf, statusIntf)
            continue
        }
        break
    }
    Expect(err).NotTo(HaveOccurred())
    if int(statusCode) != 200 {
        Failf("Elasticsearch cluster has a bad status: %v", statusCode)
    }

@wojtek-t
Copy link
Member Author

OK - I changed the PR so that it doesn't demote ElasticSearch test - we should investigate it independently (on the other hand I didn't see the failure since #11001 was merged).

PTAL

@davidopp
Copy link
Member

LGTM

@davidopp davidopp added ok-to-merge lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jul 10, 2015
@satnam6502
Copy link
Contributor

Thank you @wojtek-t -- I am still trying to get to the bottom of this -- although my vacation to Europe in the next two weeks might get in the way.

@k8s-bot
Copy link

k8s-bot commented Jul 10, 2015

GCE e2e build/test passed for commit dc711ee.

wojtek-t added a commit that referenced this pull request Jul 10, 2015
Promote Services shell out of flaky. Demote Elasticsearch
@wojtek-t wojtek-t merged commit ccc4cfc into kubernetes:master Jul 10, 2015
@ghost
Copy link

ghost commented Jul 10, 2015

Hi @wojtek-t

This PR seems to have completely broken e2e-gce-parallel. Could you add "Services.*shell" to GCE_PARALLEL_FLAKY_TEST_REGEX until it's parallel friendly?

@wojtek-t wojtek-t deleted the promote_demote branch July 16, 2015 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants