Promote Services shell out of flaky. Demote Elasticsearch #10989

wojtek-t · 2015-07-09T14:28:54Z

Elasticsearch is unfortunately still flaky

Shell services test was fix by #10978

k8s-bot · 2015-07-09T14:45:11Z

GCE e2e build/test failed for commit 71bcfac64ce9329d1af1a4a0b264066840311840.

wojtek-t · 2015-07-09T14:49:55Z

This is an unrelated e2e test failures:

Kubernetes e2e suite.Kubectl client Update Demo should scale a replication controller
Kubernetes e2e suite.Probing container with readiness probe that fails should never be ready and never restart
Kubernetes e2e suite.Pods should contain environment variables for services
Kubernetes e2e suite.Pods should *not* be restarted with a docker exec "cat /tmp/health" liveness probe

I will look into it independently, now kicking the bot to rerun tests.

wojtek-t · 2015-07-09T14:50:29Z

@k8s-bot please test

wojtek-t · 2015-07-09T15:25:42Z

@k8s-bot test this please

k8s-bot · 2015-07-09T15:45:18Z

GCE e2e build/test failed for commit 71bcfac64ce9329d1af1a4a0b264066840311840.

satnam6502 · 2015-07-09T16:15:28Z

I don't think the Elasticsearch test should be demoted. It is doing its job. Demote the test that makes nodes unhealthy -- a situation which is then detected and flagged by Elasticsearch.

satnam6502 · 2015-07-09T16:24:47Z

One one of failures I saw yesterday was due to a node that I think was made unhealthy by a previous test. The Elasticsearch test makes sure there are at least two healthy nodes before proceeding and declares failure if there are not. I reported this to @quinton-hoole and to me is seems wrong to demote this test when its environment was probably sabotaged by another test.

The last failure I saw was due to something that I think is an on-going problem with our system. Every so often when a request is made via service it is then unable to be mapped to a pod to service it -- even when pods were detected and ready earlier on in the test (as confirmed by the log below). I think this is exposing a bug -- a bug our customers will face in a similar situation. Rather than demoting the test, let us understand if this is really a genuine issue and file a bug for it and fix it -- and if this is behaviour as expected (which in my opinion would be very undesireable for our customers) -- then let us document this clearly and also state how such failures should be handled and then implement that process here.

Let's not demote this test because -- as always -- it's pointing out things we need to think about.

Cluster level logging using Elasticsearch 
  should check that logs from pods on all nodes are ingested into Elasticsearch
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/es_cluster_logging.go:46
[BeforeEach] Cluster level logging using Elasticsearch
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/framework.go:47
STEP: Creating a kubernetes client
>>> testContext.KubeConfig: /var/lib/jenkins/jobs/kubernetes-e2e-gce/workspace/.kube/config
STEP: Building a namespace api object
INFO: Waiting up to 2m0s for service account default to be provisioned in ns e2e-tests-es-logging-ktu1s
INFO: Get service account default in ns e2e-tests-es-logging-ktu1s failed, ignoring for 5s: serviceaccounts "default" not found
INFO: Service account default in ns e2e-tests-es-logging-ktu1s with secrets found. (5.007973585s)
STEP: Waiting for a default service account to be provisioned in namespace
INFO: Waiting up to 2m0s for service account default to be provisioned in ns e2e-tests-es-logging-ktu1s
INFO: Service account default in ns e2e-tests-es-logging-ktu1s with secrets found. (2.750763ms)
[BeforeEach] Cluster level logging using Elasticsearch
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/es_cluster_logging.go:42
[It] should check that logs from pods on all nodes are ingested into Elasticsearch
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/es_cluster_logging.go:46
STEP: Checking the Elasticsearch service exists.
STEP: Checking to make sure the Elasticsearch pods are running
INFO: Waiting up to 5m0s for pod elasticsearch-logging-v1-95un8                          status to be running
INFO: Found pod 'elasticsearch-logging-v1-95un8' on node 'e2e-gce-minion-eu5s'
INFO: Waiting up to 5m0s for pod elasticsearch-logging-v1-mxzyv                          status to be running
INFO: Found pod 'elasticsearch-logging-v1-mxzyv' on node 'e2e-gce-minion-armt'
STEP: Checking to make sure we are talking to an Elasticsearch service.
INFO: Bad JSON: Error: 'dial tcp 10.245.0.6:9200: connection refused'
Trying to reach: 'http://10.245.0.6:9200/'
INFO: After 6.417267ms failed to convert Elasticsearch JSON response Error: 'dial tcp 10.245.0.6:9200: connection refused'
Trying to reach: 'http://10.245.0.6:9200/' to map[string]interface{}: failed to unmarshal Elasticsearch response: invalid character 'E' looking for beginning of value
STEP: Checking health of Elasticsearch service.
INFO: Bad JSON: Error: 'dial tcp 10.245.0.6:9200: connection refused'
Trying to reach: 'http://10.245.0.6:9200/_cluster/health?health=pretty'
[AfterEach] Cluster level logging using Elasticsearch
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/framework.go:48
STEP: Collecting events from namespace "e2e-tests-es-logging-ktu1s".
INFO: Node e2e-gce-minion-armt condition 1/1: type: Ready, status: True
INFO: Node e2e-gce-minion-eu5s condition 1/1: type: Ready, status: True
STEP: Destroying namespace "e2e-tests-es-logging-ktu1s" for this suite.

• Failure [15.319 seconds]
Cluster level logging using Elasticsearch
/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/es_cluster_logging.go:47
  should check that logs from pods on all nodes are ingested into Elasticsearch [It]
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/es_cluster_logging.go:46

  Expected error:
      <*errors.errorString | 0xc208136e20>: {
          s: "failed to unmarshal Elasticsearch response: invalid character 'E' looking for beginning of value",
      }
      failed to unmarshal Elasticsearch response: invalid character 'E' looking for beginning of value
  not to have occurred

  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/es_cluster_logging.go:163

wojtek-t · 2015-07-09T16:53:17Z

The last failure I saw was due to something that I think is an on-going problem with our system. Every so often when a request is made via service it is then unable to be mapped to a pod to service it -- even when pods were detected and ready earlier on in the test (as confirmed by the log below). I think this is exposing a bug -- a bug our customers will face in a similar situation. Rather than demoting the test, let us understand if this is really a genuine issue and file a bug for it and fix it -- and if this is behaviour as expected (which in my opinion would be very undesireable for our customers) -- then let us document this clearly and also state how such failures should be handled and then implement that process here.

Not sure if I understood correctly - so please correct me if I misunderstood something.
Assume that we have few pods running. Then you create a service. If you send a request to a serivce immediately after creating a service, endpoints of the service may not be propagated to all kube-proxy instances yet. Thus, I think it is by design, that immediately after creating a service (even if pods are already running), they are not accessible. If that's the case - we should fix the test.

satnam6502 · 2015-07-09T17:01:11Z

As I said before, this connection error occurs after there have already been successfully GET requests earlier in the test. Checking the cluster health check via Elasticsearch is done after other successful requests.

    // Now assume we really are talking to an Elasticsearch instance.
    // Check the cluster health.
    By("Checking health of Elasticsearch service.")
    var body []byte
    for start := time.Now(); time.Since(start) < graceTime; time.Sleep(5 * time.Second) {
        body, err = f.Client.Get().
            Namespace(api.NamespaceSystem).
            Prefix("proxy").
            Resource("services").
            Name("elasticsearch-logging").
            Suffix("_cluster/health").
            Param("health", "pretty").
            DoRaw()
        if err == nil {
            break
        }
    }
    Expect(err).NotTo(HaveOccurred())

satnam6502 · 2015-07-09T17:15:43Z

I did find a bad use of err which somehow got introduced which masks a failure. #11001

satnam6502 · 2015-07-09T17:18:34Z

I will dig into this further to make sure I am not going mad.

satnam6502 · 2015-07-09T17:21:22Z

@wojtek-t this is the first attempt to speak to the application endpoint. It happens after the pods are determined to be ready and is retried to give enough time for the application inside the pod to come to life. Surely this is enough? Or am I wrong?

    By("Checking to make sure the Elasticsearch pods are running")
    label := labels.SelectorFromSet(labels.Set(map[string]string{esKey: esValue}))
    pods, err := f.Client.Pods(api.NamespaceSystem).List(label, fields.Everything())
    Expect(err).NotTo(HaveOccurred())
    for _, pod := range pods.Items {
        err = waitForPodRunningInNamespace(f.Client, pod.Name, api.NamespaceSystem)
        Expect(err).NotTo(HaveOccurred())
    }

    By("Checking to make sure we are talking to an Elasticsearch service.")
    // Perform a few checks to make sure this looks like an Elasticsearch cluster.
    var statusCode float64
    var esResponse map[string]interface{}
    err = nil
    var body []byte
    for start := time.Now(); time.Since(start) < graceTime; time.Sleep(5 * time.Second) {
        // Query against the root URL for Elasticsearch.
        body, err = f.Client.Get().
            Namespace(api.NamespaceSystem).
            Prefix("proxy").
            Resource("services").
            Name("elasticsearch-logging").
            DoRaw()
        if err != nil {
            Logf("After %v proxy call to elasticsearch-loigging failed: %v", time.Since(start), err)
            continue
        }
        esResponse, err = bodyToJSON(body)
        if err != nil {
            Logf("After %v failed to convert Elasticsearch JSON response %v to map[string]interface{}: %v", time.Since(start), string(body), err)
            continue
        }
        statusIntf, ok := esResponse["status"]
        if !ok {
            Logf("After %v Elasticsearch response has no status field: %v", time.Since(start), esResponse)
            continue
        }
        statusCode, ok = statusIntf.(float64)
        if !ok {
            // Assume this is a string returning Failure. Retry.
            Logf("After %v expected status to be a float64 but got %v of type %T", time.Since(start), statusIntf, statusIntf)
            continue
        }
        break
    }
    Expect(err).NotTo(HaveOccurred())
    if int(statusCode) != 200 {
        Failf("Elasticsearch cluster has a bad status: %v", statusCode)
    }

wojtek-t · 2015-07-10T06:27:59Z

OK - I changed the PR so that it doesn't demote ElasticSearch test - we should investigate it independently (on the other hand I didn't see the failure since #11001 was merged).

PTAL

davidopp · 2015-07-10T06:32:07Z

LGTM

satnam6502 · 2015-07-10T06:34:46Z

Thank you @wojtek-t -- I am still trying to get to the bottom of this -- although my vacation to Europe in the next two weeks might get in the way.

k8s-bot · 2015-07-10T06:46:47Z

GCE e2e build/test passed for commit dc711ee.

Promote Services shell out of flaky. Demote Elasticsearch

ghost · 2015-07-10T15:21:33Z

Hi @wojtek-t

This PR seems to have completely broken e2e-gce-parallel. Could you add "Services.*shell" to GCE_PARALLEL_FLAKY_TEST_REGEX until it's parallel friendly?

wojtek-t added this to the v1.0 milestone Jul 9, 2015

googlebot added the cla: yes label Jul 9, 2015

vmarmol assigned satnam6502 Jul 9, 2015

satnam6502 mentioned this pull request Jul 9, 2015

Accessing Application Endpoints in Pods via Service is Unreliable #10997

Closed

Promote Services shell out of flaky. Demote Elasticsearch

dc711ee

wojtek-t force-pushed the promote_demote branch from 71bcfac to dc711ee Compare July 10, 2015 06:26

davidopp added ok-to-merge lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jul 10, 2015

wojtek-t added a commit that referenced this pull request Jul 10, 2015

Merge pull request #10989 from wojtek-t/promote_demote

ccc4cfc

Promote Services shell out of flaky. Demote Elasticsearch

wojtek-t merged commit ccc4cfc into kubernetes:master Jul 10, 2015

wojtek-t mentioned this pull request Jul 10, 2015

Flaky e2e: Shell should pass tests for services.sh (Failed 3 times in the last 30 runs. Stability: 90 %) #10127

Closed

wojtek-t deleted the promote_demote branch July 16, 2015 09:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Promote Services shell out of flaky. Demote Elasticsearch #10989

Promote Services shell out of flaky. Demote Elasticsearch #10989

wojtek-t commented Jul 9, 2015

k8s-bot commented Jul 9, 2015

wojtek-t commented Jul 9, 2015

wojtek-t commented Jul 9, 2015

wojtek-t commented Jul 9, 2015

k8s-bot commented Jul 9, 2015

satnam6502 commented Jul 9, 2015

satnam6502 commented Jul 9, 2015

wojtek-t commented Jul 9, 2015

satnam6502 commented Jul 9, 2015

satnam6502 commented Jul 9, 2015

satnam6502 commented Jul 9, 2015

satnam6502 commented Jul 9, 2015

wojtek-t commented Jul 10, 2015

davidopp commented Jul 10, 2015

satnam6502 commented Jul 10, 2015

k8s-bot commented Jul 10, 2015

ghost commented Jul 10, 2015

Promote Services shell out of flaky. Demote Elasticsearch #10989

Promote Services shell out of flaky. Demote Elasticsearch #10989

Conversation

wojtek-t commented Jul 9, 2015

k8s-bot commented Jul 9, 2015

wojtek-t commented Jul 9, 2015

wojtek-t commented Jul 9, 2015

wojtek-t commented Jul 9, 2015

k8s-bot commented Jul 9, 2015

satnam6502 commented Jul 9, 2015

satnam6502 commented Jul 9, 2015

wojtek-t commented Jul 9, 2015

satnam6502 commented Jul 9, 2015

satnam6502 commented Jul 9, 2015

satnam6502 commented Jul 9, 2015

satnam6502 commented Jul 9, 2015

wojtek-t commented Jul 10, 2015

davidopp commented Jul 10, 2015

satnam6502 commented Jul 10, 2015

k8s-bot commented Jul 10, 2015

ghost commented Jul 10, 2015