Doesn't detect failed replication controller/deployment configuration #33

livelace · 2016-04-07T11:27:31Z

Create example DC with:

https://paste.fedoraproject.org/350947/60028840/

Deploy DC, oc get -o yaml rc:

https://paste.fedoraproject.org/350950/60028895/

Status of pod, oc get -o yaml pod:

https://paste.fedoraproject.org/350952/28958146/

==>

As result, we think that RC is running, but pod inside RC is not running.
How do we can detect that DC not running and don't start the next build steps ?

livelace · 2016-04-07T12:19:15Z

In other words, why we don't subscribe to event of pod status/why we don't wait completion of pod creation ? Can we check pod status through "Verify OpenShift Deployment" ?

gabemontero · 2016-04-07T15:52:15Z

@livelace the "Verify OpenShift Deployment" step currently stops after seeing the RC go to Complete, but after seeing you scenario, I realize it could do better.

I'll start looking into including a monitor of the deploy pod status into that step's logic (perhaps the other deploy related steps as well - we'll review).

@bparees - FYI

bparees · 2016-04-07T17:21:53Z

@livelace perhaps you could use the http check step to confirm the pod is running? or a readiness check in your DC that confirms the pod came up (which will block the deployment completion).

livelace · 2016-04-07T17:42:18Z

@bparees My service is not HTTP capable, I thought about this.

My case:

First build step - start service1.
Second build step - start service2.
I want to start third build step, which depend from 1/2. I get problems:

a) I don't know that service1 and service2 is up and running and all hooks completed. I can't stop Jenkins tasks, because I think that is all right.

b) I can't scale deployments to zero at the proper time, because I don't know that all tasks inside pods are completed.

I can't properly manage tasks, because I don't know about states of tasks.

gabemontero · 2016-04-07T17:52:35Z

Not to overly distract from this thread but I should have deployer pod
state verification working either later today or tomorrow.

On Thursday, April 7, 2016, Oleg Popov notifications@github.com wrote:

@bparees https://github.com/bparees My service is not HTTP capable, I
thought about this.

My case:

First build step - start service1.

Second build step - start service2.

I want to start third build step, which depend from 1/2. I get
problems:

a) I don't know that service1 and service2 is up and running and all hooks
completed. I can't stop Jenkins tasks, because I think that is all right.

b) I can't scale deployments to zero at the proper time, because I don't
know that all tasks inside pods are completed.

I can't properly manage tasks, because I don't know about states of tasks.

—
You are receiving this because you were assigned.
Reply to this email directly or view it on GitHub
#33 (comment)

livelace · 2016-04-07T17:58:50Z

@gabemontero It will be great!

bparees · 2016-04-07T18:26:12Z

@gabemontero deployer pod state, or just pod state?

gabemontero · 2016-04-07T18:32:48Z

@bparees I'll look for both to a degree. Testing shows the deployer pod is
prunned minimally if successful. So I'll first see if we have a deployer
pod in a non complete state. If a deployer pod no longer exists, I'll
confirm that a running pod exists for the correct gen of the deployment.

On Thursday, April 7, 2016, Ben Parees notifications@github.com wrote:

@gabemontero https://github.com/gabemontero deployer pod state, or just
pod state?

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#33 (comment)

bparees · 2016-04-07T18:36:08Z

the replication controller (deployment) ought to reflect the state of the deployer pod, so i don't see the value in looking at the deployer pod.

gabemontero · 2016-04-07T18:40:45Z

I have not seen that yet at least on what i was previously examing from the
output provided and my duplication with the evil post start hook but I'll
double check when i get back to the office. The deployment phase still
said complete.

On Thursday, April 7, 2016, Ben Parees notifications@github.com wrote:

the replication controller (deployment) ought to reflect the state of the
deployer pod, so i don't see the value in looking at the deployer pod.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#33 (comment)

gabemontero · 2016-04-07T19:34:56Z

Yep, at least with the latest level from upstream origin, @bparees is correct wrt the RC being sufficient. Adding the same the same lifecycle: postStart sabotage, the RC ends up in Failed state per the deployment.phase annotation on the RC. I think my earlier repo did not go far enough or something. Could have swore I saw it go to Complete, but I now consistently see it go to Failed after several runs.

So we are either at two spots @livelace :

you could try adding a "Verify OpenShift Deployment" step and hopefully you see the same results
if your output at https://paste.fedoraproject.org/350950/60028895/ was in fact captured after the Pod failed, then I suspect your version of OpenShift is far back enough from the latest were you are seeing a difference in deployment behavior (certainly that component has evolved some this last release cycle). If that is the case, it may be simply a matter of when you can upgrade.

livelace · 2016-04-08T05:03:42Z

Not working:

[root@openshift-master1 ~]# oc version
oc v1.1.6
kubernetes v1.2.0-36-g4a3f9c5
Jenkins console output (verbose mode), job with verification, job completed without any errors:

https://paste.fedoraproject.org/351461/91294146/

RC status:

https://paste.fedoraproject.org/351462/46009139/

[root@openshift-master1 ~]# oc get rc
NAME DESIRED CURRENT AGE
testing-11.0-drweb-netcheck-nossl-peer1-1 0 0 17h
testing-11.0-drweb-netcheck-nossl-peer1-2 1 1 16h
testing-11.0-drweb-netcheck-nossl-peer2-1 0 0 17h
testing-11.0-drweb-netcheck-nossl-peer2-2 0 0 16h
testing-11.0-drweb-netcheck-nossl-peer3-1 0 0 17h
testing-11.0-drweb-netcheck-nossl-peer3-2 0 0 16h

Pod status:

https://paste.fedoraproject.org/351463/46009150/
http://prntscr.com/apk3ey

livelace · 2016-04-08T06:23:12Z

NAME READY STATUS RESTARTS AGE
testing-11.0-drweb-netcheck-nossl-peer1-2-6zkg7 0/1 CrashLoopBackOff 14 1h

livelace · 2016-04-08T07:42:54Z

"Verify whether the pods are up" in settings will be enough :)

gabemontero · 2016-04-08T14:29:47Z

@livelace I'll see if I can pull a v1.1.6 version of openshift and reproduce what you are seeing, but at the moment, it appears that we are falling into category 2) from my earlier comment. If that does prove to be true, than rather than adding the new step, we'll want you to try the existing step against v1.2.0 when it becomes available (that is the "latest version" I was testing against).

gabemontero · 2016-04-08T15:51:30Z

@livelace - one additional request while I try to reproduce at a lower level of code - when you reproduce, is the equivalent of the testing-11.0-drweb-netcheck-nossl-peer1-2-deploy pod from your last repro staying around long enough for you to dump its contents to json/yaml ? If so, can you provide that as well (assuming you'll need to reproduce again to do so)

thanks

gabemontero · 2016-04-08T16:37:53Z

ok, I went to the same level as @livelace and could not reproduce. One additional question did occur to me ... do you create a successful deployment, the scale it down, edit the DC to introduce the
lifecycle: postStart: exec: command: - /bin/sh - -c - exit 1
and then scale to 1 and verify deployment?

livelace · 2016-04-08T16:49:34Z

@gabemontero Hello.

No, DC has hook from the beginning.

livelace · 2016-04-08T16:50:54Z

After creating "DC" has zero count.

livelace · 2016-04-08T16:58:29Z

Creation progress - https://paste.fedoraproject.org/351916/14601346/

livelace · 2016-04-08T16:59:08Z

Error - https://paste.fedoraproject.org/351917/60134739/

livelace · 2016-04-08T17:00:29Z

After error occur I can scale down DC and to repeat all again.

livelace · 2016-04-08T17:04:24Z

I can modify script (exit 0) that runs inside hook and all be fine with DC (without any modification of configuration).

I can modify script (exit 0) during attempt of set up DC, and DC will be work fine.

PS. It is possible because I can use dedicated script, that contain "exit 1"

gabemontero · 2016-04-08T17:34:56Z

On Fri, Apr 8, 2016 at 12:58 PM, Oleg Popov notifications@github.com
wrote:

Creation progress - https://paste.fedoraproject.org/351916/14601346/

Hey @livelace - not sure what you mean by "creation progress". I just see
another Pod yaml for a Pod created by a replication controller.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#33 (comment)

livelace · 2016-04-08T17:35:44Z

"Creation progress" - scale DC to 1.

gabemontero · 2016-04-08T17:36:24Z

Thanks for the additional details. I have a couple of thoughts on
reworking my repo attempts. I'll report back when I have something
tangible.

On Fri, Apr 8, 2016 at 1:34 PM, Gabe Montero gmontero@redhat.com wrote:

On Fri, Apr 8, 2016 at 12:58 PM, Oleg Popov notifications@github.com
wrote:

Creation progress - https://paste.fedoraproject.org/351916/14601346/

Hey @livelace - not sure what you mean by "creation progress". I just see
another Pod yaml for a Pod created by a replication controller.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#33 (comment)

livelace · 2016-04-08T17:38:01Z

During hour I can grant access to my test environment, I think.

livelace · 2016-04-08T18:08:37Z

@gabemontero Can you connect over SSH to my environment ?

gabemontero · 2016-04-08T18:26:00Z

OK, I've reproduced it. I did:

before starting a deployment, added your lifecycle/podstart with exit 0
deployed, then scaled back down to 0
edited the DC, changing lifecycle/podstart to exit 1
scaled to 1 ... Pod fails, but next gen of RC says it completed successfully.

Note, if I start with the lifecycle/podstart exitting with 1 and initial replicas of 1, then the RC is marked as failed. This is basically what my recent repo attempts did. And now that I understand what is going on, I'm pretty positive that my very first repro attempt, where I saw the RC in complete state, was when I edited a previously used DC to added the lifecycle/podstart with exit 1 check. So good for me that I was not imagining things originally :-).

Now, what to do about this. It is not a given we want to address this with a new plugin step.

this could be a deployment bug that needs to get addressed, with the RC reflecting the state of the pod
the nuance of updating a DC which has been deployed one way, scaled down, editted, and redployed though could be "against current design" or some such.
certainly the lifecycle/podstart induced failure is merely a means for producing an unexpected container start up failure, but are there some nuances wrt using that to tank the container, where a container dying on startup "naturally" will have different characteristics

@bparees: thoughts? ... and I thought about tagging our friends in platform mgmt now, but decided on getting a sanity check from you before officially pulling that trigger.

0xmichalis · 2016-05-18T23:22:13Z

All the links are dead unfortunately. Why a readiness probe is not working for your case @livelace? There are different kinds of probes, if you cannot use http, you can run shell commands (exec) or try to open a tcp connection.

livelace · 2016-05-20T07:28:53Z

@Kargakis No, readiness and liveness probes don't working in this case. I repeat @bparees statement

"the fundamental issue in my mind is that the replication controller is reporting an active count of 1 despite the fact that the only pod that exists is in a FAILED state."

0xmichalis · 2016-05-20T08:04:43Z

Ok. Actually we want, that lifecycle hooks will be processed in context of deployment status. If hook ended with error - deployment should be considered as a failure. Main process of pod may working well and pass all tests (liveness/readiness) but if hook returned error - deployment status should be - failed, because hook is important part of pod and it readiness.

Sorry for jumping from one thread to the other but upstream deployments have no hooks yet. If your problem is that you want a failed hook to fail your deployment you can specify FailurePolicy == Abort

0xmichalis · 2016-05-26T16:44:52Z

@livelace any news here? Did hooks work for you?

livelace · 2016-05-27T05:04:54Z

@Kargakis I'm sad :(

apiVersion: v1
kind: List
items:

- apiVersion: "v1"
  kind: "DeploymentConfig"
  metadata:
    name: "test"
  spec:
    template:
      metadata:
        labels:
          name: "test"
      spec:
        containers:
          - 
            name: "nginx"
            image: "nginx:latest"

            lifecycle:
              postStart:
                exec:
                  command: [ "exit", "1" ]
    replicas: 1
    selector:
      name: "test"

    strategy:
      type: "Rolling"
      rollingParams:
        pre:
          failurePolicy: "Abort"
          execNewPod:
            containerName: "nginx"
            command: [ "true" ]

[root@openshift-master1 ~]# oc get pods
NAME READY STATUS RESTARTS AGE
test-1-deploy 1/1 Running 0 8m
test-1-hook-pre 0/1 Completed 0 5m
test-1-oo71c 0/1 CrashLoopBackOff 3 6m

[root@openshift-master1 ~]# oc get dc
NAME REVISION REPLICAS TRIGGERED BY
test 1 1 config

livelace · 2016-05-27T05:15:47Z

If pod inside broken - we should mark DC as failed and replica count should be 0, because existing replica isn't the same that working replica is.

0xmichalis · 2016-05-27T09:05:15Z

@livelace the timeout for deployments is at 10m. If you waited 2 more minutes you would see that the deployment will be marked as failed and eventually scaled down to zero:)

[vagrant@localhost sample-app]$ oc get pods
NAME              READY     STATUS             RESTARTS   AGE
test-1-deploy     1/1       Running            0          10m
test-1-hook-pre   0/1       Completed          0          10m
test-1-utjds      0/1       CrashLoopBackOff   6          9m
[vagrant@localhost sample-app]$ oc get pods
NAME              READY     STATUS      RESTARTS   AGE
test-1-deploy     0/1       Error       0          10m
test-1-hook-pre   0/1       Completed   0          10m
[vagrant@localhost sample-app]$ oc status
In project test on server https://10.0.2.15:8443

dc/test deploys docker.io/library/nginx:latest 
  deployment #1 failed 10 minutes ago

1 warning identified, use 'oc status -v' to see details.

Note that your use of the deployment hook didn't do anything and the deployment hook was complete. I think in your case you would want a post hook with Abort policy with a script that makes sure your application pod is up and running. Also readiness probes can help.

Regarding the timeout, we will make it configurable eventually.

livelace · 2016-05-31T07:34:37Z

@Kargakis

The deployment hook is useful when we should be sure that new version of software working as expected and go back to working deployment if fail. But we have different situation. Our deployment should be launched and if container hook (which do some bunch of things inside, it is dynamic configuration) return error - deployment should be marked as failed.

The deployment hook isn't working for us, because it works only in deployment. But we need possibility when container hook cause an error during scaling. We need failure policy for container hook.

0xmichalis · 2016-05-31T08:21:13Z

@smarterclayton @ironcladlou, @livelace wants container hooks to be taken into account for deployments. Thoughts?

0xmichalis · 2016-05-31T17:01:02Z

@livelace actually can you try to run a deployment with a container postStart hook that fails and see if it works for you after 10 minutes?

livelace · 2016-05-31T17:21:33Z

@Kargakis

dc/test deploys docker.io/library/nginx:latest 
  deployment #1 failed 11 minutes ago

Every 1,0s: oc get pods | grep ^test-                                                                                                                           Tue May 31 20:16:58 2016

test-1-deploy                                      0/1       Error       0          13m

[root@openshift-master1 ~]# oc get rc
NAME      DESIRED   CURRENT   AGE
test-1    0         0         17m

@bparees @gabemontero Can we detect and wait this behavior during "Scale Deployment" in Jenkins ?

livelace · 2016-05-31T17:22:31Z

"Verify whether the specified number of replicas are up" already exist.

bparees · 2016-06-01T03:10:47Z

@livelace detect and wait for what exactly? I still haven't seen a satisfactory answer for why the replication controller is reporting N current pods when those pods are in a failed state.

livelace · 2016-06-01T05:21:23Z

@bparees @Kargakis Ok. I think, minimum two variants:

Report about replica count on early stage (immediately after a container hook return an error).
When we do "Scale Deployment", we may wait and check that this deployment not in failed state.
?

0xmichalis · 2016-06-01T09:09:41Z

I still haven't seen a satisfactory answer for why the replication controller is reporting N current pods when those pods are in a failed state.

Because it's not the job for a replication controller but for a deployment. The rc/rs will always report what it has created but it cannot know if those pods are running.

Report about replica count on early stage (immediately after a container hook return an error).
When we do "Scale Deployment", we may wait and check that this deployment not in failed state.

You shouldn't deploy zero replicas and scale after the fact. That's why we use deployments in the first place, otherwise we would still use replication controllers. Deployments ensure that your pods are able to run. Replication controllers cannot do that by design.

livelace · 2016-06-01T09:30:19Z

Deployments ensure that your pods are able to run. Replication controllers cannot do that by design.

Our deployments are always can run and working without any problem. Pods work excellent. But we use container hooks, which launch integration tests with other services in different pods.

If we will do all possible deployment configurations which will contain all possible combination of our software - it's impossible. And we can't hold pods online all time.

0xmichalis · 2016-06-01T10:03:38Z

But we use container hooks, which launch integration tests with other services in different pods.

Would it make sense to group all those containers together? Or are those other services independent components of your system?

And we can't hold pods online all time.

Do you really need to scale down to zero here or could you just stop directing traffic to those pods?

livelace · 2016-06-01T10:14:32Z

Would it make sense to group all those containers together? Or are those other services independent components of your system?

Yes, it is. For example: three containers, each should be set specific settings and all communicate each other + exist "external" services (not in the same pod) and they also in connection with other.

Do you really need to scale down to zero here or could you just stop directing traffic to those pods?

We need:

step 1: start/scale DC to N
step 2: wait tests completion
step 3: stop/scale DC to zero

0xmichalis · 2016-06-01T10:25:52Z

We need:

step 1: start/scale DC to N
step 2: wait tests completion
step 3: stop/scale DC to zero

Try setting dc.spec.test=true

You should deploy it everytime you need it to run (oc deploy NAME --latest).

livelace · 2016-06-01T10:37:57Z

@Kargakis Thanks, but:

``
[root@openshift-master1 11.0]# oc explain dc.spec.test
FIELD: test

DESCRIPTION:
Test ensures that this deployment config will have zero replicas except
while a deployment is running. This allows the deployment config to be used
as a continuous deployment test - triggering on images, running the
deployment, and then succeeding or failing. Post strategy hooks and After
actions can be used to integrate successful deployment with an action.
``

We need:

Scale DC1 (contain some services with some settings/tests through container hook)
Scale DC2 (contain some services with some settings/tests through container hook)
Scale DC3, which communicate with DC1 and DC2 (contain some services with some settings/tests through container hook).
Wait results.
Scale DC1/DC2/DC3 to zero

And what we have, steps:

Trigger deployment DC1. Deployment DC1 done and powered down.
Trigger deployment DC2. Deployment DC2 done and powered down.
DC3 can't communicate with DC1/DC2, because they was stopped.

0xmichalis · 2016-06-01T12:07:59Z

There are a couple of things you can do in such a scenario. One is you can setup posthooks in DC1/DC2 to wait for DC3 to complete. DC3 comes up, runs its tests, completes, DC1 and DC2 complete. All are scaled down automatically because they have dc.spec.test=true.

You can also play around with custom deployments:
http://lists.openshift.redhat.com/openshift-archives/dev/2016-May/msg00037.html

0xmichalis · 2016-06-01T12:17:00Z

There are a couple of things you can do in such a scenario. One is you can setup posthooks in DC1/DC2 to wait for DC3 to complete. DC3 comes up, runs its tests, completes, DC1 and DC2 complete. All are scaled down automatically because they have dc.spec.test=true.

You could also setup a prehook for DC3 to start DC1 and DC2 and wait for them to be running so you would need to run just DC3.

$ oc deploy dc3 --latest
---> DC3 prehook starts DC1 and DC2 and waits for them to be running...
---> DC1 and DC2 are running
---> Prehook exits
---> DC1 and DC2 should wait on their posthooks at this point
---> DC3 is running
---> DC3 completes, is scaled down because it has dc.spec.test=true
---> Posthooks for DC1 and DC2 exit, they complete and are scaled down because dc.spec.test=true

Complicated but it could work.

livelace · 2016-06-01T12:18:48Z

@Kargakis Thanks, I will try it later.

PS. Thorny our way :)

0xmichalis · 2016-06-01T12:26:20Z

The thing is that by scaling deployment configs up/down to zero instead of actually deploying them, you lose all the benefits you get from using deployments. Replication controllers by design cannot detect failures and I don't think it will ever change especially in the light of ... having deployments:)

0xmichalis · 2016-06-01T14:45:48Z

DC3 also sounds a lot like a Job:
https://docs.openshift.org/latest/dev_guide/jobs.html

livelace · 2016-06-01T14:49:19Z

The thing is that by scaling deployment configs up/down to zero instead of actually deploying them, you lose all the benefits you get from using deployments.

Actually deployments work fine, because they trigger by "image change" and it's Ok.

DC3 - it's most easy thing in configuration. I saw "Jobs" previously.

livelace · 2016-06-01T14:52:54Z

Main challenge - use multiple deployments (services inside) with each other. At this moment after scaling we just check file flag, which indicate about tests execution inside deployment (with Jenkins help). But we want more mature/right mechanism.

gabemontero · 2016-06-30T19:30:55Z

My interpretation is that the discussions here have circled back to openshift/origin#8507

We also got clarification from @Kargakis back with #33 (comment) on why the RC was reporting what it was reporting.

And based on the discussions noted with #33 (comment) , we still don't want the plugin to start interrogating Pod state.

Of course, this is still an evolving area. If changes occur around the multiple deployment orchestration or what the RC reports wrt Pod state, we can look into associated changes in the plugin.

But with the above preamble, I'm going to go ahead and close this one out. Please continue discussions either in openshift/origin#8507 or new issues to be opened against origin or k8s if the discussion broadens.

Thanks.

gabemontero self-assigned this Apr 7, 2016

livelace mentioned this issue May 18, 2016

Doesn't detect failed replication controller kubernetes/kubernetes#25790

Closed

gabemontero closed this as completed Jun 30, 2016

Doesn't detect failed replication controller/deployment configuration #33

Doesn't detect failed replication controller/deployment configuration #33

Comments

livelace commented Apr 7, 2016

livelace commented Apr 7, 2016

gabemontero commented Apr 7, 2016

bparees commented Apr 7, 2016

livelace commented Apr 7, 2016

gabemontero commented Apr 7, 2016

livelace commented Apr 7, 2016

bparees commented Apr 7, 2016

gabemontero commented Apr 7, 2016

bparees commented Apr 7, 2016

gabemontero commented Apr 7, 2016

gabemontero commented Apr 7, 2016

livelace commented Apr 8, 2016

livelace commented Apr 8, 2016

livelace commented Apr 8, 2016

gabemontero commented Apr 8, 2016

gabemontero commented Apr 8, 2016

gabemontero commented Apr 8, 2016

livelace commented Apr 8, 2016

livelace commented Apr 8, 2016

livelace commented Apr 8, 2016

livelace commented Apr 8, 2016

livelace commented Apr 8, 2016

livelace commented Apr 8, 2016

gabemontero commented Apr 8, 2016

livelace commented Apr 8, 2016

gabemontero commented Apr 8, 2016

livelace commented Apr 8, 2016

livelace commented Apr 8, 2016

gabemontero commented Apr 8, 2016

0xmichalis commented May 18, 2016

livelace commented May 20, 2016

0xmichalis commented May 20, 2016

0xmichalis commented May 26, 2016

livelace commented May 27, 2016

livelace commented May 27, 2016 • edited Loading

0xmichalis commented May 27, 2016

livelace commented May 31, 2016

0xmichalis commented May 31, 2016

0xmichalis commented May 31, 2016

livelace commented May 31, 2016

livelace commented May 31, 2016

bparees commented Jun 1, 2016

livelace commented Jun 1, 2016

0xmichalis commented Jun 1, 2016

livelace commented Jun 1, 2016 • edited Loading

0xmichalis commented Jun 1, 2016

livelace commented Jun 1, 2016 • edited Loading

0xmichalis commented Jun 1, 2016 • edited Loading

livelace commented Jun 1, 2016

0xmichalis commented Jun 1, 2016

0xmichalis commented Jun 1, 2016

livelace commented Jun 1, 2016

0xmichalis commented Jun 1, 2016

0xmichalis commented Jun 1, 2016

livelace commented Jun 1, 2016

livelace commented Jun 1, 2016

gabemontero commented Jun 30, 2016

livelace commented May 27, 2016 •

edited

Loading

livelace commented Jun 1, 2016 •

edited

Loading

livelace commented Jun 1, 2016 •

edited

Loading

0xmichalis commented Jun 1, 2016 •

edited

Loading