Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doesn't detect failed replication controller/deployment configuration #33

Closed
livelace opened this issue Apr 7, 2016 · 80 comments
Closed
Assignees

Comments

@livelace
Copy link

livelace commented Apr 7, 2016

  1. Create example DC with:

https://paste.fedoraproject.org/350947/60028840/

  1. Deploy DC, oc get -o yaml rc:

https://paste.fedoraproject.org/350950/60028895/

  1. Status of pod, oc get -o yaml pod:

https://paste.fedoraproject.org/350952/28958146/

==>

As result, we think that RC is running, but pod inside RC is not running.
How do we can detect that DC not running and don't start the next build steps ?

@livelace
Copy link
Author

livelace commented Apr 7, 2016

In other words, why we don't subscribe to event of pod status/why we don't wait completion of pod creation ? Can we check pod status through "Verify OpenShift Deployment" ?

@gabemontero gabemontero self-assigned this Apr 7, 2016
@gabemontero
Copy link
Contributor

@livelace the "Verify OpenShift Deployment" step currently stops after seeing the RC go to Complete, but after seeing you scenario, I realize it could do better.

I'll start looking into including a monitor of the deploy pod status into that step's logic (perhaps the other deploy related steps as well - we'll review).

@bparees - FYI

@bparees
Copy link
Contributor

bparees commented Apr 7, 2016

@livelace perhaps you could use the http check step to confirm the pod is running? or a readiness check in your DC that confirms the pod came up (which will block the deployment completion).

@livelace
Copy link
Author

livelace commented Apr 7, 2016

@bparees My service is not HTTP capable, I thought about this.

My case:

  1. First build step - start service1.
  2. Second build step - start service2.
  3. I want to start third build step, which depend from 1/2. I get problems:

a) I don't know that service1 and service2 is up and running and all hooks completed. I can't stop Jenkins tasks, because I think that is all right.

b) I can't scale deployments to zero at the proper time, because I don't know that all tasks inside pods are completed.

I can't properly manage tasks, because I don't know about states of tasks.

@gabemontero
Copy link
Contributor

Not to overly distract from this thread but I should have deployer pod
state verification working either later today or tomorrow.

On Thursday, April 7, 2016, Oleg Popov notifications@github.com wrote:

@bparees https://github.com/bparees My service is not HTTP capable, I
thought about this.

My case:

  1. First build step - start service1.
  2. Second build step - start service2.
  3. I want to start third build step, which depend from 1/2. I get
    problems:

a) I don't know that service1 and service2 is up and running and all hooks
completed. I can't stop Jenkins tasks, because I think that is all right.

b) I can't scale deployments to zero at the proper time, because I don't
know that all tasks inside pods are completed.

I can't properly manage tasks, because I don't know about states of tasks.


You are receiving this because you were assigned.
Reply to this email directly or view it on GitHub
#33 (comment)

@livelace
Copy link
Author

livelace commented Apr 7, 2016

@gabemontero It will be great!

@bparees
Copy link
Contributor

bparees commented Apr 7, 2016

@gabemontero deployer pod state, or just pod state?

@gabemontero
Copy link
Contributor

@bparees I'll look for both to a degree. Testing shows the deployer pod is
prunned minimally if successful. So I'll first see if we have a deployer
pod in a non complete state. If a deployer pod no longer exists, I'll
confirm that a running pod exists for the correct gen of the deployment.

On Thursday, April 7, 2016, Ben Parees notifications@github.com wrote:

@gabemontero https://github.com/gabemontero deployer pod state, or just
pod state?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#33 (comment)

@bparees
Copy link
Contributor

bparees commented Apr 7, 2016

the replication controller (deployment) ought to reflect the state of the deployer pod, so i don't see the value in looking at the deployer pod.

@gabemontero
Copy link
Contributor

I have not seen that yet at least on what i was previously examing from the
output provided and my duplication with the evil post start hook but I'll
double check when i get back to the office. The deployment phase still
said complete.

On Thursday, April 7, 2016, Ben Parees notifications@github.com wrote:

the replication controller (deployment) ought to reflect the state of the
deployer pod, so i don't see the value in looking at the deployer pod.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#33 (comment)

@gabemontero
Copy link
Contributor

Yep, at least with the latest level from upstream origin, @bparees is correct wrt the RC being sufficient. Adding the same the same lifecycle: postStart sabotage, the RC ends up in Failed state per the deployment.phase annotation on the RC. I think my earlier repo did not go far enough or something. Could have swore I saw it go to Complete, but I now consistently see it go to Failed after several runs.

So we are either at two spots @livelace :

  1. you could try adding a "Verify OpenShift Deployment" step and hopefully you see the same results
  2. if your output at https://paste.fedoraproject.org/350950/60028895/ was in fact captured after the Pod failed, then I suspect your version of OpenShift is far back enough from the latest were you are seeing a difference in deployment behavior (certainly that component has evolved some this last release cycle). If that is the case, it may be simply a matter of when you can upgrade.

@livelace
Copy link
Author

livelace commented Apr 8, 2016

Not working:

  1. [root@openshift-master1 ~]# oc version
    oc v1.1.6
    kubernetes v1.2.0-36-g4a3f9c5
  2. Jenkins console output (verbose mode), job with verification, job completed without any errors:

https://paste.fedoraproject.org/351461/91294146/

  1. RC status:

https://paste.fedoraproject.org/351462/46009139/

[root@openshift-master1 ~]# oc get rc
NAME DESIRED CURRENT AGE
testing-11.0-drweb-netcheck-nossl-peer1-1 0 0 17h
testing-11.0-drweb-netcheck-nossl-peer1-2 1 1 16h
testing-11.0-drweb-netcheck-nossl-peer2-1 0 0 17h
testing-11.0-drweb-netcheck-nossl-peer2-2 0 0 16h
testing-11.0-drweb-netcheck-nossl-peer3-1 0 0 17h
testing-11.0-drweb-netcheck-nossl-peer3-2 0 0 16h

  1. Pod status:

https://paste.fedoraproject.org/351463/46009150/
http://prntscr.com/apk3ey

@livelace
Copy link
Author

livelace commented Apr 8, 2016

NAME READY STATUS RESTARTS AGE
testing-11.0-drweb-netcheck-nossl-peer1-2-6zkg7 0/1 CrashLoopBackOff 14 1h

@livelace
Copy link
Author

livelace commented Apr 8, 2016

"Verify whether the pods are up" in settings will be enough :)

@gabemontero
Copy link
Contributor

@livelace I'll see if I can pull a v1.1.6 version of openshift and reproduce what you are seeing, but at the moment, it appears that we are falling into category 2) from my earlier comment. If that does prove to be true, than rather than adding the new step, we'll want you to try the existing step against v1.2.0 when it becomes available (that is the "latest version" I was testing against).

@gabemontero
Copy link
Contributor

@livelace - one additional request while I try to reproduce at a lower level of code - when you reproduce, is the equivalent of the testing-11.0-drweb-netcheck-nossl-peer1-2-deploy pod from your last repro staying around long enough for you to dump its contents to json/yaml ? If so, can you provide that as well (assuming you'll need to reproduce again to do so)

thanks

@gabemontero
Copy link
Contributor

ok, I went to the same level as @livelace and could not reproduce. One additional question did occur to me ... do you create a successful deployment, the scale it down, edit the DC to introduce the
lifecycle: postStart: exec: command: - /bin/sh - -c - exit 1
and then scale to 1 and verify deployment?

@livelace
Copy link
Author

livelace commented Apr 8, 2016

@gabemontero Hello.

No, DC has hook from the beginning.

@livelace
Copy link
Author

livelace commented Apr 8, 2016

After creating "DC" has zero count.

@livelace
Copy link
Author

livelace commented Apr 8, 2016

Creation progress - https://paste.fedoraproject.org/351916/14601346/

@livelace
Copy link
Author

livelace commented Apr 8, 2016

@livelace
Copy link
Author

livelace commented Apr 8, 2016

After error occur I can scale down DC and to repeat all again.

@livelace
Copy link
Author

livelace commented Apr 8, 2016

I can modify script (exit 0) that runs inside hook and all be fine with DC (without any modification of configuration).

I can modify script (exit 0) during attempt of set up DC, and DC will be work fine.

PS. It is possible because I can use dedicated script, that contain "exit 1"

@gabemontero
Copy link
Contributor

On Fri, Apr 8, 2016 at 12:58 PM, Oleg Popov notifications@github.com
wrote:

Creation progress - https://paste.fedoraproject.org/351916/14601346/

Hey @livelace - not sure what you mean by "creation progress". I just see
another Pod yaml for a Pod created by a replication controller.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#33 (comment)

@livelace
Copy link
Author

livelace commented Apr 8, 2016

"Creation progress" - scale DC to 1.

@gabemontero
Copy link
Contributor

Thanks for the additional details. I have a couple of thoughts on
reworking my repo attempts. I'll report back when I have something
tangible.

On Fri, Apr 8, 2016 at 1:34 PM, Gabe Montero gmontero@redhat.com wrote:

On Fri, Apr 8, 2016 at 12:58 PM, Oleg Popov notifications@github.com
wrote:

Creation progress - https://paste.fedoraproject.org/351916/14601346/

Hey @livelace - not sure what you mean by "creation progress". I just see
another Pod yaml for a Pod created by a replication controller.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#33 (comment)

@livelace
Copy link
Author

livelace commented Apr 8, 2016

During hour I can grant access to my test environment, I think.

@livelace
Copy link
Author

livelace commented Apr 8, 2016

@gabemontero Can you connect over SSH to my environment ?

@gabemontero
Copy link
Contributor

OK, I've reproduced it. I did:

  1. before starting a deployment, added your lifecycle/podstart with exit 0
  2. deployed, then scaled back down to 0
  3. edited the DC, changing lifecycle/podstart to exit 1
  4. scaled to 1 ... Pod fails, but next gen of RC says it completed successfully.

Note, if I start with the lifecycle/podstart exitting with 1 and initial replicas of 1, then the RC is marked as failed. This is basically what my recent repo attempts did. And now that I understand what is going on, I'm pretty positive that my very first repro attempt, where I saw the RC in complete state, was when I edited a previously used DC to added the lifecycle/podstart with exit 1 check. So good for me that I was not imagining things originally :-).

Now, what to do about this. It is not a given we want to address this with a new plugin step.

  1. this could be a deployment bug that needs to get addressed, with the RC reflecting the state of the pod
  2. the nuance of updating a DC which has been deployed one way, scaled down, editted, and redployed though could be "against current design" or some such.
  3. certainly the lifecycle/podstart induced failure is merely a means for producing an unexpected container start up failure, but are there some nuances wrt using that to tank the container, where a container dying on startup "naturally" will have different characteristics

@bparees: thoughts? ... and I thought about tagging our friends in platform mgmt now, but decided on getting a sanity check from you before officially pulling that trigger.

@0xmichalis
Copy link

All the links are dead unfortunately. Why a readiness probe is not working for your case @livelace? There are different kinds of probes, if you cannot use http, you can run shell commands (exec) or try to open a tcp connection.

@livelace
Copy link
Author

@Kargakis No, readiness and liveness probes don't working in this case. I repeat @bparees statement

"the fundamental issue in my mind is that the replication controller is reporting an active count of 1 despite the fact that the only pod that exists is in a FAILED state."

@0xmichalis
Copy link

Ok. Actually we want, that lifecycle hooks will be processed in context of deployment status. If hook ended with error - deployment should be considered as a failure. Main process of pod may working well and pass all tests (liveness/readiness) but if hook returned error - deployment status should be - failed, because hook is important part of pod and it readiness.

Sorry for jumping from one thread to the other but upstream deployments have no hooks yet. If your problem is that you want a failed hook to fail your deployment you can specify FailurePolicy == Abort

@0xmichalis
Copy link

@livelace any news here? Did hooks work for you?

@livelace
Copy link
Author

@Kargakis I'm sad :(

apiVersion: v1
kind: List
items:

- apiVersion: "v1"
  kind: "DeploymentConfig"
  metadata:
    name: "test"
  spec:
    template:
      metadata:
        labels:
          name: "test"
      spec:
        containers:
          - 
            name: "nginx"
            image: "nginx:latest"

            lifecycle:
              postStart:
                exec:
                  command: [ "exit", "1" ]
    replicas: 1
    selector:
      name: "test"

    strategy:
      type: "Rolling"
      rollingParams:
        pre:
          failurePolicy: "Abort"
          execNewPod:
            containerName: "nginx"
            command: [ "true" ]

[root@openshift-master1 ~]# oc get pods
NAME READY STATUS RESTARTS AGE
test-1-deploy 1/1 Running 0 8m
test-1-hook-pre 0/1 Completed 0 5m
test-1-oo71c 0/1 CrashLoopBackOff 3 6m

[root@openshift-master1 ~]# oc get dc
NAME REVISION REPLICAS TRIGGERED BY
test 1 1 config

@livelace
Copy link
Author

livelace commented May 27, 2016

If pod inside broken - we should mark DC as failed and replica count should be 0, because existing replica isn't the same that working replica is.

@0xmichalis
Copy link

@livelace the timeout for deployments is at 10m. If you waited 2 more minutes you would see that the deployment will be marked as failed and eventually scaled down to zero:)

[vagrant@localhost sample-app]$ oc get pods
NAME              READY     STATUS             RESTARTS   AGE
test-1-deploy     1/1       Running            0          10m
test-1-hook-pre   0/1       Completed          0          10m
test-1-utjds      0/1       CrashLoopBackOff   6          9m
[vagrant@localhost sample-app]$ oc get pods
NAME              READY     STATUS      RESTARTS   AGE
test-1-deploy     0/1       Error       0          10m
test-1-hook-pre   0/1       Completed   0          10m
[vagrant@localhost sample-app]$ oc status
In project test on server https://10.0.2.15:8443

dc/test deploys docker.io/library/nginx:latest 
  deployment #1 failed 10 minutes ago

1 warning identified, use 'oc status -v' to see details.

Note that your use of the deployment hook didn't do anything and the deployment hook was complete. I think in your case you would want a post hook with Abort policy with a script that makes sure your application pod is up and running. Also readiness probes can help.

Regarding the timeout, we will make it configurable eventually.

@livelace
Copy link
Author

@Kargakis

The deployment hook is useful when we should be sure that new version of software working as expected and go back to working deployment if fail. But we have different situation. Our deployment should be launched and if container hook (which do some bunch of things inside, it is dynamic configuration) return error - deployment should be marked as failed.

The deployment hook isn't working for us, because it works only in deployment. But we need possibility when container hook cause an error during scaling. We need failure policy for container hook.

@0xmichalis
Copy link

@smarterclayton @ironcladlou, @livelace wants container hooks to be taken into account for deployments. Thoughts?

@0xmichalis
Copy link

@livelace actually can you try to run a deployment with a container postStart hook that fails and see if it works for you after 10 minutes?

@livelace
Copy link
Author

@Kargakis

dc/test deploys docker.io/library/nginx:latest 
  deployment #1 failed 11 minutes ago
Every 1,0s: oc get pods | grep ^test-                                                                                                                           Tue May 31 20:16:58 2016

test-1-deploy                                      0/1       Error       0          13m
[root@openshift-master1 ~]# oc get rc
NAME      DESIRED   CURRENT   AGE
test-1    0         0         17m

@bparees @gabemontero Can we detect and wait this behavior during "Scale Deployment" in Jenkins ?

@livelace
Copy link
Author

"Verify whether the specified number of replicas are up" already exist.

@bparees
Copy link
Contributor

bparees commented Jun 1, 2016

@livelace detect and wait for what exactly? I still haven't seen a satisfactory answer for why the replication controller is reporting N current pods when those pods are in a failed state.

@livelace
Copy link
Author

livelace commented Jun 1, 2016

@bparees @Kargakis Ok. I think, minimum two variants:

  1. Report about replica count on early stage (immediately after a container hook return an error).
  2. When we do "Scale Deployment", we may wait and check that this deployment not in failed state.
  3. ?

@0xmichalis
Copy link

I still haven't seen a satisfactory answer for why the replication controller is reporting N current pods when those pods are in a failed state.

Because it's not the job for a replication controller but for a deployment. The rc/rs will always report what it has created but it cannot know if those pods are running.

Report about replica count on early stage (immediately after a container hook return an error).
When we do "Scale Deployment", we may wait and check that this deployment not in failed state.

You shouldn't deploy zero replicas and scale after the fact. That's why we use deployments in the first place, otherwise we would still use replication controllers. Deployments ensure that your pods are able to run. Replication controllers cannot do that by design.

@livelace
Copy link
Author

livelace commented Jun 1, 2016

Deployments ensure that your pods are able to run. Replication controllers cannot do that by design.

Our deployments are always can run and working without any problem. Pods work excellent. But we use container hooks, which launch integration tests with other services in different pods.

If we will do all possible deployment configurations which will contain all possible combination of our software - it's impossible. And we can't hold pods online all time.

@0xmichalis
Copy link

But we use container hooks, which launch integration tests with other services in different pods.

Would it make sense to group all those containers together? Or are those other services independent components of your system?

And we can't hold pods online all time.

Do you really need to scale down to zero here or could you just stop directing traffic to those pods?

@livelace
Copy link
Author

livelace commented Jun 1, 2016

Would it make sense to group all those containers together? Or are those other services independent components of your system?

Yes, it is. For example: three containers, each should be set specific settings and all communicate each other + exist "external" services (not in the same pod) and they also in connection with other.

Do you really need to scale down to zero here or could you just stop directing traffic to those pods?

We need:

step 1: start/scale DC to N
step 2: wait tests completion
step 3: stop/scale DC to zero

@0xmichalis
Copy link

0xmichalis commented Jun 1, 2016

We need:

step 1: start/scale DC to N
step 2: wait tests completion
step 3: stop/scale DC to zero

Try setting dc.spec.test=true

You should deploy it everytime you need it to run (oc deploy NAME --latest).

@livelace
Copy link
Author

livelace commented Jun 1, 2016

@Kargakis Thanks, but:

``
[root@openshift-master1 11.0]# oc explain dc.spec.test
FIELD: test

DESCRIPTION:
Test ensures that this deployment config will have zero replicas except
while a deployment is running. This allows the deployment config to be used
as a continuous deployment test - triggering on images, running the
deployment, and then succeeding or failing. Post strategy hooks and After
actions can be used to integrate successful deployment with an action.
``

We need:

  1. Scale DC1 (contain some services with some settings/tests through container hook)
  2. Scale DC2 (contain some services with some settings/tests through container hook)
  3. Scale DC3, which communicate with DC1 and DC2 (contain some services with some settings/tests through container hook).
  4. Wait results.
  5. Scale DC1/DC2/DC3 to zero

And what we have, steps:

  1. Trigger deployment DC1. Deployment DC1 done and powered down.
  2. Trigger deployment DC2. Deployment DC2 done and powered down.
  3. DC3 can't communicate with DC1/DC2, because they was stopped.

@0xmichalis
Copy link

There are a couple of things you can do in such a scenario. One is you can setup posthooks in DC1/DC2 to wait for DC3 to complete. DC3 comes up, runs its tests, completes, DC1 and DC2 complete. All are scaled down automatically because they have dc.spec.test=true.

You can also play around with custom deployments:
http://lists.openshift.redhat.com/openshift-archives/dev/2016-May/msg00037.html

@0xmichalis
Copy link

There are a couple of things you can do in such a scenario. One is you can setup posthooks in DC1/DC2 to wait for DC3 to complete. DC3 comes up, runs its tests, completes, DC1 and DC2 complete. All are scaled down automatically because they have dc.spec.test=true.

You could also setup a prehook for DC3 to start DC1 and DC2 and wait for them to be running so you would need to run just DC3.

$ oc deploy dc3 --latest
---> DC3 prehook starts DC1 and DC2 and waits for them to be running...
---> DC1 and DC2 are running
---> Prehook exits
---> DC1 and DC2 should wait on their posthooks at this point
---> DC3 is running
---> DC3 completes, is scaled down because it has dc.spec.test=true
---> Posthooks for DC1 and DC2 exit, they complete and are scaled down because dc.spec.test=true

Complicated but it could work.

@livelace
Copy link
Author

livelace commented Jun 1, 2016

@Kargakis Thanks, I will try it later.

PS. Thorny our way :)

@0xmichalis
Copy link

The thing is that by scaling deployment configs up/down to zero instead of actually deploying them, you lose all the benefits you get from using deployments. Replication controllers by design cannot detect failures and I don't think it will ever change especially in the light of ... having deployments:)

@0xmichalis
Copy link

DC3 also sounds a lot like a Job:
https://docs.openshift.org/latest/dev_guide/jobs.html

@livelace
Copy link
Author

livelace commented Jun 1, 2016

The thing is that by scaling deployment configs up/down to zero instead of actually deploying them, you lose all the benefits you get from using deployments.

Actually deployments work fine, because they trigger by "image change" and it's Ok.

DC3 - it's most easy thing in configuration. I saw "Jobs" previously.

@livelace
Copy link
Author

livelace commented Jun 1, 2016

Main challenge - use multiple deployments (services inside) with each other. At this moment after scaling we just check file flag, which indicate about tests execution inside deployment (with Jenkins help). But we want more mature/right mechanism.

@gabemontero
Copy link
Contributor

My interpretation is that the discussions here have circled back to openshift/origin#8507

We also got clarification from @Kargakis back with #33 (comment) on why the RC was reporting what it was reporting.

And based on the discussions noted with #33 (comment) , we still don't want the plugin to start interrogating Pod state.

Of course, this is still an evolving area. If changes occur around the multiple deployment orchestration or what the RC reports wrt Pod state, we can look into associated changes in the plugin.

But with the above preamble, I'm going to go ahead and close this one out. Please continue discussions either in openshift/origin#8507 or new issues to be opened against origin or k8s if the discussion broadens.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants