-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doesn't detect failed replication controller/deployment configuration #33
Comments
In other words, why we don't subscribe to event of pod status/why we don't wait completion of pod creation ? Can we check pod status through "Verify OpenShift Deployment" ? |
@livelace the "Verify OpenShift Deployment" step currently stops after seeing the RC go to Complete, but after seeing you scenario, I realize it could do better. I'll start looking into including a monitor of the deploy pod status into that step's logic (perhaps the other deploy related steps as well - we'll review). @bparees - FYI |
@livelace perhaps you could use the http check step to confirm the pod is running? or a readiness check in your DC that confirms the pod came up (which will block the deployment completion). |
@bparees My service is not HTTP capable, I thought about this. My case:
a) I don't know that service1 and service2 is up and running and all hooks completed. I can't stop Jenkins tasks, because I think that is all right. b) I can't scale deployments to zero at the proper time, because I don't know that all tasks inside pods are completed. I can't properly manage tasks, because I don't know about states of tasks. |
Not to overly distract from this thread but I should have deployer pod On Thursday, April 7, 2016, Oleg Popov notifications@github.com wrote:
|
@gabemontero It will be great! |
@gabemontero deployer pod state, or just pod state? |
@bparees I'll look for both to a degree. Testing shows the deployer pod is On Thursday, April 7, 2016, Ben Parees notifications@github.com wrote:
|
the replication controller (deployment) ought to reflect the state of the deployer pod, so i don't see the value in looking at the deployer pod. |
I have not seen that yet at least on what i was previously examing from the On Thursday, April 7, 2016, Ben Parees notifications@github.com wrote:
|
Yep, at least with the latest level from upstream origin, @bparees is correct wrt the RC being sufficient. Adding the same the same So we are either at two spots @livelace :
|
Not working:
https://paste.fedoraproject.org/351461/91294146/
https://paste.fedoraproject.org/351462/46009139/ [root@openshift-master1 ~]# oc get rc
https://paste.fedoraproject.org/351463/46009150/ |
NAME READY STATUS RESTARTS AGE |
"Verify whether the pods are up" in settings will be enough :) |
@livelace I'll see if I can pull a v1.1.6 version of openshift and reproduce what you are seeing, but at the moment, it appears that we are falling into category 2) from my earlier comment. If that does prove to be true, than rather than adding the new step, we'll want you to try the existing step against v1.2.0 when it becomes available (that is the "latest version" I was testing against). |
@livelace - one additional request while I try to reproduce at a lower level of code - when you reproduce, is the equivalent of the testing-11.0-drweb-netcheck-nossl-peer1-2-deploy pod from your last repro staying around long enough for you to dump its contents to json/yaml ? If so, can you provide that as well (assuming you'll need to reproduce again to do so) thanks |
ok, I went to the same level as @livelace and could not reproduce. One additional question did occur to me ... do you create a successful deployment, the scale it down, edit the DC to introduce the |
@gabemontero Hello. No, DC has hook from the beginning. |
After creating "DC" has zero count. |
Creation progress - https://paste.fedoraproject.org/351916/14601346/ |
After error occur I can scale down DC and to repeat all again. |
I can modify script (exit 0) that runs inside hook and all be fine with DC (without any modification of configuration). I can modify script (exit 0) during attempt of set up DC, and DC will be work fine. PS. It is possible because I can use dedicated script, that contain "exit 1" |
On Fri, Apr 8, 2016 at 12:58 PM, Oleg Popov notifications@github.com
Hey @livelace - not sure what you mean by "creation progress". I just see
|
"Creation progress" - scale DC to 1. |
Thanks for the additional details. I have a couple of thoughts on On Fri, Apr 8, 2016 at 1:34 PM, Gabe Montero gmontero@redhat.com wrote:
|
During hour I can grant access to my test environment, I think. |
@gabemontero Can you connect over SSH to my environment ? |
OK, I've reproduced it. I did:
Note, if I start with the lifecycle/podstart exitting with 1 and initial replicas of 1, then the RC is marked as failed. This is basically what my recent repo attempts did. And now that I understand what is going on, I'm pretty positive that my very first repro attempt, where I saw the RC in complete state, was when I edited a previously used DC to added the lifecycle/podstart with exit 1 check. So good for me that I was not imagining things originally :-). Now, what to do about this. It is not a given we want to address this with a new plugin step.
@bparees: thoughts? ... and I thought about tagging our friends in platform mgmt now, but decided on getting a sanity check from you before officially pulling that trigger. |
All the links are dead unfortunately. Why a readiness probe is not working for your case @livelace? There are different kinds of probes, if you cannot use http, you can run shell commands (exec) or try to open a tcp connection. |
Sorry for jumping from one thread to the other but upstream deployments have no hooks yet. If your problem is that you want a failed hook to fail your deployment you can specify FailurePolicy == Abort |
@livelace any news here? Did hooks work for you? |
@Kargakis I'm sad :(
[root@openshift-master1 ~]# oc get pods [root@openshift-master1 ~]# oc get dc |
If pod inside broken - we should mark DC as failed and replica count should be 0, because existing replica isn't the same that working replica is. |
@livelace the timeout for deployments is at 10m. If you waited 2 more minutes you would see that the deployment will be marked as failed and eventually scaled down to zero:)
Note that your use of the deployment hook didn't do anything and the deployment hook was complete. I think in your case you would want a post hook with Abort policy with a script that makes sure your application pod is up and running. Also readiness probes can help. Regarding the timeout, we will make it configurable eventually. |
The deployment hook is useful when we should be sure that new version of software working as expected and go back to working deployment if fail. But we have different situation. Our deployment should be launched and if container hook (which do some bunch of things inside, it is dynamic configuration) return error - deployment should be marked as failed. The deployment hook isn't working for us, because it works only in deployment. But we need possibility when container hook cause an error during scaling. We need failure policy for container hook. |
@smarterclayton @ironcladlou, @livelace wants container hooks to be taken into account for deployments. Thoughts? |
@livelace actually can you try to run a deployment with a container postStart hook that fails and see if it works for you after 10 minutes? |
@bparees @gabemontero Can we detect and wait this behavior during "Scale Deployment" in Jenkins ? |
"Verify whether the specified number of replicas are up" already exist. |
@livelace detect and wait for what exactly? I still haven't seen a satisfactory answer for why the replication controller is reporting N current pods when those pods are in a failed state. |
Because it's not the job for a replication controller but for a deployment. The rc/rs will always report what it has created but it cannot know if those pods are running.
You shouldn't deploy zero replicas and scale after the fact. That's why we use deployments in the first place, otherwise we would still use replication controllers. Deployments ensure that your pods are able to run. Replication controllers cannot do that by design. |
Our deployments are always can run and working without any problem. Pods work excellent. But we use container hooks, which launch integration tests with other services in different pods. If we will do all possible deployment configurations which will contain all possible combination of our software - it's impossible. And we can't hold pods online all time. |
Would it make sense to group all those containers together? Or are those other services independent components of your system?
Do you really need to scale down to zero here or could you just stop directing traffic to those pods? |
Yes, it is. For example: three containers, each should be set specific settings and all communicate each other + exist "external" services (not in the same pod) and they also in connection with other.
We need: step 1: start/scale DC to N |
Try setting dc.spec.test=true You should deploy it everytime you need it to run ( |
@Kargakis Thanks, but: `` DESCRIPTION: We need:
And what we have, steps:
|
There are a couple of things you can do in such a scenario. One is you can setup posthooks in DC1/DC2 to wait for DC3 to complete. DC3 comes up, runs its tests, completes, DC1 and DC2 complete. All are scaled down automatically because they have dc.spec.test=true. You can also play around with custom deployments: |
You could also setup a prehook for DC3 to start DC1 and DC2 and wait for them to be running so you would need to run just DC3.
Complicated but it could work. |
@Kargakis Thanks, I will try it later. PS. Thorny our way :) |
The thing is that by scaling deployment configs up/down to zero instead of actually deploying them, you lose all the benefits you get from using deployments. Replication controllers by design cannot detect failures and I don't think it will ever change especially in the light of ... having deployments:) |
DC3 also sounds a lot like a Job: |
Actually deployments work fine, because they trigger by "image change" and it's Ok. DC3 - it's most easy thing in configuration. I saw "Jobs" previously. |
Main challenge - use multiple deployments (services inside) with each other. At this moment after scaling we just check file flag, which indicate about tests execution inside deployment (with Jenkins help). But we want more mature/right mechanism. |
My interpretation is that the discussions here have circled back to openshift/origin#8507 We also got clarification from @Kargakis back with #33 (comment) on why the RC was reporting what it was reporting. And based on the discussions noted with #33 (comment) , we still don't want the plugin to start interrogating Pod state. Of course, this is still an evolving area. If changes occur around the multiple deployment orchestration or what the RC reports wrt Pod state, we can look into associated changes in the plugin. But with the above preamble, I'm going to go ahead and close this one out. Please continue discussions either in openshift/origin#8507 or new issues to be opened against origin or k8s if the discussion broadens. Thanks. |
https://paste.fedoraproject.org/350947/60028840/
https://paste.fedoraproject.org/350950/60028895/
https://paste.fedoraproject.org/350952/28958146/
==>
As result, we think that RC is running, but pod inside RC is not running.
How do we can detect that DC not running and don't start the next build steps ?
The text was updated successfully, but these errors were encountered: