-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster up fails with "getsockopt: connection refused ()" #20617
Comments
I also faced the same problem, have you solved this problem? Thank. |
@AIKiller unfortunately I didn't 😢. The only track I have is that this issue might be connected to the fact that machine is connected behind corporate proxy. I cannot however connect the affected machines outside of current network, so I can't verify. It sounds crazy, but that is the only attribute which all the affected machines share, some more info at: minishift/minishift#2675. If your machine is behind the proxy and you can connect it directly and verify, that would help. |
alerting the team that owns oc cluster up, but nested virtualization and corporate proxies just sounds like a recipe for problems @openshift/sig-master |
We see the same problem periodically with cluster up on a CentOS VM, although it succeeds 70% of the time or more. We are running this in a virtualized lab environment using Ravello, so each instance is identical. Also, no proxies or firewalls. I had attributed it to our lab environment sometimes running slow, but I have no evidence of that. If I can help with log files or anything else, let me know what you need. Thanks! |
After adding RAM and CPU to a CentOS 7 VM (now 5 GB RAM, 4 CPUs, running in VirtualBox on Ubuntu 18.04) that I am trying to get openshift working on, I get past the original issue (getsockopt: connection refused); but now get: ... Watching with 'top', there is a 'hyperkube' process that gets very busy, but not for long periods. Is this maybe performance related? |
I'm seeing this as well on a fresh CentOS install after an initial successful install, when the next day I found the server down and could not restart it.
Are we not supposed to be running okd on a VM? I was hoping to use one well-provisioned corporate VM for all my containers. |
I'm seeing this problem when I use Minishift. My system is a stand alone CentOS box on my home network. The version of Minishift is 1.24.0 which I pulled down 2 days ago. It appears to be running 3.10.0 of Openshift. Is there a workaround for this issue? |
any solution for this issue? im getting the same error. |
@khun83 Unfortunately I do not know about any solution yet. It might be caused by slowness of network or computer, which could lead to that cluster-up gives up after a while and throws a timeout error. One thing which could help would be to have all the images loaded in caching proxy, so time on pulls is saved. Another option would be to get into codebase of cluster-up, increase the timeouts, build the oc, try with it and verify that the I will try to try those steps above this/next week, but if you are more lucky with time than me, then you can try on your setup and inform us. ping @bill0425 as you might be interested in above stuff ^ |
cc @deads2k |
This issue is also reproducible with OKD v3.11.0. It affects Minishift users and also any QE efforts which depend on |
Looking at the thread here, it is hard to tell if a) some nodes are just slow to start and 5 minute timeout is too little (i.e. 6 min would do it), or b) if there is some race condition that actually prevents the cluster from loading (i.e. that a 30 minute timeout would not do it). Timeouts are tricky to get right for all scenarios. @agajdosi I like the configurable timeout. It might be easy to set via ENV to check, rather than plumb through via Looks like running with verbose logs would give a little more info as well, although if the last error after 5 minutes is still 'connection refused' that wouldn't add more info. https://github.com/openshift/origin/blob/master/pkg/oc/clusterup/run_self_hosted.go#L231 |
What would help you folks diagnose this problem? If you let people know what you need and any changes that need to occur, I'm sure someone on the thread would be willing to help. Just give folks directions so we can help you. /Bill |
Hi, For me, it seems that using Edit: |
@arnaud-deprez thanks. Should we look into setting that value to |
I see this problem too when running minishift. The only solution so far is to run minishift stop And delete the folders ~/.kube and ~/.minishift But it is rather timeconsuming I tried to enable image caching, but without luck. I'm using Hyper-V on windows 10 |
In fact, it is the livenessProbe of the apiserver pod that is failing (times out after 32 secondes as shown by the message: Get https://192.168.42.18:8443/healthz?timeout=32s).
livenessProbe: initialDelaySeconds: 90 httpGet: scheme: HTTPS port: 8443 path: healthz Note that the update may get overriden next time minishift starts. Yet, the workaround allows for not being stuck. |
@lovoni Good find. Do you know where those template live in code? |
Had this issue, It only occurred when I attempted to upgrade the version of Openshift on an existing profile e.g. had a profile with 3.10 and attempted to start with 3.11
after that the profile was unusable |
@lovoni Thanks for the workaround. I kept eye in the location minishift version - v1.26.1+Win10+VirtualBox |
@openshift/sig-master @mfojtik This issue started to affect more machines when we started to use OKD 3.11.0. And as there is no progress on this issue since August the only answer for all the users of Minishift or CDK who face this issue in no other than "yeah, throw that laptop away and try another one" which is terrible. It would be really great if you could find somebody who could take a look on this as it starts to be really painful issue for us. |
@agajdosi: Reiterating the mentions to trigger a notification: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I would like to confirm the same issue as described above.
My case: VM machines: windows7/10 + rhel7 - 8cpu 16 GB ram, CDK 3.7.0-alpha-1.1 (oc v3.11.16). Please, take a look at this issue, thanks. |
Was facing this issue with oc cluster up. What I did to resolve it was:
After that oc cluster up worked fine. It seems there are cases when oc cluster up (I've observed this with Minishift as well) does not start properly when you have ran a different version in the past. |
@odockal Can you verify you have the latest version of |
I don't think this is necessarily related to the version. I led a lab of over 100 people all starting up an oc cluster around the same time using exactly the same version, and fewer than 10% had this issue. None of us has ever been able to duplicate this consistently, so it is very difficult to say whether or not one solution or another fixed the problem. One thing that always fixed the problem was time. Downloading the latest oc client might have worked because it added time between attempts. |
I have that problem (win 10, virtualbox 5.2.20, minishift 1.27) as well - my problem is: I had that problem with 1.26. Only solution so far was to go back to 1.23 with openshift version 3.9.0. |
@nstielau I can tell what I am using:
How can I find most actual version? Build from source? |
I spend some time debugging - it seems the root problem is around here: [+]etcd ok\n[+]poststarthook/generic-apiserver-start-informers ok can you please help? |
I am also facing the same issue with OKD 3.11.
the |
Recently come across this issue having just started to use Minishift. Until a more stable fix is implemented upstream, thought I'd leave the workaround I'm using in case it helps anyone in the meantime. Once the Minishift VM is available (after "Starting Minishift VM ...." completes) but before "Starting OpenShift cluster ...", execute the following one-liner: -
This removes apiserver.yaml if it already exists, waits for it to be recreated, then adds the initialDelaySeconds configuration so the timeout issue isn't hit. I'm using this on Windows 7/VirtualBox but no reason it shouldn't work on any affected platfor. |
This problem manifests a little different in 3.11 even with a 15 minute timeout (which BTW is hardcoded to 5 minutes) still fails.
journalctl logs show that it appears to be an access problem to the API server:
@knobunc could you give me a hand diagnosing this problem? |
After some testing i found that iptables rules can interfere with the [1] https://github.com/imcsk8/origin-tools/blob/master/run-oc-cluster-up.sh |
@imcsk8 Thank you for investigating it. Unfortunately I use the cluster-up via Minishift and the issue sometimes happen, sometimes not even though the OS image on which it starts is the same every time. So I am not sure whether the problem really lies in |
iptables rules on Minishift/CDK images:
|
Just to mention. This issue is now blocker for CDK 3.8.0 on Windows 10 (https://issues.jboss.org/browse/CDK-389). Suggested fix through ip tables does not work. |
I had done everything according to the described procedures, including setting up the firewall zone as described here: https://github.com/openshift/origin/blob/release-3.11/docs/cluster_up_down.md. I was still getting this API server error: Get https://XXX.XXX.XXX.XXX:8443/healthz?timeout=32s: dial tcp XXX.XXX.XXX.XXX:8443: getsockopt: connection refused () |
I found that this was because of no connectivity to the internet. Although I do not know why it would fail, or give such an error while failing. Probably tries pulling something from dockerhub and fails. |
Same issue in full OKD installation 3.11.156-1. In my case, etcd can't connect to each other due connection refused, but I do the same playbook several times and all was good. |
I am facing problem with
oc cluster up
when use it in nested virtualization environments, for example: RHEL7 VM in which I run CentOS VM on which I deploy the cluster. Deployment sometimes goes well, however 90% of cases it fails withgetsockopt: connection refused ()
. It is also reproducible withv3.9.0
, however with that error looks a little bit different.Version
v3.11.0
v3.10.0
v3.9.0
Steps To Reproduce
Current Result
v3.10.0:
v3.9:
Expected Result
Cluster should be up and running.
Additional Information
Minishift issue: minishift/minishift#2675
[try to run
$ oc adm diagnostics
(oroadm diagnostics
) command if possible][if you are reporting issue related to builds, provide build logs with
BUILD_LOGLEVEL=5
][consider attaching output of the
$ oc get all -o json -n <namespace>
command to the issue][visit https://docs.openshift.org/latest/welcome/index.html]
The text was updated successfully, but these errors were encountered: