-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker failed to start on 1500/2000 machines on flannel cluster #26062
Comments
Failed Nodes have the following log every 30 seconds:
I guess some of the recent changes made flannel daemon unhappy. |
I tested it with 1000 Node cluster - also 441 Nodes are able to come up correctly (the same number as in the case of 2k Node cluster). It seems that it's a configuration issue. |
cc @bprashanth |
Logs from flannel daemon on unhealthy Kubelets:
|
The nodes won't come up till flannel on the master is up, this sounds like: #25969 This will prevent a node from entering ready, because it doesn't have a subnet
It's complaining that it can't find the network.json
This network config is: https://github.com/kubernetes/kubernetes/blob/master/cluster/saltbase/salt/flannel-server/network.json Suggest debugging along those lines, logs should show https://github.com/kubernetes/contrib/blob/master/flannel-server/main.go#L74 |
Flannel on master on 1000 Node cluster is up. So this is not #25969 |
Hmm flannel master is up? then what's with: E0523 13:58:11.611454 03203 network.go:53] Failed to retrieve network config: 500 Internal Server Error: 100: Key not found (/kubernetes.io) [3] |
I wish I knew;) |
This seems like a message from etcd. Some data is missing in etcd? @xiang90 - does this error look for you like response from etcd? |
No it's the missing network config from: we supply network.json as a file through salt |
The flannel master setup consists of:
If that last bit is not running the nodes won't get a subnet |
I was starting 1000 node cluster and got an error loop:
The flannel server is running:
(Restart happened for not apparent reason 5 minutes after it started - but it's stable since then) I don't think it's the lack of the file altogether, as >400 Nodes are working perfectly fine with logs like:
Non working Nodes have logs:
The fun part is that I can't see any entries in etcd:
|
One more data points - logs from flannel-server-helper:
|
Which seems to be consistent with the config file there is on the machine. Though, etcd seems to be empty (as @gmarek wrote above). |
OK - I think we understand the reason. As @gmarek observed, etcd was restarted after some time. And after that we lost all the data from it. The problem is that config for etcd is broken (we are NOT persisting data). There are problems with volumes. I will send out a fix for it soon. |
We still don't know why etcd has restarted after 5 minutes, but at least we know how to fix this problem. |
OK - it seems that my PR is fixing the problem of "disappearing data". We still have problem of crash-looping etcd. |
Flannel is using a different etcd or the same etcd as apiserver? etcd crashlooping probably means it's OOMing |
Different etcd (#26062 (comment)), @gmarek said there was no sign of oom |
@gmarek Do you have any log of the crashing loop that I can look at? Also have you tried to monitor the memory usage of etcd? |
All the etcd logs looks like ones I mentioned in #26062 (comment) I didn't looked into memory usage, but I inspected kernel messages file, and there was no ooms. |
@gmarek #26062 (comment) is not logs from etcd I think. They are from flannel or somewhere else. |
I'm reopening it, because we still didn't fix it (we only fixed the disappearing data from flannel etcd). @xiang90 - yeah those logs aren't from etcd. The logs from etcd doesn't give us anything (there is nothing suspicious there). I don't understand what is happening. @lavalamp - there are no signs of OOMs, no events, nothing in kernel logs, etc. I'm not sure what we should look at now. |
I have absolutely no understanding of the flannel stack, and if this isn't a problem with apiserver's etcd, then I don't think I'll be of much use. @bprashanth seems to know something about this? |
It sounded like there was an etcd restart that we can't explain, debugging that should be just like any other pod restart. |
I'm kicking this out of CSI/API machinery, I don't understand why it was labeled that way. |
@davidopp - is explaining this a blocker for 2k Node effort? |
@gmarek it seems worth investigating... |
It's worth investigating, but do we want to block 1.3 on this? |
No. |
[FLAKE-PING] @bprashanth This flaky-test issue would love to have more attention... |
1 similar comment
[FLAKE-PING] @bprashanth This flaky-test issue would love to have more attention... |
I don't think this is worth investigating for 1.4 either, since networking setup is going to be consumed by cni plugin + networking daemonset work from cluster lifecycle. Closing. |
The reason is a crashloop with the error:
If I'm not mistaken
cbr0
should have been created by Kubelet.@kubernetes/sig-node @dchen1107
The text was updated successfully, but these errors were encountered: