Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker failed to start on 1500/2000 machines on flannel cluster #26062

Closed
gmarek opened this issue May 23, 2016 · 34 comments
Closed

Docker failed to start on 1500/2000 machines on flannel cluster #26062

gmarek opened this issue May 23, 2016 · 34 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test.
Milestone

Comments

@gmarek
Copy link
Contributor

gmarek commented May 23, 2016

The reason is a crashloop with the error:

time="2016-05-23T10:44:11.175272386Z" level=fatal msg="Error starting daemon: Error initializing network controller: Error creating default \"bridge\" network: bridge device with non default name cbr0 must be created manually"

If I'm not mistaken cbr0 should have been created by Kubelet.

@kubernetes/sig-node @dchen1107

@gmarek gmarek added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/flake Categorizes issue or PR as related to a flaky test. labels May 23, 2016
@gmarek gmarek changed the title Docker failed to start on 1500/2000 machines Docker failed to start on 1500/2000 machines on flannel cluster May 23, 2016
@gmarek
Copy link
Contributor Author

gmarek commented May 23, 2016

Failed Nodes have the following log every 30 seconds:

I0523 11:38:52.826812    3341 kubelet.go:2921] Flannel server handshake failed Waiting for subnet file /var/run/flannel/subnet.env

I guess some of the recent changes made flannel daemon unhappy.

@gmarek
Copy link
Contributor Author

gmarek commented May 23, 2016

I tested it with 1000 Node cluster - also 441 Nodes are able to come up correctly (the same number as in the case of 2k Node cluster). It seems that it's a configuration issue.

@gmarek
Copy link
Contributor Author

gmarek commented May 23, 2016

cc @bprashanth

@gmarek
Copy link
Contributor Author

gmarek commented May 23, 2016

Logs from flannel daemon on unhealthy Kubelets:

E0523 13:58:11.611454 03203 network.go:53] Failed to retrieve network config: 500 Internal Server Error: 100: Key not found (/kubernetes.io) [3]

@bprashanth
Copy link
Contributor

The nodes won't come up till flannel on the master is up, this sounds like: #25969

This will prevent a node from entering ready, because it doesn't have a subnet

Flannel server handshake failed Waiting for subnet file /var/run/flannel/subnet.env

It's complaining that it can't find the network.json

E0523 13:58:11.611454 03203 network.go:53] Failed to retrieve network config: 500 Internal Server Error: 100: Key not found (/kubernetes.io) [3]

This network config is: https://github.com/kubernetes/kubernetes/blob/master/cluster/saltbase/salt/flannel-server/network.json
It should be inserted by: https://github.com/kubernetes/contrib/tree/master/flannel-server#usage

Suggest debugging along those lines, logs should show https://github.com/kubernetes/contrib/blob/master/flannel-server/main.go#L74

@gmarek
Copy link
Contributor Author

gmarek commented May 23, 2016

Flannel on master on 1000 Node cluster is up. So this is not #25969

@bprashanth
Copy link
Contributor

Hmm flannel master is up? then what's with:

E0523 13:58:11.611454 03203 network.go:53] Failed to retrieve network config: 500 Internal Server Error: 100: Key not found (/kubernetes.io) [3]

@gmarek
Copy link
Contributor Author

gmarek commented May 23, 2016

I wish I knew;)

@wojtek-t
Copy link
Member

This seems like a message from etcd. Some data is missing in etcd?

@xiang90 - does this error look for you like response from etcd?

@bprashanth
Copy link
Contributor

bprashanth commented May 23, 2016

No it's the missing network config from:
#26062 (comment)

we supply network.json as a file through salt
a go helper reads this file and writes it to etcd
this is the network config for the cluster, apparently it isn't written, so i suspect something somewhere fatalf'd?

@bprashanth
Copy link
Contributor

bprashanth commented May 23, 2016

The flannel master setup consists of:

  • flannel etcd (also serves liveness on /health)
  • flannel helper that writes network.json to etcd
  • actual flannel runnign in server mode, using flannel-etcd, network config from /kubernetes.io/network and listening for incoming connection on 10253

If that last bit is not running the nodes won't get a subnet

@gmarek
Copy link
Contributor Author

gmarek commented May 24, 2016

I was starting 1000 node cluster and got an error loop:

Waiting for 1001 ready nodes. 403 ready nodes, 1001 registered. Retrying

The flannel server is running:

flannel-server-e2e-test-gmarek-master               3/3       Running            2          14m

(Restart happened for not apparent reason 5 minutes after it started - but it's stable since then)

I don't think it's the lack of the file altogether, as >400 Nodes are working perfectly fine with logs like:

I0524 08:14:28.299453 03201 main.go:275] Installing signal handlers
I0524 08:14:28.303210 03201 main.go:130] Determining IP address of default interface
I0524 08:14:28.303356 03201 main.go:188] Using 10.240.0.120 as external interface
I0524 08:14:28.303372 03201 main.go:189] Using 10.240.0.120 as external endpoint
I0524 08:14:28.620456 03201 vxlan.go:153] Watching for L3 misses
I0524 08:14:28.620481 03201 vxlan.go:159] Watching for new subnet leases
I0524 08:14:28.689942 03201 vxlan.go:273] Handling initial subnet events
I0524 08:14:28.689965 03201 device.go:159] calling GetL2List() dev.link.Index: 4
I0524 08:14:28.690303 03201 device.go:164] calling NeighAdd: 10.240.0.149, 7e:7b:27:84:fa:39

Non working Nodes have logs:

I0524 08:19:52.633195 03211 main.go:275] Installing signal handlers
I0524 08:19:52.635254 03211 main.go:130] Determining IP address of default interface
I0524 08:19:52.635419 03211 main.go:188] Using 10.240.2.149 as external interface
I0524 08:19:52.635435 03211 main.go:189] Using 10.240.2.149 as external endpoint
E0524 08:19:52.640951 03211 network.go:53] Failed to retrieve network config: 500 Internal Server Error: 100: Key not found (/kubernetes.io) [3]
E0524 08:19:53.642102 03211 network.go:53] Failed to retrieve network config: 500 Internal Server Error: 100: Key not found (/kubernetes.io) [3]

The fun part is that I can't see any entries in etcd:

$ curl -L localhost:4003/v2/keys
{"action":"get","node":{"dir":true}}

@wojtek-t
Copy link
Member

One more data points - logs from flannel-server-helper:

wojtekt@e2e-test-gmarek-master:~$ sudo docker logs fa51a1e80e88
2016/05/24 08:12:48 Created network {
    "Network": "172.16.0.0/12",
    "SubnetLen": 24,
    "Backend": {
        "Type": "vxlan",
        "VNI": 1
     }
}

@wojtek-t
Copy link
Member

Which seems to be consistent with the config file there is on the machine.

Though, etcd seems to be empty (as @gmarek wrote above).

@wojtek-t
Copy link
Member

OK - I think we understand the reason.

As @gmarek observed, etcd was restarted after some time. And after that we lost all the data from it.

The problem is that config for etcd is broken (we are NOT persisting data). There are problems with volumes.

I will send out a fix for it soon.

@gmarek
Copy link
Contributor Author

gmarek commented May 24, 2016

We still don't know why etcd has restarted after 5 minutes, but at least we know how to fix this problem.

@wojtek-t
Copy link
Member

OK - it seems that my PR is fixing the problem of "disappearing data".

We still have problem of crash-looping etcd.

@dchen1107
Copy link
Member

cc/ @xiang90 @lavalamp for crash-looping etcd.

@dchen1107 dchen1107 added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed sig/node Categorizes an issue or PR as relevant to SIG Node. labels May 24, 2016
@lavalamp
Copy link
Member

Flannel is using a different etcd or the same etcd as apiserver?

etcd crashlooping probably means it's OOMing

@bprashanth
Copy link
Contributor

Different etcd (#26062 (comment)), @gmarek said there was no sign of oom

@xiang90
Copy link
Contributor

xiang90 commented May 24, 2016

@gmarek Do you have any log of the crashing loop that I can look at? Also have you tried to monitor the memory usage of etcd?

@gmarek
Copy link
Contributor Author

gmarek commented May 25, 2016

All the etcd logs looks like ones I mentioned in #26062 (comment)

I didn't looked into memory usage, but I inspected kernel messages file, and there was no ooms.

@xiang90
Copy link
Contributor

xiang90 commented May 25, 2016

@gmarek #26062 (comment) is not logs from etcd I think. They are from flannel or somewhere else.

@wojtek-t
Copy link
Member

I'm reopening it, because we still didn't fix it (we only fixed the disappearing data from flannel etcd).

@xiang90 - yeah those logs aren't from etcd. The logs from etcd doesn't give us anything (there is nothing suspicious there).

I don't understand what is happening. @lavalamp - there are no signs of OOMs, no events, nothing in kernel logs, etc. I'm not sure what we should look at now.

@wojtek-t wojtek-t reopened this May 27, 2016
@lavalamp lavalamp assigned bprashanth and unassigned lavalamp Jun 3, 2016
@lavalamp
Copy link
Member

lavalamp commented Jun 3, 2016

I have absolutely no understanding of the flannel stack, and if this isn't a problem with apiserver's etcd, then I don't think I'll be of much use. @bprashanth seems to know something about this?

@bprashanth
Copy link
Contributor

It sounded like there was an etcd restart that we can't explain, debugging that should be just like any other pod restart.

@lavalamp lavalamp removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jun 6, 2016
@lavalamp
Copy link
Member

lavalamp commented Jun 6, 2016

I'm kicking this out of CSI/API machinery, I don't understand why it was labeled that way.

@gmarek
Copy link
Contributor Author

gmarek commented Jun 7, 2016

@davidopp - is explaining this a blocker for 2k Node effort?

@davidopp
Copy link
Member

@gmarek it seems worth investigating...

@gmarek
Copy link
Contributor Author

gmarek commented Jun 13, 2016

It's worth investigating, but do we want to block 1.3 on this?

@davidopp
Copy link
Member

No.

@gmarek gmarek modified the milestones: v1.4, v1.3 Jun 13, 2016
@k8s-github-robot
Copy link

[FLAKE-PING] @bprashanth

This flaky-test issue would love to have more attention...

1 similar comment
@k8s-github-robot
Copy link

[FLAKE-PING] @bprashanth

This flaky-test issue would love to have more attention...

@bprashanth
Copy link
Contributor

I don't think this is worth investigating for 1.4 either, since networking setup is going to be consumed by cni plugin + networking daemonset work from cluster lifecycle. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test.
Projects
None yet
Development

No branches or pull requests

9 participants