Docker failed to start on 1500/2000 machines on flannel cluster #26062

gmarek · 2016-05-23T10:46:09Z

The reason is a crashloop with the error:

time="2016-05-23T10:44:11.175272386Z" level=fatal msg="Error starting daemon: Error initializing network controller: Error creating default \"bridge\" network: bridge device with non default name cbr0 must be created manually"

If I'm not mistaken cbr0 should have been created by Kubelet.

@kubernetes/sig-node @dchen1107

The text was updated successfully, but these errors were encountered:

gmarek · 2016-05-23T12:33:28Z

Failed Nodes have the following log every 30 seconds:

I0523 11:38:52.826812    3341 kubelet.go:2921] Flannel server handshake failed Waiting for subnet file /var/run/flannel/subnet.env

I guess some of the recent changes made flannel daemon unhappy.

gmarek · 2016-05-23T13:50:23Z

I tested it with 1000 Node cluster - also 441 Nodes are able to come up correctly (the same number as in the case of 2k Node cluster). It seems that it's a configuration issue.

gmarek · 2016-05-23T13:50:31Z

cc @bprashanth

gmarek · 2016-05-23T13:59:19Z

Logs from flannel daemon on unhealthy Kubelets:

E0523 13:58:11.611454 03203 network.go:53] Failed to retrieve network config: 500 Internal Server Error: 100: Key not found (/kubernetes.io) [3]

bprashanth · 2016-05-23T15:46:51Z

The nodes won't come up till flannel on the master is up, this sounds like: #25969

This will prevent a node from entering ready, because it doesn't have a subnet

Flannel server handshake failed Waiting for subnet file /var/run/flannel/subnet.env

It's complaining that it can't find the network.json

E0523 13:58:11.611454 03203 network.go:53] Failed to retrieve network config: 500 Internal Server Error: 100: Key not found (/kubernetes.io) [3]

This network config is: https://github.com/kubernetes/kubernetes/blob/master/cluster/saltbase/salt/flannel-server/network.json
It should be inserted by: https://github.com/kubernetes/contrib/tree/master/flannel-server#usage

Suggest debugging along those lines, logs should show https://github.com/kubernetes/contrib/blob/master/flannel-server/main.go#L74

gmarek · 2016-05-23T17:14:19Z

Flannel on master on 1000 Node cluster is up. So this is not #25969

bprashanth · 2016-05-23T17:16:44Z

Hmm flannel master is up? then what's with:

E0523 13:58:11.611454 03203 network.go:53] Failed to retrieve network config: 500 Internal Server Error: 100: Key not found (/kubernetes.io) [3]

gmarek · 2016-05-23T17:27:08Z

I wish I knew;)

wojtek-t · 2016-05-23T17:28:30Z

This seems like a message from etcd. Some data is missing in etcd?

@xiang90 - does this error look for you like response from etcd?

bprashanth · 2016-05-23T17:29:42Z

No it's the missing network config from:
#26062 (comment)

we supply network.json as a file through salt
a go helper reads this file and writes it to etcd
this is the network config for the cluster, apparently it isn't written, so i suspect something somewhere fatalf'd?

bprashanth · 2016-05-23T17:32:19Z

The flannel master setup consists of:

flannel etcd (also serves liveness on /health)
flannel helper that writes network.json to etcd
actual flannel runnign in server mode, using flannel-etcd, network config from /kubernetes.io/network and listening for incoming connection on 10253

If that last bit is not running the nodes won't get a subnet

gmarek · 2016-05-24T08:39:04Z

I was starting 1000 node cluster and got an error loop:

Waiting for 1001 ready nodes. 403 ready nodes, 1001 registered. Retrying

The flannel server is running:

flannel-server-e2e-test-gmarek-master               3/3       Running            2          14m

(Restart happened for not apparent reason 5 minutes after it started - but it's stable since then)

I don't think it's the lack of the file altogether, as >400 Nodes are working perfectly fine with logs like:

I0524 08:14:28.299453 03201 main.go:275] Installing signal handlers
I0524 08:14:28.303210 03201 main.go:130] Determining IP address of default interface
I0524 08:14:28.303356 03201 main.go:188] Using 10.240.0.120 as external interface
I0524 08:14:28.303372 03201 main.go:189] Using 10.240.0.120 as external endpoint
I0524 08:14:28.620456 03201 vxlan.go:153] Watching for L3 misses
I0524 08:14:28.620481 03201 vxlan.go:159] Watching for new subnet leases
I0524 08:14:28.689942 03201 vxlan.go:273] Handling initial subnet events
I0524 08:14:28.689965 03201 device.go:159] calling GetL2List() dev.link.Index: 4
I0524 08:14:28.690303 03201 device.go:164] calling NeighAdd: 10.240.0.149, 7e:7b:27:84:fa:39

Non working Nodes have logs:

I0524 08:19:52.633195 03211 main.go:275] Installing signal handlers
I0524 08:19:52.635254 03211 main.go:130] Determining IP address of default interface
I0524 08:19:52.635419 03211 main.go:188] Using 10.240.2.149 as external interface
I0524 08:19:52.635435 03211 main.go:189] Using 10.240.2.149 as external endpoint
E0524 08:19:52.640951 03211 network.go:53] Failed to retrieve network config: 500 Internal Server Error: 100: Key not found (/kubernetes.io) [3]
E0524 08:19:53.642102 03211 network.go:53] Failed to retrieve network config: 500 Internal Server Error: 100: Key not found (/kubernetes.io) [3]

The fun part is that I can't see any entries in etcd:

$ curl -L localhost:4003/v2/keys
{"action":"get","node":{"dir":true}}

wojtek-t · 2016-05-24T08:48:35Z

One more data points - logs from flannel-server-helper:

wojtekt@e2e-test-gmarek-master:~$ sudo docker logs fa51a1e80e88
2016/05/24 08:12:48 Created network {
    "Network": "172.16.0.0/12",
    "SubnetLen": 24,
    "Backend": {
        "Type": "vxlan",
        "VNI": 1
     }
}

wojtek-t · 2016-05-24T08:49:53Z

Which seems to be consistent with the config file there is on the machine.

Though, etcd seems to be empty (as @gmarek wrote above).

wojtek-t · 2016-05-24T09:05:40Z

OK - I think we understand the reason.

As @gmarek observed, etcd was restarted after some time. And after that we lost all the data from it.

The problem is that config for etcd is broken (we are NOT persisting data). There are problems with volumes.

I will send out a fix for it soon.

gmarek · 2016-05-24T09:09:00Z

We still don't know why etcd has restarted after 5 minutes, but at least we know how to fix this problem.

wojtek-t · 2016-05-24T12:34:53Z

OK - it seems that my PR is fixing the problem of "disappearing data".

We still have problem of crash-looping etcd.

dchen1107 · 2016-05-24T20:15:43Z

cc/ @xiang90 @lavalamp for crash-looping etcd.

lavalamp · 2016-05-24T22:50:32Z

Flannel is using a different etcd or the same etcd as apiserver?

etcd crashlooping probably means it's OOMing

bprashanth · 2016-05-24T22:52:24Z

Different etcd (#26062 (comment)), @gmarek said there was no sign of oom

xiang90 · 2016-05-24T23:18:18Z

@gmarek Do you have any log of the crashing loop that I can look at? Also have you tried to monitor the memory usage of etcd?

gmarek · 2016-05-25T05:23:23Z

All the etcd logs looks like ones I mentioned in #26062 (comment)

I didn't looked into memory usage, but I inspected kernel messages file, and there was no ooms.

xiang90 · 2016-05-25T06:06:33Z

@gmarek #26062 (comment) is not logs from etcd I think. They are from flannel or somewhere else.

wojtek-t · 2016-05-27T18:12:34Z

I'm reopening it, because we still didn't fix it (we only fixed the disappearing data from flannel etcd).

@xiang90 - yeah those logs aren't from etcd. The logs from etcd doesn't give us anything (there is nothing suspicious there).

I don't understand what is happening. @lavalamp - there are no signs of OOMs, no events, nothing in kernel logs, etc. I'm not sure what we should look at now.

lavalamp · 2016-06-03T20:37:14Z

I have absolutely no understanding of the flannel stack, and if this isn't a problem with apiserver's etcd, then I don't think I'll be of much use. @bprashanth seems to know something about this?

bprashanth · 2016-06-03T21:19:56Z

It sounded like there was an etcd restart that we can't explain, debugging that should be just like any other pod restart.

lavalamp · 2016-06-06T20:30:45Z

I'm kicking this out of CSI/API machinery, I don't understand why it was labeled that way.

gmarek · 2016-06-07T08:11:27Z

@davidopp - is explaining this a blocker for 2k Node effort?

davidopp · 2016-06-12T07:33:10Z

@gmarek it seems worth investigating...

gmarek · 2016-06-13T07:34:01Z

It's worth investigating, but do we want to block 1.3 on this?

davidopp · 2016-06-13T07:35:40Z

No.

k8s-github-robot · 2016-08-29T21:36:41Z

[FLAKE-PING] @bprashanth

This flaky-test issue would love to have more attention...

k8s-github-robot · 2016-08-29T22:01:42Z

[FLAKE-PING] @bprashanth

This flaky-test issue would love to have more attention...

bprashanth · 2016-08-30T17:22:28Z

I don't think this is worth investigating for 1.4 either, since networking setup is going to be consumed by cni plugin + networking daemonset work from cluster lifecycle. Closing.

gmarek added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/flake Categorizes issue or PR as related to a flaky test. labels May 23, 2016

gmarek changed the title ~~Docker failed to start on 1500/2000 machines~~ Docker failed to start on 1500/2000 machines on flannel cluster May 23, 2016

gmarek added the team/cluster label May 23, 2016

wojtek-t mentioned this issue May 24, 2016

Fix transient flannel etcd #26150

Merged

dchen1107 added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed sig/node Categorizes an issue or PR as relevant to SIG Node. labels May 24, 2016

fejta assigned lavalamp May 25, 2016

a-robinson closed this as completed in #26150 May 27, 2016

wojtek-t reopened this May 27, 2016

lavalamp assigned bprashanth and unassigned lavalamp Jun 3, 2016

lavalamp removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jun 6, 2016

matchstick added this to the v1.3 milestone Jun 9, 2016

therc mentioned this issue Jun 11, 2016

AWS: cache instances during service reload to avoid rate limiting on restart #26900

Merged

davidopp added the team/control-plane label Jun 12, 2016

gmarek modified the milestones: v1.4, v1.3 Jun 13, 2016

bprashanth closed this as completed Aug 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker failed to start on 1500/2000 machines on flannel cluster #26062

Docker failed to start on 1500/2000 machines on flannel cluster #26062

gmarek commented May 23, 2016

gmarek commented May 23, 2016

gmarek commented May 23, 2016

gmarek commented May 23, 2016

gmarek commented May 23, 2016

bprashanth commented May 23, 2016

gmarek commented May 23, 2016

bprashanth commented May 23, 2016

gmarek commented May 23, 2016

wojtek-t commented May 23, 2016

bprashanth commented May 23, 2016 •

edited

Loading

bprashanth commented May 23, 2016 •

edited

Loading

gmarek commented May 24, 2016

wojtek-t commented May 24, 2016

wojtek-t commented May 24, 2016

wojtek-t commented May 24, 2016

gmarek commented May 24, 2016

wojtek-t commented May 24, 2016

dchen1107 commented May 24, 2016

lavalamp commented May 24, 2016

bprashanth commented May 24, 2016

xiang90 commented May 24, 2016

gmarek commented May 25, 2016

xiang90 commented May 25, 2016

wojtek-t commented May 27, 2016

lavalamp commented Jun 3, 2016

bprashanth commented Jun 3, 2016

lavalamp commented Jun 6, 2016

gmarek commented Jun 7, 2016

davidopp commented Jun 12, 2016

gmarek commented Jun 13, 2016

davidopp commented Jun 13, 2016

k8s-github-robot commented Aug 29, 2016

k8s-github-robot commented Aug 29, 2016

bprashanth commented Aug 30, 2016

Docker failed to start on 1500/2000 machines on flannel cluster #26062

Docker failed to start on 1500/2000 machines on flannel cluster #26062

Comments

gmarek commented May 23, 2016

gmarek commented May 23, 2016

gmarek commented May 23, 2016

gmarek commented May 23, 2016

gmarek commented May 23, 2016

bprashanth commented May 23, 2016

gmarek commented May 23, 2016

bprashanth commented May 23, 2016

gmarek commented May 23, 2016

wojtek-t commented May 23, 2016

bprashanth commented May 23, 2016 • edited Loading

bprashanth commented May 23, 2016 • edited Loading

gmarek commented May 24, 2016

wojtek-t commented May 24, 2016

wojtek-t commented May 24, 2016

wojtek-t commented May 24, 2016

gmarek commented May 24, 2016

wojtek-t commented May 24, 2016

dchen1107 commented May 24, 2016

lavalamp commented May 24, 2016

bprashanth commented May 24, 2016

xiang90 commented May 24, 2016

gmarek commented May 25, 2016

xiang90 commented May 25, 2016

wojtek-t commented May 27, 2016

lavalamp commented Jun 3, 2016

bprashanth commented Jun 3, 2016

lavalamp commented Jun 6, 2016

gmarek commented Jun 7, 2016

davidopp commented Jun 12, 2016

gmarek commented Jun 13, 2016

davidopp commented Jun 13, 2016

k8s-github-robot commented Aug 29, 2016

k8s-github-robot commented Aug 29, 2016

bprashanth commented Aug 30, 2016

bprashanth commented May 23, 2016 •

edited

Loading

bprashanth commented May 23, 2016 •

edited

Loading