Closed
Description
Observed this on AWS, but I am unsure why it would not happen on GCE also. If GCE has a workaround/fix, I would like to use the same one on AWS.
Here's what happened: I was upgrading from 1.1 to 1.2, which requires launching new nodes. One of my nodes did not come up, I think (but am not sure) that it had the same address as one of the old nodes. That node could not ping the master. The master had an incorrect ARP entry in the arp table for the node in question (compared vs the node's ifconfig). I flushed the node's ARP entry from the master and immediately ping resumed and the node came up soon thereafter.
I believe this is the problem, it might only affect newer kernels: https://news.ycombinator.com/item?id=8732151