-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running etcd 2.0.5 in a pod on master #4442
Conversation
command: [ | ||
"/usr/local/bin/etcd", | ||
"--addr", | ||
"{{etcd_server}}:4001", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I continue to assert that this is brain damage, but it's fine for now. When we go to multi-master we should re-evaluate.
Can we make sure the existing configuration stored in etcd and backup by pd wouldn't be destroyed? Otherwise, we have to send an announcement for this. I did a quick test with this change, but found all prescheduled pods are gone. From docker inspect, I did see the binding for such volume: /mnt/master-pd/var/etcd:/var/etcd.
From Kubelet, I found kubelet kills it due to unwanted. |
Can we upgrade to 2.0.3 while we're at it? |
Another issue I forget to mention is that you have to kill previous etcd process, otherwise, docker container of etcd won't run due to port conflicts too. I couldn't see any place you kill previous process. For above etcd data loss issue, we guess somewhere we did format etcd pd. We need to figure out. @brendandburns Actually we build etcd image from 2.0.3+ with some latest fixes from development: etcd-io/etcd#2253 |
@ArtfulCoder Here is the script we mount master-pd: kubernetes/cluster/gce/templates/format-and-mount-pd.sh |
Can we verify this on a vagrant provider before merge to make sure nothing crazy happens there and that the cluster starts as expected? |
@zmerlynn : Do we need to worry about the upgrade path yet ? Do we do live pushes of new kubernetes versions for GKE customers who are using kubernetes ? |
coreos folks just build a new docker image based on v2.0.3 for us yesterday. Please update your manifest file to use the latest image built from them: |
ba72961
to
7c80207
Compare
@dchen1107 The updated PR incorporates the following changes based on the feedback.
Known issues with this PR:
|
I have some time tomorrow to test this on vagrant. |
@ArtfulCoder I patched my local client, and tried to figure out why the data is lost when converting etcd to a docker container. Here are a couple of issues:
Now etcd is running as container.
All above steps I performed so for, I verified data volume back up etcd, and the data is not lost. I suspect why before the data is lost we saw earlier might be due to version mismatch and etcd server discard those records? We need @Xiang99 comfirm this. |
I was able to fix the upgrade path issue. At this point, the only change that needs to be verified is that my PR works for vagrant. The commit where the data dir change was made is here:
|
Haven't read the PR yet. Have you fixed issue of killing previous etcd process and monit proccess yet? |
@ArtfulCoder @dchen1107 etcd data dir cannot have "garbage". etcd does check that, it will return unknow if so. |
I have to update the push script to kill etcd running on host. I will update the PR soon. |
With flannel depending on etcd for IP address reservation, how is the etcd pod IP assigned? |
The master is not participating in the overlay at this point. Etcd is On Mon, Feb 23, 2015 at 7:14 AM, josselin-c notifications@github.com
|
With or without flannel, we are going to have a chicken-egg issue if we tried to get rid of host port mapping for those "static" pods like etcd server here. This PR just puts etcd into a pod like what we had done for cadvisor, and a separate issue #4128 filed to handle master node configuration, including networking (cbr0 vs flannel). Again, I don't think flannel issue or chicken-egg issue is blocker for this one. Or I miss something here? |
To recap IRL discussions: Assume we run etcd in a pod: If the master doesn't run flannel: can it see and reach pods that are on If the master does run flannel: well, where does flannel get its etcd from? On Mon, Feb 23, 2015 at 2:48 PM, Dawn Chen notifications@github.com wrote:
|
cc/ @kelseyhightower |
One proposal on solving such chicken-egg issue is introducing static configuration to flannel. So that when flannel starts up, it won't query etcd for the configuration; Once etcd server is up as docker container, flannel will sync-up its "static" subnets with etcd. According @kelseyhightower, this is a very small patch to flannel, we can verify it this week. |
@errordeveloper what about #4483 ? Is weave also concerned? |
@dchen1107 @thockin I've put together a PR that would add static configuration and subnet leases to flannel. flannel-io/flannel#131 I've added some initial docs as well: |
2a71d5e
to
aec4eff
Compare
Is this PR active? If not please close it according to https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/devel/pull-requests.md |
This is active. We leave it not merging because there was a discussion on breakage day. We plan to merge this along with other breakage prs. We just heard from @thockin yesterday the decision is no breakage day. We are going to move on with backward compatibility support. |
aec4eff
to
9c466f0
Compare
Patched the latest commit and did a upgrade test:
I manually changed kubelet allowed priviledge, and changed manifest with priviledge=true, still run into the same issue. In this case, the previous existing pods are missing obviously. kubectl get pods only show static pods. @ArtfulCoder is looking into it. |
6eb7b8f
to
d03ff57
Compare
d03ff57
to
7750f35
Compare
With the latest patch, I verified it and works fine. LGTM |
Running etcd 2.0.5 in a pod on master
I believe this change broke head. (e2e-gce 3921 onward stopped being able to turn up clusters) |
BTW, sorry for leaving little detail before the revert, but I didn't do too much prodding. Clusters were getting here and hanging:
I haven't trying privately applying this PR and seeing what the issue was, but after reverting, we're back green. |
Dawn and I had manually applied the PR on GCE and it seemed fine Sent from my iPhone
|
You'll be happy to know that my first attempt to re-apply this PR came back with a successful cluster. :P |
(But I saw it spin at least once privately as well, so Jenkins isn't crazy.) |
@ArtfulCoder: I can't explain this, but I even went all the way through |
The bug might be in how I derive the address etcd will use and the address that spider we thinks that etcd is on. I should have etcd on localhost and then have apiserver looking localhost for etcd. Sent from my iPhone
|
Spider=apiserver :) Sent from my iPhone
|
@zmerlynn @ArtfulCoder Let's re-introduce this PR back on Monday with another round of fully testing. Have a good weekend. |
Sounds good Sent from my iPhone
|
#6200 should fix the transient errors that we were seeing. |
Running etcd 2.0.1 in a pod on master
@dchen1107