Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple versions of addons running after upgrade. #37641

Closed
krousey opened this issue Nov 29, 2016 · 14 comments
Closed

Multiple versions of addons running after upgrade. #37641

krousey opened this issue Nov 29, 2016 · 14 comments
Assignees
Labels
area/upgrade priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-blocker sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Milestone

Comments

@krousey
Copy link
Contributor

krousey commented Nov 29, 2016

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.): No

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): None


Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT

Kubernetes version (use kubectl version):

$ kubectl version                                                                          
Client Version: version.Info{Major:"1", Minor:"5+", GitVersion:"v1.5.0-beta.2.2+f64c9f2d999ceb", GitCommit:"f64c9f2d999ceb157d5672e9bba6639a4c456f6e", GitTreeState:"clean", BuildDate:"2016-11-29T15:21:56Z", GoVersion:"go1.7.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"5+", GitVersion:"v1.5.0-beta.2.2+f64c9f2d999ceb", GitCommit:"f64c9f2d999ceb157d5672e9bba6639a4c456f6e", GitTreeState:"clean", BuildDate:"2016-11-29T15:13:51Z", GoVersion:"go1.7.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: GKE

What happened: Upgrades to version 1.5 (from any previous version) change existing addons from ReplicationControllers to Deployments without deleting old ReplicationControllers. This leads to multiple versions of the addons running at the same time. There also seems to be multiple deployments of heapster as well.

$ kubectl get rc --namespace=kube-system                                                   
NAME                          DESIRED   CURRENT   READY     AGE
kube-dns-v17.1                2         2         2         1h
kubernetes-dashboard-v1.1.1   1         1         1         1h
l7-default-backend-v1.0       1         1         1         1h

$ kubectl get deployment --namespace=kube-system                                           
NAME                   DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
heapster-v1.1.0        1         1         1            1           1h
heapster-v1.2.0        1         1         1            1           1h
kube-dns               1         1         1            1           1h
kubernetes-dashboard   1         1         1            1           1h
l7-default-backend     1         1         1            1           1h

$ kubectl get pods --namespace=kube-system                                                 
NAME                                                               READY     STATUS    RESTARTS   AGE
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-91ebbcc7-f3wt   1/1       Unknown   0          1h
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-91ebbcc7-mgst   1/1       Running   0          1h
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-91ebbcc7-zhm0   1/1       Running   0          1h
heapster-v1.1.0-2096339923-39key                                   2/2       Running   0          1h
heapster-v1.2.0-2168613315-1bcy3                                   2/2       Running   0          1h
kube-dns-4101612645-78hx6                                          4/4       Running   0          1h
kube-dns-v17.1-3pyz0                                               3/3       Running   0          1h
kube-dns-v17.1-zaskz                                               3/3       Running   0          1h
kube-proxy-gke-jenkins-e2e-default-pool-91ebbcc7-f3wt              1/1       Unknown   0          1h
kube-proxy-gke-jenkins-e2e-default-pool-91ebbcc7-mgst              1/1       Running   0          1h
kube-proxy-gke-jenkins-e2e-default-pool-91ebbcc7-zhm0              1/1       Running   0          1h
kubernetes-dashboard-3697774758-n808h                              1/1       Running   0          1h
kubernetes-dashboard-v1.1.1-ljtj5                                  1/1       Running   0          1h
l7-default-backend-2234341178-vo5z1                                1/1       Running   0          1h
l7-default-backend-v1.0-qe9yo                                      1/1       Running   0          1h

I found this as a counting error in https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gke-container_vm-1.3-container_vm-1.5-upgrade-cluster/337. At first, I just thought it was incorrectly counting, and I attempted to fix that #36924. That fix is still valid and an improvement, but the underlying problem of multiple versions of addons still running is probably bad.

We need a mechanism to delete the old ReplicationControllers/Deployments after an upgrade.

@krousey krousey added area/upgrade priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-blocker labels Nov 29, 2016
@krousey krousey added this to the v1.5 milestone Nov 29, 2016
@krousey
Copy link
Contributor Author

krousey commented Nov 29, 2016

We could roll back #36008 to avoid the RC -> Deployment issue. To solve the 2 heapster deployments, we would have to either keep the version=v1.1.0 label in the v1.2.0 deployment or find a label combination that doesn't cause duplicate deployments to be created.

We could address this in GKE with a post-upgrade cleanup script, and note this in the release notes with manual steps to correct this.

I marked this as p0 because my counting fix won't work in 1.3 and the tests are still failing, and also there's a bigger issue at play. I would be ok with short term solutions to not encounter this for 1.5 and proper fixes in 1.6.

@krousey
Copy link
Contributor Author

krousey commented Nov 29, 2016

cc @saad-ali

@krousey
Copy link
Contributor Author

krousey commented Nov 29, 2016

cc @roberthbailey

@krousey krousey added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. team/cluster labels Nov 29, 2016
@saad-ali
Copy link
Member

CC @MrHohn @bowei

@MrHohn
Copy link
Member

MrHohn commented Nov 29, 2016

We do injected the mechanism to delete the old ReplicationControllers/Deployments after an upgrade in Add-on Manager.

This https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/addon-manager/kube-addons.sh#L191-L197 is for prunning the old ReplicationControllers. For the old heapter Deployment, because they have different names, kubectl apply --prune should be able to prune the old one as well.

Is there any way to retrieve the Add-on Manager's log from GKE master?

@MrHohn
Copy link
Member

MrHohn commented Nov 29, 2016

cc @mikedanese

@MrHohn
Copy link
Member

MrHohn commented Nov 29, 2016

The old resource pruning will have 1 minute delay though, it is for supporting zero downtime for kube-dns.

But this seems not to be this case.

@MrHohn
Copy link
Member

MrHohn commented Nov 29, 2016

Sorry, one mistake above. If the name of heapster Deployment changed, current Addon Manager will not prune the old one. This could be fixed by adding one more resource type in the same place(https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/addon-manager/kube-addons.sh#L191-L197).

I'm taking look at why the old RCs were not pruned.

@mikedanese mikedanese assigned mikedanese and MrHohn and unassigned mikedanese Nov 29, 2016
@mikedanese
Copy link
Member

@MrHohn yup, that's definitely it. Ping my on a PR and I can give you a quick review.

@mikedanese
Copy link
Member

We also need to merge #37139 to get the --prune-whitelist in.

@bowei
Copy link
Member

bowei commented Nov 29, 2016

reference wrong issue, ignore above ^^^^^^

@MrHohn
Copy link
Member

MrHohn commented Nov 29, 2016

Yeah, but I think #37139 may not fix this issue since Addon Manager v6.0-alpha should be able to prune the old RCs in theory.

I'm working on a repro on my own cluster (upgrade from 1.3 -> 1.5). Also checking GCE 1.4 -> 1.5 upgrade tests here, but found the Addon Manager's log looks normal.

@mikedanese
Copy link
Member

Yeah, but I think #37139 may not fix this issue since Addon Manager v6.0-alpha should be able to prune the old RCs in theory.

What currently is deployed doesn't have the prune whitelist and there are no RCs in the addons folder anymore so RCs aren't considered for pruning. I think we need both?

@MrHohn
Copy link
Member

MrHohn commented Nov 29, 2016

there are no RCs in the addons folder anymore

You are right. I used to think that there is still one ReplicationController in the addons folder --- elasticsearch-logging-v1. But turns out this is not enable on GKE.

If this is the case, #37139 combines with the quick fix for Deployment should do the job.

Will sent that PR very soon.

k8s-github-robot pushed a commit that referenced this issue Nov 30, 2016
Automatic merge from submit-queue

Fixes Addon Manager's pruning issue for old Deployments

Fixes #37641.

Attaches the `last-applied`annotations to the existing Deployments for pruning.

Below images are built and pushed:
- gcr.io/google-containers/kube-addon-manager:v6.1
- gcr.io/google-containers/kube-addon-manager-amd64:v6.1
- gcr.io/google-containers/kube-addon-manager-arm:v6.1
- gcr.io/google-containers/kube-addon-manager-arm64:v6.1
- gcr.io/google-containers/kube-addon-manager-ppc64le:v6.1

@mikedanese 

cc @saad-ali @krousey
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/upgrade priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-blocker sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Projects
None yet
Development

No branches or pull requests

5 participants