-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster dns stops functioning after 1.6 -> 1.5 downgrade #43668
Comments
Following up, it looks like cluster dns is generally down, not just for services. I can't wget from google.com either or run |
@pwittrock working to recreate now, can you post the DNS logs if they exist? |
Output from Pod trying to use cluster dns
Then change /etc/resolve.conf entry from
So the issue appears to be with the service routing not actually having been updated. |
Hypothesis: when you downgrade the master, the new etcd2 resource versions are lower, and kube-proxy will ignore changes until the resource version "catches up". I wonder if there's anything in the kube-proxy logs, and I wonder if a restart of kube-proxy on the node will fix it. |
I tried kicking all of the kube-proxies on each of the nodes - no luck. |
That is interesting, as it possibly points to something to do with kube-proxy rather than with the API. Does the gce node upgrade procedure build new instances, or is it a "hot push"? Another hypothesis: maybe kube-proxy no longer has permissions on the API? The ( |
Have been trying to reproduce this - ran into similar DNS issues with the 1.5.5 cluster even before upgrade on a few different attempts. Going to start from scratch and try again. |
Recreated the issue on 2 clusters. Seemed to work for a few moments on both before stopping. I also was having serious DNS problems on the 1.5 cluster but they seemed to be fixed by increasing the replicas on the DNS autoscaler deployment. Cluster 1
Cluster 2
|
Looks like the permissions theory is the winner:
|
/var/log/kube-apiserver.log
Looks like the api server is still trying to use beta rbac for some reason |
@ethernetdan Good catch. |
@ethernetdan @calebamiles I am surprised that this was not already caught during a downgrade test of services. How did we perform downgrade testing this release (did we at all?)? |
@ethernetdan Was the version or group of rbac changed in the 1.6? I wonder if kube-proxy uses the discovery service to figure out which version of rbac to use, and is caching the discovery version from the 1.6 server. After downgrading, try deleting the nodes and let the node pool recreate them. Check if they are healthy afterward. Also check the FIRST version of rbac listed by discovery service in 1.5 vs 1.6. Is it alpha in 1.5 and beta in 1.6? |
I'm confused by this as well, this should have been able to be caught in our large cluster manual testing as well as in automated tests. @wojtek-t @krousey any ideas why we are just seeing this now? |
Didn't mean to close.
Interesting thought but it didn't seem to help. Maybe it could be cached somehow on the API server?
RBAC did move to beta in 1.6, I'm wondering if this is a side effect of making ABAC default again. @liggitt @bgrant0607 @ericchiang any thoughts? |
No, API clients don't inspect RBAC versions to determine how they should authorize… that's purely used server side.
A 1.5 apiserver would only serve alpha RBAC (it doesn't have the beta code), and like all alpha versions, does not enable the API by default. I'd expect no RBAC versions listed in discovery for a 1.5 apiserver. 1.5 kube-up installations don't use RBAC authz, their only authorizer has always been ABAC.
The recent change to leave ABAC enabled by default in 1.6 has no effect on a downgrade to 1.5, since it only changed 1.6 install scripts. |
Here are the API server logs. |
|
@liggitt identified that the 1.5 upgrade script is not modifying
|
Also seeing this in downgrade jobs logs: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-1.6-1.5-downgrade-cluster/183/artifacts/bootstrap-e2e-master/kube-apiserver.log It seems the service test doesn't get run after the master downgrade for some reason so it wasn't caught |
Is it correct that this is an issue in 1.5 then and would be fixed with a patch release? We should figure out the scope of what downgrade tests are not being run and get them running. |
That's definitely an option, others would be having a migration script in 1.6 or keeping backwards compatibility with the hardcoded username.
Agreed, I want to get to the bottom of why we didn't catch this earlier |
It is an issue in 1.5, and should be fixed with a patch release. 1.5 hard-codes authorization rules, but if the This issue was fixed in 1.6 so that whether you are upgrading to 1.6 or downgrading to 1.6, the configured tokens identify users that have the correct permissions.
In 1.5, the scheduler and controller manager used the insecure port (which means they bypass authentication and authorization entirely, and weren't affected by tokens identifying permissionless users when downgrading from 1.6):
That means the only affected component was If a downgrade to 1.5 ran in an environment that was missing the |
Automatic merge from submit-queue kube-up: ensure tokens file is correct on upgrades/downgrades Fixes #43668 1.5 [hard-codes authorization rules](https://github.com/kubernetes/kubernetes/blob/release-1.5/cluster/gce/gci/configure-helper.sh#L915-L920), but if the `known_tokens.csv` file already exists, it [does not ensure it works with those authorization rules](https://github.com/kubernetes/kubernetes/blob/release-1.5/cluster/gce/gci/configure-helper.sh#L264). ```release-note kube-up (with gce/gci and gce/coreos providers) now ensures the authentication token file contains correct tokens for the control plane components, even if the file already exists (ensures upgrades and downgrades work successfully) ``` This issue was fixed in 1.6 for the gce and coreos providers. This PR picks those fixes for the control plane elements from these commits: * 968b0b3 * d94bb26
fixed in #43676 successfully upgraded v1.5.5 -> ci/latest-1.6 -> ci/latest-1.5 and ensured kube-proxy and kube-dns were functioning correctly after upgrade and downgrade, and tokens file contained correct control plane users after each step |
This has merged into |
cc @bowei |
I have been able to reproduce this twice.
Steps to reproduce:
v1.6.0-rc.1
release tar, extract, and cd into the directory./cluster/gce/upgrade.sh -M v1.6.0-rc.1
./cluster/gce/upgrade.sh -N -o v1.6.0-rc.1
to upgrade the nodes./cluster/gce/upgrade.sh -N -o v1.5.5
to downgrade the nodes./cluster/gce/upgrade.sh -M v1.5.5
to downgrade the masterThe text was updated successfully, but these errors were encountered: