Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nodes go missing if cloudprovider instances api returns an error #41850

Closed
colemickens opened this issue Feb 22, 2017 · 23 comments
Closed

nodes go missing if cloudprovider instances api returns an error #41850

colemickens opened this issue Feb 22, 2017 · 23 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@colemickens
Copy link
Contributor

Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT
Kubernetes version (use kubectl version):

Environment:

  • Cloud provider or hardware configuration: azure
  • OS (e.g. from /etc/os-release): xenial
  • Kernel (e.g. uname -a): latest xenial lts kernel
  • Install tools:
  • Others: ACS-Engine cluster

What happened:
Two of my three nodes went NotReady and have not returned.

What you expected to happen:
Them not to have gone NotReady or them to have transitioned back out.

Conditions:
  Type                  Status  LastHeartbeatTime                       LastTransitionTime                      Reason                          Message
  ----                  ------  -----------------                       ------------------                      ------                          -------
  OutOfDisk             False   Tue, 21 Feb 2017 16:10:32 -0800         Mon, 20 Feb 2017 11:21:26 -0800         KubeletHasSufficientDisk        kubelet has sufficient disk space available
  MemoryPressure        False   Tue, 21 Feb 2017 16:10:32 -0800         Fri, 10 Feb 2017 17:45:04 -0800         KubeletHasSufficientMemory      kubelet has sufficient memory available
  DiskPressure          False   Tue, 21 Feb 2017 16:10:32 -0800         Fri, 10 Feb 2017 17:45:04 -0800         KubeletHasNoDiskPressure        kubelet has no disk pressure
  Ready                 False   Tue, 21 Feb 2017 16:10:32 -0800         Mon, 20 Feb 2017 11:21:26 -0800         KubeletNotReady                 Kubelet failed to get node info: failed to get external ID from cloud provider: compute.VirtualMachinesClient#Get: Failure responding to request: StatusCode=504 -- Original Error: autorest/azure: Service returned an error. Status=504 Code="GatewayTimeout" Message="The gateway did not receive a response from 'Microsoft.Compute' within the specified time period."
  NetworkUnavailable    False   Fri, 10 Feb 2017 17:45:23 -0800         Fri, 10 Feb 2017 17:45:23 -0800         RouteCreated                    RouteController created a route

My guess is that it got an error back from Azure and then got "stuck". This is similar to how kubelet gets stuck on bootup if the cloudprovider instances returns a 401 error... In that case, the kubelet grinds to a halt and doesn't start up the static pods. That's why I'm suspicious of a bug here.

How to reproduce it (as minimally and precisely as possible):
Unclear at this moment.

Anything else we need to know:
n/a

@colemickens
Copy link
Contributor Author

cc: @justinsb Have you ever seen anything like this with AWS?

@justinsb
Copy link
Member

We've seen some stuff where instances were incorrectly not returned in the list instances call, but I haven't seen this one that I recall.

I think though that ExternalID is now called in two places.. on kubelet start-up, but also if the node goes NotReady, because it is checking that the node still exists.

Based on the timing, I think the nodes ran out of disk? But then I don't know why it didn't transition back...

@colemickens
Copy link
Contributor Author

Hm, okay. I'm not sure why you identified the disk space though. These nodes were idle, I don't think it'd have run out of disk. I was more keyed into the fact that Kubelet went KubeletNotReady on Monday 20 Feb 2017 and then seems to never have transitioned back to Healthy... yet... the error there from external ID was a very intermittent issue. As if it got an error back and decided to stop one of its sync loops?

@justinsb
Copy link
Member

justinsb commented Feb 22, 2017

Just that the disk space transitioned at around the same time:

OutOfDisk False Tue, 21 Feb 2017 16:10:32 -0800 Mon, 20 Feb 2017 11:21:26 -0800

But there's something here - check out #41916 (probably not related to disk space)

@willmore
Copy link

@colemickens what version are you running?

@colemickens
Copy link
Contributor Author

Gah, I rushed the bug report. It was either 1.5.2 or 1.5.3.

@seanknox
Copy link

seanknox commented May 19, 2017

Might be seeing the same behavior on 1.6.2. It's easy to reproduce on Azure, at least. In this example, the kubelet failed to get node info due to the Azure subscription used is being rate limited by the Azure API. @colemickens I can follow up with you on how to repro if you're curious.

Conditions:
  Type			Status	LastHeartbeatTime			LastTransitionTime			Reason				Message
  ----			------	-----------------			------------------			------				-------
  OutOfDisk 		False 	Fri, 19 May 2017 12:29:16 -0700 	Fri, 19 May 2017 12:14:30 -0700 	KubeletHasSufficientDisk 	kubelet has sufficient disk space available
  MemoryPressure 	False 	Fri, 19 May 2017 12:29:16 -0700 	Fri, 19 May 2017 12:14:30 -0700 	KubeletHasSufficientMemory 	kubelet has sufficient memory available
  DiskPressure 		False 	Fri, 19 May 2017 12:29:16 -0700 	Fri, 19 May 2017 12:14:30 -0700 	KubeletHasNoDiskPressure 	kubelet has no disk pressure
  Ready 		False 	Fri, 19 May 2017 12:29:16 -0700 	Fri, 19 May 2017 12:14:30 -0700 	KubeletNotReady 		Kubelet failed to get node info: failed to get instance ID from cloud provider: compute.VirtualMachinesClient#Get: Failure responding to request: StatusCode=429 -- Original Error: autorest/azure: Service returned an error. Status=429 Code="OperationNotAllowed" Message="The server rejected the request because too many requests have been received for this subscription."
  NetworkUnavailable 	True 	Fri, 19 May 2017 12:29:23 -0700 	Fri, 19 May 2017 12:29:23 -0700 	NoRouteCreated 			RouteController failed to create a route
Addresses:		10.240.0.186,k8s-agentpool1-55822348-14

@k8s-github-robot
Copy link

@colemickens There are no sig labels on this issue. Please add a sig label by:
(1) mentioning a sig: @kubernetes/sig-<team-name>-misc
(2) specifying the label manually: /sig <label>

Note: method (1) will trigger a notification to the team. You can find the team list here.

@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 31, 2017
@spiffxp
Copy link
Member

spiffxp commented Jun 26, 2017

/sig azure
since this is easy to repro on azure; this may make more sense under an eventual sig-cloud or some other sig, please triage as appropriate

@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 26, 2017
@rushabhnagda11
Copy link

@seanknox I'm seeing a similar issue on acs-engine. Can you tell me how to reproduce?

@feiskyer
Copy link
Member

feiskyer commented Nov 1, 2017

I also met a similiar problem but it's not totally same with this one. When I configured a wrong node name, kubelet can't get an external ID from azure, so it refuses to start.

@colemickens Could you also provide related kubelet logs?

@colemickens
Copy link
Contributor Author

I'm not really running in Azure, or any cloud these days, so it will not be easy for me to provide these.

Anyone with a cluster can emulate this and check. Add an iptables rule blocking calls to the Azure control plane. Wait a few minutes and then remove the block. If there is a bug, kubelet will have stopped reporting status and you should start to see Nodes go unhealthy after some period of time. If there is no bug, then kubelet should recover from the momentary failure and move on.

@feiskyer If there's not an open bug for that, can you please open one. @brendandburns and I have both observed and discussed that before. Even if the CP is throwing out errors, kubelet should arguably still finish initializing and run static pods (especially with the work going on to pull the CP mostly out of the kubelet initialization path).

@jdumars
Copy link
Member

jdumars commented Dec 26, 2017

Bringing to SIG-Azure meeting to discuss.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 26, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 25, 2018
@djsly
Copy link
Contributor

djsly commented Apr 25, 2018

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 25, 2018
@djsly
Copy link
Contributor

djsly commented Apr 25, 2018

We are seeing the same behavior.

Yesterday, we lost 10 nodes. The only info I have right now from kubelet is this line (with log level set to 4)

[root@salt ~]# salt '*' cmd.run 'journalctl -u kubelet --since "2018-04-24 11:00" --until "2018-04-24 13:30" --no-pager | grep azure_'
salt.<domain_name>:
km1.<domain_name>:
    Apr 24 11:00:11 km1.<domain_name> kubelet[33940]: E0424 11:00:11.167639   33940 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(km1), err=instance not found
    Apr 24 11:01:31 km1.<domain_name> kubelet[33940]: E0424 11:01:31.188795   33940 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(km1), err=instance not found
    Apr 24 11:02:51 km1.<domain_name> kubelet[33940]: E0424 11:02:51.227603   33940 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(km1), err=instance not found
    Apr 24 11:04:14 km1.<domain_name> kubelet[33940]: E0424 11:04:14.905923   33940 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(km1), err=instance not found
    Apr 24 11:05:54 km1.<domain_name> kubelet[33940]: E0424 11:05:54.747346   33940 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(km1), err=instance not found
    Apr 24 11:07:09 km1.<domain_name> kubelet[33940]: E0424 11:07:09.773288   33940 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(km1), err=instance not found
km2.<domain_name>:
    Apr 24 12:15:23 km2.<domain_name> kubelet[18261]: E0424 12:15:23.950655   18261 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(km2), err=instance not found
    Apr 24 12:16:40 km2.<domain_name> kubelet[18261]: E0424 12:16:40.775011   18261 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(km2), err=instance not found
km0.<domain_name>:
    Apr 24 12:15:23 km0.<domain_name> kubelet[103571]: E0424 12:15:23.951996  103571 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(km0), err=instance not found
    Apr 24 12:16:40 km0.<domain_name> kubelet[103571]: E0424 12:16:40.776483  103571 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(km0), err=instance not found
kn-default-18.<domain_name>:
kn-default-15.<domain_name>:
kn-default-16.<domain_name>:
    Apr 24 11:19:09 kn-default-16.<domain_name> kubelet[61446]: E0424 11:19:09.864880   61446 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-16), err=instance not found
    Apr 24 11:20:25 kn-default-16.<domain_name> kubelet[61446]: E0424 11:20:25.703237   61446 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-16), err=instance not found
    Apr 24 11:21:46 kn-default-16.<domain_name> kubelet[61446]: E0424 11:21:46.458516   61446 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-16), err=instance not found
    Apr 24 12:15:52 kn-default-16.<domain_name> kubelet[61446]: E0424 12:15:52.080704   61446 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-16), err=instance not found
    Apr 24 12:16:52 kn-default-16.<domain_name> kubelet[61446]: E0424 12:16:52.101113   61446 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-16), err=instance not found
kn-default-17.<domain_name>:
    Apr 24 11:19:33 kn-default-17.<domain_name> kubelet[6783]: E0424 11:19:33.986632    6783 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-17), err=instance not found
    Apr 24 11:20:34 kn-default-17.<domain_name> kubelet[6783]: E0424 11:20:34.147306    6783 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-17), err=instance not found
    Apr 24 11:21:46 kn-default-17.<domain_name> kubelet[6783]: E0424 11:21:46.458938    6783 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-17), err=instance not found
    Apr 24 12:15:52 kn-default-17.<domain_name> kubelet[6783]: E0424 12:15:52.942454    6783 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-17), err=instance not found
    Apr 24 12:16:52 kn-default-17.<domain_name> kubelet[6783]: E0424 12:16:52.961847    6783 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-17), err=instance not found
kn-default-6.<domain_name>:
    Apr 24 11:19:18 kn-default-6.<domain_name> kubelet[109913]: E0424 11:19:18.524452  109913 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-6), err=instance not found
    Apr 24 11:20:25 kn-default-6.<domain_name> kubelet[109913]: E0424 11:20:25.703676  109913 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-6), err=instance not found
    Apr 24 11:21:46 kn-default-6.<domain_name> kubelet[109913]: E0424 11:21:46.459183  109913 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-6), err=instance not found
    Apr 24 12:15:51 kn-default-6.<domain_name> kubelet[109913]: E0424 12:15:51.857600  109913 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-6), err=instance not found
    Apr 24 12:16:51 kn-default-6.<domain_name> kubelet[109913]: E0424 12:16:51.879543  109913 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-6), err=instance not found
kn-edge-1.<domain_name>:
    Apr 24 11:19:15 kn-edge-1.<domain_name> kubelet[66388]: E0424 11:19:15.586761   66388 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-edge-1), err=instance not found
    Apr 24 11:20:25 kn-edge-1.<domain_name> kubelet[66388]: E0424 11:20:25.703685   66388 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-edge-1), err=instance not found
    Apr 24 11:21:46 kn-edge-1.<domain_name> kubelet[66388]: E0424 11:21:46.459178   66388 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-edge-1), err=instance not found
    Apr 24 12:15:53 kn-edge-1.<domain_name> kubelet[66388]: E0424 12:15:53.581371   66388 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-edge-1), err=instance not found
    Apr 24 12:16:53 kn-edge-1.<domain_name> kubelet[66388]: E0424 12:16:53.608408   66388 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-edge-1), err=instance not found
kn-default-1.<domain_name>:
kn-default-2.<domain_name>:
    Apr 24 11:19:15 kn-default-2.<domain_name> kubelet[70992]: E0424 11:19:15.748836   70992 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-2), err=instance not found
    Apr 24 11:20:25 kn-default-2.<domain_name> kubelet[70992]: E0424 11:20:25.703218   70992 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-2), err=instance not found
    Apr 24 11:21:46 kn-default-2.<domain_name> kubelet[70992]: E0424 11:21:46.458673   70992 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-2), err=instance not found
    Apr 24 12:15:52 kn-default-2.<domain_name> kubelet[70992]: E0424 12:15:52.136942   70992 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-2), err=instance not found
    Apr 24 12:16:52 kn-default-2.<domain_name> kubelet[70992]: E0424 12:16:52.162735   70992 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-2), err=instance not found
kn-edge-0.<domain_name>:
kn-default-5.<domain_name>:
    Apr 24 11:19:36 kn-default-5.<domain_name> kubelet[121802]: E0424 11:19:36.145891  121802 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-5), err=instance not found
    Apr 24 11:20:36 kn-default-5.<domain_name> kubelet[121802]: E0424 11:20:36.369044  121802 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-5), err=instance not found
    Apr 24 11:21:46 kn-default-5.<domain_name> kubelet[121802]: E0424 11:21:46.457854  121802 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-5), err=instance not found
    Apr 24 12:15:52 kn-default-5.<domain_name> kubelet[121802]: E0424 12:15:52.163100  121802 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-5), err=instance not found
    Apr 24 12:16:52 kn-default-5.<domain_name> kubelet[121802]: E0424 12:16:52.185957  121802 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-5), err=instance not found
kn-edge-2.<domain_name>:
    Apr 24 12:15:28 kn-edge-2.<domain_name> kubelet[18748]: E0424 12:15:28.984355   18748 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-edge-2), err=instance not found
    Apr 24 12:16:40 kn-edge-2.<domain_name> kubelet[18748]: E0424 12:16:40.775609   18748 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-edge-2), err=instance not found
kn-default-14.<domain_name>:
    Apr 24 12:15:23 kn-default-14.<domain_name> kubelet[78946]: E0424 12:15:23.950478   78946 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-14), err=instance not found
    Apr 24 12:16:40 kn-default-14.<domain_name> kubelet[78946]: E0424 12:16:40.775342   78946 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-14), err=instance not found
kn-default-4.<domain_name>:
    Apr 24 11:19:56 kn-default-4.<domain_name> kubelet[108186]: E0424 11:19:56.764218  108186 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-4), err=instance not found
    Apr 24 11:20:56 kn-default-4.<domain_name> kubelet[108186]: E0424 11:20:56.929916  108186 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-4), err=instance not found
    Apr 24 11:21:56 kn-default-4.<domain_name> kubelet[108186]: E0424 11:21:56.952303  108186 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-4), err=instance not found
    Apr 24 12:16:02 kn-default-4.<domain_name> kubelet[108186]: E0424 12:16:02.472815  108186 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-4), err=instance not found
    Apr 24 12:17:02 kn-default-4.<domain_name> kubelet[108186]: E0424 12:17:02.496710  108186 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-4), err=instance not found
kn-default-7.<domain_name>:
    Apr 24 11:19:28 kn-default-7.<domain_name> kubelet[104146]: E0424 11:19:28.605550  104146 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-7), err=instance not found
    Apr 24 11:20:28 kn-default-7.<domain_name> kubelet[104146]: E0424 11:20:28.750911  104146 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-7), err=instance not found
    Apr 24 11:21:46 kn-default-7.<domain_name> kubelet[104146]: E0424 11:21:46.459029  104146 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-7), err=instance not found
    Apr 24 12:15:53 kn-default-7.<domain_name> kubelet[104146]: E0424 12:15:53.226854  104146 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-7), err=instance not found
    Apr 24 12:16:53 kn-default-7.<domain_name> kubelet[104146]: E0424 12:16:53.248764  104146 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-7), err=instance not found
kn-default-9.<domain_name>:
    Apr 24 11:19:44 kn-default-9.<domain_name> kubelet[53270]: E0424 11:19:44.860856   53270 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-9), err=instance not found
    Apr 24 11:20:45 kn-default-9.<domain_name> kubelet[53270]: E0424 11:20:45.000457   53270 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-9), err=instance not found
    Apr 24 11:21:46 kn-default-9.<domain_name> kubelet[53270]: E0424 11:21:46.458796   53270 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-9), err=instance not found
    Apr 24 12:15:51 kn-default-9.<domain_name> kubelet[53270]: E0424 12:15:51.922491   53270 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-9), err=instance not found
    Apr 24 12:16:51 kn-default-9.<domain_name> kubelet[53270]: E0424 12:16:51.945088   53270 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-9), err=instance not found
    Apr 24 13:21:09 kn-default-9.<domain_name> kubelet[53270]: E0424 13:21:09.277706   53270 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-9), err=instance not found
kn-default-10.<domain_name>:
    Apr 24 11:19:57 kn-default-10.<domain_name> kubelet[123571]: E0424 11:19:57.489475  123571 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-10), err=instance not found
    Apr 24 11:20:57 kn-default-10.<domain_name> kubelet[123571]: E0424 11:20:57.632815  123571 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-10), err=instance not found
    Apr 24 11:21:57 kn-default-10.<domain_name> kubelet[123571]: E0424 11:21:57.655359  123571 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-10), err=instance not found
    Apr 24 12:16:05 kn-default-10.<domain_name> kubelet[123571]: E0424 12:16:05.878499  123571 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-10), err=instance not found
    Apr 24 12:17:05 kn-default-10.<domain_name> kubelet[123571]: E0424 12:17:05.900784  123571 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-10), err=instance not found
kn-default-3.<domain_name>:
    Apr 24 11:19:23 kn-default-3.<domain_name> kubelet[98169]: E0424 11:19:23.360182   98169 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-3), err=instance not found
    Apr 24 11:20:25 kn-default-3.<domain_name> kubelet[98169]: E0424 11:20:25.703786   98169 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-3), err=instance not found
    Apr 24 11:21:46 kn-default-3.<domain_name> kubelet[98169]: E0424 11:21:46.459267   98169 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-3), err=instance not found
    Apr 24 12:15:52 kn-default-3.<domain_name> kubelet[98169]: E0424 12:15:52.745287   98169 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-3), err=instance not found
    Apr 24 12:16:52 kn-default-3.<domain_name> kubelet[98169]: E0424 12:16:52.767132   98169 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-3), err=instance not found
kn-default-8.<domain_name>:
    Apr 24 12:15:29 kn-default-8.<domain_name> kubelet[92108]: E0424 12:15:29.961648   92108 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-8), err=instance not found
    Apr 24 12:16:40 kn-default-8.<domain_name> kubelet[92108]: E0424 12:16:40.773896   92108 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-8), err=instance not found
kn-default-12.<domain_name>:
    Apr 24 12:15:45 kn-default-12.<domain_name> kubelet[78243]: E0424 12:15:45.859928   78243 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-12), err=instance not found
    Apr 24 12:16:45 kn-default-12.<domain_name> kubelet[78243]: E0424 12:16:45.883724   78243 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-12), err=instance not found
kn-default-11.<domain_name>:
    Apr 24 11:19:24 kn-default-11.<domain_name> kubelet[51342]: E0424 11:19:24.472289   51342 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-11), err=instance not found
    Apr 24 11:20:25 kn-default-11.<domain_name> kubelet[51342]: E0424 11:20:25.701654   51342 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-11), err=instance not found
    Apr 24 11:21:46 kn-default-11.<domain_name> kubelet[51342]: E0424 11:21:46.457011   51342 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-11), err=instance not found
    Apr 24 12:15:52 kn-default-11.<domain_name> kubelet[51342]: E0424 12:15:52.841093   51342 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-11), err=instance not found
    Apr 24 12:16:52 kn-default-11.<domain_name> kubelet[51342]: E0424 12:16:52.865072   51342 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-11), err=instance not found
kn-default-13.<domain_name>:
    Apr 24 12:15:25 kn-default-13.<domain_name> kubelet[36846]: E0424 12:15:25.383678   36846 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-13), err=instance not found
    Apr 24 12:16:40 kn-default-13.<domain_name> kubelet[36846]: E0424 12:16:40.775030   36846 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-13), err=instance not found
kn-default-19.<domain_name>:
    Apr 24 11:19:14 kn-default-19.<domain_name> kubelet[11615]: E0424 11:19:14.015957   11615 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-19), err=instance not found
    Apr 24 12:15:23 kn-default-19.<domain_name> kubelet[11615]: E0424 12:15:23.950778   11615 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-19), err=instance not found
    Apr 24 12:16:40 kn-default-19.<domain_name> kubelet[11615]: E0424 12:16:40.775880   11615 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-19), err=instance not found
kn-default-0.<domain_name>:
    Apr 24 11:19:30 kn-default-0.<domain_name> kubelet[121631]: E0424 11:19:30.332399  121631 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-0), err=instance not found
    Apr 24 11:20:30 kn-default-0.<domain_name> kubelet[121631]: E0424 11:20:30.494012  121631 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-0), err=instance not found
    Apr 24 11:21:46 kn-default-0.<domain_name> kubelet[121631]: E0424 11:21:46.456901  121631 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-0), err=instance not found
    Apr 24 12:15:52 kn-default-0.<domain_name> kubelet[121631]: E0424 12:15:52.124184  121631 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-0), err=instance not found
    Apr 24 12:16:52 kn-default-0.<domain_name> kubelet[121631]: E0424 12:16:52.164652  121631 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-0), err=instance not found

I'm now adding extra logging to kubelet since the azure autorest error is getting swallowed...

This will helpidentify with my root cause, but it doesn't explain why after 1 hour the nodes are still not back to Healthy and the controller-manager marks them all as NotReady. If we restart kubelet all statuses come back to normal.

@brendandburns
Copy link
Contributor

@djsly what version of k8s is this? There was a pretty serious bug in 1.7 and earlier that could cause this:

#57484

Can you confirm that you have that PR in your k8s version?

@djsly
Copy link
Contributor

djsly commented Jun 13, 2018

@brendanburns thanks for the reply, after we upgraded to 1.7.13 the issue where kubelet disappear forever was resolved. We are still affected by Kubelet being marked as NotReady and we were able to pinpoint it to the management.azure.com becoming not available across multiple SubID for ~5mins.

This is causing PodEviction across 15-50% of our nodes once a day. I have engaged with Azure Support since I do not think this is something related to Upstream at this point.

Regarding this issue, I would say that the NotReady status sticking for ever is indeed fixed.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 11, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 11, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests