nodes go missing if cloudprovider instances api returns an error #41850

colemickens · 2017-02-22T00:13:32Z

Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT
Kubernetes version (use kubectl version):

Environment:

Cloud provider or hardware configuration: azure
OS (e.g. from /etc/os-release): xenial
Kernel (e.g. uname -a): latest xenial lts kernel
Install tools:
Others: ACS-Engine cluster

What happened:
Two of my three nodes went NotReady and have not returned.

What you expected to happen:
Them not to have gone NotReady or them to have transitioned back out.

Conditions:
  Type                  Status  LastHeartbeatTime                       LastTransitionTime                      Reason                          Message
  ----                  ------  -----------------                       ------------------                      ------                          -------
  OutOfDisk             False   Tue, 21 Feb 2017 16:10:32 -0800         Mon, 20 Feb 2017 11:21:26 -0800         KubeletHasSufficientDisk        kubelet has sufficient disk space available
  MemoryPressure        False   Tue, 21 Feb 2017 16:10:32 -0800         Fri, 10 Feb 2017 17:45:04 -0800         KubeletHasSufficientMemory      kubelet has sufficient memory available
  DiskPressure          False   Tue, 21 Feb 2017 16:10:32 -0800         Fri, 10 Feb 2017 17:45:04 -0800         KubeletHasNoDiskPressure        kubelet has no disk pressure
  Ready                 False   Tue, 21 Feb 2017 16:10:32 -0800         Mon, 20 Feb 2017 11:21:26 -0800         KubeletNotReady                 Kubelet failed to get node info: failed to get external ID from cloud provider: compute.VirtualMachinesClient#Get: Failure responding to request: StatusCode=504 -- Original Error: autorest/azure: Service returned an error. Status=504 Code="GatewayTimeout" Message="The gateway did not receive a response from 'Microsoft.Compute' within the specified time period."
  NetworkUnavailable    False   Fri, 10 Feb 2017 17:45:23 -0800         Fri, 10 Feb 2017 17:45:23 -0800         RouteCreated                    RouteController created a route

My guess is that it got an error back from Azure and then got "stuck". This is similar to how kubelet gets stuck on bootup if the cloudprovider instances returns a 401 error... In that case, the kubelet grinds to a halt and doesn't start up the static pods. That's why I'm suspicious of a bug here.

How to reproduce it (as minimally and precisely as possible):
Unclear at this moment.

Anything else we need to know:
n/a

The text was updated successfully, but these errors were encountered:

colemickens · 2017-02-22T00:13:51Z

cc: @justinsb Have you ever seen anything like this with AWS?

justinsb · 2017-02-22T01:43:46Z

We've seen some stuff where instances were incorrectly not returned in the list instances call, but I haven't seen this one that I recall.

I think though that ExternalID is now called in two places.. on kubelet start-up, but also if the node goes NotReady, because it is checking that the node still exists.

Based on the timing, I think the nodes ran out of disk? But then I don't know why it didn't transition back...

colemickens · 2017-02-22T20:42:03Z

Hm, okay. I'm not sure why you identified the disk space though. These nodes were idle, I don't think it'd have run out of disk. I was more keyed into the fact that Kubelet went KubeletNotReady on Monday 20 Feb 2017 and then seems to never have transitioned back to Healthy... yet... the error there from external ID was a very intermittent issue. As if it got an error back and decided to stop one of its sync loops?

justinsb · 2017-02-22T20:43:45Z

Just that the disk space transitioned at around the same time:

OutOfDisk False Tue, 21 Feb 2017 16:10:32 -0800 Mon, 20 Feb 2017 11:21:26 -0800

But there's something here - check out #41916 (probably not related to disk space)

willmore · 2017-02-27T07:20:14Z

@colemickens what version are you running?

colemickens · 2017-02-27T07:27:26Z

Gah, I rushed the bug report. It was either 1.5.2 or 1.5.3.

seanknox · 2017-05-19T19:33:34Z

Might be seeing the same behavior on 1.6.2. It's easy to reproduce on Azure, at least. In this example, the kubelet failed to get node info due to the Azure subscription used is being rate limited by the Azure API. @colemickens I can follow up with you on how to repro if you're curious.

Conditions:
  Type			Status	LastHeartbeatTime			LastTransitionTime			Reason				Message
  ----			------	-----------------			------------------			------				-------
  OutOfDisk 		False 	Fri, 19 May 2017 12:29:16 -0700 	Fri, 19 May 2017 12:14:30 -0700 	KubeletHasSufficientDisk 	kubelet has sufficient disk space available
  MemoryPressure 	False 	Fri, 19 May 2017 12:29:16 -0700 	Fri, 19 May 2017 12:14:30 -0700 	KubeletHasSufficientMemory 	kubelet has sufficient memory available
  DiskPressure 		False 	Fri, 19 May 2017 12:29:16 -0700 	Fri, 19 May 2017 12:14:30 -0700 	KubeletHasNoDiskPressure 	kubelet has no disk pressure
  Ready 		False 	Fri, 19 May 2017 12:29:16 -0700 	Fri, 19 May 2017 12:14:30 -0700 	KubeletNotReady 		Kubelet failed to get node info: failed to get instance ID from cloud provider: compute.VirtualMachinesClient#Get: Failure responding to request: StatusCode=429 -- Original Error: autorest/azure: Service returned an error. Status=429 Code="OperationNotAllowed" Message="The server rejected the request because too many requests have been received for this subscription."
  NetworkUnavailable 	True 	Fri, 19 May 2017 12:29:23 -0700 	Fri, 19 May 2017 12:29:23 -0700 	NoRouteCreated 			RouteController failed to create a route
Addresses:		10.240.0.186,k8s-agentpool1-55822348-14

k8s-github-robot · 2017-05-31T23:42:09Z

@colemickens There are no sig labels on this issue. Please add a sig label by:
(1) mentioning a sig: @kubernetes/sig-<team-name>-misc
(2) specifying the label manually: /sig <label>

Note: method (1) will trigger a notification to the team. You can find the team list here.

spiffxp · 2017-06-26T19:45:29Z

/sig azure
since this is easy to repro on azure; this may make more sense under an eventual sig-cloud or some other sig, please triage as appropriate

rushabhnagda11 · 2017-06-28T13:21:40Z

@seanknox I'm seeing a similar issue on acs-engine. Can you tell me how to reproduce?

feiskyer · 2017-11-01T06:37:07Z

I also met a similiar problem but it's not totally same with this one. When I configured a wrong node name, kubelet can't get an external ID from azure, so it refuses to start.

@colemickens Could you also provide related kubelet logs?

colemickens · 2017-11-10T23:42:48Z

I'm not really running in Azure, or any cloud these days, so it will not be easy for me to provide these.

Anyone with a cluster can emulate this and check. Add an iptables rule blocking calls to the Azure control plane. Wait a few minutes and then remove the block. If there is a bug, kubelet will have stopped reporting status and you should start to see Nodes go unhealthy after some period of time. If there is no bug, then kubelet should recover from the momentary failure and move on.

@feiskyer If there's not an open bug for that, can you please open one. @brendandburns and I have both observed and discussed that before. Even if the CP is throwing out errors, kubelet should arguably still finish initializing and run static pods (especially with the work going on to pull the CP mostly out of the kubelet initialization path).

jdumars · 2017-12-26T17:16:14Z

Bringing to SIG-Azure meeting to discuss.

fejta-bot · 2018-03-26T17:58:52Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-04-25T18:15:54Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

djsly · 2018-04-25T18:22:06Z

/remove-lifecycle rotten

djsly · 2018-04-25T18:30:43Z

We are seeing the same behavior.

Yesterday, we lost 10 nodes. The only info I have right now from kubelet is this line (with log level set to 4)

[root@salt ~]# salt '*' cmd.run 'journalctl -u kubelet --since "2018-04-24 11:00" --until "2018-04-24 13:30" --no-pager | grep azure_'
salt.<domain_name>:
km1.<domain_name>:
    Apr 24 11:00:11 km1.<domain_name> kubelet[33940]: E0424 11:00:11.167639   33940 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(km1), err=instance not found
    Apr 24 11:01:31 km1.<domain_name> kubelet[33940]: E0424 11:01:31.188795   33940 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(km1), err=instance not found
    Apr 24 11:02:51 km1.<domain_name> kubelet[33940]: E0424 11:02:51.227603   33940 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(km1), err=instance not found
    Apr 24 11:04:14 km1.<domain_name> kubelet[33940]: E0424 11:04:14.905923   33940 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(km1), err=instance not found
    Apr 24 11:05:54 km1.<domain_name> kubelet[33940]: E0424 11:05:54.747346   33940 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(km1), err=instance not found
    Apr 24 11:07:09 km1.<domain_name> kubelet[33940]: E0424 11:07:09.773288   33940 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(km1), err=instance not found
km2.<domain_name>:
    Apr 24 12:15:23 km2.<domain_name> kubelet[18261]: E0424 12:15:23.950655   18261 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(km2), err=instance not found
    Apr 24 12:16:40 km2.<domain_name> kubelet[18261]: E0424 12:16:40.775011   18261 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(km2), err=instance not found
km0.<domain_name>:
    Apr 24 12:15:23 km0.<domain_name> kubelet[103571]: E0424 12:15:23.951996  103571 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(km0), err=instance not found
    Apr 24 12:16:40 km0.<domain_name> kubelet[103571]: E0424 12:16:40.776483  103571 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(km0), err=instance not found
kn-default-18.<domain_name>:
kn-default-15.<domain_name>:
kn-default-16.<domain_name>:
    Apr 24 11:19:09 kn-default-16.<domain_name> kubelet[61446]: E0424 11:19:09.864880   61446 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-16), err=instance not found
    Apr 24 11:20:25 kn-default-16.<domain_name> kubelet[61446]: E0424 11:20:25.703237   61446 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-16), err=instance not found
    Apr 24 11:21:46 kn-default-16.<domain_name> kubelet[61446]: E0424 11:21:46.458516   61446 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-16), err=instance not found
    Apr 24 12:15:52 kn-default-16.<domain_name> kubelet[61446]: E0424 12:15:52.080704   61446 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-16), err=instance not found
    Apr 24 12:16:52 kn-default-16.<domain_name> kubelet[61446]: E0424 12:16:52.101113   61446 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-16), err=instance not found
kn-default-17.<domain_name>:
    Apr 24 11:19:33 kn-default-17.<domain_name> kubelet[6783]: E0424 11:19:33.986632    6783 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-17), err=instance not found
    Apr 24 11:20:34 kn-default-17.<domain_name> kubelet[6783]: E0424 11:20:34.147306    6783 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-17), err=instance not found
    Apr 24 11:21:46 kn-default-17.<domain_name> kubelet[6783]: E0424 11:21:46.458938    6783 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-17), err=instance not found
    Apr 24 12:15:52 kn-default-17.<domain_name> kubelet[6783]: E0424 12:15:52.942454    6783 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-17), err=instance not found
    Apr 24 12:16:52 kn-default-17.<domain_name> kubelet[6783]: E0424 12:16:52.961847    6783 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-17), err=instance not found
kn-default-6.<domain_name>:
    Apr 24 11:19:18 kn-default-6.<domain_name> kubelet[109913]: E0424 11:19:18.524452  109913 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-6), err=instance not found
    Apr 24 11:20:25 kn-default-6.<domain_name> kubelet[109913]: E0424 11:20:25.703676  109913 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-6), err=instance not found
    Apr 24 11:21:46 kn-default-6.<domain_name> kubelet[109913]: E0424 11:21:46.459183  109913 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-6), err=instance not found
    Apr 24 12:15:51 kn-default-6.<domain_name> kubelet[109913]: E0424 12:15:51.857600  109913 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-6), err=instance not found
    Apr 24 12:16:51 kn-default-6.<domain_name> kubelet[109913]: E0424 12:16:51.879543  109913 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-6), err=instance not found
kn-edge-1.<domain_name>:
    Apr 24 11:19:15 kn-edge-1.<domain_name> kubelet[66388]: E0424 11:19:15.586761   66388 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-edge-1), err=instance not found
    Apr 24 11:20:25 kn-edge-1.<domain_name> kubelet[66388]: E0424 11:20:25.703685   66388 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-edge-1), err=instance not found
    Apr 24 11:21:46 kn-edge-1.<domain_name> kubelet[66388]: E0424 11:21:46.459178   66388 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-edge-1), err=instance not found
    Apr 24 12:15:53 kn-edge-1.<domain_name> kubelet[66388]: E0424 12:15:53.581371   66388 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-edge-1), err=instance not found
    Apr 24 12:16:53 kn-edge-1.<domain_name> kubelet[66388]: E0424 12:16:53.608408   66388 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-edge-1), err=instance not found
kn-default-1.<domain_name>:
kn-default-2.<domain_name>:
    Apr 24 11:19:15 kn-default-2.<domain_name> kubelet[70992]: E0424 11:19:15.748836   70992 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-2), err=instance not found
    Apr 24 11:20:25 kn-default-2.<domain_name> kubelet[70992]: E0424 11:20:25.703218   70992 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-2), err=instance not found
    Apr 24 11:21:46 kn-default-2.<domain_name> kubelet[70992]: E0424 11:21:46.458673   70992 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-2), err=instance not found
    Apr 24 12:15:52 kn-default-2.<domain_name> kubelet[70992]: E0424 12:15:52.136942   70992 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-2), err=instance not found
    Apr 24 12:16:52 kn-default-2.<domain_name> kubelet[70992]: E0424 12:16:52.162735   70992 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-2), err=instance not found
kn-edge-0.<domain_name>:
kn-default-5.<domain_name>:
    Apr 24 11:19:36 kn-default-5.<domain_name> kubelet[121802]: E0424 11:19:36.145891  121802 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-5), err=instance not found
    Apr 24 11:20:36 kn-default-5.<domain_name> kubelet[121802]: E0424 11:20:36.369044  121802 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-5), err=instance not found
    Apr 24 11:21:46 kn-default-5.<domain_name> kubelet[121802]: E0424 11:21:46.457854  121802 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-5), err=instance not found
    Apr 24 12:15:52 kn-default-5.<domain_name> kubelet[121802]: E0424 12:15:52.163100  121802 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-5), err=instance not found
    Apr 24 12:16:52 kn-default-5.<domain_name> kubelet[121802]: E0424 12:16:52.185957  121802 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-5), err=instance not found
kn-edge-2.<domain_name>:
    Apr 24 12:15:28 kn-edge-2.<domain_name> kubelet[18748]: E0424 12:15:28.984355   18748 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-edge-2), err=instance not found
    Apr 24 12:16:40 kn-edge-2.<domain_name> kubelet[18748]: E0424 12:16:40.775609   18748 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-edge-2), err=instance not found
kn-default-14.<domain_name>:
    Apr 24 12:15:23 kn-default-14.<domain_name> kubelet[78946]: E0424 12:15:23.950478   78946 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-14), err=instance not found
    Apr 24 12:16:40 kn-default-14.<domain_name> kubelet[78946]: E0424 12:16:40.775342   78946 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-14), err=instance not found
kn-default-4.<domain_name>:
    Apr 24 11:19:56 kn-default-4.<domain_name> kubelet[108186]: E0424 11:19:56.764218  108186 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-4), err=instance not found
    Apr 24 11:20:56 kn-default-4.<domain_name> kubelet[108186]: E0424 11:20:56.929916  108186 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-4), err=instance not found
    Apr 24 11:21:56 kn-default-4.<domain_name> kubelet[108186]: E0424 11:21:56.952303  108186 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-4), err=instance not found
    Apr 24 12:16:02 kn-default-4.<domain_name> kubelet[108186]: E0424 12:16:02.472815  108186 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-4), err=instance not found
    Apr 24 12:17:02 kn-default-4.<domain_name> kubelet[108186]: E0424 12:17:02.496710  108186 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-4), err=instance not found
kn-default-7.<domain_name>:
    Apr 24 11:19:28 kn-default-7.<domain_name> kubelet[104146]: E0424 11:19:28.605550  104146 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-7), err=instance not found
    Apr 24 11:20:28 kn-default-7.<domain_name> kubelet[104146]: E0424 11:20:28.750911  104146 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-7), err=instance not found
    Apr 24 11:21:46 kn-default-7.<domain_name> kubelet[104146]: E0424 11:21:46.459029  104146 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-7), err=instance not found
    Apr 24 12:15:53 kn-default-7.<domain_name> kubelet[104146]: E0424 12:15:53.226854  104146 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-7), err=instance not found
    Apr 24 12:16:53 kn-default-7.<domain_name> kubelet[104146]: E0424 12:16:53.248764  104146 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-7), err=instance not found
kn-default-9.<domain_name>:
    Apr 24 11:19:44 kn-default-9.<domain_name> kubelet[53270]: E0424 11:19:44.860856   53270 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-9), err=instance not found
    Apr 24 11:20:45 kn-default-9.<domain_name> kubelet[53270]: E0424 11:20:45.000457   53270 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-9), err=instance not found
    Apr 24 11:21:46 kn-default-9.<domain_name> kubelet[53270]: E0424 11:21:46.458796   53270 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-9), err=instance not found
    Apr 24 12:15:51 kn-default-9.<domain_name> kubelet[53270]: E0424 12:15:51.922491   53270 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-9), err=instance not found
    Apr 24 12:16:51 kn-default-9.<domain_name> kubelet[53270]: E0424 12:16:51.945088   53270 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-9), err=instance not found
    Apr 24 13:21:09 kn-default-9.<domain_name> kubelet[53270]: E0424 13:21:09.277706   53270 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-9), err=instance not found
kn-default-10.<domain_name>:
    Apr 24 11:19:57 kn-default-10.<domain_name> kubelet[123571]: E0424 11:19:57.489475  123571 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-10), err=instance not found
    Apr 24 11:20:57 kn-default-10.<domain_name> kubelet[123571]: E0424 11:20:57.632815  123571 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-10), err=instance not found
    Apr 24 11:21:57 kn-default-10.<domain_name> kubelet[123571]: E0424 11:21:57.655359  123571 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-10), err=instance not found
    Apr 24 12:16:05 kn-default-10.<domain_name> kubelet[123571]: E0424 12:16:05.878499  123571 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-10), err=instance not found
    Apr 24 12:17:05 kn-default-10.<domain_name> kubelet[123571]: E0424 12:17:05.900784  123571 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-10), err=instance not found
kn-default-3.<domain_name>:
    Apr 24 11:19:23 kn-default-3.<domain_name> kubelet[98169]: E0424 11:19:23.360182   98169 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-3), err=instance not found
    Apr 24 11:20:25 kn-default-3.<domain_name> kubelet[98169]: E0424 11:20:25.703786   98169 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-3), err=instance not found
    Apr 24 11:21:46 kn-default-3.<domain_name> kubelet[98169]: E0424 11:21:46.459267   98169 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-3), err=instance not found
    Apr 24 12:15:52 kn-default-3.<domain_name> kubelet[98169]: E0424 12:15:52.745287   98169 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-3), err=instance not found
    Apr 24 12:16:52 kn-default-3.<domain_name> kubelet[98169]: E0424 12:16:52.767132   98169 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-3), err=instance not found
kn-default-8.<domain_name>:
    Apr 24 12:15:29 kn-default-8.<domain_name> kubelet[92108]: E0424 12:15:29.961648   92108 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-8), err=instance not found
    Apr 24 12:16:40 kn-default-8.<domain_name> kubelet[92108]: E0424 12:16:40.773896   92108 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-8), err=instance not found
kn-default-12.<domain_name>:
    Apr 24 12:15:45 kn-default-12.<domain_name> kubelet[78243]: E0424 12:15:45.859928   78243 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-12), err=instance not found
    Apr 24 12:16:45 kn-default-12.<domain_name> kubelet[78243]: E0424 12:16:45.883724   78243 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-12), err=instance not found
kn-default-11.<domain_name>:
    Apr 24 11:19:24 kn-default-11.<domain_name> kubelet[51342]: E0424 11:19:24.472289   51342 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-11), err=instance not found
    Apr 24 11:20:25 kn-default-11.<domain_name> kubelet[51342]: E0424 11:20:25.701654   51342 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-11), err=instance not found
    Apr 24 11:21:46 kn-default-11.<domain_name> kubelet[51342]: E0424 11:21:46.457011   51342 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-11), err=instance not found
    Apr 24 12:15:52 kn-default-11.<domain_name> kubelet[51342]: E0424 12:15:52.841093   51342 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-11), err=instance not found
    Apr 24 12:16:52 kn-default-11.<domain_name> kubelet[51342]: E0424 12:16:52.865072   51342 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-11), err=instance not found
kn-default-13.<domain_name>:
    Apr 24 12:15:25 kn-default-13.<domain_name> kubelet[36846]: E0424 12:15:25.383678   36846 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-13), err=instance not found
    Apr 24 12:16:40 kn-default-13.<domain_name> kubelet[36846]: E0424 12:16:40.775030   36846 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-13), err=instance not found
kn-default-19.<domain_name>:
    Apr 24 11:19:14 kn-default-19.<domain_name> kubelet[11615]: E0424 11:19:14.015957   11615 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-19), err=instance not found
    Apr 24 12:15:23 kn-default-19.<domain_name> kubelet[11615]: E0424 12:15:23.950778   11615 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-19), err=instance not found
    Apr 24 12:16:40 kn-default-19.<domain_name> kubelet[11615]: E0424 12:16:40.775880   11615 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-19), err=instance not found
kn-default-0.<domain_name>:
    Apr 24 11:19:30 kn-default-0.<domain_name> kubelet[121631]: E0424 11:19:30.332399  121631 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-0), err=instance not found
    Apr 24 11:20:30 kn-default-0.<domain_name> kubelet[121631]: E0424 11:20:30.494012  121631 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-0), err=instance not found
    Apr 24 11:21:46 kn-default-0.<domain_name> kubelet[121631]: E0424 11:21:46.456901  121631 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-0), err=instance not found
    Apr 24 12:15:52 kn-default-0.<domain_name> kubelet[121631]: E0424 12:15:52.124184  121631 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-0), err=instance not found
    Apr 24 12:16:52 kn-default-0.<domain_name> kubelet[121631]: E0424 12:16:52.164652  121631 azure_instances.go:34] error: az.NodeAddresses, az.getIPForMachine(kn-default-0), err=instance not found

I'm now adding extra logging to kubelet since the azure autorest error is getting swallowed...

This will helpidentify with my root cause, but it doesn't explain why after 1 hour the nodes are still not back to Healthy and the controller-manager marks them all as NotReady. If we restart kubelet all statuses come back to normal.

brendandburns · 2018-06-13T04:38:07Z

@djsly what version of k8s is this? There was a pretty serious bug in 1.7 and earlier that could cause this:

#57484

Can you confirm that you have that PR in your k8s version?

djsly · 2018-06-13T11:22:49Z

@brendanburns thanks for the reply, after we upgraded to 1.7.13 the issue where kubelet disappear forever was resolved. We are still affected by Kubelet being marked as NotReady and we were able to pinpoint it to the management.azure.com becoming not available across multiple SubID for ~5mins.

This is causing PodEviction across 15-50% of our nodes once a day. I have engaged with Azure Support since I do not think this is something related to Upstream at this point.

Regarding this issue, I would say that the NotReady status sticking for ever is indeed fixed.

fejta-bot · 2018-09-11T11:39:43Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-10-11T12:03:02Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2018-11-10T12:50:38Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2018-11-10T12:50:45Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

justinsb mentioned this issue Feb 22, 2017

AWS: Node becomes NodeNotReady without logged reason #41916

Closed

activeshadow mentioned this issue Mar 28, 2017

Kubelet failed to get node info following network unreachable for Kubernetes master #43768

Closed

k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 31, 2017

k8s-ci-robot added the sig/azure label Jun 26, 2017

k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 26, 2017

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 26, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 25, 2018

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 25, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 11, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 11, 2018

k8s-ci-robot closed this as completed Nov 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nodes go missing if cloudprovider instances api returns an error #41850

nodes go missing if cloudprovider instances api returns an error #41850

colemickens commented Feb 22, 2017

colemickens commented Feb 22, 2017

justinsb commented Feb 22, 2017

colemickens commented Feb 22, 2017

justinsb commented Feb 22, 2017 •

edited

Loading

willmore commented Feb 27, 2017

colemickens commented Feb 27, 2017

seanknox commented May 19, 2017 •

edited

Loading

k8s-github-robot commented May 31, 2017

spiffxp commented Jun 26, 2017

rushabhnagda11 commented Jun 28, 2017

feiskyer commented Nov 1, 2017

colemickens commented Nov 10, 2017

jdumars commented Dec 26, 2017

fejta-bot commented Mar 26, 2018

fejta-bot commented Apr 25, 2018

djsly commented Apr 25, 2018

djsly commented Apr 25, 2018 •

edited

Loading

brendandburns commented Jun 13, 2018

djsly commented Jun 13, 2018

fejta-bot commented Sep 11, 2018

fejta-bot commented Oct 11, 2018

fejta-bot commented Nov 10, 2018

k8s-ci-robot commented Nov 10, 2018

nodes go missing if cloudprovider instances api returns an error #41850

nodes go missing if cloudprovider instances api returns an error #41850

Comments

colemickens commented Feb 22, 2017

colemickens commented Feb 22, 2017

justinsb commented Feb 22, 2017

colemickens commented Feb 22, 2017

justinsb commented Feb 22, 2017 • edited Loading

willmore commented Feb 27, 2017

colemickens commented Feb 27, 2017

seanknox commented May 19, 2017 • edited Loading

k8s-github-robot commented May 31, 2017

spiffxp commented Jun 26, 2017

rushabhnagda11 commented Jun 28, 2017

feiskyer commented Nov 1, 2017

colemickens commented Nov 10, 2017

jdumars commented Dec 26, 2017

fejta-bot commented Mar 26, 2018

fejta-bot commented Apr 25, 2018

djsly commented Apr 25, 2018

djsly commented Apr 25, 2018 • edited Loading

brendandburns commented Jun 13, 2018

djsly commented Jun 13, 2018

fejta-bot commented Sep 11, 2018

fejta-bot commented Oct 11, 2018

fejta-bot commented Nov 10, 2018

k8s-ci-robot commented Nov 10, 2018

justinsb commented Feb 22, 2017 •

edited

Loading

seanknox commented May 19, 2017 •

edited

Loading

djsly commented Apr 25, 2018 •

edited

Loading