-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
e2e flake: Kubelet failed in scalability test #21866
Comments
@dchen1107 - can you please take a look or delegate? |
More complete info:
|
To save time for other people, below lists the relevant logs:
The symptom seems consistent with #20096 (comment) |
This is #21085 (comment) |
More specifically, you should see:
In the kernel log for #20096. I think this is something @dchen1107 just wanted to document in the release notes. Our decision to do so should hinge on how frequently it happens, if it's a daily occurrence we should treat it more seriously IMO. |
The differences are:
Thanks for the clarification. |
Absence of evidence is not evidence of absence. Dawn observed this yesterday on a cluster in the wild. I can say with some confidence that it isn't that kernel bug, a hung task could of course be for many reasons, including another kernel bug or a buggy disk device driver (less likely with a GCE vm of course). Hung tasks are probably exacerbated by our sending of kill signals. We can mitigate this by increasing the hung task timeout, but there's really no way to recover without a reboot, which we don't want to do automatically (if we did, we can easily just push out #20947). I do believe there's a corresponding (non hairpin) docker bug for this. |
Clarification: We didn't try killing docker. Our health check
|
@bprashanth is right. This is a different issue moby/moby#9605 since 1.5 release. This is relative rare case comparing to unregister_netdevice issue. @yujuhong the reason kubelet is killed because docker API request hang, which makes syncLoop too long, fails kubelet health check here. If we switch back to docker ps from docker version, you will see both docker and kubelet restart. The only way to deal with this issue is rebooting the node. |
I knew that and was just telling @bprashanth that docker wasn't being killed during that time because he wrote "Hung tasks are probably exacerbated by our sending of kill signals" :) |
This is not a flaky, and the docker is totally hung due to deadlock at kernel: Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.175460] Call Trace: We cannot do much except documenting it and telling the user to reboot the node like previous releases since we don't have a way to gracefully drain a node yet. |
Hmm - since we can't do anything about it and it's documented, can we close this one? @dchen1107 ? |
Close this one. After adding workaround for hairpin mode, docker process hung issue is rare. I am closing this issue, and we are going to document docker issue as known release issue. |
One of our scalability test runs failed with:
I took a quick look into it and it seems that something strange happened to Kubelet.
See the end of this file:
https://00e9e64bacf1da6c6807cb09861e2f84245c7b68119c8eddd5-apidata.googleusercontent.com/download/storage/v1_internal/b/kubernetes-jenkins/o/logs%2Fkubernetes-e2e-gce-scalability%2F4805%2Fartifacts%2F104.154.18.94:22-kubelet.log?qk=AD5uMEuGCECRLVur5CciL5wz4yH6zwU4Hpby0bxZyymy7GwmBVd3OokDf1mxIgV7ufBpaDXbvvKChJ5qfzehFhjPed4upFRssuBNJHSTO6BWDIIUFvwo9uwf1qHWJ_ZKNXiSV4s6wkJBsQEZij6jCq17FPvUfEp-2O5C2TWrAzjaig80ZvsWwJFigkjNgEfACn63STUAqAIg5pk32xqydV6Y0E0y7OAruTblsA-WijF8BwDEXlgfkvF9ZIOuO00honN7BmlUPGB-6Z_WstGfZZpxIivnk5D4nLsQkuwOqp7uLpiEXY25-GKHY3Lt6Yl5XoBbbs5WE7gnbo1kTqmOaiaQHOG7bN4MHb_vYbdRsgzl1M-ZfP155VVEeEFJtGYZrcGcwwWHCli7MmsGjA3FOUwM8Scd-Xl1CdF2jLGVYotjp_ObbdZ-7ReroIK706omRyV0QeccWAYLfPerVS8JJytotb-YupFNaufDr1flKgqKsCl2ta-0TQYjWV8vosWcyrifSaGq27yY6y3oMsBv33eOx2ITW1cZDrzCvW3pzqFKJlHsnWf_PumJ1YkfTlXBlBMj8XfdF6at1Igcha0BxjyUaDVhKYaBaYZalJrFPZHS_VRRNCuk3q_IKshqj8RB_7cogFKfssEdEfI_yYL3Y8k85cvK6kVpZkZiVEuQa3Uwwgf6A7yEI-vQLujpcgnRisvdYJ_6Zu3UR-wepcnaXv13QN1Sd6D3RBE5ih4l-xw0fIdKCAUVZc46k-PBSitofoW_tqZgn4AskTd6Z__W1v2OlC8-7PfXNLWFvBuwYihFRo_l38GdqPFRoSBgo0HjZ9tZgxH7xsv_DsZw-tlGbZcvxoKbbK7tSDUyt_takszDwwaJzpD67eSM0IuJxcknOVmWqHty9T3EkbpYtX_JBTQV6eUL-7u1W7jWhvCwYNsluqxM0uPSYvg
For completeness - the failed run is this one:
https://storage.googleapis.com/kubernetes-jenkins/logs/kubernetes-e2e-gce-scalability/4805/build-log.txt
The text was updated successfully, but these errors were encountered: