-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
e2e flake: Namespace cannot be deleted in time #26290
Comments
Is this team/node or cluster or?? |
Another instance is here: BTW - I think we have another issue open for that, but I can't find it now. |
This is control plane. @liggitt? @derekwaynecarr? One of you guys want this? |
So actually the failure from kubemark seems to be a different issue - "event etcd" is not working there, so we weren't able to remove events at all (we weren't even able to list them). |
@wojtek-t - I don't think we can safely delete a namespace unless all stores reply back. this is usually always an issue kubelet side. I can look next week, but would not object to help from others. |
Is this the same as #23514 ? |
I am trying to think of more comprehensive information so these are much easier to debug. I will look to add a PR that dumps out more information about pods when hung tomorrow morning. The majority of the time the issue is kubelet side, but dumping the state of the pods as seen in the API server would help to know if pods ever started, if images were ever pulled, etc. |
@wojtek-t - we can turn on event etcd in kubemark - it's implemented but I think it's not enabled by default. @caesarxuchao - do you think it's possible that it's related to ownerRef? |
if it's not enabled by default, why we would even require it in that particular run? this sounds strange... |
In kubemark - I have no idea what's the default config for gce-serial. |
No one is setting ownerRef or OrphanDependents, so shouldn't be caused by the gc. I'm assigned with other "couldn't delete ns" flakes on gke, which doesn't have master logs. This gce flake is a good sample, so I'll take a look tomorrow. I took a glance at the api server's log, it hasn't received any deletion requests for the remaining pods, which is weird. |
@caesarxuchao thanks for your comments! Can we reassign this issue to you? (if so, please just reassign it yourself) |
@caesarxuchao - if the pod is already terminating, the issue is that the kubelet needs to send the final delete request (not the namespace controller). |
@derekwaynecarr thanks for the hint. |
The failed test:
kubelet log:
kubelet saw the pod at The more serious problem is that kubelet continue trying the mount the volume (seemingly in a very tight loop) and couldn't retrieve the secret. It never stopped trying to mount. |
That is expected. The kubelet volume manager runs an asynchronous loop that attempts to mount volumes for pods if they are not mounted (retrying on failure). It does not block any kubelet functionality except the particular pod's start up goroutine, and even that times out eventually as well. But, since it results in API calls, there should be back off logic: opened #27492 to track that.
|
@saad-ali so it's the volume mount retry loop prevented kubelet from deleting the pod? |
The syncPod method that is used to set up a pod, calls the volume manager to make sure that all the volumes for the pod are attached/mounted-- |
@saad-ali Does that wait get cancelled if the pod is deleted? We need to get this fixed asap or I'm gonna have to roll back your big change :/ |
@saad-ali I think the volume manager can stop trying to mount once the |
+1 on having this cancel on pod deletion or agree on roll-back. These On Thu, Jun 16, 2016 at 12:54 PM, Yu-Ju Hong notifications@github.com
|
To be fair, kubelet doesn't have the ability to cancel a docker operation midway as well. If the worker is blocked at pulling an image, kubelet'd wait until the pull succeeds and delete the pod. We should ideally cancel the sync iteration on a deletion request, but this is not the case now. The difference here for attachment/mounting is that
(2) has already been addressed by #27491, which lowered the timeout to 2min. |
Serial test take a long time to reflect changes, since they run so slowly (3.5 hours or so). PR #27491 was merged to lower the timeout for waitForAttach at 10:57 PM last night. There were 2 runs on gce-serial that failed overnight after that because of build issues (one was manually stopped the other hit build timeout) then a 7:39 AM run that was green. Same story on the GKE side. So far lowering the timeout seems to have fixed this test. Will continue to monitor subsequent serial runs. |
Closing this in favor of the automated issue #27502 |
https://console.cloud.google.com/storage/kubernetes-jenkins/logs/kubernetes-e2e-gce-serial/1356/
The pods were not deleted in time, causing cascading failures in the following tests.
The text was updated successfully, but these errors were encountered: