-
Notifications
You must be signed in to change notification settings - Fork 848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unblock process deletion in worker runtime #8479
base: master
Are you sure you want to change the base?
Conversation
This reverts commit cee8ff8. Signed-off-by: Rui Yang <ruiya@vmware.com>
Why is the process not exiting in the first place? I would expect aborting a build will cause a I would feel a lot better if the process exits on its own instead of us having to kill the container directly. |
@@ -48,14 +48,14 @@ func (p *Process) Wait() (int, error) { | |||
return 0, fmt.Errorf("proc closeio: %w", err) | |||
} | |||
|
|||
p.process.IO().Wait() | |||
|
|||
_, err = p.process.Delete(context.Background()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenbh do you mean why should we delete it here? My knowledge is limited here about linux programming TBO. Not sure how it suppose to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of what happens is:
- User clicks the abort button
- Web node handles the request and the step context gets cancelled
- Cancelling the context causes invokes
Stop(false)
on the worker which will try to gracefully kill the process - The killer will first try to gracefully stop the process (by sending a
SIGINT/SIGTERM
), but if that doesn't work it sends aSIGKILL
- Even if the resource ignores the
SIGINT/SIGTERM
and refuse to self-terminate, then it should've been killed by theSIGKILL
- Because the process has been killed, calling
p.process.IO().Wait()
eventually returns (because you know, the process is killed and the IO is closed) - We can then
p.process.Delete()
to free up any remaining OS resources reserved by the process
But from the #8462, it looks like the the resource is misbehaving by ignoring the SIGINT/SIGTERM
, but for some reason the SIGKILL
wasn't issued, which allowed the process to continue executing and jamming up the worker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pool resource hasn't been changed for a long time. The change comes from the big refactor in Concourse v7.5.
6. Because the process has been killed, calling
p.process.IO().Wait()
eventually returns (because you know, the process is killed and the IO is closed)
The problem is on the last part. During my investigation, I cancel the build while monitoring the process in the container.
Before cancelling,
PID USER TIME COMMAND
1 root 0:00 /tmp/gdn-init
7 root 0:00 {out} /bin/sh /opt/resource/out /tmp/build/put
19 root 0:00 ssh-agent
32 root 0:00 /opt/go/out /tmp/build/put
49 root 0:00 bash
60 root 0:00 ps aux
After cancelling, my ssh session got kicked out and i have to ssh back to see:
PID USER TIME COMMAND
1 root 0:00 /tmp/gdn-init
19 root 0:00 ssh-agent
32 root 0:00 /opt/go/out /tmp/build/put
156 root 0:00 bash
167 root 0:00 ps aux
worker logs after cancelling:
{"timestamp":"2022-07-27T16:44:44.198530100Z","level":"debug","source":"worker","message":"worker.garden.garden-server.stop.stopping","data":{"handle":"2faeb592-4850-48e7-6969-680c045582ba","session":"1.4.27"}}
...
{"timestamp":"2022-07-27T16:44:54.213367800Z","level":"info","source":"worker","message":"worker.garden.garden-server.stop.stopped","data":{"handle":"2faeb592-4850-48e7-6969-680c045582ba","session":"1.4.27"}}
{"timestamp":"2022-07-27T16:44:54.215373100Z","level":"debug","source":"worker","message":"worker.garden.connection.closed","data":{"local_addr":{"IP":"127.0.0.1","Port":7777,"Zone":""},"remote_addr":{"IP":"127.0.0.1","Port":37256,"Zone":""},"session":"1.5"}}
{"timestamp":"2022-07-27T16:44:54.226306800Z","level":"info","source":"worker","message":"worker.garden.garden-server.run.exited","data":{"handle":"2faeb592-4850-48e7-6969-680c045582ba","id":"4ecd375f-6844-4f99-6128-6199b7f02bb3","session":"1.4.20","status":137}}
you can see the process was actually killed (v7.4 behaves the same), and the worker logs shows garden stopped the container, but the IO was still running and outputing those ...
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenbh I have attached a pipeline for reproducible steps and acceptance.
@xtremerui Re-running PR pipeline for this PR. I think I've seen this happen in my own clusters as well. I had one build that got stuck in a "running" state similar to this. I'd like to try getting this in for the next release. |
Changes proposed by this PR
This reverts commit cee8ff8.
closes #8462
Notes to reviewer
The root cause of the stuck is because
p.process.IO().Wait()
will block for stdin/stdout while in pool resource case the stdout is occupied for the loop of acquring poolIt worked in v7.4.4. Then we have this big refactor of atc/worker in v7.5. Not exactly sure which change cause it to not work as before, am guessing it could be some underlining changes of containerd about how it close the process IO pipes too.
Anyway, cee8ff8 was to fix a racing condition for IO closing too early due to process deletion, which is a rare edge case IMO. Since the current problem could happen to any resource type that blockes stdout (and if that happens there is no way to finish the build), I think it is fair to revert it and see if we will spot the race again.
We can merge it to master and deploy to hush-house to fix the stucking issue and keep an eye on the race condition on both CI and hush-house.
Reproduce
I have created a temp pool repo, the private_key is just a deploy key specific to the repo only so it is safe to post here.
Set and run the above pipeline in local concourse, you should see the get is still running after cancelling the build without this PR.
Release Note