Unblock process deletion in worker runtime #8479

xtremerui · 2022-07-14T16:49:45Z

Changes proposed by this PR

This reverts commit cee8ff8.

done
todo

Notes to reviewer

The root cause of the stuck is because p.process.IO().Wait() will block for stdin/stdout while in pool resource case the stdout is occupied for the loop of acquring pool

It worked in v7.4.4. Then we have this big refactor of atc/worker in v7.5. Not exactly sure which change cause it to not work as before, am guessing it could be some underlining changes of containerd about how it close the process IO pipes too.

Anyway, cee8ff8 was to fix a racing condition for IO closing too early due to process deletion, which is a rare edge case IMO. Since the current problem could happen to any resource type that blockes stdout (and if that happens there is no way to finish the build), I think it is fair to revert it and see if we will spot the race again.

We can merge it to master and deploy to hush-house to fix the stucking issue and keep an eye on the race condition on both CI and hush-house.

Reproduce

I have created a temp pool repo, the private_key is just a deploy key specific to the repo only so it is safe to post here.

resources:
- name: mypool
  type: pool
  source:
    uri: git@github.com:xtremerui/testing-pool.git
    branch: main
    pool: aws
    private_key: |
      -----BEGIN OPENSSH PRIVATE KEY-----
      b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEbm9uZQAAAAAAAAABAAAAMwAAAAtzc2gtZW
      QyNTUxOQAAACAKe1lrCMqne6UwpRxuW9zTs6c4GV+ikjtti2y9wX5pOQAAAKA45n/NOOZ/
      zQAAAAtzc2gtZWQyNTUxOQAAACAKe1lrCMqne6UwpRxuW9zTs6c4GV+ikjtti2y9wX5pOQ
      AAAEC8eU4mx2Htsm9D+1Nzzk+xaM6j5F8EbbbKi610xDNV7Ap7WWsIyqd7pTClHG5b3NOz
      pzgZX6KSO22LbL3Bfmk5AAAAFnlvdXJfZW1haWxAZXhhbXBsZS5jb20BAgMEBQYH
      -----END OPENSSH PRIVATE KEY-----

jobs:
- name: do-nothing
  plan:
  - put: mypool
    params: {acquire: true}

Set and run the above pipeline in local concourse, you should see the get is still running after cancelling the build without this PR.

Release Note

Fix a bug where aborting build could stuck if a pool resource is waiting for pool available in put step

This reverts commit cee8ff8. Signed-off-by: Rui Yang <ruiya@vmware.com>

chenbh · 2022-07-26T18:37:44Z

Why is the process not exiting in the first place? I would expect aborting a build will cause a SIGINT or SIGKILL to be sent, so shouldn't the process exit by itself before we have to resort to deleting it directly?

I would feel a lot better if the process exits on its own instead of us having to kill the container directly.

xtremerui · 2022-07-26T19:21:41Z

worker/runtime/process.go

@@ -48,14 +48,14 @@ func (p *Process) Wait() (int, error) {
 		return 0, fmt.Errorf("proc closeio: %w", err)
 	}

-	p.process.IO().Wait()
-
 	_, err = p.process.Delete(context.Background())


@chenbh do you mean why should we delete it here? My knowledge is limited here about linux programming TBO. Not sure how it suppose to work.

My understanding of what happens is:

User clicks the abort button

Web node handles the request and the step context gets cancelled

Cancelling the context causes invokes Stop(false) on the worker which will try to gracefully kill the process

The killer will first try to gracefully stop the process (by sending a SIGINT/SIGTERM), but if that doesn't work it sends a SIGKILL

Even if the resource ignores the SIGINT/SIGTERM and refuse to self-terminate, then it should've been killed by the SIGKILL

Because the process has been killed, calling p.process.IO().Wait() eventually returns (because you know, the process is killed and the IO is closed)

We can then p.process.Delete() to free up any remaining OS resources reserved by the process

But from the #8462, it looks like the the resource is misbehaving by ignoring the SIGINT/SIGTERM, but for some reason the SIGKILL wasn't issued, which allowed the process to continue executing and jamming up the worker

pool resource hasn't been changed for a long time. The change comes from the big refactor in Concourse v7.5.

6. Because the process has been killed, calling p.process.IO().Wait() eventually returns (because you know, the process is killed and the IO is closed)

The problem is on the last part. During my investigation, I cancel the build while monitoring the process in the container.
Before cancelling,

PID USER TIME COMMAND 1 root 0:00 /tmp/gdn-init 7 root 0:00 {out} /bin/sh /opt/resource/out /tmp/build/put 19 root 0:00 ssh-agent 32 root 0:00 /opt/go/out /tmp/build/put 49 root 0:00 bash 60 root 0:00 ps aux

After cancelling, my ssh session got kicked out and i have to ssh back to see:

PID USER TIME COMMAND 1 root 0:00 /tmp/gdn-init 19 root 0:00 ssh-agent 32 root 0:00 /opt/go/out /tmp/build/put 156 root 0:00 bash 167 root 0:00 ps aux

worker logs after cancelling:

{"timestamp":"2022-07-27T16:44:44.198530100Z","level":"debug","source":"worker","message":"worker.garden.garden-server.stop.stopping","data":{"handle":"2faeb592-4850-48e7-6969-680c045582ba","session":"1.4.27"}} ... {"timestamp":"2022-07-27T16:44:54.213367800Z","level":"info","source":"worker","message":"worker.garden.garden-server.stop.stopped","data":{"handle":"2faeb592-4850-48e7-6969-680c045582ba","session":"1.4.27"}} {"timestamp":"2022-07-27T16:44:54.215373100Z","level":"debug","source":"worker","message":"worker.garden.connection.closed","data":{"local_addr":{"IP":"127.0.0.1","Port":7777,"Zone":""},"remote_addr":{"IP":"127.0.0.1","Port":37256,"Zone":""},"session":"1.5"}} {"timestamp":"2022-07-27T16:44:54.226306800Z","level":"info","source":"worker","message":"worker.garden.garden-server.run.exited","data":{"handle":"2faeb592-4850-48e7-6969-680c045582ba","id":"4ecd375f-6844-4f99-6128-6199b7f02bb3","session":"1.4.20","status":137}}

you can see the process was actually killed (v7.4 behaves the same), and the worker logs shows garden stopped the container, but the IO was still running and outputing those ....

@chenbh I have attached a pipeline for reproducible steps and acceptance.

taylorsilva · 2024-12-05T15:58:48Z

@xtremerui Re-running PR pipeline for this PR. I think I've seen this happen in my own clusters as well. I had one build that got stuck in a "running" state similar to this. I'd like to try getting this in for the next release.

Revert "behavior: runtime: fix fast print-then-exit"

219c453

This reverts commit cee8ff8. Signed-off-by: Rui Yang <ruiya@vmware.com>

xtremerui added the bug label Jul 14, 2022

xtremerui requested a review from a team as a code owner July 14, 2022 16:49

xtremerui assigned chenbh Jul 14, 2022

xtremerui commented Jul 26, 2022

View reviewed changes

taylorsilva added this to the v7.13.0 milestone Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unblock process deletion in worker runtime #8479

Unblock process deletion in worker runtime #8479

xtremerui commented Jul 14, 2022 •

edited

Loading

chenbh commented Jul 26, 2022

xtremerui Jul 26, 2022

chenbh Jul 26, 2022

xtremerui Jul 27, 2022 •

edited

Loading

xtremerui Jul 27, 2022

taylorsilva commented Dec 5, 2024

Unblock process deletion in worker runtime #8479

Are you sure you want to change the base?

Unblock process deletion in worker runtime #8479

Conversation

xtremerui commented Jul 14, 2022 • edited Loading

Changes proposed by this PR

Notes to reviewer

Reproduce

Release Note

chenbh commented Jul 26, 2022

xtremerui Jul 26, 2022

Choose a reason for hiding this comment

chenbh Jul 26, 2022

Choose a reason for hiding this comment

xtremerui Jul 27, 2022 • edited Loading

Choose a reason for hiding this comment

xtremerui Jul 27, 2022

Choose a reason for hiding this comment

taylorsilva commented Dec 5, 2024

xtremerui commented Jul 14, 2022 •

edited

Loading

xtremerui Jul 27, 2022 •

edited

Loading