Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unblock process deletion in worker runtime #8479

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

xtremerui
Copy link
Contributor

@xtremerui xtremerui commented Jul 14, 2022

Changes proposed by this PR

This reverts commit cee8ff8.

closes #8462

  • done
  • todo

Notes to reviewer

The root cause of the stuck is because p.process.IO().Wait() will block for stdin/stdout while in pool resource case the stdout is occupied for the loop of acquring pool

It worked in v7.4.4. Then we have this big refactor of atc/worker in v7.5. Not exactly sure which change cause it to not work as before, am guessing it could be some underlining changes of containerd about how it close the process IO pipes too.

Anyway, cee8ff8 was to fix a racing condition for IO closing too early due to process deletion, which is a rare edge case IMO. Since the current problem could happen to any resource type that blockes stdout (and if that happens there is no way to finish the build), I think it is fair to revert it and see if we will spot the race again.

We can merge it to master and deploy to hush-house to fix the stucking issue and keep an eye on the race condition on both CI and hush-house.

Reproduce

I have created a temp pool repo, the private_key is just a deploy key specific to the repo only so it is safe to post here.

resources:
- name: mypool
  type: pool
  source:
    uri: git@github.com:xtremerui/testing-pool.git
    branch: main
    pool: aws
    private_key: |
      -----BEGIN OPENSSH PRIVATE KEY-----
      b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEbm9uZQAAAAAAAAABAAAAMwAAAAtzc2gtZW
      QyNTUxOQAAACAKe1lrCMqne6UwpRxuW9zTs6c4GV+ikjtti2y9wX5pOQAAAKA45n/NOOZ/
      zQAAAAtzc2gtZWQyNTUxOQAAACAKe1lrCMqne6UwpRxuW9zTs6c4GV+ikjtti2y9wX5pOQ
      AAAEC8eU4mx2Htsm9D+1Nzzk+xaM6j5F8EbbbKi610xDNV7Ap7WWsIyqd7pTClHG5b3NOz
      pzgZX6KSO22LbL3Bfmk5AAAAFnlvdXJfZW1haWxAZXhhbXBsZS5jb20BAgMEBQYH
      -----END OPENSSH PRIVATE KEY-----

jobs:
- name: do-nothing
  plan:
  - put: mypool
    params: {acquire: true}

Set and run the above pipeline in local concourse, you should see the get is still running after cancelling the build without this PR.

Release Note

  • Fix a bug where aborting build could stuck if a pool resource is waiting for pool available in put step

This reverts commit cee8ff8.

Signed-off-by: Rui Yang <ruiya@vmware.com>
@xtremerui xtremerui added the bug label Jul 14, 2022
@xtremerui xtremerui requested a review from a team as a code owner July 14, 2022 16:49
@chenbh
Copy link
Contributor

chenbh commented Jul 26, 2022

Why is the process not exiting in the first place? I would expect aborting a build will cause a SIGINT or SIGKILL to be sent, so shouldn't the process exit by itself before we have to resort to deleting it directly?

I would feel a lot better if the process exits on its own instead of us having to kill the container directly.

@@ -48,14 +48,14 @@ func (p *Process) Wait() (int, error) {
return 0, fmt.Errorf("proc closeio: %w", err)
}

p.process.IO().Wait()

_, err = p.process.Delete(context.Background())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenbh do you mean why should we delete it here? My knowledge is limited here about linux programming TBO. Not sure how it suppose to work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of what happens is:

  1. User clicks the abort button
  2. Web node handles the request and the step context gets cancelled
  3. Cancelling the context causes invokes Stop(false) on the worker which will try to gracefully kill the process
  4. The killer will first try to gracefully stop the process (by sending a SIGINT/SIGTERM), but if that doesn't work it sends a SIGKILL
  5. Even if the resource ignores the SIGINT/SIGTERM and refuse to self-terminate, then it should've been killed by the SIGKILL
  6. Because the process has been killed, calling p.process.IO().Wait() eventually returns (because you know, the process is killed and the IO is closed)
  7. We can then p.process.Delete() to free up any remaining OS resources reserved by the process

But from the #8462, it looks like the the resource is misbehaving by ignoring the SIGINT/SIGTERM, but for some reason the SIGKILL wasn't issued, which allowed the process to continue executing and jamming up the worker

Copy link
Contributor Author

@xtremerui xtremerui Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pool resource hasn't been changed for a long time. The change comes from the big refactor in Concourse v7.5.

6. Because the process has been killed, calling p.process.IO().Wait() eventually returns (because you know, the process is killed and the IO is closed)

The problem is on the last part. During my investigation, I cancel the build while monitoring the process in the container.
Before cancelling,

PID   USER     TIME  COMMAND
    1 root      0:00 /tmp/gdn-init
    7 root      0:00 {out} /bin/sh /opt/resource/out /tmp/build/put
   19 root      0:00 ssh-agent
   32 root      0:00 /opt/go/out /tmp/build/put
   49 root      0:00 bash
   60 root      0:00 ps aux

After cancelling, my ssh session got kicked out and i have to ssh back to see:

PID   USER     TIME  COMMAND
    1 root      0:00 /tmp/gdn-init
   19 root      0:00 ssh-agent
   32 root      0:00 /opt/go/out /tmp/build/put
  156 root      0:00 bash
  167 root      0:00 ps aux

worker logs after cancelling:

{"timestamp":"2022-07-27T16:44:44.198530100Z","level":"debug","source":"worker","message":"worker.garden.garden-server.stop.stopping","data":{"handle":"2faeb592-4850-48e7-6969-680c045582ba","session":"1.4.27"}}
...
{"timestamp":"2022-07-27T16:44:54.213367800Z","level":"info","source":"worker","message":"worker.garden.garden-server.stop.stopped","data":{"handle":"2faeb592-4850-48e7-6969-680c045582ba","session":"1.4.27"}}
{"timestamp":"2022-07-27T16:44:54.215373100Z","level":"debug","source":"worker","message":"worker.garden.connection.closed","data":{"local_addr":{"IP":"127.0.0.1","Port":7777,"Zone":""},"remote_addr":{"IP":"127.0.0.1","Port":37256,"Zone":""},"session":"1.5"}}
{"timestamp":"2022-07-27T16:44:54.226306800Z","level":"info","source":"worker","message":"worker.garden.garden-server.run.exited","data":{"handle":"2faeb592-4850-48e7-6969-680c045582ba","id":"4ecd375f-6844-4f99-6128-6199b7f02bb3","session":"1.4.20","status":137}}

you can see the process was actually killed (v7.4 behaves the same), and the worker logs shows garden stopped the container, but the IO was still running and outputing those ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenbh I have attached a pipeline for reproducible steps and acceptance.

@taylorsilva taylorsilva added this to the v7.13.0 milestone Dec 5, 2024
@taylorsilva
Copy link
Member

@xtremerui Re-running PR pipeline for this PR. I think I've seen this happen in my own clusters as well. I had one build that got stuck in a "running" state similar to this. I'd like to try getting this in for the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Aborting build does nothing when build runs in put step for pool resource
3 participants