Skip to content
This repository has been archived by the owner on Feb 24, 2020. It is now read-only.

all processes in container receive SIGTERM when sending SIGTERM to rkt process #3512

Open
blalor opened this issue Jan 5, 2017 · 9 comments

Comments

@blalor
Copy link

blalor commented Jan 5, 2017

Environment

rkt Version: 1.21.0
appc Version: 0.8.9
Go Version: go1.7.3
Go OS/Arch: linux/amd64
Features: -TPM +SDJOURNAL
--
Linux 4.9.0-1.el7.elrepo.x86_64 x86_64
--
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
--
systemd 219
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ -LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN

What did you do?

With the attached file at /tmp/assassin.py:

rkt run \
    --debug \
    --insecure-options=ondisk,image \
    --mount volume=srv,target=/srv/ \
    --volume srv,kind=host,source=/tmp/ docker://python:2.7 \
    -- \
    /srv/assassin.py

In another window ¡ASSUMING THERE ARE NO OTHER CONTAINERS RUNNING!:

pkill --echo --full stage1/rootfs/usr/bin/systemd-nspawn

Wait for the container to exit. Then:

grep -h 'got signal' /tmp/assassin_*.log

What did you expect to see?

2017-01-05 23:19:16,066 pid:5 - got signal SIGTERM

What did you see instead?

2017-01-05 23:15:12,592 pid:10 - got signal SIGTERM
2017-01-05 23:15:12,593 pid:11 - got signal SIGTERM
2017-01-05 23:15:12,593 pid:12 - got signal SIGTERM
2017-01-05 23:15:12,594 pid:13 - got signal SIGTERM
2017-01-05 23:15:12,594 pid:14 - got signal SIGTERM
2017-01-05 23:15:12,595 pid:15 - got signal SIGTERM
2017-01-05 23:15:12,595 pid:16 - got signal SIGTERM
2017-01-05 23:15:12,596 pid:17 - got signal SIGTERM
2017-01-05 23:15:12,596 pid:18 - got signal SIGTERM
2017-01-05 23:15:12,597 pid:19 - got signal SIGTERM
2017-01-05 23:15:12,597 pid:5 - got signal SIGTERM

What just happened here?

assassin.py spawns 10 child processes. The parent and all the children write a log message whenever they receive SIGTERM, but they do not exit. The parent does not automatically propagate signals to its children (inhibited via os.setpgrp()). Therefore, sending SIGTERM to the parent process (or the process that spawned the container) should only result in a single log message being generated by the parent. This is in fact exactly what happens when you run assassin.py in a terminal and send the parent process SIGTERM from another terminal. rkt (or more likely systemd), on the other hand, sends SIGTERM to every single process in the container.

This makes it very difficult for a process which spawns children to shut down properly when the container is being shut down, especially when trying to wrangle an application whose source you don't directly control. When combined with #2870, it is impossible to implement any kind of processing after the main application (process, whatever) has exited on command from the rkt runtime.

@squeed
Copy link
Contributor

squeed commented Jan 9, 2017

This is probably happening because the default systemd KillMode is control-group. I wonder if setting it to mixed in the stage1 unitfile would be the correct approach.

@lucab
Copy link
Member

lucab commented Jan 9, 2017

Possibly. But they would receive SIGTERM anyway, as when the whole pod goes down (ie. systemd-nspawn killed) systemd-pid1 will do a SIGTERM+SIGKILL round anyway. I don't see any easy way out of this, except for keeping the pod running á la rkt-app.

@blalor
Copy link
Author

blalor commented May 4, 2017

This issue has come up again for me. There's gotta be some kind of solution or workaround. I'm a big boy, I can handle my own signals, thankyouverymuch systemd. I really don't want to use Docker for my current problem. Or worse, not run in a container at all!

@blalor
Copy link
Author

blalor commented May 4, 2017

There's a generated systemd .service file in the stage1 rootfs that is (or appears to be) specifically for the single application that's been spawned:

2z [root@elasticsearch-0ed6f9c740fd9491c:/proc/6933/cwd/stage1/rootfs] # cat ./usr/lib64/systemd/system/elasticsearch.service
[Unit]
OnFailure=halt.target
Description=Application=elasticsearch Image=s3.amazonaws.com/example/apps/elasticsearch
DefaultDependencies=false
Wants=reaper-elasticsearch.service
Requires=sysusers.service
After=sysusers.service
Requires=prepare-app@-opt-stage2-elasticsearch-rootfs.service
After=prepare-app@-opt-stage2-elasticsearch-rootfs.service

[Service]
Restart=no
SyslogIdentifier=elasticsearch
StandardInput=null
StandardOutput=journal+console
StandardError=journal+console
TimeoutStartSec=0
ExecStart="/usr/local/bin/launch-elasticsearch.sh"
RootDirectory=/opt/stage2/elasticsearch/rootfs
WorkingDirectory=/
EnvironmentFile=/rkt/env/elasticsearch
User=0
Group=0
NoNewPrivileges=false
MemoryLimit=26000000000
CPUQuota=650%

Why can't that single service have KillMode set to mixed or process without impacting the pod as a whole? I'm sorry, I've never used rkt with multiple apps in a pod, so perhaps I'm missing a detail (or something bigger). I need time to orchestrate the shutdown of the application, but obviously more serious action needs to be taken if the timeout's exceeded.

@lucab
Copy link
Member

lucab commented May 5, 2017

@blalor is this a pod with a single app inside? My understanding is that your root problem comes from the pod also being torn down with this single application.

@blalor
Copy link
Author

blalor commented May 5, 2017

Yes, it's a single-application pod. I need the application to shut down in a controlled fashion, whereby the process launched by systemd is able to initiate cleanup and then notify child processes to exit. I can't do that if the child processes are told to terminate by systemd (and I'm unable to inhibit the child processes' SIGTERM handling).

This is the second time I've run into this problem, which revolves around managing data for a stateful application. I'm currently attempting to get an Elasticsearch node to remove itself from the cluster by deallocating shards on shutdown. A wrapper script is responsible for starting and stopping the main ES process; when SIGTERM is received, it updates the cluster state to move shards away from the terminating node, waits for the shards to be relocated, and then sends SIGTERM to the main ES process.

@lucab
Copy link
Member

lucab commented May 5, 2017

I'm not sure if this works, but you may as well try: instead of stopping the pod, just enter the running stage1 (via nsenter on the systemd-pid1 process) and do a systemctl kill or a plain kill on the parent service. This should let the parent do whatever it needs to handle children and then exit. I understand this is quite dirty, but it is actually how we are planning to implement rkt signal for #1496.

@blalor
Copy link
Author

blalor commented May 5, 2017

I'm working with containers scheduled via Nomad; that's not a viable production solution. I can test it on Monday against a running pod if you're looking for verification of the final config of the unit, but that's not something I can entertain in a production scenario.

@fabiokung
Copy link
Contributor

I started working on custom KillModes on #3732

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants