-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
seccomp: add default profile #660
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,30 @@ | ||
# Seccomp | ||
|
||
- [Abstract](#abstract) | ||
- [Motivation](#motivation) | ||
- [Constraints and Assumptions](#constraints-and-assumptions) | ||
- [Use Cases](#use-cases) | ||
* [Use Case: Administrator Access Control](#use-case--administrator-access-control) | ||
* [Use Case: Seccomp profiles similar to container runtime defaults](#use-case--seccomp-profiles-similar-to-container-runtime-defaults) | ||
* [Use Case: Applications that link to libseccomp](#use-case--applications-that-link-to-libseccomp) | ||
* [Use Case: Custom profiles](#use-case--custom-profiles) | ||
- [Community Work](#community-work) | ||
* [Docker / OCI](#docker---oci) | ||
* [rkt / appcontainers](#rkt---appcontainers) | ||
* [HyperContainer](#hypercontainer) | ||
* [lxd](#lxd) | ||
* [Other platforms and seccomp-like capabilities](#other-platforms-and-seccomp-like-capabilities) | ||
- [Proposed Design](#proposed-design) | ||
* [Seccomp API Resource?](#seccomp-api-resource-) | ||
* [Pod Security Policy annotation](#pod-security-policy-annotation) | ||
* [Spec](#spec) | ||
* [Default Profile](#default-profile) | ||
+ [Various Syscalls Not Allowed](#various-syscalls-not-allowed) | ||
+ [Default Behavior](#default-behavior) | ||
- [Examples](#examples) | ||
* [Unconfined profile](#unconfined-profile) | ||
* [Custom profile](#custom-profile) | ||
|
||
## Abstract | ||
|
||
A proposal for adding **alpha** support for | ||
|
@@ -28,6 +55,7 @@ This design should: | |
* be container-runtime agnostic | ||
* allow use of custom profiles | ||
* facilitate containerized applications that link directly to libseccomp | ||
* enable a default seccomp profile for containers | ||
|
||
## Use Cases | ||
|
||
|
@@ -40,14 +68,16 @@ This design should: | |
unmediated by Kubernetes | ||
4. As a user, I want to be able to use a custom seccomp profile and use | ||
it with my containers | ||
5. As a user and administrator I want kubernetes to apply a sane default | ||
seccomp profile to containers unless I otherwise specify. | ||
|
||
### Use Case: Administrator access control | ||
### Use Case: Administrator Access Control | ||
|
||
Controlling access to seccomp profiles is a cluster administrator | ||
concern. It should be possible for an administrator to control which users | ||
have access to which profiles. | ||
|
||
The [pod security policy](https://github.com/kubernetes/kubernetes/pull/7893) | ||
The [Pod Security Policy](https://github.com/kubernetes/kubernetes/pull/7893) | ||
API extension governs the ability of users to make requests that affect pod | ||
and container security contexts. The proposed design should deal with | ||
required changes to control access to new functionality. | ||
|
@@ -101,9 +131,7 @@ implement a sandbox for user-provided code, such as | |
|
||
## Community Work | ||
|
||
### Container runtime support for seccomp | ||
|
||
#### Docker / opencontainers | ||
### Docker / OCI | ||
|
||
Docker supports the open container initiative's API for | ||
seccomp, which is very close to the libseccomp API. It allows full | ||
|
@@ -112,14 +140,21 @@ specification of seccomp filters, with arguments, operators, and actions. | |
Docker allows the specification of a single seccomp filter. There are | ||
community requests for: | ||
|
||
Issues: | ||
|
||
* [docker/22109](https://github.com/docker/docker/issues/22109): composable | ||
seccomp filters | ||
* [docker/21105](https://github.com/docker/docker/issues/22105): custom | ||
seccomp filters for builds | ||
|
||
#### rkt / appcontainers | ||
Implementation details: | ||
|
||
* [docker/17989](https://github.com/moby/moby/pull/17989): initial | ||
implementation | ||
* [docker/18780](https://github.com/moby/moby/pull/18780): default blacklist | ||
profile | ||
* [docker/18979](https://github.com/moby/moby/pull/18979): default whitelist | ||
profile | ||
|
||
### rkt / appcontainers | ||
|
||
The `rkt` runtime delegates to systemd for seccomp support; there is an open | ||
issue to add support once `appc` supports it. The `appc` project has an open | ||
|
@@ -133,23 +168,24 @@ Issues: | |
* [appc/529](https://github.com/appc/spec/issues/529) | ||
* [rkt/1614](https://github.com/coreos/rkt/issues/1614) | ||
|
||
#### HyperContainer | ||
### HyperContainer | ||
|
||
[HyperContainer](https://hypercontainer.io) does not support seccomp. | ||
|
||
### Other platforms and seccomp-like capabilities | ||
|
||
FreeBSD has a seccomp/capability-like facility called | ||
[Capsicum](https://www.freebsd.org/cgi/man.cgi?query=capsicum&sektion=4). | ||
|
||
#### lxd | ||
### lxd | ||
|
||
[`lxd`](http://www.ubuntu.com/cloud/lxd) constrains containers using a default profile. | ||
|
||
Issues: | ||
|
||
* [lxd/1084](https://github.com/lxc/lxd/issues/1084): add knobs for seccomp | ||
|
||
### Other platforms and seccomp-like capabilities | ||
|
||
FreeBSD has a seccomp/capability-like facility called | ||
[Capsicum](https://www.freebsd.org/cgi/man.cgi?query=capsicum&sektion=4). | ||
|
||
|
||
## Proposed Design | ||
|
||
### Seccomp API Resource? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think your proposal alters this text about API resources a bit, and I wonder if this ought to be a new proposal (as we have tended to do in the past when making proposals for new additions to a feature). So, we should probably either add some new text here that continues the narrative about what API resource we're adding, or split your net-new changes into a separate follow-on proposal. |
||
|
@@ -168,8 +204,6 @@ Instead of implementing a new API resource, we propose that pods be able to | |
reference seccomp profiles by name. Since this is an alpha feature, we will | ||
use annotations instead of extending the API with new fields. | ||
|
||
### API changes? | ||
|
||
In the alpha version of this feature we will use annotations to store the | ||
names of seccomp profiles. The keys will be: | ||
|
||
|
@@ -214,6 +248,192 @@ profiles using the key | |
`seccomp.security.alpha.kubernetes.io/allowedProfileNames`. The value of this | ||
key should be a comma delimited list. | ||
|
||
### Spec | ||
|
||
We will start from the OCI specification. This API resource will be added to | ||
`settings.k8s.io` as an `alpha` resource. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just put this as a place holder but I have no idea where it belongs There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I also just added a spec but I am aware it needs more info with interaction with pods/containers and obviously examples but wanted to get early feedback, can add the rest as I get some time |
||
|
||
``` | ||
// Seccomp represents syscall restrictions | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not clear whether this is global or namespaced - seems like namespaced? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. wdyt should it be namespaced? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Kinda feels a lot like storage classes which are not namespaced. @pmorie what makes it feel like namespaced? Multi-tenancy? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have no preference on namespaced or un-namespaced :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would vote for non-namespaced, to match PodSecurityPolicy. It seems like something a ClusterAdmin should have control over... defining seccomp policies and then controlling who has permission to use them. If an admin of a namespace has permission to create (and then use) new seccomp policies, they could potentially use known vulnerabilities in a multi-tenant cluster, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it possible for resources to be both namespaced & unnamespaced? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
no |
||
type Seccomp struct { | ||
unversioned.TypeMeta | ||
ObjectMeta | ||
|
||
// +optional | ||
Spec SeccompSpec | ||
} | ||
|
||
// SeccompSpec represents the spec for syscall restrictions | ||
type SeccompSpec struct { | ||
DefaultAction Action `json:"defaultAction"` | ||
Architectures []Arch `json:"architectures,omitempty"` | ||
Syscalls []Syscall `json:"syscalls,omitempty"` | ||
} | ||
|
||
// Arch used for additional architectures | ||
type Arch string | ||
|
||
// Additional architectures permitted to be used for system calls | ||
// By default only the native architecture of the kernel is permitted | ||
const ( | ||
ArchX86 Arch = "SCMP_ARCH_X86" | ||
ArchX86_64 Arch = "SCMP_ARCH_X86_64" | ||
ArchX32 Arch = "SCMP_ARCH_X32" | ||
ArchARM Arch = "SCMP_ARCH_ARM" | ||
ArchAARCH64 Arch = "SCMP_ARCH_AARCH64" | ||
ArchMIPS Arch = "SCMP_ARCH_MIPS" | ||
ArchMIPS64 Arch = "SCMP_ARCH_MIPS64" | ||
ArchMIPS64N32 Arch = "SCMP_ARCH_MIPS64N32" | ||
ArchMIPSEL Arch = "SCMP_ARCH_MIPSEL" | ||
ArchMIPSEL64 Arch = "SCMP_ARCH_MIPSEL64" | ||
ArchMIPSEL64N32 Arch = "SCMP_ARCH_MIPSEL64N32" | ||
ArchPPC Arch = "SCMP_ARCH_PPC" | ||
ArchPPC64 Arch = "SCMP_ARCH_PPC64" | ||
ArchPPC64LE Arch = "SCMP_ARCH_PPC64LE" | ||
ArchS390 Arch = "SCMP_ARCH_S390" | ||
ArchS390X Arch = "SCMP_ARCH_S390X" | ||
ArchPARISC Arch = "SCMP_ARCH_PARISC" | ||
ArchPARISC64 Arch = "SCMP_ARCH_PARISC64" | ||
) | ||
|
||
// SeccompAction taken upon Seccomp rule match | ||
type SeccompAction string | ||
|
||
// Define actions for Seccomp rules | ||
const ( | ||
ActKill SeccompAction = "SCMP_ACT_KILL" | ||
ActTrap SeccompAction = "SCMP_ACT_TRAP" | ||
ActErrno SeccompAction = "SCMP_ACT_ERRNO" | ||
ActTrace SeccompAction = "SCMP_ACT_TRACE" | ||
ActAllow SeccompAction = "SCMP_ACT_ALLOW" | ||
) | ||
|
||
// SeccompOperator used to match syscall arguments in Seccomp | ||
type SeccompOperator string | ||
|
||
// Define operators for syscall arguments in Seccomp | ||
const ( | ||
OpNotEqual SeccompOperator = "SCMP_CMP_NE" | ||
OpLessThan SeccompOperator = "SCMP_CMP_LT" | ||
OpLessEqual SeccompOperator = "SCMP_CMP_LE" | ||
OpEqualTo SeccompOperator = "SCMP_CMP_EQ" | ||
OpGreaterEqual SeccompOperator = "SCMP_CMP_GE" | ||
OpGreaterThan SeccompOperator = "SCMP_CMP_GT" | ||
OpMaskedEqual SeccompOperator = "SCMP_CMP_MASKED_EQ" | ||
) | ||
|
||
// SeccompArg used for matching specific syscall arguments in Seccomp | ||
type SeccompArg struct { | ||
Index uint `json:"index"` | ||
Value uint64 `json:"value"` | ||
ValueTwo uint64 `json:"valueTwo"` | ||
Op SeccompOperator `json:"op"` | ||
} | ||
|
||
// Syscall is used to match a syscall in Seccomp | ||
type Syscall struct { | ||
Names []string `json:"names"` | ||
Action SeccompAction `json:"action"` | ||
Args []SeccompArg `json:"args,omitempty"` | ||
} | ||
``` | ||
|
||
### Default Profile | ||
|
||
We will create our own default seccomp profile that uses the above spec | ||
for containers and use the set of syscalls from the docker default profile | ||
as the initial base. Having our own will allow us to control and | ||
restrict different syscalls in the future. | ||
|
||
#### Various Syscalls Not Allowed | ||
|
||
Below includes a table of some of the syscalls we will not allow in our | ||
whitelist and why. It does not include all the syscalls but merely some | ||
important ones. Most of this was taken from the | ||
[original pull request](https://github.com/moby/moby/pull/19059) to Docker | ||
for the default profile. | ||
|
||
| Syscall | Description | | ||
|---------------------|---------------------------------------------------------------------------------------------------------------------------------------| | ||
| `acct` | Accounting syscall which could let containers disable their own resource limits or process accounting. Also gated by `CAP_SYS_PACCT`. | | ||
| `add_key` | Prevent containers from using the kernel keyring, which is not namespaced. | | ||
| `adjtimex` | Similar to `clock_settime` and `settimeofday`, time/date is not namespaced. Also gated by `CAP_SYS_TIME`. | | ||
| `bpf` | Deny loading potentially persistent bpf programs into kernel, already gated by `CAP_SYS_ADMIN`. | | ||
| `clock_adjtime` | Time/date is not namespaced. Also gated by `CAP_SYS_TIME`. | | ||
| `clock_settime` | Time/date is not namespaced. Also gated by `CAP_SYS_TIME`. | | ||
| `clone` | Deny cloning new namespaces. Also gated by `CAP_SYS_ADMIN` for CLONE_* flags, except `CLONE_USERNS`. | | ||
| `create_module` | Deny manipulation and functions on kernel modules. Obsolete. Also gated by `CAP_SYS_MODULE`. | | ||
| `delete_module` | Deny manipulation and functions on kernel modules. Also gated by `CAP_SYS_MODULE`. | | ||
| `finit_module` | Deny manipulation and functions on kernel modules. Also gated by `CAP_SYS_MODULE`. | | ||
| `get_kernel_syms` | Deny retrieval of exported kernel and module symbols. Obsolete. | | ||
| `get_mempolicy` | Syscall that modifies kernel memory and NUMA settings. Already gated by `CAP_SYS_NICE`. | | ||
| `init_module` | Deny manipulation and functions on kernel modules. Also gated by `CAP_SYS_MODULE`. | | ||
| `ioperm` | Prevent containers from modifying kernel I/O privilege levels. Already gated by `CAP_SYS_RAWIO`. | | ||
| `iopl` | Prevent containers from modifying kernel I/O privilege levels. Already gated by `CAP_SYS_RAWIO`. | | ||
| `kcmp` | Restrict process inspection capabilities, already blocked by dropping `CAP_PTRACE`. | | ||
| `kexec_file_load` | Sister syscall of `kexec_load` that does the same thing, slightly different arguments. Also gated by `CAP_SYS_BOOT`. | | ||
| `kexec_load` | Deny loading a new kernel for later execution. Also gated by `CAP_SYS_BOOT`. | | ||
| `keyctl` | Prevent containers from using the kernel keyring, which is not namespaced. | | ||
| `lookup_dcookie` | Tracing/profiling syscall, which could leak a lot of information on the host. Also gated by `CAP_SYS_ADMIN`. | | ||
| `mbind` | Syscall that modifies kernel memory and NUMA settings. Already gated by `CAP_SYS_NICE`. | | ||
| `mount` | Deny mounting, already gated by `CAP_SYS_ADMIN`. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably being dense, but what do you mean as "gated by"? Will the gates allow mount for instance? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. meaning it is also controlled by |
||
| `move_pages` | Syscall that modifies kernel memory and NUMA settings. | | ||
| `name_to_handle_at` | Sister syscall to `open_by_handle_at`. Already gated by `CAP_SYS_NICE`. | | ||
| `nfsservctl` | Deny interaction with the kernel nfs daemon. Obsolete since Linux 3.1. | | ||
| `open_by_handle_at` | Cause of an old container breakout. Also gated by `CAP_DAC_READ_SEARCH`. | | ||
| `perf_event_open` | Tracing/profiling syscall, which could leak a lot of information on the host. | | ||
| `personality` | Prevent container from enabling BSD emulation. Not inherently dangerous, but poorly tested, potential for a lot of kernel vulns. | | ||
| `pivot_root` | Deny `pivot_root`, should be privileged operation. | | ||
| `process_vm_readv` | Restrict process inspection capabilities, already blocked by dropping `CAP_PTRACE`. | | ||
| `process_vm_writev` | Restrict process inspection capabilities, already blocked by dropping `CAP_PTRACE`. | | ||
| `ptrace` | Tracing/profiling syscall, which could leak a lot of information on the host. Already blocked by dropping `CAP_PTRACE`. | | ||
| `query_module` | Deny manipulation and functions on kernel modules. Obsolete. | | ||
| `quotactl` | Quota syscall which could let containers disable their own resource limits or process accounting. Also gated by `CAP_SYS_ADMIN`. | | ||
| `reboot` | Don't let containers reboot the host. Also gated by `CAP_SYS_BOOT`. | | ||
| `request_key` | Prevent containers from using the kernel keyring, which is not namespaced. | | ||
| `set_mempolicy` | Syscall that modifies kernel memory and NUMA settings. Already gated by `CAP_SYS_NICE`. | | ||
| `setns` | Deny associating a thread with a namespace. Also gated by `CAP_SYS_ADMIN`. | | ||
| `settimeofday` | Time/date is not namespaced. Also gated by `CAP_SYS_TIME`. | ||
| `socket`, `socketcall` | Used to send or receive packets and for other socket operations. All `socket` and `socketcall` calls are blocked except communication domains `AF_UNIX`, `AF_INET`, `AF_INET6`, `AF_NETLINK`, and `AF_PACKET`. | | ||
| `stime` | Time/date is not namespaced. Also gated by `CAP_SYS_TIME`. | | ||
| `swapon` | Deny start/stop swapping to file/device. Also gated by `CAP_SYS_ADMIN`. | | ||
| `swapoff` | Deny start/stop swapping to file/device. Also gated by `CAP_SYS_ADMIN`. | | ||
| `sysfs` | Obsolete syscall. | | ||
| `_sysctl` | Obsolete, replaced by /proc/sys. | | ||
| `umount` | Should be a privileged operation. Also gated by `CAP_SYS_ADMIN`. | | ||
| `umount2` | Should be a privileged operation. Also gated by `CAP_SYS_ADMIN`. | | ||
| `unshare` | Deny cloning new namespaces for processes. Also gated by `CAP_SYS_ADMIN`, with the exception of `unshare --user`. | | ||
| `uselib` | Older syscall related to shared libraries, unused for a long time. | | ||
| `userfaultfd` | Userspace page fault handling, largely needed for process migration. | | ||
| `ustat` | Obsolete syscall. | | ||
| `vm86` | In kernel x86 real mode virtual machine. Also gated by `CAP_SYS_ADMIN`. | | ||
| `vm86old` | In kernel x86 real mode virtual machine. Also gated by `CAP_SYS_ADMIN`. | | ||
|
||
#### Default Behavior | ||
|
||
For `privileged` containers, no default seccomp profile will be used unless | ||
explicitly requested by the user via annotations. | ||
|
||
If `capAdd` is used on a Container, the default profile will be adjusted to | ||
interact accordingly with the capability added. These are documented below in | ||
a table by the cap being added: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Some of these capabilities are in the default set (e.g. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh, the defaults would stay the same this was merely a reference. |
||
|
||
| Capability | Syscalls Allowed | | ||
|----------------------|--------------------------------------------------------------| | ||
| `CAP_CHOWN` | chown, chown32, fchown, fchown32, fchownat, lchown, lchown32 | | ||
| `CAP_DAC_READ_SEARCH`| open_by_handle_at | | ||
| `CAP_IPC_LOCK` | mlock, mlock2, mlockall | | ||
| `CAP_SYS_ADMIN` | name_to_handle_at, bpf, clone, fanotify_init, lookup_dcookie, mount, perf_event_open, setdomainname, sethostname, setns, umount, umount2, unshare | | ||
| `CAP_SYS_BOOT` | reboot | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the only one that I would be raise a concern about. Do we have a usecase for reboot? But maybe I am overthinking it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it's been in docker's default and there have been no complaints, do you reboot in a container, where is your cause for concern, just curious There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this reboot the node or the container? And frankly I have never typed reboot inside a container. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's referring to the reboot syscall, which is not for containers because they aren't real :) http://man7.org/linux/man-pages/man2/reboot.2.html |
||
| `CAP_SYS_CHROOT` | chroot | | ||
| `CAP_SYS_MODULE` | delete_module, init_module, finit_module, query_module | | ||
| `CAP_SYS_PACCT` | acct | | ||
| `CAP_SYS_PTRACE` | kcmp, process_vm_readv, process_vm_writev, ptrace | | ||
| `CAP_SYS_RAWIO` | iopl, ioperm | | ||
| `CAP_SYS_TIME` | settimeofday, stime, adjtimex, clock_settime | | ||
| `CAP_SYS_TTY_CONFIG` | vhangup | | ||
|
||
|
||
## Examples | ||
|
||
### Unconfined profile | ||
|
@@ -260,7 +480,3 @@ spec: | |
- name: test-volume | ||
emptyDir: {} | ||
``` | ||
|
||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS --> | ||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/seccomp.md?pixel)]() | ||
<!-- END MUNGE: GENERATED_ANALYTICS --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good stuff; you may want to split the updates to the content out into a separate PR, since they're non-controversial and can easily be merged without depending on the new content here being accepted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya I was unsure what the protocol was for when things change but can totally split it out :)