forked from kubernetes/kubernetes
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request kubernetes#33951 from pmorie/selinux-overhaul
Automatic merge from submit-queue Proposal: SELinux enhancements TLDR: Try to make SELinux support better by not requiring Kubelet directory to be labeled with an SELinux type usable from the container. cc @kubernetes/sig-node @yifan-gu
- Loading branch information
Showing
1 changed file
with
247 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,247 @@ | ||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING --> | ||
|
||
<!-- BEGIN STRIP_FOR_RELEASE --> | ||
|
||
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
|
||
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2> | ||
|
||
If you are using a released version of Kubernetes, you should | ||
refer to the docs that go with that version. | ||
|
||
Documentation for other releases can be found at | ||
[releases.k8s.io](http://releases.k8s.io). | ||
</strong> | ||
-- | ||
|
||
<!-- END STRIP_FOR_RELEASE --> | ||
|
||
<!-- END MUNGE: UNVERSIONED_WARNING --> | ||
|
||
## Abstract | ||
|
||
Presents a proposal for enhancing the security of Kubernetes clusters using | ||
SELinux and simplifying the implementation of SELinux support within the | ||
Kubelet by removing the need to label the Kubelet directory with an SELinux | ||
context usable from a container. | ||
|
||
## Motivation | ||
|
||
The current Kubernetes codebase relies upon the Kubelet directory being | ||
labeled with an SELinux context usable from a container. This means that a | ||
container escaping namespace isolation will be able to use any file within the | ||
Kubelet directory without defeating kernel | ||
[MAC (mandatory access control)](https://en.wikipedia.org/wiki/Mandatory_access_control). | ||
In order to limit the attack surface, we should enhance the Kubelet to relabel | ||
any bind-mounts into containers into a usable SELinux context without depending | ||
on the Kubelet directory's SELinux context. | ||
|
||
## Constraints and Assumptions | ||
|
||
1. No API changes allowed | ||
2. Behavior must be fully backward compatible | ||
3. No new admission controllers - make incremental improvements without huge | ||
refactorings | ||
|
||
## Use Cases | ||
|
||
1. As a cluster operator, I want to avoid having to label the Kubelet | ||
directory with a label usable from a container, so that I can limit the | ||
attack surface available to a container escaping its namespace isolation | ||
2. As a user, I want to run a pod without an SELinux context explicitly | ||
specified and be isolated using MCS (multi-category security) on systems | ||
where SELinux is enabled, so that the pods on each host are isolated from | ||
one another | ||
3. As a user, I want to run a pod that uses the host IPC or PID namespace and | ||
want the system to do the right thing with regard to SELinux, so that no | ||
unnecessary relabel actions are performed | ||
|
||
### Labeling the Kubelet directory | ||
|
||
As previously stated, the current codebase relies on the Kubelet directory | ||
being labeled with an SELinux context usable from a container. The Kubelet | ||
uses the SELinux context of this directory to determine what SELinux context | ||
`tmpfs` mounts (provided by the EmptyDir memory-medium option) should receive. | ||
The problem with this is that it opens an attack surface to a container that | ||
escapes its namespace isolation; such a container would be able to use any | ||
file in the Kubelet directory without defeating kernel MAC. | ||
|
||
### SELinux when no context is specified | ||
|
||
When no SELinux context is specified, Kubernetes should just do the right | ||
thing, where doing the right thing is defined as isolating pods with a node- | ||
unique set of categories. Node-uniqueness means unique among the pods | ||
scheduled onto the node. Long-term, we want to have a cluster-wide allocator | ||
for MCS labels. Node-unique MCS labels are a good middle ground that is | ||
possible without a new, large, feature. | ||
|
||
### SELinux and host IPC and PID namespaces | ||
|
||
Containers in pods that use the host IPC or PID namespaces need access to | ||
other processes and IPC mechanisms on the host. Therefore, these containers | ||
should be run with the `spc_t` SELinux type by the container runtime. The | ||
`spc_t` type is an unconfined type that other SELinux domains are allowed to | ||
connect to. In the case where a pod uses one of these host namespaces, it | ||
should be unnecessary to relabel the pod's volumes. | ||
|
||
## Analysis | ||
|
||
### Libcontainer SELinux library | ||
|
||
Docker and rkt both use the libcontainer SELinux library. This library | ||
provides a method, `GetLxcContexts`, that returns the a unique SELinux | ||
contexts for container processes and files used by them. `GetLxcContexts` | ||
reads the base SELinux context information from a file at `/etc/selinux/<policy- | ||
name>/contexts/lxc_contexts` and then adds a process-unique MCS label. | ||
|
||
Docker and rkt both leverage this call to determine the 'starting' SELinux | ||
contexts for containers. | ||
|
||
### Docker | ||
|
||
Docker's behavior when no SELinux context is defined for a container is to | ||
give the container a node-unique MCS label. | ||
|
||
#### Sharing IPC namespaces | ||
|
||
On the Docker runtime, the containers in a Kubernetes pod share the IPC and | ||
PID namespaces of the pod's infra container. | ||
|
||
Docker's behavior for containers sharing these namespaces is as follows: if a | ||
container B shares the IPC namespace of another container A, container B is | ||
given the SELinux context of container A. Therefore, for Kubernetes pods | ||
running on docker, in a vacuum the containers in a pod should have the same | ||
SELinux context. | ||
|
||
[**Known issue**](https://bugzilla.redhat.com/show_bug.cgi?id=1377869): When | ||
the seccomp profile is set on a docker container that shares the IPC namespace | ||
of another container, that container will not receive the other container's | ||
SELinux context. | ||
|
||
#### Host IPC and PID namespaces | ||
|
||
In the case of a pod that shares the host IPC or PID namespace, this flag is | ||
simply ignored and the container receives the `spc_t` SELinux type. The | ||
`spc_t` type is unconfined, and so no relabeling needs to be done for volumes | ||
for these pods. Currently, however, there is code which relabels volumes into | ||
explicitly specified SELinux contexts for these pods. This code is unnecessary | ||
and should be removed. | ||
|
||
#### Relabeling bind-mounts | ||
|
||
Docker is capable of relabeling bind-mounts into containers using the `:Z` | ||
bind-mount flag. However, in the current implementation of the docker runtime | ||
in Kubernetes, the `:Z` option is only applied when the pod's SecurityContext | ||
contains an SELinux context. We could easily implement the correct behaviors | ||
by always setting `:Z` on systems where SELinux is enabled. | ||
|
||
### rkt | ||
|
||
rkt's behavior when no SELinux context is defined for a pod is similar to | ||
Docker's -- an SELinux context with a node-unique MCS label is given to the | ||
containers of a pod. | ||
|
||
#### Sharing IPC namespaces | ||
|
||
Containers (apps, in rkt terminology) in rkt pods share an IPC and PID | ||
namespace by default. | ||
|
||
#### Relabeling bind-mounts | ||
|
||
Bind-mounts into rkt pods are automatically relabeled into the pod's SELinux | ||
context. | ||
|
||
#### Host IPC and PID namespaces | ||
|
||
Using the host IPC and PID namespaces is not currently supported by rkt. | ||
|
||
## Proposed Changes | ||
|
||
### Refactor `pkg/util/selinux` | ||
|
||
1. The `selinux` package should provide a method `SELinuxEnabled` that returns | ||
whether SELinux is enabled, and is built for all platforms (the | ||
libcontainer SELinux is only built on linux) | ||
2. The `SelinuxContextRunner` interface should be renamed to `SELinuxRunner` | ||
and be changed to have the same method names and signatures as the | ||
libcontainer methods its implementations wrap | ||
3. The `SELinuxRunner` interface should have a new method added called | ||
`GetLxcContexts`; this should return a **shared** (ie, without MCS labels) | ||
SELinux context usable by a container | ||
|
||
```go | ||
package selinux | ||
|
||
// Note: the libcontainer SELinux package is only built for Linux, so it is | ||
// necessary to have a NOP wrapper which is built for non-Linux platforms to | ||
// allow code that links to this package not to differentiate its own methods | ||
// for Linux and non-Linux platforms. | ||
// | ||
// SELinuxRunner wraps certain libcontainer SELinux calls. For more | ||
// information, see: | ||
// | ||
// https://github.com/opencontainers/runc/blob/master/libcontainer/selinux/selinux.go | ||
type SELinuxRunner interface { | ||
// Setfilecon sets the SELinux context for the given path or returns an | ||
// error. | ||
Setfilecon(path, context string) error | ||
|
||
// Getfilecon returns the SELinux context for the given path or returns an | ||
// error. | ||
Getfilecon(path string) (string, error) | ||
|
||
// GetLxcContexts returns the process and file SELinux contexts to use for | ||
// containers. | ||
GetLxcContexts() (string, string) | ||
} | ||
``` | ||
|
||
### Kubelet Changes | ||
|
||
1. The `relabelVolumes` method in `kubelet_volumes.go` is not needed and can | ||
be removed | ||
2. The `GenerateRunContainerOptions` method in `kubelet_pods.go` should no | ||
longer call `relabelVolumes` | ||
3. The `makeHostsMount` method in `kubelet_pods.go` should set the | ||
`SELinuxRelabel` attribute of the mount for the pod's hosts file to `true` | ||
|
||
### Changes to `pkg/kubelet/dockertools/` | ||
|
||
1. The `makeMountBindings` should be changed to: | ||
1. No longer accept the `podHasSELinuxLabel` parameter | ||
2. Always use the `:Z` bind-mount flag when SELinux is enabled and the mount | ||
has the `SELinuxRelabel` attribute set to `true` | ||
2. The `runContainer` method should be changed to always use the `:Z` | ||
bind-mount flag on the termination message mount when SELinux is enabled | ||
|
||
### Changes to `pkg/kubelet/rkt` | ||
|
||
The should not be any required changes for the rkt runtime; we should test to | ||
ensure things work as expected under rkt. | ||
|
||
### Changes to volume plugins and infrastructure | ||
|
||
1. The `VolumeHost` interface contains a method called `GetRootContext`; this | ||
is an artifact of the old assumptions about the Kubelet directory's SELinux | ||
context and can be removed | ||
2. The `empty_dir.go` file should be changed to create an `SELinuxRunner` and | ||
call its `GetLxcContexts` method to determine the right SELinux context to | ||
give `tmpfs` mounts | ||
|
||
### Changes to `pkg/controller/...` | ||
|
||
The `VolumeHost` abstraction is used in a couple of PV controllers as NOP | ||
implementations. These should be altered to no longer include `GetRootContext`. | ||
|
||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS --> | ||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/selinux-enhancements.md?pixel)]() | ||
<!-- END MUNGE: GENERATED_ANALYTICS --> |