-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod-level security context proposal #12823
Conversation
67b8fbe
to
24be8ad
Compare
How could I forget @thockin |
GCE e2e build/test passed for commit 769ea856800b247d06e7cc0623405e5ddaea2642. |
That's what i thought @k8s-bot 👊 |
GCE e2e build/test passed for commit 67b8fbe168aa677e9c830a7535900f8b59ecc97f. |
GCE e2e build/test passed for commit 24be8ad87e50bb4611c725679d01e6978c061241. |
Privileged *bool | ||
SELinuxOptions *SELinuxOptions | ||
RunAsUser *int64 | ||
RunAsNonRoot bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are there two flags for non-root? can we get away with just one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question -- they are two different facets:
RunAsUser
is the specific UID that is requestedRunAsNonRoot
is to ensure that when the uid is delegated to what's declared in the image metadata, that the image doesn't want to run as root.
If I had included the comments on those fields it would probably be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments and names might make this clearer.
@thockin @bgrant0607 What's the backward-compat story for this going to be? |
We must support the current v1 API for the foreseeable future. |
|
||
#### Container runtime changes | ||
|
||
The docker and rkt `Runtime` implementations should be changed to set the group of each container |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @dchen1107 @vishh
I like it. As when we introduced SecurityContext, we need to continue to support the current fields (except anything added since 1.0, such as hostIPC). Any duplicated fields need to be set from each other in conversion (it can't be done in defaulting because that causes problems for updates). |
ContainerDefaults in the PodSecurityContext is problematic. If we copy it to ContainerSecurityContext of each container, that causes problems for updates -- we'd at least have to do something special. If we don't copy it, then we break v1 API clients expecting to be able to read the securityContext on each container. |
@bgrant0607 Agree it is problematic. IMO backward compat is going to be the trickiest aspect of actually coding this. By 'something special', I imagine you mean that for updates, we will have to project the container security contexts into the container defaults specified at the pod level. Accurate? |
In addition to updates, we also need to think about reading versions of the resource from etcd that were written prior to the field being added. It's general API policy to explicitly expand all default values so that client and other code dealing with the API don't all need to reason about implicit default behaviors. However, if a user originally set just one field that was copied to another field and then tried to update the original field, the two fields would then not match. What we'd really like is spreadsheet-like behavior, where updating any equivalent field updates the others automagically. This doesn't exactly match what we do for defaulting, since that doesn't touch fields that are already set. It is possible to do during conversion, by treating one field in the internal representation as the source and others as mirrors. If/when we eliminate the internal representation we'd need to find another solution. The other problem is how to tell which field was changed in the case that the two fields are set inconsistently. If PATCH, it should be straightforward. If PUT, then we'd need to diff the resource with the previous version. With ContainerDefaults, it's even harder. If we copied ContainerDefaults into the ContainerSecurityContext of each container that didn't specify SecurityContext, we probably should not allow ContainerDefaults to be updated, since doing so would have no effect, unless the user explicitly nulled the ContainerSecurityContext fields again. In this case, maybe we need to assume that anyone writing PodSecurityContext knows what they're doing (e.g., don't write PodSecurityContext and then just update the deprecated HostNetwork field without updating PodSecurityContext.HostNetwork), and we should just try to make reading the existing container-level securityContext work in a backward-compatible manner. The other alternative is that we just make this better in the v2 API, and comment which fields are relevant to pod-level security. PodSecurityContext doesn't address host ports nor host paths, either. Is this really worth the effort now? |
two additions: | ||
|
||
1. The `RunAsGroup` field specifies the GID the container process runs as | ||
2. The `RunWithSupplementalGroups` field specifies additional groups the container process should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the use cases for this? Is kubelet expected to verify that all the supplemental groups exist on the system?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vishh The use-case is to enable sharing volumes via supplemental groups. The supplemental groups don't need to exist in /etc/groups
inside or outside the container -- since they're just IDs, not strings that have to be resolved by looking at the groups file. I'm preparing another PR for a volumes proposal that will go in-depth into their use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Looking forward to your proposal. Kubelet might end up using gids for disk management internally. To avoid conflicts, we might have to reserve ranges of gids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vishh That's basically what I'm going to propose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #12944 for volume stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that supplemental groups need to be part of the API. If RunAsGroup is specified in the pod's security context then the volumes can be owned by that GID. If it is not specified then the kubelet should generate and use supplemental group IDs in the background to solve the volume permissions problem. This has the added benefit of being able to solve the volume permissions problem without a need for API changes :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm - if there is a use case for common supplemental groups across
different pods on shared volumes we'll still need a field.
On Aug 26, 2015, at 5:17 PM, Sami Wagiaalla notifications@github.com
wrote:
In docs/proposals/pod-security-context.md
#12823 (comment):
+type ContainerSecurityContext struct {
- Capabilities *Capabilities
- Privileged *bool
- SELinuxOptions *SELinuxOptions
- RunAsUser *int64
- RunAsNonRoot bool
- RunAsGroup *int64
- RunWithSupplementalGroups []int64
+}
+```
+The
ContainerSecurityContext
type is very similar to the existingSecurityContext
type, with
+two additions:
+
+1. TheRunAsGroup
field specifies the GID the container process runs as
+2. TheRunWithSupplementalGroups
field specifies additional groups the container process should
I don't know that supplemental groups need to be part of the API. If
RunAsGroup is specified in the pod's security context then the volumes can
be owned by that GID. If it is not specified then the kubelet should
generate and use supplemental group IDs in the background to solve the
volume permissions problem. This has the added benefit of being able to
solve the volume permissions problem without a need for API changes :)
—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/pull/12823/files#r38029439.
I did not think of ContainerDefaults at all the same as API defaults fro non-specified values. I saw PodSecuritySpec and ContainerSecuritySpec as a late-binding base and overlay. In that sense, any pod that does not have a ContainerDefaults field (i.e. everything today) will continue to operate as it does today. Adding a ContainerDefaults would change the behavior ONLY of things that were not overridden by the container. Updating the ContainerDefaults on a running pod would probably result in all of the containers being restarted (that's easy, we maybe could be smarter). |
Me too. |
The problem with late binding / overlay is that anything looking at container-level SecurityContext would be misled. Let's say, for example, someone developed an admission-control plugin to enforce security policy over SecurityContext. An overlay would be a good way to circumvent enforcement. |
Admission control would have to look at the final rendering of the On Wed, Aug 19, 2015 at 3:46 PM, Brian Grant notifications@github.com
|
The point is backwards compatibility with the existing v1 API -- code that's unaware of PodSecurityContext. |
ahh, right. Good catch. yeah, that's harder then On Wed, Aug 19, 2015 at 4:37 PM, Brian Grant notifications@github.com
|
@bgrant0607 @thockin At what point are we planning to introduce v2beta1? |
@bgrant0607 @thockin @smarterclayton In the context of a new api version, I feel like this gets easier to support. A v1 podspec would be equivalent to a pod without a pod level security context in the internal API. I do think we could accomplish our goals (which for me personally are around sharing volumes between multiple containers in a pod which may be running with different uid/gids) by omitting container defaults (maybe) and just introducing the pod-level security context for the pod-level options (host network ns, host ipc ns, pod-wide supplemental group). I think I could live with that for now (assuming it doesn't appear to be a horrible option after more thought) if the backwards compatibility issues within the v1 api are seriously problematic. Let me think through fully what we would have to do on backwards compat and see what shakes out of that. |
I kind of like just saying we have this new thing that overrides the old thing as a start. Tim convinced me that most of the time the new thing is what people want. In the future, the old thing (per container) can mutate to take into account new thing |
@bgrant0607 Think you will have a chance to review today? |
@thockin Brian told me you were the one to bug about this :-D |
1. Attributes that apply to the pod itself | ||
2. Attributes that apply to the containers of the pod | ||
|
||
In the internal API, fields of the `PodSpec` controlling the use of the host PID, IPC, and network |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might emphasize this is internal API only and why (compat)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this below, so nm
LGTM - most of my comments are for when the proposal turns into code. Just nits. My only gripe with this doc is that it is very verbose in explaining what amounts to a relatively small change (after all the compat kerfuffle) |
Please remove WIP from commit and PR titles. |
+1 |
Commit message changed. Thanks a lot @thockin @bgrant0607 @smarterclayton |
GCE e2e build/test passed for commit 95cd1ff70f8e5dfac3658b4c435017de7d2e6ce3. |
@k8s-oncall PTAL |
Please rebase to get shippable.yml changes. |
@bgrant0607 Rebased |
Unit, integration and GCE e2e test build/test passed for commit 0c70062. |
Shippable was a flake (#14577) |
Pod-level security context proposal
Proposal for differentiating pod-level and container-level security contexts.
@bgrant0607 @erictune @smarterclayton @pweil- @eparis