-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for user namespaces #4572
Conversation
Can you show us an example of how a user would interact with this on the cli? |
Yes, there is one example in cli.rst:
The file has a UID of 100000 on the host but appears as root-owned in the container. Similarly, the Note that I am using a standard Ubuntu image. Before starting the container, the UIDs and GIDs of its files are translated to their real values on the host (100000 range). The time taken for the translation (a second or two for the above image) can be avoided by using a pre-translated image. |
With this interface, how would I map multiple uid's. Can I do:
To mount all of the docker's uid's to a given uid? Can I do:
To map 0 to 100000 and 100 to 100100? I'm really looking forward to this feature! I'm glad to see progress! |
You say: """" Is there a way to apply this user mapping at build time? Would it be possible to have a |
The semantics of
The mappings can be sparse. Multiple ranges of UIDs can be mapped with multiple The mappings cannot overlap. Obviously one-to-many mapping doesn't make sense (one virtual UID cannot be translated into multiple real UIDs at the same time). Many-to-one mappings may be allowed in principle but they are disallowed in current LXC implementation. For example, if real UID 100000 on the host is mapped to virtual UID 0 in the container, real UID 0 cannot also be mapped to virtual UID 0 in the container. This makes any directories in the container image owned by root on the host appear as owned by nobody in the container. |
The UID translation time can be avoided by using a 'pre-translated' image which is basically produced by committing the container into the (same or new) image and using it the next time. |
"Many-to-one mappings may be allowed in principle but they are disallowed in current LXC implementation." - That's a pity, as there goes my usecase... Mapping "any/every user in the container including root to a given user." Is Would it make sense to mimic list comprehension with a syntax like: I might suggest also It seems really important to me that we be able to map ALL UIDs, as the security flaw with volumes in the current non-mapped model comes from potential UID overlap. Imagine if there is a PaaS which gives it's users the ability to run docker containers, and also the ability to ssh in to a special non-privileged shell as a special non-privileged user. Within the ssh session, perhaps that user only has the right to run the docker client, in order to check the status of their container... Now if the docker container was able to create an executable file owned by that non-privileged user in a volume somewhere, you could end up with ssh non-privileged shell breakout. My use case is a bit different. I'm writing a program called subuser. I want each user on the system to be able to run docker containers which have volumes mounted, in order to access user files. Currently, I create a user in the "subuser container" which just happens to have the same UID as the user that is running that subuser container, which makes permissions match up. But it is terribly ugly. |
Non one-to-one mappings are tricky in practice. Say virtual UIDs 100 and 200 map to the same real UID, 1000. Chowning a file to either 100 or 200 would cause its UID on the host to be set to 1000. When a container process subsequently calls a The solution adopted by the kernel is to retain one-to-one mapping between host and container UIDs but assign a range of real UIDs to individual users which can in turn be mapped to virtual UIDs within the containers created by them. Even though virtual UIDs map to different real UIDs, they can be potentially owned by a single user on the host. Can you elaborate what you mean by the following: It seems really important to me that we be able to map ALL UIDs, as the security flaw with volumes in the current non-mapped model comes from potential UID overlap. Imagine if there is a PaaS which gives it's users the ability to run docker containers, and also the ability to ssh in to a special non-privileged shell as a special non-privileged user. Within the ssh session, perhaps that user only has the right to run the docker client, in order to check the status of their container... Now if the docker container was able to create an executable file owned by that non-privileged user in a volume somewhere, you could end up with ssh non-privileged shell breakout. |
@dineshs-altiscale I see the problem with the many to one mapping now. Perhaps it would be possible to make it so that only a single container-side uid was allowed to access a volume at all. The security problem, is that you don't want, an untrusted docker container which has a volume mounted to be able to create a file which is owned by some arbitrary user on the host. You want to be able to dictate the host side owner of the files in the volume. If you set a range, then the docker container could still create files owned by users outside that range... |
Even the root in a container can create files only with UIDs explicitly mapped into the container with |
@dineshs-altiscale can root create users with unmapped UIDs? |
No, even root cannot create any artifacts (users, files, processes etc.) with unmapped UIDs. Note that root is always relative in this model where UIDs are hierarchically delegated. In the global host namespace, with the entire UID space mapped in, root can create and administer all users. |
So if I do:
Will I get an error message because the root user cannot exist(it is not mapped in that example, if I understand correctly)? What if I do:
Will useradd return some sort of permission denied error(since in this case it is not allowed to create any new users as the only mapped UID is 0? |
Yes, the patch checks for UID 0 and any UID passed through |
This exposes UID namespace support. A new command line option (--uidmap) maps a set of virtual UIDs to which the application within the container is confined. The application could potentially be the root in the container but unprivileged on the host. Addresses issue moby#2918 Docker-DCO-1.1-Signed-off-by: Dinesh Subhraveti <dineshs@altiscale.com> (github: dineshs-altiscale)
If -x flag is not set, UIDs of the files in the image are assumed to match the specified UID mappings and no UID translation is performed. Images with UIDs already translated can be produced by simply committing a container created with -x flag: $ docker commit $(docker run -d -x --uidmap="100000:0:10000" centos true) centos_uid100000 $ docker run --uidmap="100000:0:10000" -i -t centos_uid100000 bash Docker-DCO-1.1-Signed-off-by: Dinesh Subhraveti dineshs@altiscale.com (github: dineshs-altiscale)
--private-uids option is introduced to simplify the use of virtual UID space. A default host UID range is chosen to create the container rather than the user having to specify a mapping. If user specifies mappings using --uidmap, they take precedence. In either case, the semantics of -x remain the same: $ docker commit $(docker run -d -x --private-uids centos true) centos_private_uids $ docker run --private-uids -i -t centos_private_uids bash # cat /proc/self/uid_map 0 100000 10000
Here is some draft text outlining the usage scenarios. Will update as we go. Creating an image with translated UIDs is simple.
It may take several seconds to translate the UIDs of all files in the image to a default UID range on the host and commit the product as a new image. The image can then be used without
Even though UID appears to be '0', the process really runs as UID 100000 on the host. Similarly the real UID of
Also, the virtual UID provided has to be within the virtual UID range mapped into the container.
An alternate custom UID mapping, rather than the default mapping used by
Since the mapping is different in this case, the base ubuntu image is used with The general syntax of
The mappings can be sparse. Multiple ranges of UIDs can be mapped with multiple --uidmap options. If a real to virtual UID mapping doesn't exist, it would show up as The mappings are one-to-one and cannot overlap. For example, if real UID 100000 on the host is mapped to virtual UID 0 in the container, real UID 0 cannot also be mapped to virtual UID 0 in the container. This makes any directories in the container image owned by root on the host appear as owned by nobody in the container. If
Any new empty volumes or volumes populated with contents from the image acquire the UID mappings as well.
However UIDs of any volumes imported from other containers or from the host are not translated and would remain inaccessible unless the UIDs of the files belong to the range of UIDs mapped into the container (or the permissions allow).
In this case, since there is no virtual UID mapping for real UID 0, the volume owned by root on the host appears to belong to |
I wonder, can you add Dockerfile equivalents to the commandline params? something like
I think the text you've written above should go into the examples in cli.rst (and we'll move them around later) |
@SvenDowideit, yes, Dockerfile commands is to be done. Updated cli.rst with above examples. |
This seems a bit low-level to me. I mean, we should probably allow manual specification of uid maps for specialized needs, but in general the uid mapping is complex because it is a global resource on the host system that needs to be allocated and maintained. I guess it depends on exactly what kind of usecases one sees for user namespaces, but I think we could end up with a more useful system if the docker daemon did the allocation of uid ranges, remapping, etc. There are many complexities involved here: Persistent allocation of host uid ranges, remapping of uids for images, volumes shared between containers need the same uid mappings, how to share images between hosts where the uid ranges are allocated differently (remap image when pushing to repository?). Etc, etc. |
UIDs are a rather complex resource. This PR attempts to provide a simple model for common use by using implicit defaults, while retaining flexibility for more advanced scenarios by using an additional option. Virtual UIDs is a new feature to Docker and Linux in general -- I think patterns around more complex use cases will evolve over time. In the mean time, the following simple model could cover at least the most common use cases: Each host has a range of available UIDs and Docker uses a particular subrange as default when Container portability is preserved by expressing UIDs as host-independent relative offsets. Dockerfile uses a new UIDMAP instruction which would either specify "default" for the simple case or a mapping of form "host UID relative to default : container UID : size". The host UID field is expressed as a relative offset to the default subrange on that host. When the resulting container is run on a different host, container UIDs are mapped to the available subrange on the target host. Images in the repository are always stored in their "identity mapping" (UID x maps to UID x). The UID space to which they need to be translated is host dependent and the translation is performed on the target before running the container. The image, once translated with |
This adds a Dockerfile instruction called PRIVATEUIDS which indicates the range of host UIDs to use rather than just one default range of UIDs (100k-110k) for all containers. It allows the user to specify which contiguous set of 10k UIDs (referred to as a "bank" of UIDs) are mapped into the container. A cohesive group of containers that need to share data through volumes must use the same bank. --private-uids option is modified to take a parameter to specify the bank as well. The behavior of --uidmap and -x options remains the same.
The last commit adds an integer parameter to Sysadmin of a host dedicates an unused portion of the host UID space for Docker, which is organized into a series of UID banks. The range(s) of host UIDs dedicated to Docker and UIDs per bank could be configurable (but hardcoded in current implementation). For example, The following picture shows host UID range 100K - 500K dedicated to Docker, which is divided into 40 banks of 10K UIDs each. 20K UIDs from banks 1 and 3 are mapped at the default virtual UID 0 in the container.
Dockerfile for the above example would be something like:
Note that there is no Dockerfile equivalent for I'll collect feedback on this and update the documentation. |
interesting - and then I wonder - can we name the banks - and perhaps get some auto-linkage when using shared volumes (sorry, this is not a request, I'm just thinking aloud) |
@SvenDowideit, I like the idea in general, names add color to otherwise boring integers. One way is to define an ordered list of unique names per host, each representing an available UID bank. The list has to be ordered to be able to specify ranges. Otherwise, each container would be limited to 5 banks, given that a namespace can have at most 5 UID mappings. Since names are host-dependent in this case, image config must store bank IDs for image portability. But then Dockerfiles themselves won't be portable if they reference host-dependent bank names. Another possibility is to define a well-known universal mapping between integers and names. (Something like atomic numbers in periodic table of elements, but there only ~100 of them.) Then, we'd be able to specify ranges, reference them in Dockerfiles etc. I am also debating the value of configurable UIDs-per-bank. Hardcoding it to a small enough value like 1000 would make banks portable (a container that needs 1 bank of 10K UIDs on one host may need 10 banks of 1K UIDs on another) and would also reduce configuration burden. Names aside, automatically pulling in UID banks from the container sharing its volumes is simple enough to implement. |
@SvenDowideit, just pushed the change to inherit banks from containers sharing volumes. |
@dineshs-altiscale Yes, that makes sense and is on my todo list. We also need a flag to enable this on docker run so that these mappings are applied. |
As of the latest commit (5975178), all user flags are removed in the interest of simplicity. Containers are always created with default UID mappings that map container root to docker-root on host and all other host users except root one-to-one. So your code would get the mappings to enforce from the generic code rather than from a user flag. Local images are stored with UIDs remapped so that no UID translation is required on container start. Images are reverse translated on push. |
Rather than (or in addition to) a |
This PR has outgrown itself. As discussed at plumbers conference, I am going to split this up into simpler bite size PRs as follows and add references to them here:
@crosbymichael let me know if this looks okay to you and I'll go ahead with the push. |
@dineshs-altiscale sounds good |
If we remap on disk at pull time, won't that break non-namespaced |
I liked my earlier I am happy to bring it back : ) |
Why don't we make the remapping occur during container creation, not on image creation (or pulling)? IMO, it seems that the |
@cyphar perhaps there is concern that it will slow down launch times for standard containers (maybe that suggests that the reverse - ie when running priv it can be mapped back at launch time?) @dineshs-altiscale - can you link here to the PRs you open so we can test/try them individually? |
@michaelneale The problem with that is that some users may want different mappings for the users for each container (such as sharing a volume, where you don't want every container to read the shared data of every other container). That's why I said that it should be a property of the container, not just done at container creation. |
@cyphar to address the core issue of privilege isolation while keeping the usage model simple, we agreed to map just the root user initially. If only the root user is mapped and also always to the same host user, the mappings become static. If the mappings are static, they could be directly applied to the image itself. That's kind of how we arrived here. But then, as folks are pointing out, this scheme won't work if user namespace is not used or if the root in different containers are mapped to different host users (isolating containers from each other is as necessary as isolating containers from the host) or if custom UID mappings are supported in the future. Evidently this is a complex problem and some trade-offs seem to be unavoidable. With that in mind, let me make the following proposal for the initial patch (splitting it up into multiple simple PRs is no problem):
It's a trade off between the burden of user creating remapped images before first use and transparency. The user should know what images are created and in use and how much storage they are consuming etc. Most backends are poor at efficiently tracking changes to file metadata containing the UIDs. Then it becomes:
|
And what do we do when there is no "docker-root" user on the host? Also, |
@tianon I already asked the first question, and it boiled down to "it should be part of the install". You can't just use a random uid (since any uid might be used somewhere, and doesn't need to be defined in |
@tianon basically all host users, except root, are mapped into the container. That's 2^32 -1 users. The missing UID in the container is the UID of host docker-root user. Installer should assign it a UID which is not commonly used, but luckily the range is large. To keep this simple and transparent, the user is in-charge of images -- remap before using and unmap before sharing. More sophisticated automation could be done in the next iteration in another PR. |
@crosbymichael could you please share your thoughts? Could something like the following be acceptable?
|
@dineshs-altiscale Would all UIDs other than root map to |
@cyphar Only the UID of host docker-root user would be unavailable in the container and ends up appearing as nobody (or whatever overflowuid is.) The rest of the 2^32 -1 UIDs are mapped. The specific mappings used in the code are:
Only mapping root makes the container quite unusable. Any attempt to use any other UID than 0 causes EINVAL -- no |
This is refactored into 3 PRs:
|
ping @crosbymichael @shykes |
@dineshs-altiscale do you think we can close this PR infavor of the others? |
Yes we could, but I was going to give it a few days to capture any other comments and discussion on the high level approach. |
Could this have label project/security added? |
This exposes UID namespace support. A new command line option (--uidmap) maps a set of virtual UIDs to which the application within the container is confined. The application could potentially be the root in the container but unprivileged on the host.
This is still missing tests but wanted to push it anyway to get feedback. Testing requires the latest kernel (kernel.org 3.13 or Fedora 20).
Addresses issue #2918
Docker-DCO-1.1-Signed-off-by: Dinesh Subhraveti dineshs@altiscale.com (github: dineshs-altiscale)