Network namespaces for unprivileged users

A couple of weekends ago I’ve played with libslirp and put together slirp-forwarder. The challenge with network namespaces for unprivileged users is that creating TAP or TUN devices requires privileges in the host network namespace. SliRP sidesteps this by emulating a full TCP/IP stack entirely in user space, so the helper process can forward traffic to the outside world using only normal socket operations, without needing any elevated capability.

SliRP emulates in userspace a TCP/IP stack. It can be used to circumvent the limitation of creating TAP/TUN devices in the host namespace for an unprivileged user. The program could run in the host namespace, receive messages from the network namespace where a TAP device is configured, and forward them to the outside world using unprivileged operations such as opening another connection to the destination host. Privileged operations are still not possible outside of the emulated network, as the helper program doesn’t gain any additional privilege that running as an unprivileged user.

[read more]

Become-root in a user namespace

I’ve cleaned up some C files I was using locally for hacking with user namespaces and uploaded them to a new repository on github: https://github.com/giuseppe/become-root. The tool creates a new user namespace and maps the caller to UID 0 inside it, while also mapping additional UIDs and GIDs from the ranges allocated in /etc/subuid and /etc/subgid. This is the foundation needed for rootless containers, which require a full UID/GID mapping — not just the single-UID mapping that unshare -r provides — to correctly represent file ownership inside container images.

[read more]

Fuse-overlayfs moved to github.com/containers

The fuse-overlayfs project I was working on in the last weeks was moved under the github.com/containers umbrella. fuse-overlayfs is a user-space implementation of the overlay filesystem that can be mounted without root privileges, which is essential for rootless containers. With Linux 4.18 introducing the ability to mount FUSE filesystems inside user namespaces, this makes overlay-based storage finally usable by unprivileged container runtimes such as Podman.

With Linux 4.18 it will be possible to mount a FUSE file system in an user namespace. fuse-overlayfs is an implementation in user space of the overlay file system already present in the Linux kernel, but that can be mounted only by the root user. Union file systems were around for a long time, allowing multiple layers to be stacked on top of each other where usually the last one is the only writeable.
Overlay is an union file system widely used for mounting OCI image. Each OCI image is made up of different layers, each layer can be used by different images. A list of layers, stacked on each other gives the final image that is used by a container. The last level, that is writeable, is specific for the container. This model enables different containers to use the same image that is accessible as read-only from the lower layers of the overlay file system.

[read more]

Current status (and problems) of running Buildah as non root

Having Buildah running in a user namespace opens the possibility of building container images as a non-root user. I’ve done some work to get Buildah running inside a user container, where it can still create and modify container images without any elevated privileges on the host. This is useful for CI environments and shared systems where granting root or setuid access is not acceptable.

There are still some open issues to get it fully working. The biggest open one is that overlayfs cannot be currently used as non root user. There is some work going on, but this will require changes in the kernel and the way extended attributes work for overlay. The alternative is far from ideal and it is to use the vfs storage driver, but it is a good starting point to get things moving and see how far we get. (Another possibility that doesn’t require changes in the kernel would be an OSTree storage for Buildah, but that is a different story).

[read more]

New COPR repository for crun

I made a new COPR repository for crun so that it can be easily tested on Fedora without having to build from source. crun is a lightweight OCI container runtime written in C, intended as a faster and lower-overhead alternative to runC. The COPR repository tracks the upstream development branch, making it straightforward to try out new features and report issues before they land in a distribution package.

https://copr.fedorainfracloud.org/coprs/gscrivano/crun/

To install crun on Fedora, it is enough to:

[read more]

C is a better fit for tools like an OCI runtime

I’ve spent some of the last weeks working on a replacement for runC, the most used/known OCI runtime for running containers. It might not be very well known, but it is a key component for running containers. Every Docker container ultimately runs through runC. The OCI runtime is the thin layer between the container engine and the kernel: it reads a JSON configuration file, creates the necessary namespaces and cgroups, sets up mounts and capabilities, and finally execs the container process. Because it runs for such a short time and its workload is almost entirely syscalls, the implementation language matters for startup latency.

[read more]

OpenShift on system containers

It is still an ongoing work not ready for production, but the upstream version of OpenShift origin has already an experimental support for running OpenShift Origin using system containers. The “latest” Docker image for origin, node and openvswitch, the 3 components we need, are automatically pushed to docker.io, so we can use these for our test. The rhel7/etcd system container image instead is pulled from the Red Hat registry.

This demo is based on these blog posts www.projectatomic.io/blog/2016/12/part1-install-origin-on-f25-atomic-host/ and www.projectatomic.io/blog/2016/12/part2-install-origin-on-f25-atomic-host/ with some differences for the provision of the VMs and obviously running system containers instead of Docker containers.

[read more]

System containers presentation

Here are the slides for the Atomic System Containers talk I gave at Devconf.cz 2017. System containers are a way to run infrastructure services — such as etcd and Flannel — outside of Docker, managed directly by runc and systemd, which removes the circular dependency that arises when a container runtime depends on components that must themselves be running inside containers.

http://scrivano.org/static/system-containers-demo/

If you are interested in the video, it is on YouTube:

[read more]

Facebook detox?

I have been using Facebook for the last years to fill every dead time:waiting for the bus, ads on TV, compiling, etc.  The quality of the information coming from Facebook is inferior to any other social network, at least to my experience (it can be I follow/know the wrong people), though the part of the brain that controls procrastination seems addicted to this lower quality information and the chattering there.  Also, I don’t want to simply delete my Facebook account and move on, most of the people I know are present only there, neither I want to be more “asocial”.

[read more]

Use bubblewrap as an unprivileged user to run systemd images

bubblewrap is a sandboxing tool that allows unprivileged users to run containers. I was recently working on a way to allow unprivileged users to take advantage of bubblewrap to run regular system images that use systemd. To do so, it was necessary to modify bubblewrap to retain a controlled set of Linux capabilities inside the sandbox. Without those capabilities, systemd cannot perform the privilege-separation steps it needs at startup, even when running as UID 0 inside a user namespace.

[read more]