A bug report filed against crun a few days ago exposed a deadlock: under certain seccomp profiles, the runtime would hang indefinitely before the container process ever started. The root cause is a subtle sequencing problem between installing a seccomp filter that intercepts a syscall and then immediately using that same syscall to hand off the resulting listener file descriptor to the userspace handler — the very handler that has not yet received the descriptor it needs to process the interception.

The reporter noticed that the following configuration worked fine with runc but caused crun to hang:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
runc (v1.1.4) accepts the following .linux.seccomp configuration (sendmsg is in the SCMP_ACT_NOTIFY list), but crun (v1.5, also tested v0.19) just hangs.

    "seccomp": {
      "defaultAction": "SCMP_ACT_ALLOW",
      "listenerPath": "/tmp/foo.sock",
      "syscalls": [
        {
          "names": [
            "sendmsg"
          ],
          "action": "SCMP_ACT_NOTIFY"
        }
      ]
    }

Seccomp supports user-space notifications: when a syscall is configured with SCMP_ACT_NOTIFY, the kernel opens a file descriptor and returns it to the caller instead of silently allowing or denying the syscall. That descriptor is subsequently used to receive notifications for every intercepted call.

The OCI runtime does not handle these notifications directly — it passes the file descriptor to a separate listener process. The listenerPath field in the OCI configuration specifies the UNIX socket that will receive the listener fd once crun has obtained it.

The bug was straightforward: crun was using sendmsg(2) to deliver the fd to that socket, but doing so immediately after installing the seccomp filter. Since sendmsg itself was in the intercept list, the call blocked waiting for a listener that had not yet received the fd — a classic deadlock.

What to do?#

The problem we need to solve is to send the file descriptor from an environment where the sendmsg is not blocked,

This is easily achieved with a helper process, that is created just before the seccomp filter is installed. The helper process will be responsible to send the file descriptor to the specified socket.

From the issue report, it seems that runc has already solved the problem by using a pipe to inform the helper process on what fd contains the seccomp listener and then let the helper process retrieve the file descriptor with the pidfd_getfd(2) syscall.

Two issues with this approach are:

  • it requires a new kernel feature, pidfd_getfd(2).
  • it still expects write(2) to not be filtered by seccomp.

The first issue can be solved by using a different approach, instead of using pidfd_getfd(2), we can fork the helper process with the CLONE_FILES flag, so the helper process will have the same file descriptors as the parent process!

We still need to solve the second issue, but we can do that by using a shared memory region and let the helper process do a busy loop on the region until it contains the file descriptor number.

Shared memory#

The shared memory region is backed by a memfd created as:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
      memfd = memfd_create ("seccomp-helper-memfd", O_RDWR);
      if (UNLIKELY (memfd < 0))
        return crun_make_error (err, errno, "memfd_create");

      ret = ftruncate (memfd, sizeof (atomic_int));
      if (UNLIKELY (ret < 0))
        return crun_make_error (err, errno, "ftruncate seccomp memfd");

      ret = libcrun_mmap (&mmap_region, NULL, sizeof (atomic_int),
                          PROT_WRITE | PROT_READ, MAP_SHARED, memfd, 0, err);
      if (UNLIKELY (ret < 0))
        return ret;

The first block creates the memfd file, the second one resizes it to the size of an atomic int and the third one maps it in memory.

Helper process#

Now that there is a way for the two processes to communicate without using any syscall we can look at the helper process, that just does:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
      helper_proc = syscall_clone (CLONE_FILES | SIGCHLD, NULL);
      if (UNLIKELY (helper_proc < 0))
        return crun_make_error (err, errno, "clone seccomp listener helper process");

      if (helper_proc == 0)
        {
          int fd;

          prctl (PR_SET_PDEATHSIG, SIGKILL);
          for (;;)
            {
              fd = *fd_received;
              if (fd == -1)
                {
                  usleep (1000);
                  continue;
                }
              break;
            }
          ret = send_fd_to_socket_with_payload (listener_receiver_fd, fd,
                                                receiver_fd_payload,
                                                receiver_fd_payload_len,
                                                err);
          if (UNLIKELY (ret < 0))
            _exit (crun_error_get_errno (err));
          _exit (0);
        }

the prctl(2) call is used to make sure that the helper process won’t survive its parent process.

Once the fd is retrieved from the shared memory region, the send_fd_to_socket_with_payload function sends it to the receiver socket using the sendmsg(2) syscall.

Main process#

The main process, the one that will be eventually execve the container program, just does:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
  ret = syscall_seccomp (SECCOMP_SET_MODE_FILTER, flags, &seccomp_filter);
  if (UNLIKELY (ret < 0))
    return crun_make_error (err, errno, "seccomp (SECCOMP_SET_MODE_FILTER)");
  if (listener_receiver_fd >= 0)
    {
      atomic_int *fd_to_send = mmap_region->addr;
      int status = 0;

      *fd_to_send = listener_fd = ret;

      ret = waitpid (helper_proc, &status, 0);
      ...
    }

The syscall_seccomp function is a wrapper around the seccomp(2) syscall to install the seccomp filter and retrieve the listener fd.

The *fd_to_send = ret; assignment writes the listener file descriptor to the shared memory and that the helper process will consume.

Conclusion#

With all of this in place, crun accepts a seccomp profile with no limitations on what syscalls can be intercepted with SCMP_ACT_NOTIFY. The notified process, that receives the seccomp listener, must still ensure that all syscalls until the execve(2) syscall are allowed, otherwise the OCI runtime will fail to start the container.