In this blog post, we share a journey that involved an “impossible” customer case, some neat kernel debugging, and a serious bug and fix in runc, the CLI that spawns and manages containers in Kubernetes and Docker and is part of Open Containers Initiative.

At Twistlock, we take customer issues with utmost seriousness as we know they depend on our platform to secure their mission critical cloud environments. A few months ago, we got an urgent customer case stating that Twistlock Defenders were sporadically causing Kubernetes liveness probes to fail. The failure produced the following cryptic error:

false Error: OCI runtime exec failed: exec failed: container_linux.go:348: starting container process caused “process_linux.go:138: adding pid 3101 to cgroups caused \”failed to write 3101 to cgroup.procs: write /sys/fs/cgroup/cpu,cpuacct/docker/1d7b29b96dfd7bc97c6a2d6cbff82b00509cdcc4dbf2ac72ef5dd2bef9db7067/cgroup.procs: invalid argument\””: unknown

Quickly Googling the error yielded multiple repeated occurrences [1,2,3] without any root cause or solution, which is usually indicative of an issue that is hard to solve. To provide some context about the failure, let’s first examine how Docker spawns a new process in a container. In short, Docker uses a CLI tool called runc to create new containers and spawn processes inside them. As part of spawning process in an existing container, runc attaches the process to the container cgroup (a common kernel isolation mechanism). In our case, this attachment sporadically failed and thereby caused the Kubernetes liveness probe to fail (Kubernetes liveness probe is based on Docker exec mechanism).

When handling customer cases caused by deep underlying platform issues like this, we usually take the following steps:

  1. Reproduce the issue in our private environment
  2. Stop the bleeding – unblock customer using a temporary but solid solution
  3. Perform a root cause analysis and deliver a long term fix

First, we spent few hours debugging the issue, understanding the environment and creating a persistent reproduction without Twistlock. Then, to mitigate the issue for the customer, we hacked a solution that ensures runc retries on those sporadic errors; this change did not require any changes from the customer and completely eliminated the customer’s errors.

After we ensured the problem was “solved” from the customer perspective, we were ready to find the root cause and ensure this is fixed for all runc consumers —which is practically everybody in the world who uses Docker or Kubernetes.

First, let’s shortly discuss the repro details (the full details can be found in the GitHub issue we opened). Practically, we discovered the issue consistently reproduced when we run a perf tool and spawn large number of processes inside a container (similar to a repetitive Kubernetes liveness probes). For example:

$ perf trace --no-syscalls --event 'sched:*'
$ for i in {1..10000}; do docker exec -ti <mycontainer> ls; done

Armed with a repro and a custom compiled runc we went down the rabbit-hole and started debugging the issue and experimenting potential fixes. After some more hard debugging work, we were again stuck as we discovered there wasn’t a simple way to eliminate the problem in runc without retrying the cgroup attachment flow.

The next step was to start debugging the kernel flows and diagnose why the cgroup attachment fails. To debug the kernel, we choose to use the awesome systemtap platform. Simply put, systemtap enables you to perform live-diagnostics of the kernel code without compiling it; you can modify parameters, print functions and structs, and perform complex conditional live debugging. Using this tool is out of the scope for this blog post, but you can watch this great tutorial.

Using our repro and systemtap, we tracked the issue to the following condition in cpu_cgroup_can_attach (kernel/sched/core.c#L8311).
This function that was added in kernel version v4.8:

static int cpu_cgroup_can_attach(struct cgroup_taskset
*tset)
{

cgroup_taskset_for_each( task, css, tset) {

#ifdef CONFIG_RT_GROUP_SCHED

….

* move wanting to detach+attach while we're not attached yet.

*/

if (task->state == TASK_NEW)

ret = - EINVAL;

In simple terms, the kernel condition means that a process can only be added to the cgroup if all its threads are not in TASK_NEW state. In our example, when the cgroup attachment failed, the rejected pid was a thread in the runc process.

Let’s explore a sample failure and the kernel state:
adding pid 18413 to cgroups caused \”failed to write 18413 to cgroup.procs: write /sys/fs/cgroup/cpu,cpuacct/docker/b906c3bcd1d0cbe999b9ba48bb07fc6cedbb05cc5a0fb5b15645c3f5660ef181/cgroup.procs: invalid argument\””

PPID PID TID Description

18412 18412 Docker runc init (process that is spawned by runc)
18412 18413 18413 Docker runc init (process that is added to the cgroup)
18413 18414 18413 Thread that causes adding pid 18413 cgroup failure

Practically, this issue happens because part of the runc flow is written in Go, which does not provide any guarantee on the number or runtime threads that are spawned. In your Go program, you can control the number of user-level go code threads using GOMAXPROCS but have no control on the number of threads used by go runtime.

Luckily, part of the runc init code is written in C, which runs before to the Go code starts. Thus, the solution is to ensure that the cgroup attachments happens as part of the C initialization code. Once we analyzed all this data, we opened a bug in runc. A solution to this bug was applied in a PR by crosbymichael, which continued the initial work by cyphar.
The fix is included in runc release 1.0-rc6 and order is restored.

We would like to thank Greenhouse for being patient and helping us to identify and debug the issue and Michael Crosby and Aleksa Sarai from Docker and SUSE for quickly fixing the issue in runc.

← Back to All Posts Next Post →