Recently I did a talk at AppsecUSA on protecting containerized applications using system call profiling. I received many questions and follow up interests from users, so I’m drafting this blog to recap the talk and answer some of the questions.

Before we delve into the blog, the talk recording is available here.

First of all, what is a system call? An application in the user space would use system calls to invoke system-level utilities, such as opening a file, writing a file, sending a message, all of which require the operating system to execute some system functions. A typical Linux OS, for example, has more than 300 native system calls.

One of the worries of using containers is that a malicious container could potentially compromise other containers on the same host via the common operation system. The way this could be accomplished is through system calls, since that is the only mechanism a user space application like a container communicates with the OS.

In my talk, I discussed two different threat models:

  • A malicious container attacks the underlying kernel
  • A malicious container attacks other containers via the kernel
Threat #1: A container attacks the underlying host.

In this scenario, a container attacks the host, posing threats such as denial of service, kernel exploits, or unauthorized API access. For example, the malicious application could pass a parameter that is longer than usual in a system call to attempt a buffer overflow attack. If successful, the application could execute arbitrary code with the privilege of the kernel, thereby inflicting harm on the entire system.

Similarly, the malicious container could exploit a vulnerability in the kernel and get around the boundary of Cgroups or user namespaces to DOS Kernel services.

Threat #2: A container attacks other containers via the kernel.

In another scenario, an application container attacks another application container on the same host machine through system calls. The attacking application could be trying to expand its privileges by, say, accessing another container’s private files or data. This is different than threat #1 because it does not necessarily entail the container compromise the operating system.

There are of course other threats like a remote attack on a container. But for the purpose of this discussion, I’d like to concentrate on just these two threats above.

We should note that both of these threats will entail a vulnerable system call being used in the attack.

Understanding system calls with a Node application

I focused my talk on Node.js application. Since Node is an interpreted language, it’s possible to process Node code statically and understand what it is supposed to do in terms of system calls the application intends to execute in runtime.

Let’s consider a very simple Node application.

screen-shot-2016-12-05-at-9-12-04-pm

For this application, we can statically determine that the system calls it will invoke are: syscall::write(), syscall::open(), and syscall::close().

We can do this determination statically, without having to run the application, because a) Node is an interpreted language, and b) container apps are immutable. And you can deduce this information using dTrace, STrace, or simply build up a library of mapping between Node APIs and corresponding system calls.

Twistlock built this in as part of our container security platform – we can automatically parse Node and other interpreted language applications to statically determine a set of system calls the application intends to use. This, then, becomes a profile for that specific application. In runtime, Twistlock uses this profile to determine if there is any deviation from that profile during execution.

In the case of a compromised application – say that someone injected a piece of malicious code at runtime – then you might see the application perform a different set of system calls, such as spawning a shell, which the profile has not previously stipulated. Once we see such an anomaly or deviation, we can report it, log the event, block, or do any number of responses to control or mitigate the malicious behavior.

Note that this system call profile we built is application-specific. We deduce and maintain different profiles for different applications on the same host, and use the specific profile to protect that application. We might, for instance, determine that App A in the illustration below initiates system calls eventfd2(), read(), write(), close(), and App B would have Open(), fstat(), close (), etc. 

screen-shot-2016-12-06-at-1-45-56-am

What about Seccomp ()?

One question we sometimes receive is what about seccomp()? The seccomp() utility in Linux offers a way to filter system calls at the OS level. You can load up a Seccomp profile as part of the operating system to restrict applications on that host from accessing specified system calls. In fact, Docker Engine comes with a default Seccomp profile that disables 44 out of more than 300 Linux syscalls.

While the Seccomp profile is useful, in order to avoid false positives, a default system wide profile can only offer the lowest common denominator of protection. In other words, the profile can only block system calls that are commonly bad for every single container on the host. It’s difficult for such a profile to selectively filter system calls based on the nature of different applications.

Docker Engine does provide the ability to set up seccomp profile for each individual container. However, such profiles need to be established manually and loaded manually within the Docker Engine to take effect. If you need to update the profile, say there is a new vulnerability that you have found which you want to incorporate as part of the profile, you will need to stop the applications, restart the Docker Engine with the new profiles, and reload the containers.

With Twistlock, you can basically just update the profile within our product and it doesn’t impact the host or the execution of the application.

For example, in January 2016 there was a zero-day vulnerability found to be affecting the Linux kernel. CVE-2016-0728 is the vulnerability that pertains to the system call Keyctl, a function that implements a key ring.

Keyctl (KEYCTL_JOIN_SESSION_KEYRING, name)

This is a rarely used system call. Any program that accesses this system call is probably one that implements cryptographic functions. Other applications attempting to access this system call is suspect. A system call whitelisting approach can catch an exploit attempting to access Keyctl(), if the system call is not part of the original intent of the application.

What if the application is not written in an interpreted language?

A follow up question about the above analysis is that what if the application is written in a language other than Node, Java, PHP, or an of the other interpreted languages. Well, even in that case, there is still a lot that you can do. Keep in mind that the container universe today is still a limited space. There are particular containers that are in widespread use and these are the ones that we need to analyze first. Datadog published a report earlier on this year which detailed the top ten most commonly used containers, including Registry, NGINX, Redis, Elasticsearch, Postgres, etc. You can see their report here, and our analysis here.

We could use dTrace, STrace, or Sysdig to analyze these containers a prior to deployment, in order to deduce a list of system calls that these applications intend to execute.

We can take this a step further to custom-developed containers. We could use launch time metadata, static binary analysis, and dynamic analysis in a sandbox, as well as runtime machine learning to determine the intended syscall behavior of protecting containerized applications, and then use that to build a whitelisting profile.

This Appsec talk is just the beginning of a series of technical talks that we will give around this topic. In Appsec California coming up in January, I’ll deliver a talk on whitelisting file system behavior for containers.

Again, if you are interested in accessing the entirety of the talk, the recording is available here, thanks to OWASP.

What’s Next?

 

← Back to All Posts Next Post →