seccomp-bpf is an extension to seccomp[8] that allows filtering of system calls using
a configurable policy implemented using Berkeley Packet Filter rules. It is used by
OpenSSH and vsftpd as well as the Google Chrome/Chromium web browsers on Chrome OS and
Linux. (In this regard seccomp-bpf achieves similar functionality to the older
systrace—which seems to be no longer supported for Linux).
-- https://en.wikipedia.org/wiki/Seccomp
Right now this is all handled in App::EvalServerAdvanced::Seccomp, with a large set of predefined rules, organized into 'profiles'. Each profile is intended to represent a single kind of action that a program could do, such as open a file for reading, open a file for writing, etc.
I've created a few profiles to start with
stdio Allow reading from STDIN, and writing to STDOUT/STDERR.
file_open Allows calling some file related system calls, such as: open, openat, close, select, read (on any descriptor), pread64, lseek, fstat, lstat, stat, fcntl, and ioctl with flags to detect if it's a tty. The flags that are allowed to go to a opening a file are defined in the "open_modes" rules that will be covered later
file_opendir Allows opening a directory to get a list of files, and also includes the file_open profile to allow interacting with the handle. Essentially allows the behavior of /bin/ls or similar programs
file_tty Adds O_NOCTTY to the allowed flags passed to open() and similar calls
file_readonly Adds O_NONBLOCK, O_EXCL, O_RDONLY, O_NOFOLLOW, O_CLOEXEC to be passed to open() and similar calls
file_write Adds O_CREAT, O_WRONLY, O_TRUNC, O_RDWR to be passed to open() and similar calls. Also allows the use of write, pwrite64, mkdir, and chmod syscalls.
time_calls Allows calling nanosleep, clock_gettime, and clock_getres syscalls. For perl this means allowing time(), and similar calls, and sleep() along with Time::HiRes.
ruby_timer_thread
This one is a special ruby specific profile. It allows ruby to create a thread that
it uses internally, and only allows that thread creation with a specific set of flags,
CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID
This prevents it from doing arbitrary fork() calls, while still allowing the interpreter
to run. It also allows for pipe2 to be called to create communication between the two
threads.
perl_file_temp This was added specifically for behavior of File::Temp, and might get folded into a more generic profile. It allows chmod with a mode of 0600 and unlink to be called.
exec_wrapper This one is seriously special. It's not a predefined set of rules, but in fact generates the rules at runtime. This is because of limitations of seccomp. Since seccomp can't inspect inside of pointers, there's no way to verify the contents of a string being passed to execve(), instead we create a white-list of strings that can be passed to it, and only allow calls to execve that are passed pointers to this syscall. This isn't perfectly secure since someone could overwrite the contents at a later point but it's safe enough because an attacker can't view the generated BPF to extract the addresses, and the strings themselves should be gone from memory by the time their code runs, preventing them from recreating the original addresses. This requires ASLR in order to be effective at preventing an attacker from derriving the address of the strings from previous runs.
There's also some other profiles like ruby_timer_thread specifically for allowing node.js to do similar things to ruby (create a thread, use epoll, etc.).
The way the rules are defined allow syscalls like open() to not need special handling. Since many syscalls can take flags, it's useful to be able to limit the flags they can take.
{syscall => 'openat', permute_rules => [['2', '==', \'open_modes']]},
Inside A::ESA::Seccomp you can define a syscall like the above, to take a set of automatically generated rules from a permutation. In this cases it's called 'open_modes'. A profile can add (but not remove) values to the permutation rules, and then when the whole BPF program gets compiled it'll generate all the applicable rules for you. This makes setting up calls like open much much simpler since you don't have to write out all possible modes yourself. This is also an area where I could be doing better to optimize the whole thing, but have not done so yet. Seccomp itself supports doing some bitwise operations that could make this more effective but they were not well exposed through Linux::Seccomp when this was originally designed.
In the second part of this blog I'll document the proposed configuration scheme using YAML 1.2 and the perl modules located in the sandbox root.
Tags: evalserver seccomp
]]>