Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit cb60e3e6 authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull security subsystem updates from James Morris:
 "New notable features:
   - The seccomp work from Will Drewry
   - PR_{GET,SET}_NO_NEW_PRIVS from Andy Lutomirski
   - Longer security labels for Smack from Casey Schaufler
   - Additional ptrace restriction modes for Yama by Kees Cook"

Fix up trivial context conflicts in arch/x86/Kconfig and include/linux/filter.h

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (65 commits)
  apparmor: fix long path failure due to disconnected path
  apparmor: fix profile lookup for unconfined
  ima: fix filename hint to reflect script interpreter name
  KEYS: Don't check for NULL key pointer in key_validate()
  Smack: allow for significantly longer Smack labels v4
  gfp flags for security_inode_alloc()?
  Smack: recursive tramsmute
  Yama: replace capable() with ns_capable()
  TOMOYO: Accept manager programs which do not start with / .
  KEYS: Add invalidation support
  KEYS: Do LRU discard in full keyrings
  KEYS: Permit in-place link replacement in keyring list
  KEYS: Perform RCU synchronisation on keys prior to key destruction
  KEYS: Announce key type (un)registration
  KEYS: Reorganise keys Makefile
  KEYS: Move the key config into security/keys/Kconfig
  KEYS: Use the compat keyctl() syscall wrapper on Sparc64 for Sparc32 compat
  Yama: remove an unused variable
  samples/seccomp: fix dependencies on arch macros
  Yama: add additional ptrace scopes
  ...
parents 99262a3d ff2bb047
Loading
Loading
Loading
Loading
+163 −0
Original line number Diff line number Diff line
		SECure COMPuting with filters
		=============================

Introduction
------------

A large number of system calls are exposed to every userland process
with many of them going unused for the entire lifetime of the process.
As system calls change and mature, bugs are found and eradicated.  A
certain subset of userland applications benefit by having a reduced set
of available system calls.  The resulting set reduces the total kernel
surface exposed to the application.  System call filtering is meant for
use with those applications.

Seccomp filtering provides a means for a process to specify a filter for
incoming system calls.  The filter is expressed as a Berkeley Packet
Filter (BPF) program, as with socket filters, except that the data
operated on is related to the system call being made: system call
number and the system call arguments.  This allows for expressive
filtering of system calls using a filter program language with a long
history of being exposed to userland and a straightforward data set.

Additionally, BPF makes it impossible for users of seccomp to fall prey
to time-of-check-time-of-use (TOCTOU) attacks that are common in system
call interposition frameworks.  BPF programs may not dereference
pointers which constrains all filters to solely evaluating the system
call arguments directly.

What it isn't
-------------

System call filtering isn't a sandbox.  It provides a clearly defined
mechanism for minimizing the exposed kernel surface.  It is meant to be
a tool for sandbox developers to use.  Beyond that, policy for logical
behavior and information flow should be managed with a combination of
other system hardening techniques and, potentially, an LSM of your
choosing.  Expressive, dynamic filters provide further options down this
path (avoiding pathological sizes or selecting which of the multiplexed
system calls in socketcall() is allowed, for instance) which could be
construed, incorrectly, as a more complete sandboxing solution.

Usage
-----

An additional seccomp mode is added and is enabled using the same
prctl(2) call as the strict seccomp.  If the architecture has
CONFIG_HAVE_ARCH_SECCOMP_FILTER, then filters may be added as below:

PR_SET_SECCOMP:
	Now takes an additional argument which specifies a new filter
	using a BPF program.
	The BPF program will be executed over struct seccomp_data
	reflecting the system call number, arguments, and other
	metadata.  The BPF program must then return one of the
	acceptable values to inform the kernel which action should be
	taken.

	Usage:
		prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog);

	The 'prog' argument is a pointer to a struct sock_fprog which
	will contain the filter program.  If the program is invalid, the
	call will return -1 and set errno to EINVAL.

	If fork/clone and execve are allowed by @prog, any child
	processes will be constrained to the same filters and system
	call ABI as the parent.

	Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or
	run with CAP_SYS_ADMIN privileges in its namespace.  If these are not
	true, -EACCES will be returned.  This requirement ensures that filter
	programs cannot be applied to child processes with greater privileges
	than the task that installed them.

	Additionally, if prctl(2) is allowed by the attached filter,
	additional filters may be layered on which will increase evaluation
	time, but allow for further decreasing the attack surface during
	execution of a process.

The above call returns 0 on success and non-zero on error.

Return values
-------------
A seccomp filter may return any of the following values. If multiple
filters exist, the return value for the evaluation of a given system
call will always use the highest precedent value. (For example,
SECCOMP_RET_KILL will always take precedence.)

In precedence order, they are:

SECCOMP_RET_KILL:
	Results in the task exiting immediately without executing the
	system call.  The exit status of the task (status & 0x7f) will
	be SIGSYS, not SIGKILL.

SECCOMP_RET_TRAP:
	Results in the kernel sending a SIGSYS signal to the triggering
	task without executing the system call.  The kernel will
	rollback the register state to just before the system call
	entry such that a signal handler in the task will be able to
	inspect the ucontext_t->uc_mcontext registers and emulate
	system call success or failure upon return from the signal
	handler.

	The SECCOMP_RET_DATA portion of the return value will be passed
	as si_errno.

	SIGSYS triggered by seccomp will have a si_code of SYS_SECCOMP.

SECCOMP_RET_ERRNO:
	Results in the lower 16-bits of the return value being passed
	to userland as the errno without executing the system call.

SECCOMP_RET_TRACE:
	When returned, this value will cause the kernel to attempt to
	notify a ptrace()-based tracer prior to executing the system
	call.  If there is no tracer present, -ENOSYS is returned to
	userland and the system call is not executed.

	A tracer will be notified if it requests PTRACE_O_TRACESECCOMP
	using ptrace(PTRACE_SETOPTIONS).  The tracer will be notified
	of a PTRACE_EVENT_SECCOMP and the SECCOMP_RET_DATA portion of
	the BPF program return value will be available to the tracer
	via PTRACE_GETEVENTMSG.

SECCOMP_RET_ALLOW:
	Results in the system call being executed.

If multiple filters exist, the return value for the evaluation of a
given system call will always use the highest precedent value.

Precedence is only determined using the SECCOMP_RET_ACTION mask.  When
multiple filters return values of the same precedence, only the
SECCOMP_RET_DATA from the most recently installed filter will be
returned.

Pitfalls
--------

The biggest pitfall to avoid during use is filtering on system call
number without checking the architecture value.  Why?  On any
architecture that supports multiple system call invocation conventions,
the system call numbers may vary based on the specific invocation.  If
the numbers in the different calling conventions overlap, then checks in
the filters may be abused.  Always check the arch value!

Example
-------

The samples/seccomp/ directory contains both an x86-specific example
and a more generic example of a higher level macro interface for BPF
program generation.



Adding architecture support
-----------------------

See arch/Kconfig for the authoritative requirements.  In general, if an
architecture supports both ptrace_event and seccomp, it will be able to
support seccomp filter with minor fixup: SIGSYS support and seccomp return
value checking.  Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER
to its arch-specific Kconfig.
+164 −40
Original line number Diff line number Diff line
@@ -15,7 +15,7 @@ at hand.

Smack consists of three major components:
    - The kernel
    - A start-up script and a few modified applications
    - Basic utilities, which are helpful but not required
    - Configuration data

The kernel component of Smack is implemented as a Linux
@@ -23,37 +23,28 @@ Security Modules (LSM) module. It requires netlabel and
works best with file systems that support extended attributes,
although xattr support is not strictly required.
It is safe to run a Smack kernel under a "vanilla" distribution.

Smack kernels use the CIPSO IP option. Some network
configurations are intolerant of IP options and can impede
access to systems that use them as Smack does.

The startup script etc-init.d-smack should be installed
in /etc/init.d/smack and should be invoked early in the
start-up process. On Fedora rc5.d/S02smack is recommended.
This script ensures that certain devices have the correct
Smack attributes and loads the Smack configuration if
any is defined. This script invokes two programs that
ensure configuration data is properly formatted. These
programs are /usr/sbin/smackload and /usr/sin/smackcipso.
The system will run just fine without these programs,
but it will be difficult to set access rules properly.

A version of "ls" that provides a "-M" option to display
Smack labels on long listing is available.
The current git repositories for Smack user space are:

A hacked version of sshd that allows network logins by users
with specific Smack labels is available. This version does
not work for scp. You must set the /etc/ssh/sshd_config
line:
   UsePrivilegeSeparation no
	git@gitorious.org:meego-platform-security/smackutil.git
	git@gitorious.org:meego-platform-security/libsmack.git

The format of /etc/smack/usr is:
These should make and install on most modern distributions.
There are three commands included in smackutil:

   username smack
smackload  - properly formats data for writing to /smack/load
smackcipso - properly formats data for writing to /smack/cipso
chsmack    - display or set Smack extended attribute values

In keeping with the intent of Smack, configuration data is
minimal and not strictly required. The most important
configuration step is mounting the smackfs pseudo filesystem.
If smackutil is installed the startup script will take care
of this, but it can be manually as well.

Add this line to /etc/fstab:

@@ -61,19 +52,148 @@ Add this line to /etc/fstab:

and create the /smack directory for mounting.

Smack uses extended attributes (xattrs) to store file labels.
The command to set a Smack label on a file is:
Smack uses extended attributes (xattrs) to store labels on filesystem
objects. The attributes are stored in the extended attribute security
name space. A process must have CAP_MAC_ADMIN to change any of these
attributes.

The extended attributes that Smack uses are:

SMACK64
	Used to make access control decisions. In almost all cases
	the label given to a new filesystem object will be the label
	of the process that created it.
SMACK64EXEC
	The Smack label of a process that execs a program file with
	this attribute set will run with this attribute's value.
SMACK64MMAP
	Don't allow the file to be mmapped by a process whose Smack
	label does not allow all of the access permitted to a process
	with the label contained in this attribute. This is a very
	specific use case for shared libraries.
SMACK64TRANSMUTE
	Can only have the value "TRUE". If this attribute is present
	on a directory when an object is created in the directory and
	the Smack rule (more below) that permitted the write access
	to the directory includes the transmute ("t") mode the object
	gets the label of the directory instead of the label of the
	creating process. If the object being created is a directory
	the SMACK64TRANSMUTE attribute is set as well.
SMACK64IPIN
	This attribute is only available on file descriptors for sockets.
	Use the Smack label in this attribute for access control
	decisions on packets being delivered to this socket.
SMACK64IPOUT
	This attribute is only available on file descriptors for sockets.
	Use the Smack label in this attribute for access control
	decisions on packets coming from this socket.

There are multiple ways to set a Smack label on a file:

    # attr -S -s SMACK64 -V "value" path
    # chsmack -a value path

NOTE: Smack labels are limited to 23 characters. The attr command
      does not enforce this restriction and can be used to set
      invalid Smack labels on files.

If you don't do anything special all users will get the floor ("_")
label when they log in. If you do want to log in via the hacked ssh
at other labels use the attr command to set the smack value on the
home directory and its contents.
A process can see the smack label it is running with by
reading /proc/self/attr/current. A process with CAP_MAC_ADMIN
can set the process smack by writing there.

Most Smack configuration is accomplished by writing to files
in the smackfs filesystem. This pseudo-filesystem is usually
mounted on /smack.

access
	This interface reports whether a subject with the specified
	Smack label has a particular access to an object with a
	specified Smack label. Write a fixed format access rule to
	this file. The next read will indicate whether the access
	would be permitted. The text will be either "1" indicating
	access, or "0" indicating denial.
access2
	This interface reports whether a subject with the specified
	Smack label has a particular access to an object with a
	specified Smack label. Write a long format access rule to
	this file. The next read will indicate whether the access
	would be permitted. The text will be either "1" indicating
	access, or "0" indicating denial.
ambient
	This contains the Smack label applied to unlabeled network
	packets.
cipso
	This interface allows a specific CIPSO header to be assigned
	to a Smack label. The format accepted on write is:
		"%24s%4d%4d"["%4d"]...
	The first string is a fixed Smack label. The first number is
	the level to use. The second number is the number of categories.
	The following numbers are the categories.
	"level-3-cats-5-19          3   2   5  19"
cipso2
	This interface allows a specific CIPSO header to be assigned
	to a Smack label. The format accepted on write is:
	"%s%4d%4d"["%4d"]...
	The first string is a long Smack label. The first number is
	the level to use. The second number is the number of categories.
	The following numbers are the categories.
	"level-3-cats-5-19   3   2   5  19"
direct
	This contains the CIPSO level used for Smack direct label
	representation in network packets.
doi
	This contains the CIPSO domain of interpretation used in
	network packets.
load
	This interface allows access control rules in addition to
	the system defined rules to be specified. The format accepted
	on write is:
		"%24s%24s%5s"
	where the first string is the subject label, the second the
	object label, and the third the requested access. The access
	string may contain only the characters "rwxat-", and specifies
	which sort of access is allowed. The "-" is a placeholder for
	permissions that are not allowed. The string "r-x--" would
	specify read and execute access. Labels are limited to 23
	characters in length.
load2
	This interface allows access control rules in addition to
	the system defined rules to be specified. The format accepted
	on write is:
		"%s %s %s"
	where the first string is the subject label, the second the
	object label, and the third the requested access. The access
	string may contain only the characters "rwxat-", and specifies
	which sort of access is allowed. The "-" is a placeholder for
	permissions that are not allowed. The string "r-x--" would
	specify read and execute access.
load-self
	This interface allows process specific access rules to be
	defined. These rules are only consulted if access would
	otherwise be permitted, and are intended to provide additional
	restrictions on the process. The format is the same as for
	the load interface.
load-self2
	This interface allows process specific access rules to be
	defined. These rules are only consulted if access would
	otherwise be permitted, and are intended to provide additional
	restrictions on the process. The format is the same as for
	the load2 interface.
logging
	This contains the Smack logging state.
mapped
	This contains the CIPSO level used for Smack mapped label
	representation in network packets.
netlabel
	This interface allows specific internet addresses to be
	treated as single label hosts. Packets are sent to single
	label hosts without CIPSO headers, but only from processes
	that have Smack write access to the host label. All packets
	received from single label hosts are given the specified
	label. The format accepted on write is:
		"%d.%d.%d.%d label" or "%d.%d.%d.%d/%d label".
onlycap
	This contains the label processes must have for CAP_MAC_ADMIN
	and CAP_MAC_OVERRIDE to be effective. If this file is empty
	these capabilities are effective at for processes with any
	label. The value is set by writing the desired label to the
	file or cleared by writing "-" to the file.

You can add access rules in /etc/smack/accesses. They take the form:

@@ -83,10 +203,6 @@ access is a combination of the letters rwxa which specify the
kind of access permitted a subject with subjectlabel on an
object with objectlabel. If there is no rule no access is allowed.

A process can see the smack label it is running with by
reading /proc/self/attr/current. A privileged process can
set the process smack by writing there.

Look for additional programs on http://schaufler-ca.com

From the Smack Whitepaper:
@@ -186,7 +302,7 @@ team. Smack labels are unstructured, case sensitive, and the only operation
ever performed on them is comparison for equality. Smack labels cannot
contain unprintable characters, the "/" (slash), the "\" (backslash), the "'"
(quote) and '"' (double-quote) characters.
Smack labels cannot begin with a '-', which is reserved for special options.
Smack labels cannot begin with a '-'. This is reserved for special options.

There are some predefined labels:

@@ -194,7 +310,7 @@ There are some predefined labels:
	^ 	Pronounced "hat", a single circumflex character.
	* 	Pronounced "star", a single asterisk character.
	? 	Pronounced "huh", a single question mark character.
	@ 	Pronounced "Internet", a single at sign character.
	@ 	Pronounced "web", a single at sign character.

Every task on a Smack system is assigned a label. System tasks, such as
init(8) and systems daemons, are run with the floor ("_") label. User tasks
@@ -246,13 +362,14 @@ The format of an access rule is:

Where subject-label is the Smack label of the task, object-label is the Smack
label of the thing being accessed, and access is a string specifying the sort
of access allowed. The Smack labels are limited to 23 characters. The access
specification is searched for letters that describe access modes:
of access allowed. The access specification is searched for letters that
describe access modes:

	a: indicates that append access should be granted.
	r: indicates that read access should be granted.
	w: indicates that write access should be granted.
	x: indicates that execute access should be granted.
	t: indicates that the rule requests transmutation.

Uppercase values for the specification letters are allowed as well.
Access mode specifications can be in any order. Examples of acceptable rules
@@ -273,7 +390,7 @@ Examples of unacceptable rules are:

Spaces are not allowed in labels. Since a subject always has access to files
with the same label specifying a rule for that case is pointless. Only
valid letters (rwxaRWXA) and the dash ('-') character are allowed in
valid letters (rwxatRWXAT) and the dash ('-') character are allowed in
access specifications. The dash is a placeholder, so "a-r" is the same
as "ar". A lone dash is used to specify that no access should be allowed.

@@ -297,6 +414,13 @@ but not any of its attributes by the circumstance of having read access to the
containing directory but not to the differently labeled file. This is an
artifact of the file name being data in the directory, not a part of the file.

If a directory is marked as transmuting (SMACK64TRANSMUTE=TRUE) and the
access rule that allows a process to create an object in that directory
includes 't' access the label assigned to the new object will be that
of the directory, not the creating process. This makes it much easier
for two processes with different labels to share data without granting
access to all of their files.

IPC objects, message queues, semaphore sets, and memory segments exist in flat
namespaces and access requests are only required to match the object in
question.
+9 −1
Original line number Diff line number Diff line
@@ -34,7 +34,7 @@ parent to a child process (i.e. direct "gdb EXE" and "strace EXE" still
work), or with CAP_SYS_PTRACE (i.e. "gdb --pid=PID", and "strace -p PID"
still work as root).

For software that has defined application-specific relationships
In mode 1, software that has defined application-specific relationships
between a debugging process and its inferior (crash handlers, etc),
prctl(PR_SET_PTRACER, pid, ...) can be used. An inferior can declare which
other process (and its descendents) are allowed to call PTRACE_ATTACH
@@ -46,6 +46,8 @@ restrictions, it can call prctl(PR_SET_PTRACER, PR_SET_PTRACER_ANY, ...)
so that any otherwise allowed process (even those in external pid namespaces)
may attach.

These restrictions do not change how ptrace via PTRACE_TRACEME operates.

The sysctl settings are:

0 - classic ptrace permissions: a process can PTRACE_ATTACH to any other
@@ -60,6 +62,12 @@ The sysctl settings are:
    inferior can call prctl(PR_SET_PTRACER, debugger, ...) to declare
    an allowed debugger PID to call PTRACE_ATTACH on the inferior.

2 - admin-only attach: only processes with CAP_SYS_PTRACE may use ptrace
    with PTRACE_ATTACH.

3 - no attach: no processes may use ptrace with PTRACE_ATTACH. Once set,
    this sysctl cannot be changed to a lower value.

The original children-only logic was based on the restrictions in grsecurity.

==============================================================
+17 −0
Original line number Diff line number Diff line
@@ -805,6 +805,23 @@ The keyctl syscall functions are:
     kernel and resumes executing userspace.


 (*) Invalidate a key.

	long keyctl(KEYCTL_INVALIDATE, key_serial_t key);

     This function marks a key as being invalidated and then wakes up the
     garbage collector.  The garbage collector immediately removes invalidated
     keys from all keyrings and deletes the key when its reference count
     reaches zero.

     Keys that are marked invalidated become invisible to normal key operations
     immediately, though they are still visible in /proc/keys until deleted
     (they're marked with an 'i' flag).

     A process must have search permission on the key for this function to be
     successful.


===============
KERNEL SERVICES
===============
+2 −1
Original line number Diff line number Diff line
@@ -1733,6 +1733,7 @@ S: Supported
F:	include/linux/capability.h
F:	security/capability.c
F:	security/commoncap.c 
F:	kernel/capability.c

CELL BROADBAND ENGINE ARCHITECTURE
M:	Arnd Bergmann <arnd@arndb.de>
@@ -5950,7 +5951,7 @@ SECURITY SUBSYSTEM
M:	James Morris <james.l.morris@oracle.com>
L:	linux-security-module@vger.kernel.org (suggested Cc:)
T:	git git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security.git
W:	http://security.wiki.kernel.org/
W:	http://kernsec.org/
S:	Supported
F:	security/

Loading