Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 8bdc69b7 authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull cgroup updates from Tejun Heo:

 - a new PIDs controller is added.  It turns out that PIDs are actually
   an independent resource from kmem due to the limited PID space.

 - more core preparations for the v2 interface.  Once cpu side interface
   is settled, it should be ready for lifting the devel mask.
   for-4.3-unified-base was temporarily branched so that other trees
   (block) can pull cgroup core changes that blkcg changes depend on.

 - a non-critical idr_preload usage bug fix.

* 'for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: pids: fix invalid get/put usage
  cgroup: introduce cgroup_subsys->legacy_name
  cgroup: don't print subsystems for the default hierarchy
  cgroup: make cftype->private a unsigned long
  cgroup: export cgrp_dfl_root
  cgroup: define controller file conventions
  cgroup: fix idr_preload usage
  cgroup: add documentation for the PIDs controller
  cgroup: implement the PIDs subsystem
  cgroup: allow a cgroup subsystem to reject a fork
parents 76ec51ef 20f1f4b5
Loading
Loading
Loading
Loading
+5 −0
Original line number Diff line number Diff line
@@ -3219,6 +3219,11 @@ S: 69 rue Dunois
S: 75013 Paris
S: France

N: Aleksa Sarai
E: cyphar@cyphar.com
W: https://www.cyphar.com/
D: `pids` cgroup subsystem

N: Dipankar Sarma
E: dipankar@in.ibm.com
D: RCU
+2 −0
Original line number Diff line number Diff line
@@ -22,6 +22,8 @@ net_cls.txt
	- Network classifier cgroups details and usages.
net_prio.txt
	- Network priority cgroups details and usages.
pids.txt
	- Process number cgroups details and usages.
resource_counter.txt
	- Resource Counter API.
unified-hierarchy.txt
+85 −0
Original line number Diff line number Diff line
						   Process Number Controller
						   =========================

Abstract
--------

The process number controller is used to allow a cgroup hierarchy to stop any
new tasks from being fork()'d or clone()'d after a certain limit is reached.

Since it is trivial to hit the task limit without hitting any kmemcg limits in
place, PIDs are a fundamental resource. As such, PID exhaustion must be
preventable in the scope of a cgroup hierarchy by allowing resource limiting of
the number of tasks in a cgroup.

Usage
-----

In order to use the `pids` controller, set the maximum number of tasks in
pids.max (this is not available in the root cgroup for obvious reasons). The
number of processes currently in the cgroup is given by pids.current.

Organisational operations are not blocked by cgroup policies, so it is possible
to have pids.current > pids.max. This can be done by either setting the limit to
be smaller than pids.current, or attaching enough processes to the cgroup such
that pids.current > pids.max. However, it is not possible to violate a cgroup
policy through fork() or clone(). fork() and clone() will return -EAGAIN if the
creation of a new process would cause a cgroup policy to be violated.

To set a cgroup to have no limit, set pids.max to "max". This is the default for
all new cgroups (N.B. that PID limits are hierarchical, so the most stringent
limit in the hierarchy is followed).

pids.current tracks all child cgroup hierarchies, so parent/pids.current is a
superset of parent/child/pids.current.

Example
-------

First, we mount the pids controller:
# mkdir -p /sys/fs/cgroup/pids
# mount -t cgroup -o pids none /sys/fs/cgroup/pids

Then we create a hierarchy, set limits and attach processes to it:
# mkdir -p /sys/fs/cgroup/pids/parent/child
# echo 2 > /sys/fs/cgroup/pids/parent/pids.max
# echo $$ > /sys/fs/cgroup/pids/parent/cgroup.procs
# cat /sys/fs/cgroup/pids/parent/pids.current
2
#

It should be noted that attempts to overcome the set limit (2 in this case) will
fail:

# cat /sys/fs/cgroup/pids/parent/pids.current
2
# ( /bin/echo "Here's some processes for you." | cat )
sh: fork: Resource temporary unavailable
#

Even if we migrate to a child cgroup (which doesn't have a set limit), we will
not be able to overcome the most stringent limit in the hierarchy (in this case,
parent's):

# echo $$ > /sys/fs/cgroup/pids/parent/child/cgroup.procs
# cat /sys/fs/cgroup/pids/parent/pids.current
2
# cat /sys/fs/cgroup/pids/parent/child/pids.current
2
# cat /sys/fs/cgroup/pids/parent/child/pids.max
max
# ( /bin/echo "Here's some processes for you." | cat )
sh: fork: Resource temporary unavailable
#

We can set a limit that is smaller than pids.current, which will stop any new
processes from being forked at all (note that the shell itself counts towards
pids.current):

# echo 1 > /sys/fs/cgroup/pids/parent/pids.max
# /bin/echo "We can't even spawn a single process now."
sh: fork: Resource temporary unavailable
# echo 0 > /sys/fs/cgroup/pids/parent/pids.max
# /bin/echo "We can't even spawn a single process now."
sh: fork: Resource temporary unavailable
#
+72 −8
Original line number Diff line number Diff line
@@ -23,10 +23,13 @@ CONTENTS
5. Other Changes
  5-1. [Un]populated Notification
  5-2. Other Core Changes
  5-3. Per-Controller Changes
    5-3-1. blkio
    5-3-2. cpuset
    5-3-3. memory
  5-3. Controller File Conventions
    5-3-1. Format
    5-3-2. Control Knobs
  5-4. Per-Controller Changes
    5-4-1. blkio
    5-4-2. cpuset
    5-4-3. memory
6. Planned Changes
  6-1. CAP for resource control

@@ -372,14 +375,75 @@ supported and the interface files "release_agent" and
- The "cgroup.clone_children" file is removed.


5-3. Per-Controller Changes
5-3. Controller File Conventions

5-3-1. blkio
5-3-1. Format

In general, all controller files should be in one of the following
formats whenever possible.

- Values only files

  VAL0 VAL1...\n

- Flat keyed files

  KEY0 VAL0\n
  KEY1 VAL1\n
  ...

- Nested keyed files

  KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01...
  KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11...
  ...

For a writeable file, the format for writing should generally match
reading; however, controllers may allow omitting later fields or
implement restricted shortcuts for most common use cases.

For both flat and nested keyed files, only the values for a single key
can be written at a time.  For nested keyed files, the sub key pairs
may be specified in any order and not all pairs have to be specified.


5-3-2. Control Knobs

- Settings for a single feature should generally be implemented in a
  single file.

- In general, the root cgroup should be exempt from resource control
  and thus shouldn't have resource control knobs.

- If a controller implements ratio based resource distribution, the
  control knob should be named "weight" and have the range [1, 10000]
  and 100 should be the default value.  The values are chosen to allow
  enough and symmetric bias in both directions while keeping it
  intuitive (the default is 100%).

- If a controller implements an absolute resource guarantee and/or
  limit, the control knobs should be named "min" and "max"
  respectively.  If a controller implements best effort resource
  gurantee and/or limit, the control knobs should be named "low" and
  "high" respectively.

  In the above four control files, the special token "max" should be
  used to represent upward infinity for both reading and writing.

- If a setting has configurable default value and specific overrides,
  the default settings should be keyed with "default" and appear as
  the first entry in the file.  Specific entries can use "default" as
  its value to indicate inheritance of the default value.


5-4. Per-Controller Changes

5-4-1. blkio

- blk-throttle becomes properly hierarchical.


5-3-2. cpuset
5-4-2. cpuset

- Tasks are kept in empty cpusets after hotplug and take on the masks
  of the nearest non-empty ancestor, instead of being moved to it.
@@ -388,7 +452,7 @@ supported and the interface files "release_agent" and
  masks of the nearest non-empty ancestor.


5-3-3. memory
5-4-3. memory

- use_hierarchy is on by default and the cgroup file for the flag is
  not created.
+13 −2
Original line number Diff line number Diff line
@@ -34,12 +34,17 @@ struct seq_file;

/* define the enumeration of all cgroup subsystems */
#define SUBSYS(_x) _x ## _cgrp_id,
#define SUBSYS_TAG(_t) CGROUP_ ## _t, \
	__unused_tag_ ## _t = CGROUP_ ## _t - 1,
enum cgroup_subsys_id {
#include <linux/cgroup_subsys.h>
	CGROUP_SUBSYS_COUNT,
};
#undef SUBSYS_TAG
#undef SUBSYS

#define CGROUP_CANFORK_COUNT (CGROUP_CANFORK_END - CGROUP_CANFORK_START)

/* bits in struct cgroup_subsys_state flags field */
enum {
	CSS_NO_REF	= (1 << 0), /* no reference counting for this css */
@@ -318,7 +323,7 @@ struct cftype {
	 * end of cftype array.
	 */
	char name[MAX_CFTYPE_NAME];
	int private;
	unsigned long private;
	/*
	 * If not 0, file mode is set to this value, otherwise it will
	 * be figured out automatically
@@ -406,7 +411,9 @@ struct cgroup_subsys {
			      struct cgroup_taskset *tset);
	void (*attach)(struct cgroup_subsys_state *css,
		       struct cgroup_taskset *tset);
	void (*fork)(struct task_struct *task);
	int (*can_fork)(struct task_struct *task, void **priv_p);
	void (*cancel_fork)(struct task_struct *task, void *priv);
	void (*fork)(struct task_struct *task, void *priv);
	void (*exit)(struct cgroup_subsys_state *css,
		     struct cgroup_subsys_state *old_css,
		     struct task_struct *task);
@@ -434,6 +441,9 @@ struct cgroup_subsys {
	int id;
	const char *name;

	/* optional, initialized automatically during boot if not set */
	const char *legacy_name;

	/* link to parent, protected by cgroup_lock() */
	struct cgroup_root *root;

@@ -491,6 +501,7 @@ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)

#else	/* CONFIG_CGROUPS */

#define CGROUP_CANFORK_COUNT 0
#define CGROUP_SUBSYS_COUNT 0

static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) {}
Loading