Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit b8ae30ee authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge branch 'sched-core-for-linus' of...

Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (49 commits)
  stop_machine: Move local variable closer to the usage site in cpu_stop_cpu_callback()
  sched, wait: Use wrapper functions
  sched: Remove a stale comment
  ondemand: Make the iowait-is-busy time a sysfs tunable
  ondemand: Solve a big performance issue by counting IOWAIT time as busy
  sched: Intoduce get_cpu_iowait_time_us()
  sched: Eliminate the ts->idle_lastupdate field
  sched: Fold updating of the last_update_time_info into update_ts_time_stats()
  sched: Update the idle statistics in get_cpu_idle_time_us()
  sched: Introduce a function to update the idle statistics
  sched: Add a comment to get_cpu_idle_time_us()
  cpu_stop: add dummy implementation for UP
  sched: Remove rq argument to the tracepoints
  rcu: need barrier() in UP synchronize_sched_expedited()
  sched: correctly place paranioa memory barriers in synchronize_sched_expedited()
  sched: kill paranoia check in synchronize_sched_expedited()
  sched: replace migration_thread with cpu_stop
  stop_machine: reimplement using cpu_stop
  cpu_stop: implement stop_cpu[s]()
  sched: Fix select_idle_sibling() logic in select_task_rq_fair()
  ...
parents 4d7b4ac2 9c6f7e43
Loading
Loading
Loading
Loading
+0 −10
Original line number Diff line number Diff line
@@ -182,16 +182,6 @@ Similarly, sched_expedited RCU provides the following:
	sched_expedited-torture: Reader Pipe:  12660320201 95875 0 0 0 0 0 0 0 0 0
	sched_expedited-torture: Reader Batch:  12660424885 0 0 0 0 0 0 0 0 0 0
	sched_expedited-torture: Free-Block Circulation:  1090795 1090795 1090794 1090793 1090792 1090791 1090790 1090789 1090788 1090787 0
	state: -1 / 0:0 3:0 4:0

As before, the first four lines are similar to those for RCU.
The last line shows the task-migration state.  The first number is
-1 if synchronize_sched_expedited() is idle, -2 if in the process of
posting wakeups to the migration kthreads, and N when waiting on CPU N.
Each of the colon-separated fields following the "/" is a CPU:state pair.
Valid states are "0" for idle, "1" for waiting for quiescent state,
"2" for passed through quiescent state, and "3" when a race with a
CPU-hotplug event forces use of the synchronize_sched() primitive.


USAGE
+3 −51
Original line number Diff line number Diff line
@@ -211,7 +211,7 @@ provide fair CPU time to each such task group. For example, it may be
desirable to first provide fair CPU time to each user on the system and then to
each task belonging to a user.

CONFIG_GROUP_SCHED strives to achieve exactly that.  It lets tasks to be
CONFIG_CGROUP_SCHED strives to achieve exactly that.  It lets tasks to be
grouped and divides CPU time fairly among such groups.

CONFIG_RT_GROUP_SCHED permits to group real-time (i.e., SCHED_FIFO and
@@ -220,38 +220,11 @@ SCHED_RR) tasks.
CONFIG_FAIR_GROUP_SCHED permits to group CFS (i.e., SCHED_NORMAL and
SCHED_BATCH) tasks.

At present, there are two (mutually exclusive) mechanisms to group tasks for
CPU bandwidth control purposes:

 - Based on user id (CONFIG_USER_SCHED)

   With this option, tasks are grouped according to their user id.

 - Based on "cgroup" pseudo filesystem (CONFIG_CGROUP_SCHED)

   This options needs CONFIG_CGROUPS to be defined, and lets the administrator
   These options need CONFIG_CGROUPS to be defined, and let the administrator
   create arbitrary groups of tasks, using the "cgroup" pseudo filesystem.  See
   Documentation/cgroups/cgroups.txt for more information about this filesystem.

Only one of these options to group tasks can be chosen and not both.

When CONFIG_USER_SCHED is defined, a directory is created in sysfs for each new
user and a "cpu_share" file is added in that directory.

	# cd /sys/kernel/uids
	# cat 512/cpu_share		# Display user 512's CPU share
	1024
	# echo 2048 > 512/cpu_share	# Modify user 512's CPU share
	# cat 512/cpu_share		# Display user 512's CPU share
	2048
	#

CPU bandwidth between two users is divided in the ratio of their CPU shares.
For example: if you would like user "root" to get twice the bandwidth of user
"guest," then set the cpu_share for both the users such that "root"'s cpu_share
is twice "guest"'s cpu_share.

When CONFIG_CGROUP_SCHED is defined, a "cpu.shares" file is created for each
When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each
group created using the pseudo filesystem.  See example steps below to create
task groups and modify their CPU share using the "cgroups" pseudo filesystem.

@@ -273,24 +246,3 @@ task groups and modify their CPU share using the "cgroups" pseudo filesystem.

	# #Launch gmplayer (or your favourite movie player)
	# echo <movie_player_pid> > multimedia/tasks

8. Implementation note: user namespaces

User namespaces are intended to be hierarchical.  But they are currently
only partially implemented.  Each of those has ramifications for CFS.

First, since user namespaces are hierarchical, the /sys/kernel/uids
presentation is inadequate.  Eventually we will likely want to use sysfs
tagging to provide private views of /sys/kernel/uids within each user
namespace.

Second, the hierarchical nature is intended to support completely
unprivileged use of user namespaces.  So if using user groups, then
we want the users in a user namespace to be children of the user
who created it.

That is currently unimplemented.  So instead, every user in a new
user namespace will receive 1024 shares just like any user in the
initial user namespace.  Note that at the moment creation of a new
user namespace requires each of CAP_SYS_ADMIN, CAP_SETUID, and
CAP_SETGID.
+4 −16
Original line number Diff line number Diff line
@@ -126,23 +126,12 @@ priority!
2.3 Basis for grouping tasks
----------------------------

There are two compile-time settings for allocating CPU bandwidth. These are
configured using the "Basis for grouping tasks" multiple choice menu under
General setup > Group CPU Scheduler:

a. CONFIG_USER_SCHED (aka "Basis for grouping tasks" =  "user id")

This lets you use the virtual files under
"/sys/kernel/uids/<uid>/cpu_rt_runtime_us" to control he CPU time reserved for
each user .

The other option is:

.o CONFIG_CGROUP_SCHED (aka "Basis for grouping tasks" = "Control groups")
Enabling CONFIG_RT_GROUP_SCHED lets you explicitly allocate real
CPU bandwidth to task groups.

This uses the /cgroup virtual file system and
"/cgroup/<cgroup>/cpu.rt_runtime_us" to control the CPU time reserved for each
control group instead.
control group.

For more information on working with control groups, you should read
Documentation/cgroups/cgroups.txt as well.
@@ -161,8 +150,7 @@ For now, this can be simplified to just the following (but see Future plans):
===============

There is work in progress to make the scheduling period for each group
("/sys/kernel/uids/<uid>/cpu_rt_period_us" or
"/cgroup/<cgroup>/cpu.rt_period_us" respectively) configurable as well.
("/cgroup/<cgroup>/cpu.rt_period_us") configurable as well.

The constraint on the period is that a subgroup must have a smaller or
equal period to its parent. But realistically its not very useful _yet_
+0 −1
Original line number Diff line number Diff line
@@ -391,7 +391,6 @@ static void __init time_init_wq(void)
	if (time_sync_wq)
		return;
	time_sync_wq = create_singlethread_workqueue("timesync");
	stop_machine_create();
}

/*
+73 −2
Original line number Diff line number Diff line
@@ -73,6 +73,7 @@ enum {DBS_NORMAL_SAMPLE, DBS_SUB_SAMPLE};

struct cpu_dbs_info_s {
	cputime64_t prev_cpu_idle;
	cputime64_t prev_cpu_iowait;
	cputime64_t prev_cpu_wall;
	cputime64_t prev_cpu_nice;
	struct cpufreq_policy *cur_policy;
@@ -108,6 +109,7 @@ static struct dbs_tuners {
	unsigned int down_differential;
	unsigned int ignore_nice;
	unsigned int powersave_bias;
	unsigned int io_is_busy;
} dbs_tuners_ins = {
	.up_threshold = DEF_FREQUENCY_UP_THRESHOLD,
	.down_differential = DEF_FREQUENCY_DOWN_DIFFERENTIAL,
@@ -148,6 +150,16 @@ static inline cputime64_t get_cpu_idle_time(unsigned int cpu, cputime64_t *wall)
	return idle_time;
}

static inline cputime64_t get_cpu_iowait_time(unsigned int cpu, cputime64_t *wall)
{
	u64 iowait_time = get_cpu_iowait_time_us(cpu, wall);

	if (iowait_time == -1ULL)
		return 0;

	return iowait_time;
}

/*
 * Find right freq to be set now with powersave_bias on.
 * Returns the freq_hi to be used right now and will set freq_hi_jiffies,
@@ -249,6 +261,7 @@ static ssize_t show_##file_name \
	return sprintf(buf, "%u\n", dbs_tuners_ins.object);		\
}
show_one(sampling_rate, sampling_rate);
show_one(io_is_busy, io_is_busy);
show_one(up_threshold, up_threshold);
show_one(ignore_nice_load, ignore_nice);
show_one(powersave_bias, powersave_bias);
@@ -299,6 +312,23 @@ static ssize_t store_sampling_rate(struct kobject *a, struct attribute *b,
	return count;
}

static ssize_t store_io_is_busy(struct kobject *a, struct attribute *b,
				   const char *buf, size_t count)
{
	unsigned int input;
	int ret;

	ret = sscanf(buf, "%u", &input);
	if (ret != 1)
		return -EINVAL;

	mutex_lock(&dbs_mutex);
	dbs_tuners_ins.io_is_busy = !!input;
	mutex_unlock(&dbs_mutex);

	return count;
}

static ssize_t store_up_threshold(struct kobject *a, struct attribute *b,
				  const char *buf, size_t count)
{
@@ -381,6 +411,7 @@ static struct global_attr _name = \
__ATTR(_name, 0644, show_##_name, store_##_name)

define_one_rw(sampling_rate);
define_one_rw(io_is_busy);
define_one_rw(up_threshold);
define_one_rw(ignore_nice_load);
define_one_rw(powersave_bias);
@@ -392,6 +423,7 @@ static struct attribute *dbs_attributes[] = {
	&up_threshold.attr,
	&ignore_nice_load.attr,
	&powersave_bias.attr,
	&io_is_busy.attr,
	NULL
};

@@ -470,14 +502,15 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)

	for_each_cpu(j, policy->cpus) {
		struct cpu_dbs_info_s *j_dbs_info;
		cputime64_t cur_wall_time, cur_idle_time;
		unsigned int idle_time, wall_time;
		cputime64_t cur_wall_time, cur_idle_time, cur_iowait_time;
		unsigned int idle_time, wall_time, iowait_time;
		unsigned int load, load_freq;
		int freq_avg;

		j_dbs_info = &per_cpu(od_cpu_dbs_info, j);

		cur_idle_time = get_cpu_idle_time(j, &cur_wall_time);
		cur_iowait_time = get_cpu_iowait_time(j, &cur_wall_time);

		wall_time = (unsigned int) cputime64_sub(cur_wall_time,
				j_dbs_info->prev_cpu_wall);
@@ -487,6 +520,10 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
				j_dbs_info->prev_cpu_idle);
		j_dbs_info->prev_cpu_idle = cur_idle_time;

		iowait_time = (unsigned int) cputime64_sub(cur_iowait_time,
				j_dbs_info->prev_cpu_iowait);
		j_dbs_info->prev_cpu_iowait = cur_iowait_time;

		if (dbs_tuners_ins.ignore_nice) {
			cputime64_t cur_nice;
			unsigned long cur_nice_jiffies;
@@ -504,6 +541,16 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
			idle_time += jiffies_to_usecs(cur_nice_jiffies);
		}

		/*
		 * For the purpose of ondemand, waiting for disk IO is an
		 * indication that you're performance critical, and not that
		 * the system is actually idle. So subtract the iowait time
		 * from the cpu idle time.
		 */

		if (dbs_tuners_ins.io_is_busy && idle_time >= iowait_time)
			idle_time -= iowait_time;

		if (unlikely(!wall_time || wall_time < idle_time))
			continue;

@@ -617,6 +664,29 @@ static inline void dbs_timer_exit(struct cpu_dbs_info_s *dbs_info)
	cancel_delayed_work_sync(&dbs_info->work);
}

/*
 * Not all CPUs want IO time to be accounted as busy; this dependson how
 * efficient idling at a higher frequency/voltage is.
 * Pavel Machek says this is not so for various generations of AMD and old
 * Intel systems.
 * Mike Chan (androidlcom) calis this is also not true for ARM.
 * Because of this, whitelist specific known (series) of CPUs by default, and
 * leave all others up to the user.
 */
static int should_io_be_busy(void)
{
#if defined(CONFIG_X86)
	/*
	 * For Intel, Core 2 (model 15) andl later have an efficient idle.
	 */
	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
	    boot_cpu_data.x86 == 6 &&
	    boot_cpu_data.x86_model >= 15)
		return 1;
#endif
	return 0;
}

static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
				   unsigned int event)
{
@@ -679,6 +749,7 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
			dbs_tuners_ins.sampling_rate =
				max(min_sampling_rate,
				    latency * LATENCY_MULTIPLIER);
			dbs_tuners_ins.io_is_busy = should_io_be_busy();
		}
		mutex_unlock(&dbs_mutex);

Loading