Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 0fe2d4b0 authored by Srivatsa Vaddagiri's avatar Srivatsa Vaddagiri Committed by Matt Wagantall
Browse files

sched: Improve HMP scheduler documentation



Various miscellaneous improvements to HMP scheduler documentation.

Change-Id: I3550ff1ffc08139fef62124a1a9d627320326319
Signed-off-by: default avatarSrivatsa Vaddagiri <vatsa@codeaurora.org>
parent 230bfa57
Loading
Loading
Loading
Loading
+274 −114
Original line number Diff line number Diff line
@@ -18,7 +18,12 @@ CONTENTS
4. CPU Power
5. HMP Scheduler
   5.1 Classification of Tasks and CPUs
   5.2 Task Wakeup and select_best_cpu()
   5.2 select_best_cpu()
   5.2.1 sched_boost
   5.2.2 task_will_fit()
   5.2.3 Tunables affecting select_best_cpu()
   5.2.4 Wakeup Logic of a Non-Small Task
   5.2.5 Wakeup Logic of a Small Task
   5.3 Scheduler Tick
   5.4 Load Balancer
   5.5 Real Time Tasks
@@ -123,7 +128,7 @@ since v3.7, has some perceived shortcomings when used to place tasks on HMP
systems or provide recommendations on CPU frequency.

Per-entity load tracking does not make a distinction between the ramp up
vs. ramp down time of task load. It also decays task load without exception when
vs ramp down time of task load. It also decays task load without exception when
a task sleeps. As an example, a cpu bound task at its peak load (LOAD_AVG_MAX or
47742) can see its load decay to 0 after a sleep of just 213ms! A cpu-bound task
running on a performance-efficient cpu could thus get re-classified as not
@@ -531,7 +536,7 @@ both tasks and CPUs to aid in the placement of tasks.
  which is not idle, but lightly loaded.

  The small task threshold is set by the value
  /proc/sys/kenrel/sched_small_task. This value is a percentage. If the
  /proc/sys/kernel/sched_small_task. This value is a percentage. If the
  task consumes this much or less of the minimum CPU in the system, the
  task is considered "small."

@@ -592,24 +597,24 @@ both tasks and CPUs to aid in the placement of tasks.

- spill threshold

  The spill threshold determines how much task load the scheduler
  should put on a CPU before considering that CPU busy and putting the
  load elsewhere. This allows a configurable level of task packing within
  one or more CPUs in the system. How aggressively should the scheduler
  attempt to fill CPUs with task demand before utilizing other CPUs?
  Tasks will normally be placed on lowest power-cost cluster where they can fit.
  This could result in power-efficient cluster becoming overcrowded when there
  are "too" many low-demand tasks. Spill threshold provides a spill over
  criteria, wherein low-demand task are allowed to be placed on idle or
  mostly-idle cpus in high-performance cluster. Note that spill over criteria
  applies only to cpu cluster with lower capacity and does not apply to the cpu
  cluster with highest capacity.

  These two tunable parameters together define the spill threshold.
  Scheduler will avoid placing a task on a cpu in power-efficient cluster if it
  can result in cpu exceeding its spill threshold, which is defined by two
  tunables:

  /proc/sys/kernel/sched_spill_nr_run
  /proc/sys/kernel/sched_spill_load
  /proc/sys/kernel/sched_spill_nr_run (default: 10)
  /proc/sys/kernel/sched_spill_load   (default : 100%)

  If placing a task on a CPU would cause it to have more than
  sched_spill_nr_run runnable tasks, or would cause the CPU to be more
  than sched_spill_load percent busy, the scheduler will interpret that as
  causing the CPU to cross its spill threshold. Spill threshold is only
  considered when having to consider whether a task, which can fit in
  a power-efficient cpu, should spill over to a high-performance CPU because
  the aggregate load of power-efficient cpus exceed their spill threshold.
  A cpu is considered to be above its spill level if it already has 10 tasks or
  if the sum of task load (scaled in reference to given cpu) and
  rq->cumulative_runnable_avg exceeds 'sched_spill_load'.

- power band

@@ -628,82 +633,239 @@ both tasks and CPUs to aid in the placement of tasks.
  be in a different "band" and it is selected, despite perhaps having
  a higher current task load.

*** 5.2 Task Wakeup and select_best_cpu()
*** 5.2 select_best_cpu()

CPU placement decisions for a task at its wakeup or creation time are the
most important decisions made by the HMP scheduler. This section will describe
the call flow and algorithm used in detail.

The primary entry point for a task wakeup operation is try_to_wake_up(),
located in kernel/sched/core.c. This function relies on select_task_rq() to
determine the target CPU for the waking task. For fair-class (SCHED_OTHER)
tasks, that request will be routed to select_task_rq_fair() in
kernel/sched/fair.c. As part of these scheduler extensions a hook has been
inserted into the top of that function. If HMP scheduling is enabled the normal
scheduling behavior will be replaced by a call to select_best_cpu(). This
function, select_best_cpu(), represents the heart of the HMP scheduling
algorithm described in this document. Note that select_best_cpu() is also
invoked for a task being created.

The behavior of select_best_cpu() depends on several factors such as boost
setting, choice of several tunables and on task demand.

**** 5.2.1 Boost

Normally the high performance cpu cluster is reserved for use by high demand
tasks i.e tasks whose demand on the power efficient cpu cluster exceeds
'sched_upmigrate'. This implies some amount of latency for low-demand tasks to
be migrated to high-performance cpu cluster when they experience a surge in
their demand. Such tasks will continue running on power-efficient cpu cluster
until they have exhibited sufficient demand before being up migrated. This
latency could hurt application performance in some cases. To avoid such latency,
scheduler supports a boost API which eliminates the bar to make use of
high-performance cpu cluster. When boost is turned on, all tasks are considered
eligible to make use of high-performance cpus, irrespective of their demand.

Boost can be set via either /proc/sys/kernel/sched_boost or by invoking
kernel API sched_set_boost().

	int sched_set_boost(int enable);

Once turned on, boost will remain in effect until it is explicitly turned off.
To allow for boost to be controlled by multiple external entities (application
or kernel module) at same time, boost setting is reference counted.  This means
that two applications can turn on boost and the effect of boost is eliminated
only after both applications have turned off boost. boost_refcount variable
represents this reference count.

**** 5.2.2 task_will_fit()

The overall goal of select_best_cpu() is to place a task on the least power
cluster where it can "fit" i.e where its cpu usage shall be below the capacity
offered by cluster. Criteria for a task to be considered as fitting in a cluster
is:

  i) When boost is active, all tasks, irrespective of their demand or priority,
     are considered to fit only on highest-capacity cluster.

 ii) A low-priority task, whose nice value is greater than
     sysctl_sched_upmigrate_min_nice or whose cgroup has its
     upmigrate_discourage flag set, is considered to be fitting in all clusters,
     irrespective of their capacity and task's cpu demand.

iii) All tasks are considered to fit in highest capacity cluster.

 iv) Task demand scaled in reference to the given cluster should be less than a
     threshold. See section on load_scale_factor to know more about how task
     demand is scaled in reference to a given cpu (cluster). The threshold used
     is normally sched_upmigrate. Its possible for a task's demand to exceed
     sched_upmigrate threshold in reference to a cluster when its upmigrated to
     higher capacity cluster. To prevent it from coming back immediately to
     lower capacity cluster, the  task is not considered to "fit" on its earlier
     cluster until its demand has dropped below sched_downmigrate in reference
     to that earlier cluster. sched_downmigrate thus provides for some
     hysteresis control.


**** 5.2.3 Factors affecting select_best_cpu()

Behavior of select_best_cpu() is further controlled by several tunables and
synchronous nature of wakeup.

a. /proc/sys/kernel/sched_small_task
	This controls classification of tasks as small or not. Any task whose
	demand is less than this threshold will be classified as small.
	Scheduler avoids placing small tasks on idle cpus and instead prefers to
	place them on least busy cpu in lowest power-cost cluster.

b. /sys/devices/system/cpu/cpuX/sched_mostly_idle_[nr_run, load]
	This controls classification of cpus as mostly idle or not. Any cpu
	whose rq->nr_running and rq->cumulative_runnable_avg is below those
	thresholds is classified as mostly_idle. Additionally to account for
	idle cpus with high amount of irq processing load, a cpu is considered
	mostly idle only when its irq_load is also less than
	'sched_cpu_high_irqload'. See section on 'sched_cpu_high_irqload' for
	more details.

c. /sys/devices/system/cpu/cpuX/sched_mostly_idle_freq
	This controls packing behavior in a cluster. Tasks will be packed on a
	single cpu in a cluster, provided sched_mostly_idle_freq is non-zero for
	the cluster and its current frequency is less than
	sched_mostly_idle_freq. See section on "Task packing" for more details
	about sched_mostly_idle_freq.

c. /sys/devices/system/cpu/cpuX/sched_prefer_idle
	sched_prefer_idle = 1 is a directive to scheduler to place a non-small
	task on idle cpu in the cluster, while prefer_idle = 0 will cause
	scheduler to place non-small tasks on least busy mostly-idle cpu in the
	cluster where they can fit.  prefer_idle = 0 thus enables packing
	behavior (where more than one task can be packed on the same cpu).

	sched_prefer_idle can be set differently for each cpu, although its
	expected that all cpus in a cluster will have the same value. The
	per-cpu interface allows one to differentiate packing behavior in
	different clusters. sched_prefer_idle can be set to 1 in most
	power-efficient cluster (to disable packing behavior for non-small
	tasks) while it can be set to 0 in highest performance cluster (to
	enable packing behavior for non-small tasks).

d. /proc/sys/kernel/sched_cpu_high_irqload
	A cpu whose irq load is greater than this threshold will not be
	considered idle or mostly idle. This threshold value in expressed in
	nanoseconds scale, with default threshold being 10000000 (10ms). See
	notes on sched_cpu_high_irqload tunable to understand how irq load on a
	cpu is measured.

e. Synchronous nature of wakeup
	Synchronous wakeup is a hint to scheduler that the task issuing wakeup
	(i.e task currently running on cpu where wakeup is being processed by
	scheduler) will "soon" relinquish CPU. A simple example is two tasks
	communicating with each other using a pipe structure. When reader task
	blocks waiting for data, its woken by writer task after it has written
	data to pipe. Writer task usually blocks waiting for reader task to
	consume data in pipe (which may not have any more room for writes).

	Synchronous wakeup is accounted for by adjusting load of a cpu to not
	include load of currently running task. As a result, a cpu that has only
	one runnable task and which is currently processing synchronous wakeup
	will be considered idle.

f. PF_WAKE_UP_IDLE
	Any task with this flag set will be woken up to an idle cpu (if one is
	available) independent of sched_prefer_idle flag setting, its demand and
	synchronous nature of wakeup. Similarly idle cpu is preferred during
	wakeup for any task that does not have this flag set but is being woken
	by a task with PF_WAKE_UP_IDLE flag set. For simplicity, we will use the
	term "PF_WAKE_UP_IDLE wakeup" to signify wakeups involving a task with
	PF_WAKE_UP_IDLE flag set.

**** 5.2.4 Wakeup Logic of a Non-Small Task "p"

Following is the order of CPU preference for a non-small task when
sched_prefer_idle (for the lowest power-cost cluster where it can fit) is 1.
Note that this same order of preference is used for even small tasks when
PF_WAKE_UP_IDLE wakeup is involved (i.e small task being woken has
PF_WAKE_UP_IDLE set or is being woken by a task with PF_WAKE_UP_IDLE set).

  1. Least power cost CPU in the least power-cost cluster where task will fit,
     provided its not a "PF_WAKE_UP_IDLE wakeup", the sched_mostly_idle_freq
     setting for that cluster is non-zero and cluster's current frequency is
     less than sched_mostly_idle_freq setting for that cluster.

  2. Idle cpu in least power-cost cluster where task will fit. Ties broken by
     cstate (cpu in least-shallow cstate preferred) first, then by power (cpu
     with lowest power preferred) and lastly by task's previous cpu association
     (i.e from amongst two cpus that are both idle, in same c-state and of same
     power cost, and task had run on one of them previously, the one chosen is
     cpu where task previously ran).

  3. Mostly idle cpu in least power-cost cluster where task will fit.  Ties
     broken by load first (least loaded cpu preferred), then by power and lastly
     by task's previous cpu association.

  4. Least loaded busy cpu where task will fit and where adding task will not
     result in spill-over. Note that spill-over criteria does not apply to
     cluster with maximum capacity. Ties (for cpus with same minimum load) are
     broken by power cost first and then by task's previous cpu association.

  5. Idle cpu in a higher power-cost cluster where task will fit. Ties broken
     by cstate first, then by power and lastly by task's previous cpu
     association.

     Higher power-cost cluster is considered in this case as the least
     power-cost cluster where task will fit is close to its spill threshold.

  6. Mostly idle cpu in a higher power-cost cluster where task will fit.
     Ties broken by load first, then by power and lastly by task's previous cpu
     association.

     Higher power-cost cluster is considered in this case as the least
     power-cost cluster where task will fit is close to its spill threshold.

  7. The first least loaded idle or mostly_idle cpu in cluster where task won't
     fit (if such cluster is available). Ties broken by task's previous cpu
     association.

  8. The CPU which the task last ran on.

When sched_prefer_idle is 0, the order of prefererence for a non-small task is
as-above with following changes:

	#2 and #3 are swapped in order of preference. Similarly #5 and #6 are
	swapped in order of preference. This results in mostly idle cpu being
	preferred over idle cpu and thus enables packing behavior.


**** 5.2.3 Wakeup Logic a Small Task "p"

Small tasks will be treated as non-small tasks when boost is in effect and the
logic for selecting candidate cpu for their placement is similar to the logic
described earlier for non-small tasks.

When boost is not in effect, the order of CPU preference for a small task is the
following:

  1. Least power cost CPU in the least power-cost cluster where task will fit,
     provided its not a "PF_WAKE_UP_IDLE wakeup", the sched_mostly_idle_freq
     setting for that cluster is non-zero and cluster's current frequency is
     less than sched_mostly_idle_freq setting for that cluster.

  2. The lowest-power CPU, if it is not idle but is mostly idle and which
     happens to be cpu where task previously ran.

CPU placement of a waking task is the single most important decision
made by the HMP scheduler. This section will describe the call flow
and algorithm used in detail.
  3. A non-idle CPU in the lowest power band which is mostly idle. The first
     such CPU found (or task's previous cpu) is selected.

The primary entry point for a task wakeup operation is
try_to_wake_up(), located in kernel/sched/core.c. This function relies
on select_task_rq() to determine the target CPU for the waking
task. For fair-class (SCHED_OTHER) tasks, that request will be routed
to select_task_rq_fair() in kernel/sched/fair.c. As part of these
scheduler extensions a hook has been inserted into the top of that
function. If HMP scheduling is enabled the normal scheduling behavior
will be replaced by a call to select_best_cpu(). This function,
select_best_cpu(), represents the heart of the HMP scheduling
algorithm described in this document.
  4. An idle CPU in the lowest power band that is in the least shallow C-state.
     Ties (for cpus in same shallowest C-state) broken by task's previous cpu
     association.

The behavior of select_best_cpu() differs depending on whether the
task being placed is a small task or not and the value of the sched_prefer_idle
tunable.
  5. The least busy CPU in the lowest power band where adding the task will not
     result in exceeding the spill threshold. Ties (for cpus with same minimum
     load) broken by task's previous cpu association.

--- Wakeup Logic a Non-Small Task "p"

The order of CPU preference for a non-small task when sched_prefer_idle = 1 is
the following:

  1. The shallowest-cstate idle CPU in the lowest-power cluster which can fit
     the task. Where there is a tie of two CPUs with the same load, the CPU with
     the lowest power cost is chosen.

  2. The least-loaded CPU the task is allowed to run on in the lowest power band
     where the task will fit and where the placement will not result in cpu
     exceeding spill level. When there is a tie of two CPUs at same load, the
     CPU with the lowest power cost is chosen.

  3. The least-loaded mostly idle CPU that the task is allowed to run on where
     the task won't fit (since there was no CPU where the task would fit).

  4. The CPU which the task last ran on.

The order of CPU preference for a non-small task when sched_prefer_idle = 0
is the following:

  1. The least-loaded non-idle mostly idle CPU the task is allowed to run on in
     the lowest power band where the task will fit. When there is a tie of two
     CPUs at same load, the CPU with the lowest power cost is chosen.

  2. The shallowest-cstate idle CPU in the lowest-power cluster which can fit
     the task. Where there is a tie of two CPUs with the same load, the CPU with
     the lowest power cost is chosen.

  3. The least-loaded CPU the task is allowed to run on in the lowest power band
     where the task will fit and where the placement will not result in the CPU
     exceeding spill level. When there is a tie of two CPUs at the same load,
     the CPU with the lowest power cost is chosen.

  4. The least-loaded mostly idle CPU that the task is allowed to run on where
     the task won't fit (since there was no CPU where the task would fit).

  5. The CPU which the task last ran on.

--- Wakeup Logic a Small Task "p"

The order of CPU preference for a small task is the following:

  1. The lowest-power CPU, if it is not idle but is mostly idle.

  2. A non-idle CPU in the lowest power band which is mostly idle. The first
     such CPU found is selected.

  3. An idle CPU in the lowest power band that is in the least shallow C-state.

  4. The least busy CPU in the lowest power band where adding the task will not
     result in exceeding the spill threshold.

  5. The most power-efficient CPU outside of the lowest power band.
  6. The most power-efficient CPU outside of the lowest power band. Ties broken
     by task's previous cpu association.

*** 5.3 Scheduler Tick

@@ -804,14 +966,14 @@ low latency to run immediately, when compared to being woken to a idle cpu in
deep sleep state. In later case, task has to wait for cpu to exit sleep state,
considerable enough in some cases to hurt performance.

Packing thus is a delicate matter to play with! The following parameters control
packing behavior.
Packing thus is a delicate matter to play with!

The following parameters control packing behavior.

- sched_small_task
	This parameter specifies demand threshold below which a task will be
classified as "small". As described in Sec 5.2 ("Task Wakeup and
select_best_cpu()"), for small tasks wakeups, a busy cpu is prefered as target
rather than idle cpu.
classified as "small". As described in Sec 5.2 ("select_best_cpu()"), for small
tasks wakeups, a busy cpu is preferred as target rather than idle cpu.

- mostly_idle_load and mostly_idle_nr_run

@@ -828,10 +990,11 @@ pack all tasks on a single cpu in cluster. The cpu chosen is the first most
power-efficient cpu found while scanning cluster's online cpus.

- PF_WAKE_UP_IDLE
	Any task that has this flag set in its 'task_struct.flags' field will be
always woken to idle cpu. Further any task woken by such tasks will be also
placed on idle cpu. PF_WAKE_UP_IDLE flag is inherited by children of a task.
It can be modified for a task in two ways:

Idle cpu is preferred for any waking task that has this flag set in its
'task_struct.flags' field. Further idle cpu is preferred for any task woken by
such tasks. PF_WAKE_UP_IDLE flag of a task is inherited by it's children. It can
be modified for a task in two ways:

	> kernel-space interface
		set_wake_up_idle() needs to be called in the context of a task
@@ -841,17 +1004,13 @@ It can be modified for a task in two ways:
		/proc/[pid]/sched_wake_up_idle file needs to be written to for
		setting or clearing PF_WAKE_UP_IDLE flag for a given task

For some low band of frequency, spread of task on all available cpus can be
groslly power-inefficient. As an example, consider two tasks that each need
500MHz. Packing them on one cpu could lead to 1GHz. In spread case, we incur
cost of two cpus running at 500MHz, while in packed case, we incur the cost of
one cpu running at 1GHz. Based on the silicon characteristics, where leakage
power can be dominant factor, former can be worse on power rather than latter.
Running at slow frequency (in spread case) can actually makes it worse on
leakage power (especially if 500MHz and 1GHz share the same voltage point).
sched_mostly_idle_freq is set based on silicon characteristics and can provide
a winning argument for both power and performance.
- sched_prefer_idle

This parameter enables packing behavior for non-small tasks. When set to 0,
non-small tasks are placed on mostly_idle cpus rather than idle cpus.
sched_prefer_idle can be changed independently for each cpu cluster and thus its
possible to enable packing of non-small tasks in one cluster and disable it in
another cluster.

=====================
6. FREQUENCY GUIDANCE
@@ -877,7 +1036,7 @@ get_cpu_iowait_time_us() APIs.
    This API is invoked by governor at initialization time or whenever
    window size is changed. 'window_size' argument (in jiffy units)
    indicates the size of window to be used. The first window of size
    'window_size' is set to beging at jiffy 'window_start'
    'window_size' is set to begin at jiffy 'window_start'

    -EINVAL is returned if per-entity load tracking is in use rather
    than window-based load tracking, otherwise a success value of 0
@@ -1066,7 +1225,7 @@ This tunable is a percentage. It exists to control hysteresis. Lets say a task
migrated to a high-performance cpu when it crossed 80% demand on a
power-efficient cpu. We don't let it come back to a power-efficient cpu until
its demand *in reference to the power-efficient cpu* drops less than 60%
(sched_down_migrate).
(sched_downmigrate).

*** 7.7 sched_small_task

@@ -1142,7 +1301,7 @@ Possible values for this tunable are:
1: Use the maximum value of first M samples found in task's cpu demand
   history (sum_history[] array), where M = sysctl_sched_ravg_hist_size
2: Use the maximum of (the most recent window sample, average of first M
   samples), where M = syctl_sched_ravg_hist_size
   samples), where M = sysctl_sched_ravg_hist_size
3. Use average of first M samples, where M = sysctl_sched_ravg_hist_size

*** 7.13 sched_ravg_window
@@ -1270,6 +1429,7 @@ Default value: 1
Non-small tasks will prefer to wake up on idle CPUs if this tunable is set to 1.
If the tunable is set to 0, non-small tasks will prefer to wake up on mostly
idle CPUs which are not completely idle, increasing task packing behavior.
See section on "Task packing" for more details.

** 7.24 sched_min_runtime

@@ -1277,7 +1437,7 @@ Appears at: /proc/sys/kernel/sched_min_runtime

Default value: 0 (0 ms)

This tunable helps avouid frequent migration of task on account of
This tunable helps avoid frequent migration of task on account of
energy-awareness. During scheduler tick, a check is made (in migration_needed())
whether the running task needs to be migrated to a "better" cpu, which could
either offer better performance or power. When deciding to migrate task on
@@ -1351,7 +1511,7 @@ Logged when selecting the best CPU to run the task (select_best_cpu()).
- reason: reason we are picking a new CPU:
  0: no migration - selecting a CPU for a wakeup or new task wakeup
  1: move to big CPU (migration)
  2: move to littlte CPU (migration)
  2: move to little CPU (migration)
  3: move to power efficient CPU (migration)

*** 8.3 sched_cpu_load