Loading Documentation/scheduler/sched-hmp.txt +82 −2 Original line number Diff line number Diff line Loading @@ -22,7 +22,7 @@ CONTENTS 5.3 Scheduler Tick 5.4 Load Balancer 5.5 Real Time Tasks 5.6 Stop-Class Tasks 5.6 Task packing 6. Frequency Guidance 6.1 Per-CPU Window-Based Stats 6.1 Per-task Window-Based Stats Loading Loading @@ -571,15 +571,19 @@ both tasks and CPUs to aid in the placement of tasks. the scheduler is tracking the demand of each task it can make an educated guess as to whether a CPU will become idle in the near future. There are two tunable parameters which are used to determine whether There are three tunable parameters which are used to determine whether a CPU is mostly idle: /sys/devices/system/cpu/cpuX/sched_mostly_idle_nr_run /sys/devices/system/cpu/cpuX/sched_mostly_idle_load /sys/devices/system/cpu/cpuX/sched_mostly_idle_freq Note that these tunables are per-cpu. If a CPU does not have more than sched_mostly_idle_nr_run runnable tasks and is not more than sched_mostly_idle_load percent busy, it is considered mostly idle. Additionally if a cpu's sched_mostly_idle_freq is non-zero and its current frequency is less than threshold, then scheduler will attempt to pack tasks on the most power-efficient cpu in the cluster. - spill threshold Loading Loading @@ -894,6 +898,71 @@ HMP scheduler brings in a change which avoids fast-path and always resorts to slow-path. Further cpu with lowest power-rating from candidate list of cpus is chosen as cpu for placing waking real-time task. *** 5.6 Task packing Task packing is letting one cpu take up more than one task in an attempt to improve power (and in some cases performance). Power benefit is derived by avoiding wakeup cost for idle cpus from their deep sleep states. For example, consider a system with one cpu busy while other cpus are idle and in deep sleep state. A small task in this situation needs to be placed on a suitable cpu. Placing the small task on the busy cpu will likely not hurt its performance (it is after all a low-demand task) while helping gain on power because we avoid the cost associated with waking idle cpu from deep sleep state. Task packing can have good or bad implications for power and performance. a. Power implications As described in the small task wakeup example, task packing can be beneficial for power. However, the adverse impact on power can arise when packing on one cpu can increase its busy time and hence result in frequency raise. b. Performance implications The most obvious negative impact on performance because of packing is increased scheduling latencies for tasks that can occur. Positive impact on performance from packing has also been seen. This arises from the fact that a waking task, when woken to busy cpu because of packing, will incur very low latency to run immediately, when compared to being woken to a idle cpu in deep sleep state. In later case, task has to wait for cpu to exit sleep state, considerable enough in some cases to hurt performance. Packing thus is a delicate matter to play with! The following parameters control packing behavior. - sched_small_task This parameter specifies demand threshold below which a task will be classified as "small". As described in Sec 5.2 ("Task Wakeup and select_best_cpu()"), for small tasks wakeups, a busy cpu is prefered as target rather than idle cpu. - mostly_idle_load and mostly_idle_nr_run These are per-cpu parameters that define mostly_idle thresholds for a cpu. A cpu whose load < mostly_idle_load AND whose nr_running is < mostly_idle_nr_run is classified as mostly_idle. See further description of "mostly_idle" thresholds in Sec 5. - mostly_idle_freq This is a per-cpu parameter. If non-zero for a cpu which is part of a cluster and cluster current frequency is less than this threshold, then scheduler will poack all tasks on a single cpu in cluster. The cpu chosen is the first most power-efficient cpu found while scanning cluster's online cpus. For some low band of frequency, spread of task on all available cpus can be groslly power-inefficient. As an example, consider two tasks that each need 500MHz. Packing them on one cpu could lead to 1GHz. In spread case, we incur cost of two cpus running at 500MHz, while in packed case, we incur the cost of one cpu running at 1GHz. Based on the silicon characteristics, where leakage power can be dominant factor, former can be worse on power rather than latter. Running at slow frequency (in spread case) can actually makes it worse on leakage power (especially if 500MHz and 1GHz share the same voltage point). sched_mostly_idle_freq is set based on silicon characteristics and can provide a winning argument for both power and performance. ===================== 6. FREQUENCY GUIDANCE ===================== Loading Loading @@ -1271,6 +1340,17 @@ comparison. Scheduler will request a raise in cpu frequency when heavy tasks wakeup after at least one window of sleep, where window size is defined by sched_ravg_window. Value 0 will disable this feature. ** 7.21 sched_mostly_idle_freq Appears at: /sys/devices/system/cpu/cpuX/sched_mostly_idle_freq Default value: 0 This tunable is intended to achieve task packing behavior based on cluster frequency. Hence it is strongly advised to have all cpus in a cluster have the same value for mostly_idle_freq. For more details, see section on "Task packing" (sec 5.6). ========================= 8. HMP SCHEDULER TRACE POINTS ========================= Loading drivers/base/cpu.c +41 −0 Original line number Diff line number Diff line Loading @@ -205,6 +205,42 @@ static ssize_t __ref store_sched_mostly_idle_load(struct device *dev, return err; } static ssize_t show_sched_mostly_idle_freq(struct device *dev, struct device_attribute *attr, char *buf) { struct cpu *cpu = container_of(dev, struct cpu, dev); ssize_t rc; int cpunum; unsigned int mostly_idle_freq; cpunum = cpu->dev.id; mostly_idle_freq = sched_get_cpu_mostly_idle_freq(cpunum); rc = snprintf(buf, PAGE_SIZE-2, "%d\n", mostly_idle_freq); return rc; } static ssize_t __ref store_sched_mostly_idle_freq(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { struct cpu *cpu = container_of(dev, struct cpu, dev); int cpuid = cpu->dev.id, err; unsigned int mostly_idle_freq; err = kstrtoint(strstrip((char *)buf), 0, &mostly_idle_freq); if (err) return err; err = sched_set_cpu_mostly_idle_freq(cpuid, mostly_idle_freq); if (err >= 0) err = count; return err; } static ssize_t show_sched_mostly_idle_nr_run(struct device *dev, struct device_attribute *attr, char *buf) { Loading Loading @@ -241,6 +277,8 @@ static ssize_t __ref store_sched_mostly_idle_nr_run(struct device *dev, return err; } static DEVICE_ATTR(sched_mostly_idle_freq, 0664, show_sched_mostly_idle_freq, store_sched_mostly_idle_freq); static DEVICE_ATTR(sched_mostly_idle_load, 0664, show_sched_mostly_idle_load, store_sched_mostly_idle_load); static DEVICE_ATTR(sched_mostly_idle_nr_run, 0664, Loading Loading @@ -424,6 +462,9 @@ int __cpuinit register_cpu(struct cpu *cpu, int num) if (!error) error = device_create_file(&cpu->dev, &dev_attr_sched_mostly_idle_nr_run); if (!error) error = device_create_file(&cpu->dev, &dev_attr_sched_mostly_idle_freq); #endif return error; Loading include/linux/sched.h +3 −0 Original line number Diff line number Diff line Loading @@ -1921,6 +1921,9 @@ extern int sched_set_cpu_mostly_idle_load(int cpu, int mostly_idle_pct); extern int sched_get_cpu_mostly_idle_load(int cpu); extern int sched_set_cpu_mostly_idle_nr_run(int cpu, int nr_run); extern int sched_get_cpu_mostly_idle_nr_run(int cpu); extern int sched_set_cpu_mostly_idle_freq(int cpu, unsigned int mostly_idle_freq); extern unsigned int sched_get_cpu_mostly_idle_freq(int cpu); #else static inline int sched_set_boost(int enable) Loading kernel/sched/core.c +1 −0 Original line number Diff line number Diff line Loading @@ -8997,6 +8997,7 @@ void __init sched_init(void) rq->hmp_flags = 0; rq->mostly_idle_load = pct_to_real(20); rq->mostly_idle_nr_run = 3; rq->mostly_idle_freq = 0; #ifdef CONFIG_SCHED_FREQ_INPUT rq->old_busy_time = 0; rq->curr_runnable_sum = rq->prev_runnable_sum = 0; Loading kernel/sched/fair.c +62 −0 Original line number Diff line number Diff line Loading @@ -1389,6 +1389,25 @@ int sched_set_cpu_mostly_idle_load(int cpu, int mostly_idle_pct) return 0; } int sched_set_cpu_mostly_idle_freq(int cpu, unsigned int mostly_idle_freq) { struct rq *rq = cpu_rq(cpu); if (mostly_idle_freq > rq->max_possible_freq) return -EINVAL; rq->mostly_idle_freq = mostly_idle_freq; return 0; } unsigned int sched_get_cpu_mostly_idle_freq(int cpu) { struct rq *rq = cpu_rq(cpu); return rq->mostly_idle_freq; } int sched_get_cpu_mostly_idle_load(int cpu) { struct rq *rq = cpu_rq(cpu); Loading Loading @@ -1795,6 +1814,42 @@ static int skip_cpu(struct task_struct *p, int cpu, int reason) return skip; } /* * Select a single cpu in cluster as target for packing, iff cluster frequency * is less than a threshold level */ static int select_packing_target(struct task_struct *p, int best_cpu) { struct rq *rq = cpu_rq(best_cpu); struct cpumask search_cpus; int i; int min_cost = INT_MAX; int target = best_cpu; if (rq->cur_freq >= rq->mostly_idle_freq) return best_cpu; /* Don't pack if current freq is low because of throttling */ if (rq->max_freq <= rq->mostly_idle_freq) return best_cpu; cpumask_and(&search_cpus, tsk_cpus_allowed(p), cpu_online_mask); cpumask_and(&search_cpus, &search_cpus, &rq->freq_domain_cpumask); /* Pick the first lowest power cpu as target */ for_each_cpu(i, &search_cpus) { int cost = power_cost(p, i); if (cost < min_cost) { target = i; min_cost = cost; } } return target; } /* return cheapest cpu that can fit this task */ static int select_best_cpu(struct task_struct *p, int target, int reason) { Loading Loading @@ -1906,6 +1961,9 @@ done: best_cpu = fallback_idle_cpu; } if (cpu_rq(best_cpu)->mostly_idle_freq) best_cpu = select_packing_target(p, best_cpu); return best_cpu; } Loading Loading @@ -7212,6 +7270,10 @@ static inline int _nohz_kick_needed_hmp(struct rq *rq, int cpu, int *type) struct sched_domain *sd; int i; if (rq->mostly_idle_freq && rq->cur_freq < rq->mostly_idle_freq && rq->max_freq > rq->mostly_idle_freq) return 0; if (rq->nr_running >= 2 && (rq->nr_running - rq->nr_small_tasks >= 2 || rq->nr_running > rq->mostly_idle_nr_run || cpu_load(cpu) > rq->mostly_idle_load)) { Loading Loading
Documentation/scheduler/sched-hmp.txt +82 −2 Original line number Diff line number Diff line Loading @@ -22,7 +22,7 @@ CONTENTS 5.3 Scheduler Tick 5.4 Load Balancer 5.5 Real Time Tasks 5.6 Stop-Class Tasks 5.6 Task packing 6. Frequency Guidance 6.1 Per-CPU Window-Based Stats 6.1 Per-task Window-Based Stats Loading Loading @@ -571,15 +571,19 @@ both tasks and CPUs to aid in the placement of tasks. the scheduler is tracking the demand of each task it can make an educated guess as to whether a CPU will become idle in the near future. There are two tunable parameters which are used to determine whether There are three tunable parameters which are used to determine whether a CPU is mostly idle: /sys/devices/system/cpu/cpuX/sched_mostly_idle_nr_run /sys/devices/system/cpu/cpuX/sched_mostly_idle_load /sys/devices/system/cpu/cpuX/sched_mostly_idle_freq Note that these tunables are per-cpu. If a CPU does not have more than sched_mostly_idle_nr_run runnable tasks and is not more than sched_mostly_idle_load percent busy, it is considered mostly idle. Additionally if a cpu's sched_mostly_idle_freq is non-zero and its current frequency is less than threshold, then scheduler will attempt to pack tasks on the most power-efficient cpu in the cluster. - spill threshold Loading Loading @@ -894,6 +898,71 @@ HMP scheduler brings in a change which avoids fast-path and always resorts to slow-path. Further cpu with lowest power-rating from candidate list of cpus is chosen as cpu for placing waking real-time task. *** 5.6 Task packing Task packing is letting one cpu take up more than one task in an attempt to improve power (and in some cases performance). Power benefit is derived by avoiding wakeup cost for idle cpus from their deep sleep states. For example, consider a system with one cpu busy while other cpus are idle and in deep sleep state. A small task in this situation needs to be placed on a suitable cpu. Placing the small task on the busy cpu will likely not hurt its performance (it is after all a low-demand task) while helping gain on power because we avoid the cost associated with waking idle cpu from deep sleep state. Task packing can have good or bad implications for power and performance. a. Power implications As described in the small task wakeup example, task packing can be beneficial for power. However, the adverse impact on power can arise when packing on one cpu can increase its busy time and hence result in frequency raise. b. Performance implications The most obvious negative impact on performance because of packing is increased scheduling latencies for tasks that can occur. Positive impact on performance from packing has also been seen. This arises from the fact that a waking task, when woken to busy cpu because of packing, will incur very low latency to run immediately, when compared to being woken to a idle cpu in deep sleep state. In later case, task has to wait for cpu to exit sleep state, considerable enough in some cases to hurt performance. Packing thus is a delicate matter to play with! The following parameters control packing behavior. - sched_small_task This parameter specifies demand threshold below which a task will be classified as "small". As described in Sec 5.2 ("Task Wakeup and select_best_cpu()"), for small tasks wakeups, a busy cpu is prefered as target rather than idle cpu. - mostly_idle_load and mostly_idle_nr_run These are per-cpu parameters that define mostly_idle thresholds for a cpu. A cpu whose load < mostly_idle_load AND whose nr_running is < mostly_idle_nr_run is classified as mostly_idle. See further description of "mostly_idle" thresholds in Sec 5. - mostly_idle_freq This is a per-cpu parameter. If non-zero for a cpu which is part of a cluster and cluster current frequency is less than this threshold, then scheduler will poack all tasks on a single cpu in cluster. The cpu chosen is the first most power-efficient cpu found while scanning cluster's online cpus. For some low band of frequency, spread of task on all available cpus can be groslly power-inefficient. As an example, consider two tasks that each need 500MHz. Packing them on one cpu could lead to 1GHz. In spread case, we incur cost of two cpus running at 500MHz, while in packed case, we incur the cost of one cpu running at 1GHz. Based on the silicon characteristics, where leakage power can be dominant factor, former can be worse on power rather than latter. Running at slow frequency (in spread case) can actually makes it worse on leakage power (especially if 500MHz and 1GHz share the same voltage point). sched_mostly_idle_freq is set based on silicon characteristics and can provide a winning argument for both power and performance. ===================== 6. FREQUENCY GUIDANCE ===================== Loading Loading @@ -1271,6 +1340,17 @@ comparison. Scheduler will request a raise in cpu frequency when heavy tasks wakeup after at least one window of sleep, where window size is defined by sched_ravg_window. Value 0 will disable this feature. ** 7.21 sched_mostly_idle_freq Appears at: /sys/devices/system/cpu/cpuX/sched_mostly_idle_freq Default value: 0 This tunable is intended to achieve task packing behavior based on cluster frequency. Hence it is strongly advised to have all cpus in a cluster have the same value for mostly_idle_freq. For more details, see section on "Task packing" (sec 5.6). ========================= 8. HMP SCHEDULER TRACE POINTS ========================= Loading
drivers/base/cpu.c +41 −0 Original line number Diff line number Diff line Loading @@ -205,6 +205,42 @@ static ssize_t __ref store_sched_mostly_idle_load(struct device *dev, return err; } static ssize_t show_sched_mostly_idle_freq(struct device *dev, struct device_attribute *attr, char *buf) { struct cpu *cpu = container_of(dev, struct cpu, dev); ssize_t rc; int cpunum; unsigned int mostly_idle_freq; cpunum = cpu->dev.id; mostly_idle_freq = sched_get_cpu_mostly_idle_freq(cpunum); rc = snprintf(buf, PAGE_SIZE-2, "%d\n", mostly_idle_freq); return rc; } static ssize_t __ref store_sched_mostly_idle_freq(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { struct cpu *cpu = container_of(dev, struct cpu, dev); int cpuid = cpu->dev.id, err; unsigned int mostly_idle_freq; err = kstrtoint(strstrip((char *)buf), 0, &mostly_idle_freq); if (err) return err; err = sched_set_cpu_mostly_idle_freq(cpuid, mostly_idle_freq); if (err >= 0) err = count; return err; } static ssize_t show_sched_mostly_idle_nr_run(struct device *dev, struct device_attribute *attr, char *buf) { Loading Loading @@ -241,6 +277,8 @@ static ssize_t __ref store_sched_mostly_idle_nr_run(struct device *dev, return err; } static DEVICE_ATTR(sched_mostly_idle_freq, 0664, show_sched_mostly_idle_freq, store_sched_mostly_idle_freq); static DEVICE_ATTR(sched_mostly_idle_load, 0664, show_sched_mostly_idle_load, store_sched_mostly_idle_load); static DEVICE_ATTR(sched_mostly_idle_nr_run, 0664, Loading Loading @@ -424,6 +462,9 @@ int __cpuinit register_cpu(struct cpu *cpu, int num) if (!error) error = device_create_file(&cpu->dev, &dev_attr_sched_mostly_idle_nr_run); if (!error) error = device_create_file(&cpu->dev, &dev_attr_sched_mostly_idle_freq); #endif return error; Loading
include/linux/sched.h +3 −0 Original line number Diff line number Diff line Loading @@ -1921,6 +1921,9 @@ extern int sched_set_cpu_mostly_idle_load(int cpu, int mostly_idle_pct); extern int sched_get_cpu_mostly_idle_load(int cpu); extern int sched_set_cpu_mostly_idle_nr_run(int cpu, int nr_run); extern int sched_get_cpu_mostly_idle_nr_run(int cpu); extern int sched_set_cpu_mostly_idle_freq(int cpu, unsigned int mostly_idle_freq); extern unsigned int sched_get_cpu_mostly_idle_freq(int cpu); #else static inline int sched_set_boost(int enable) Loading
kernel/sched/core.c +1 −0 Original line number Diff line number Diff line Loading @@ -8997,6 +8997,7 @@ void __init sched_init(void) rq->hmp_flags = 0; rq->mostly_idle_load = pct_to_real(20); rq->mostly_idle_nr_run = 3; rq->mostly_idle_freq = 0; #ifdef CONFIG_SCHED_FREQ_INPUT rq->old_busy_time = 0; rq->curr_runnable_sum = rq->prev_runnable_sum = 0; Loading
kernel/sched/fair.c +62 −0 Original line number Diff line number Diff line Loading @@ -1389,6 +1389,25 @@ int sched_set_cpu_mostly_idle_load(int cpu, int mostly_idle_pct) return 0; } int sched_set_cpu_mostly_idle_freq(int cpu, unsigned int mostly_idle_freq) { struct rq *rq = cpu_rq(cpu); if (mostly_idle_freq > rq->max_possible_freq) return -EINVAL; rq->mostly_idle_freq = mostly_idle_freq; return 0; } unsigned int sched_get_cpu_mostly_idle_freq(int cpu) { struct rq *rq = cpu_rq(cpu); return rq->mostly_idle_freq; } int sched_get_cpu_mostly_idle_load(int cpu) { struct rq *rq = cpu_rq(cpu); Loading Loading @@ -1795,6 +1814,42 @@ static int skip_cpu(struct task_struct *p, int cpu, int reason) return skip; } /* * Select a single cpu in cluster as target for packing, iff cluster frequency * is less than a threshold level */ static int select_packing_target(struct task_struct *p, int best_cpu) { struct rq *rq = cpu_rq(best_cpu); struct cpumask search_cpus; int i; int min_cost = INT_MAX; int target = best_cpu; if (rq->cur_freq >= rq->mostly_idle_freq) return best_cpu; /* Don't pack if current freq is low because of throttling */ if (rq->max_freq <= rq->mostly_idle_freq) return best_cpu; cpumask_and(&search_cpus, tsk_cpus_allowed(p), cpu_online_mask); cpumask_and(&search_cpus, &search_cpus, &rq->freq_domain_cpumask); /* Pick the first lowest power cpu as target */ for_each_cpu(i, &search_cpus) { int cost = power_cost(p, i); if (cost < min_cost) { target = i; min_cost = cost; } } return target; } /* return cheapest cpu that can fit this task */ static int select_best_cpu(struct task_struct *p, int target, int reason) { Loading Loading @@ -1906,6 +1961,9 @@ done: best_cpu = fallback_idle_cpu; } if (cpu_rq(best_cpu)->mostly_idle_freq) best_cpu = select_packing_target(p, best_cpu); return best_cpu; } Loading Loading @@ -7212,6 +7270,10 @@ static inline int _nohz_kick_needed_hmp(struct rq *rq, int cpu, int *type) struct sched_domain *sd; int i; if (rq->mostly_idle_freq && rq->cur_freq < rq->mostly_idle_freq && rq->max_freq > rq->mostly_idle_freq) return 0; if (rq->nr_running >= 2 && (rq->nr_running - rq->nr_small_tasks >= 2 || rq->nr_running > rq->mostly_idle_nr_run || cpu_load(cpu) > rq->mostly_idle_load)) { Loading