Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 39cf275a authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler changes from Ingo Molnar:
 "The main changes in this cycle are:

   - (much) improved CONFIG_NUMA_BALANCING support from Mel Gorman, Rik
     van Riel, Peter Zijlstra et al.  Yay!

   - optimize preemption counter handling: merge the NEED_RESCHED flag
     into the preempt_count variable, by Peter Zijlstra.

   - wait.h fixes and code reorganization from Peter Zijlstra

   - cfs_bandwidth fixes from Ben Segall

   - SMP load-balancer cleanups from Peter Zijstra

   - idle balancer improvements from Jason Low

   - other fixes and cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits)
  ftrace, sched: Add TRACE_FLAG_PREEMPT_RESCHED
  stop_machine: Fix race between stop_two_cpus() and stop_cpus()
  sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus
  sched: Fix asymmetric scheduling for POWER7
  sched: Move completion code from core.c to completion.c
  sched: Move wait code from core.c to wait.c
  sched: Move wait.c into kernel/sched/
  sched/wait: Fix __wait_event_interruptible_lock_irq_timeout()
  sched: Avoid throttle_cfs_rq() racing with period_timer stopping
  sched: Guarantee new group-entities always have weight
  sched: Fix hrtimer_cancel()/rq->lock deadlock
  sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining
  sched: Fix race on toggling cfs_bandwidth_used
  sched: Remove extra put_online_cpus() inside sched_setaffinity()
  sched/rt: Fix task_tick_rt() comment
  sched/wait: Fix build breakage
  sched/wait: Introduce prepare_to_wait_event()
  sched/wait: Add ___wait_cond_timeout() to wait_event*_timeout() too
  sched: Remove get_online_cpus() usage
  sched: Fix race in migrate_swap_stop()
  ...
parents ad5d6989 e5137b50
Loading
Loading
Loading
Loading
+76 −0
Original line number Diff line number Diff line
@@ -355,6 +355,82 @@ utilize.

==============================================================

numa_balancing

Enables/disables automatic page fault based NUMA memory
balancing. Memory is moved automatically to nodes
that access it often.

Enables/disables automatic NUMA memory balancing. On NUMA machines, there
is a performance penalty if remote memory is accessed by a CPU. When this
feature is enabled the kernel samples what task thread is accessing memory
by periodically unmapping pages and later trapping a page fault. At the
time of the page fault, it is determined if the data being accessed should
be migrated to a local memory node.

The unmapping of pages and trapping faults incur additional overhead that
ideally is offset by improved memory locality but there is no universal
guarantee. If the target workload is already bound to NUMA nodes then this
feature should be disabled. Otherwise, if the system overhead from the
feature is too high then the rate the kernel samples for NUMA hinting
faults may be controlled by the numa_balancing_scan_period_min_ms,
numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
numa_balancing_scan_size_mb, numa_balancing_settle_count sysctls and
numa_balancing_migrate_deferred.

==============================================================

numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb

Automatic NUMA balancing scans tasks address space and unmaps pages to
detect if pages are properly placed or if the data should be migrated to a
memory node local to where the task is running.  Every "scan delay" the task
scans the next "scan size" number of pages in its address space. When the
end of the address space is reached the scanner restarts from the beginning.

In combination, the "scan delay" and "scan size" determine the scan rate.
When "scan delay" decreases, the scan rate increases.  The scan delay and
hence the scan rate of every task is adaptive and depends on historical
behaviour. If pages are properly placed then the scan delay increases,
otherwise the scan delay decreases.  The "scan size" is not adaptive but
the higher the "scan size", the higher the scan rate.

Higher scan rates incur higher system overhead as page faults must be
trapped and potentially data must be migrated. However, the higher the scan
rate, the more quickly a tasks memory is migrated to a local node if the
workload pattern changes and minimises performance impact due to remote
memory accesses. These sysctls control the thresholds for scan delays and
the number of pages scanned.

numa_balancing_scan_period_min_ms is the minimum time in milliseconds to
scan a tasks virtual memory. It effectively controls the maximum scanning
rate for each task.

numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
when it initially forks.

numa_balancing_scan_period_max_ms is the maximum time in milliseconds to
scan a tasks virtual memory. It effectively controls the minimum scanning
rate for each task.

numa_balancing_scan_size_mb is how many megabytes worth of pages are
scanned for a given scan.

numa_balancing_settle_count is how many scan periods must complete before
the schedule balancer stops pushing the task towards a preferred node. This
gives the scheduler a chance to place the task on an alternative node if the
preferred node is overloaded.

numa_balancing_migrate_deferred is how many page migrations get skipped
unconditionally, after a page migration is skipped because a page is shared
with other tasks. This reduces page migration overhead, and determines
how much stronger the "move task near its memory" policy scheduler becomes,
versus the "move memory near its task" memory management policy, for workloads
with shared memory.

==============================================================

osrelease, ostype & version:

# cat osrelease
+5 −1
Original line number Diff line number Diff line
@@ -655,7 +655,11 @@ explains which is which.
		  read the irq flags variable, an 'X' will always
		  be printed here.

  need-resched: 'N' task need_resched is set, '.' otherwise.
  need-resched:
	'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set,
	'n' only TIF_NEED_RESCHED is set,
	'p' only PREEMPT_NEED_RESCHED is set,
	'.' otherwise.

  hardirq/softirq:
	'H' - hard irq occurred inside a softirq.
+2 −0
Original line number Diff line number Diff line
@@ -7326,6 +7326,8 @@ S: Maintained
F:	kernel/sched/
F:	include/linux/sched.h
F:	include/uapi/linux/sched.h
F:	kernel/wait.c
F:	include/linux/wait.h

SCORE ARCHITECTURE
M:	Chen Liqin <liqin.linux@gmail.com>
+1 −0
Original line number Diff line number Diff line
@@ -3,3 +3,4 @@ generic-y += clkdev.h

generic-y += exec.h
generic-y += trace_clock.h
generic-y += preempt.h
+1 −0
Original line number Diff line number Diff line
@@ -46,3 +46,4 @@ generic-y += ucontext.h
generic-y += user.h
generic-y += vga.h
generic-y += xor.h
generic-y += preempt.h
Loading