Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 0081a0ce authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RCU updates from Ingo Molnad:
 "The main RCU related changes in this cycle were:

   - Removal of spin_unlock_wait()
   - SRCU updates
   - RCU torture-test updates
   - RCU Documentation updates
   - Extend the sys_membarrier() ABI with the MEMBARRIER_CMD_PRIVATE_EXPEDITED variant
   - Miscellaneous RCU fixes
   - CPU-hotplug fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (63 commits)
  arch: Remove spin_unlock_wait() arch-specific definitions
  locking: Remove spin_unlock_wait() generic definitions
  drivers/ata: Replace spin_unlock_wait() with lock/unlock pair
  ipc: Replace spin_unlock_wait() with lock/unlock pair
  exit: Replace spin_unlock_wait() with lock/unlock pair
  completion: Replace spin_unlock_wait() with lock/unlock pair
  doc: Set down RCU's scheduling-clock-interrupt needs
  doc: No longer allowed to use rcu_dereference on non-pointers
  doc: Add RCU files to docbook-generation files
  doc: Update memory-barriers.txt for read-to-write dependencies
  doc: Update RCU documentation
  membarrier: Provide expedited private command
  rcu: Remove exports from rcu_idle_exit() and rcu_idle_enter()
  rcu: Add warning to rcu_idle_enter() for irqs enabled
  rcu: Make rcu_idle_enter() rely on callers disabling irqs
  rcu: Add assertions verifying blocked-tasks list
  rcu/tracing: Set disable_rcu_irq_enter on rcu_eqs_exit()
  rcu: Add TPS() protection for _rcu_barrier_trace strings
  rcu: Use idle versions of swait to make idle-hack clear
  swait: Add idle variants which don't contribute to load average
  ...
parents fea15437 94edf6f3
Loading
Loading
Loading
Loading
+130 −0
Original line number Diff line number Diff line
@@ -2080,6 +2080,8 @@ Some of the relevant points of interest are as follows:
<li>	<a href="#Scheduler and RCU">Scheduler and RCU</a>.
<li>	<a href="#Tracing and RCU">Tracing and RCU</a>.
<li>	<a href="#Energy Efficiency">Energy Efficiency</a>.
<li>	<a href="#Scheduling-Clock Interrupts and RCU">
	Scheduling-Clock Interrupts and RCU</a>.
<li>	<a href="#Memory Efficiency">Memory Efficiency</a>.
<li>	<a href="#Performance, Scalability, Response Time, and Reliability">
	Performance, Scalability, Response Time, and Reliability</a>.
@@ -2532,6 +2534,134 @@ I learned of many of these requirements via angry phone calls:
Flaming me on the Linux-kernel mailing list was apparently not
sufficient to fully vent their ire at RCU's energy-efficiency bugs!

<h3><a name="Scheduling-Clock Interrupts and RCU">
Scheduling-Clock Interrupts and RCU</a></h3>

<p>
The kernel transitions between in-kernel non-idle execution, userspace
execution, and the idle loop.
Depending on kernel configuration, RCU handles these states differently:

<table border=3>
<tr><th><tt>HZ</tt> Kconfig</th>
	<th>In-Kernel</th>
		<th>Usermode</th>
			<th>Idle</th></tr>
<tr><th align="left"><tt>HZ_PERIODIC</tt></th>
	<td>Can rely on scheduling-clock interrupt.</td>
		<td>Can rely on scheduling-clock interrupt and its
		    detection of interrupt from usermode.</td>
			<td>Can rely on RCU's dyntick-idle detection.</td></tr>
<tr><th align="left"><tt>NO_HZ_IDLE</tt></th>
	<td>Can rely on scheduling-clock interrupt.</td>
		<td>Can rely on scheduling-clock interrupt and its
		    detection of interrupt from usermode.</td>
			<td>Can rely on RCU's dyntick-idle detection.</td></tr>
<tr><th align="left"><tt>NO_HZ_FULL</tt></th>
	<td>Can only sometimes rely on scheduling-clock interrupt.
	    In other cases, it is necessary to bound kernel execution
	    times and/or use IPIs.</td>
		<td>Can rely on RCU's dyntick-idle detection.</td>
			<td>Can rely on RCU's dyntick-idle detection.</td></tr>
</table>

<table>
<tr><th>&nbsp;</th></tr>
<tr><th align="left">Quick Quiz:</th></tr>
<tr><td>
	Why can't <tt>NO_HZ_FULL</tt> in-kernel execution rely on the
	scheduling-clock interrupt, just like <tt>HZ_PERIODIC</tt>
	and <tt>NO_HZ_IDLE</tt> do?
</td></tr>
<tr><th align="left">Answer:</th></tr>
<tr><td bgcolor="#ffffff"><font color="ffffff">
	Because, as a performance optimization, <tt>NO_HZ_FULL</tt>
	does not necessarily re-enable the scheduling-clock interrupt
	on entry to each and every system call.
</font></td></tr>
<tr><td>&nbsp;</td></tr>
</table>

<p>
However, RCU must be reliably informed as to whether any given
CPU is currently in the idle loop, and, for <tt>NO_HZ_FULL</tt>,
also whether that CPU is executing in usermode, as discussed
<a href="#Energy Efficiency">earlier</a>.
It also requires that the scheduling-clock interrupt be enabled when
RCU needs it to be:

<ol>
<li>	If a CPU is either idle or executing in usermode, and RCU believes
	it is non-idle, the scheduling-clock tick had better be running.
	Otherwise, you will get RCU CPU stall warnings.  Or at best,
	very long (11-second) grace periods, with a pointless IPI waking
	the CPU from time to time.
<li>	If a CPU is in a portion of the kernel that executes RCU read-side
	critical sections, and RCU believes this CPU to be idle, you will get
	random memory corruption.  <b>DON'T DO THIS!!!</b>

	<br>This is one reason to test with lockdep, which will complain
	about this sort of thing.
<li>	If a CPU is in a portion of the kernel that is absolutely
	positively no-joking guaranteed to never execute any RCU read-side
	critical sections, and RCU believes this CPU to to be idle,
	no problem.  This sort of thing is used by some architectures
	for light-weight exception handlers, which can then avoid the
	overhead of <tt>rcu_irq_enter()</tt> and <tt>rcu_irq_exit()</tt>
	at exception entry and exit, respectively.
	Some go further and avoid the entireties of <tt>irq_enter()</tt>
	and <tt>irq_exit()</tt>.

	<br>Just make very sure you are running some of your tests with
	<tt>CONFIG_PROVE_RCU=y</tt>, just in case one of your code paths
	was in fact joking about not doing RCU read-side critical sections.
<li>	If a CPU is executing in the kernel with the scheduling-clock
	interrupt disabled and RCU believes this CPU to be non-idle,
	and if the CPU goes idle (from an RCU perspective) every few
	jiffies, no problem.  It is usually OK for there to be the
	occasional gap between idle periods of up to a second or so.

	<br>If the gap grows too long, you get RCU CPU stall warnings.
<li>	If a CPU is either idle or executing in usermode, and RCU believes
	it to be idle, of course no problem.
<li>	If a CPU is executing in the kernel, the kernel code
	path is passing through quiescent states at a reasonable
	frequency (preferably about once per few jiffies, but the
	occasional excursion to a second or so is usually OK) and the
	scheduling-clock interrupt is enabled, of course no problem.

	<br>If the gap between a successive pair of quiescent states grows
	too long, you get RCU CPU stall warnings.
</ol>

<table>
<tr><th>&nbsp;</th></tr>
<tr><th align="left">Quick Quiz:</th></tr>
<tr><td>
	But what if my driver has a hardware interrupt handler
	that can run for many seconds?
	I cannot invoke <tt>schedule()</tt> from an hardware
	interrupt handler, after all!
</td></tr>
<tr><th align="left">Answer:</th></tr>
<tr><td bgcolor="#ffffff"><font color="ffffff">
	One approach is to do <tt>rcu_irq_exit();rcu_irq_enter();</tt>
	every so often.
	But given that long-running interrupt handlers can cause
	other problems, not least for response time, shouldn't you
	work to keep your interrupt handler's runtime within reasonable
	bounds?
</font></td></tr>
<tr><td>&nbsp;</td></tr>
</table>

<p>
But as long as RCU is properly informed of kernel state transitions between
in-kernel execution, usermode execution, and idle, and as long as the
scheduling-clock interrupt is enabled when RCU needs it to be, you
can rest assured that the bugs you encounter will be in some other
part of RCU or some other part of the kernel!

<h3><a name="Memory Efficiency">Memory Efficiency</a></h3>

<p>
+85 −36
Original line number Diff line number Diff line
@@ -23,6 +23,14 @@ over a rather long period of time, but improvements are always welcome!
	Yet another exception is where the low real-time latency of RCU's
	read-side primitives is critically important.

	One final exception is where RCU readers are used to prevent
	the ABA problem (https://en.wikipedia.org/wiki/ABA_problem)
	for lockless updates.  This does result in the mildly
	counter-intuitive situation where rcu_read_lock() and
	rcu_read_unlock() are used to protect updates, however, this
	approach provides the same potential simplifications that garbage
	collectors do.

1.	Does the update code have proper mutual exclusion?

	RCU does allow -readers- to run (almost) naked, but -writers- must
@@ -40,7 +48,9 @@ over a rather long period of time, but improvements are always welcome!
	explain how this single task does not become a major bottleneck on
	big multiprocessor machines (for example, if the task is updating
	information relating to itself that other tasks can read, there
	by definition can be no bottleneck).
	by definition can be no bottleneck).  Note that the definition
	of "large" has changed significantly:  Eight CPUs was "large"
	in the year 2000, but a hundred CPUs was unremarkable in 2017.

2.	Do the RCU read-side critical sections make proper use of
	rcu_read_lock() and friends?  These primitives are needed
@@ -55,6 +65,12 @@ over a rather long period of time, but improvements are always welcome!
	Disabling of preemption can serve as rcu_read_lock_sched(), but
	is less readable.

	Letting RCU-protected pointers "leak" out of an RCU read-side
	critical section is every bid as bad as letting them leak out
	from under a lock.  Unless, of course, you have arranged some
	other means of protection, such as a lock or a reference count
	-before- letting them out of the RCU read-side critical section.

3.	Does the update code tolerate concurrent accesses?

	The whole point of RCU is to permit readers to run without
@@ -81,7 +97,7 @@ over a rather long period of time, but improvements are always welcome!
	c.	Make updates appear atomic to readers.	For example,
		pointer updates to properly aligned fields will
		appear atomic, as will individual atomic primitives.
		Sequences of perations performed under a lock will -not-
		Sequences of operations performed under a lock will -not-
		appear to be atomic to RCU readers, nor will sequences
		of multiple atomic primitives.

@@ -168,8 +184,8 @@ over a rather long period of time, but improvements are always welcome!

5.	If call_rcu(), or a related primitive such as call_rcu_bh(),
	call_rcu_sched(), or call_srcu() is used, the callback function
	must be written to be called from softirq context.  In particular,
	it cannot block.
	will be called from softirq context.  In particular, it cannot
	block.

6.	Since synchronize_rcu() can block, it cannot be called from
	any sort of irq context.  The same rule applies for
@@ -178,11 +194,14 @@ over a rather long period of time, but improvements are always welcome!
	synchronize_sched_expedite(), and synchronize_srcu_expedited().

	The expedited forms of these primitives have the same semantics
	as the non-expedited forms, but expediting is both expensive
	and unfriendly to real-time workloads.	Use of the expedited
	primitives should be restricted to rare configuration-change
	operations that would not normally be undertaken while a real-time
	workload is running.
	as the non-expedited forms, but expediting is both expensive and
	(with the exception of synchronize_srcu_expedited()) unfriendly
	to real-time workloads.  Use of the expedited primitives should
	be restricted to rare configuration-change operations that would
	not normally be undertaken while a real-time workload is running.
	However, real-time workloads can use rcupdate.rcu_normal kernel
	boot parameter to completely disable expedited grace periods,
	though this might have performance implications.

	In particular, if you find yourself invoking one of the expedited
	primitives repeatedly in a loop, please do everyone a favor:
@@ -193,11 +212,6 @@ over a rather long period of time, but improvements are always welcome!
	of the system, especially to real-time workloads running on
	the rest of the system.

	In addition, it is illegal to call the expedited forms from
	a CPU-hotplug notifier, or while holding a lock that is acquired
	by a CPU-hotplug notifier.  Failing to observe this restriction
	will result in deadlock.

7.	If the updater uses call_rcu() or synchronize_rcu(), then the
	corresponding readers must use rcu_read_lock() and
	rcu_read_unlock().  If the updater uses call_rcu_bh() or
@@ -321,7 +335,7 @@ over a rather long period of time, but improvements are always welcome!
	Similarly, disabling preemption is not an acceptable substitute
	for rcu_read_lock().  Code that attempts to use preemption
	disabling where it should be using rcu_read_lock() will break
	in real-time kernel builds.
	in CONFIG_PREEMPT=y kernel builds.

	If you want to wait for interrupt handlers, NMI handlers, and
	code under the influence of preempt_disable(), you instead
@@ -356,23 +370,22 @@ over a rather long period of time, but improvements are always welcome!
	not the case, a self-spawning RCU callback would prevent the
	victim CPU from ever going offline.)

14.	SRCU (srcu_read_lock(), srcu_read_unlock(), srcu_dereference(),
	synchronize_srcu(), synchronize_srcu_expedited(), and call_srcu())
	may only be invoked from process context.  Unlike other forms of
	RCU, it -is- permissible to block in an SRCU read-side critical
	section (demarked by srcu_read_lock() and srcu_read_unlock()),
	hence the "SRCU": "sleepable RCU".  Please note that if you
	don't need to sleep in read-side critical sections, you should be
	using RCU rather than SRCU, because RCU is almost always faster
	and easier to use than is SRCU.

	Also unlike other forms of RCU, explicit initialization
	and cleanup is required via init_srcu_struct() and
	cleanup_srcu_struct().	These are passed a "struct srcu_struct"
	that defines the scope of a given SRCU domain.	Once initialized,
	the srcu_struct is passed to srcu_read_lock(), srcu_read_unlock()
	synchronize_srcu(), synchronize_srcu_expedited(), and call_srcu().
	A given synchronize_srcu() waits only for SRCU read-side critical
14.	Unlike other forms of RCU, it -is- permissible to block in an
	SRCU read-side critical section (demarked by srcu_read_lock()
	and srcu_read_unlock()), hence the "SRCU": "sleepable RCU".
	Please note that if you don't need to sleep in read-side critical
	sections, you should be using RCU rather than SRCU, because RCU
	is almost always faster and easier to use than is SRCU.

	Also unlike other forms of RCU, explicit initialization and
	cleanup is required either at build time via DEFINE_SRCU()
	or DEFINE_STATIC_SRCU() or at runtime via init_srcu_struct()
	and cleanup_srcu_struct().  These last two are passed a
	"struct srcu_struct" that defines the scope of a given
	SRCU domain.  Once initialized, the srcu_struct is passed
	to srcu_read_lock(), srcu_read_unlock() synchronize_srcu(),
	synchronize_srcu_expedited(), and call_srcu().	A given
	synchronize_srcu() waits only for SRCU read-side critical
	sections governed by srcu_read_lock() and srcu_read_unlock()
	calls that have been passed the same srcu_struct.  This property
	is what makes sleeping read-side critical sections tolerable --
@@ -390,10 +403,16 @@ over a rather long period of time, but improvements are always welcome!
	Therefore, SRCU should be used in preference to rw_semaphore
	only in extremely read-intensive situations, or in situations
	requiring SRCU's read-side deadlock immunity or low read-side
	realtime latency.
	realtime latency.  You should also consider percpu_rw_semaphore
	when you need lightweight readers.

	Note that, rcu_assign_pointer() relates to SRCU just as it does
	to other forms of RCU.
	SRCU's expedited primitive (synchronize_srcu_expedited())
	never sends IPIs to other CPUs, so it is easier on
	real-time workloads than is synchronize_rcu_expedited(),
	synchronize_rcu_bh_expedited() or synchronize_sched_expedited().

	Note that rcu_dereference() and rcu_assign_pointer() relate to
	SRCU just as they do to other forms of RCU.

15.	The whole point of call_rcu(), synchronize_rcu(), and friends
	is to wait until all pre-existing readers have finished before
@@ -435,3 +454,33 @@ over a rather long period of time, but improvements are always welcome!

	These debugging aids can help you find problems that are
	otherwise extremely difficult to spot.

18.	If you register a callback using call_rcu(), call_rcu_bh(),
	call_rcu_sched(), or call_srcu(), and pass in a function defined
	within a loadable module, then it in necessary to wait for
	all pending callbacks to be invoked after the last invocation
	and before unloading that module.  Note that it is absolutely
	-not- sufficient to wait for a grace period!  The current (say)
	synchronize_rcu() implementation waits only for all previous
	callbacks registered on the CPU that synchronize_rcu() is running
	on, but it is -not- guaranteed to wait for callbacks registered
	on other CPUs.

	You instead need to use one of the barrier functions:

	o	call_rcu() -> rcu_barrier()
	o	call_rcu_bh() -> rcu_barrier_bh()
	o	call_rcu_sched() -> rcu_barrier_sched()
	o	call_srcu() -> srcu_barrier()

	However, these barrier functions are absolutely -not- guaranteed
	to wait for a grace period.  In fact, if there are no call_rcu()
	callbacks waiting anywhere in the system, rcu_barrier() is within
	its rights to return immediately.

	So if you need to wait for both an RCU grace period and for
	all pre-existing call_rcu() callbacks, you will need to execute
	both rcu_barrier() and synchronize_rcu(), if necessary, using
	something like workqueues to to execute them concurrently.

	See rcubarrier.txt for more information.
+3 −6
Original line number Diff line number Diff line
@@ -76,15 +76,12 @@ o I hear that RCU is patented? What is with that?
	Of these, one was allowed to lapse by the assignee, and the
	others have been contributed to the Linux kernel under GPL.
	There are now also LGPL implementations of user-level RCU
	available (http://lttng.org/?q=node/18).
	available (http://liburcu.org/).

o	I hear that RCU needs work in order to support realtime kernels?

	This work is largely completed.  Realtime-friendly RCU can be
	enabled via the CONFIG_PREEMPT_RCU kernel configuration
	parameter.  However, work is in progress for enabling priority
	boosting of preempted RCU read-side critical sections.	This is
	needed if you have CPU-bound realtime threads.
	Realtime-friendly RCU can be enabled via the CONFIG_PREEMPT_RCU
	kernel configuration parameter.

o	Where can I find more information on RCU?

+21 −40
Original line number Diff line number Diff line
@@ -25,35 +25,35 @@ o You must use one of the rcu_dereference() family of primitives
	for an example where the compiler can in fact deduce the exact
	value of the pointer, and thus cause misordering.

o	You are only permitted to use rcu_dereference on pointer values.
	The compiler simply knows too much about integral values to
	trust it to carry dependencies through integer operations.
	There are a very few exceptions, namely that you can temporarily
	cast the pointer to uintptr_t in order to:

	o	Set bits and clear bits down in the must-be-zero low-order
		bits of that pointer.  This clearly means that the pointer
		must have alignment constraints, for example, this does
		-not- work in general for char* pointers.

	o	XOR bits to translate pointers, as is done in some
		classic buddy-allocator algorithms.

	It is important to cast the value back to pointer before
	doing much of anything else with it.

o	Avoid cancellation when using the "+" and "-" infix arithmetic
	operators.  For example, for a given variable "x", avoid
	"(x-x)".  There are similar arithmetic pitfalls from other
	arithmetic operators, such as "(x*0)", "(x/(x+1))" or "(x%1)".
	The compiler is within its rights to substitute zero for all of
	these expressions, so that subsequent accesses no longer depend
	on the rcu_dereference(), again possibly resulting in bugs due
	to misordering.
	"(x-(uintptr_t)x)" for char* pointers.	The compiler is within its
	rights to substitute zero for this sort of expression, so that
	subsequent accesses no longer depend on the rcu_dereference(),
	again possibly resulting in bugs due to misordering.

	Of course, if "p" is a pointer from rcu_dereference(), and "a"
	and "b" are integers that happen to be equal, the expression
	"p+a-b" is safe because its value still necessarily depends on
	the rcu_dereference(), thus maintaining proper ordering.

o	Avoid all-zero operands to the bitwise "&" operator, and
	similarly avoid all-ones operands to the bitwise "|" operator.
	If the compiler is able to deduce the value of such operands,
	it is within its rights to substitute the corresponding constant
	for the bitwise operation.  Once again, this causes subsequent
	accesses to no longer depend on the rcu_dereference(), causing
	bugs due to misordering.

	Please note that single-bit operands to bitwise "&" can also
	be dangerous.  At this point, the compiler knows that the
	resulting value can only take on one of two possible values.
	Therefore, a very small amount of additional information will
	allow the compiler to deduce the exact value, which again can
	result in misordering.

o	If you are using RCU to protect JITed functions, so that the
	"()" function-invocation operator is applied to a value obtained
	(directly or indirectly) from rcu_dereference(), you may need to
@@ -61,25 +61,6 @@ o If you are using RCU to protect JITed functions, so that the
	This issue arises on some systems when a newly JITed function is
	using the same memory that was used by an earlier JITed function.

o	Do not use the results from the boolean "&&" and "||" when
	dereferencing.	For example, the following (rather improbable)
	code is buggy:

		int *p;
		int *q;

		...

		p = rcu_dereference(gp)
		q = &global_q;
		q += p != &oom_p1 && p != &oom_p2;
		r1 = *q;  /* BUGGY!!! */

	The reason this is buggy is that "&&" and "||" are often compiled
	using branches.  While weak-memory machines such as ARM or PowerPC
	do order stores after such branches, they can speculate loads,
	which can result in misordering bugs.

o	Do not use the results from relational operators ("==", "!=",
	">", ">=", "<", or "<=") when dereferencing.  For example,
	the following (quite strange) code is buggy:
+5 −0
Original line number Diff line number Diff line
@@ -263,6 +263,11 @@ Quick Quiz #2: What happens if CPU 0's rcu_barrier_func() executes
	are delayed for a full grace period? Couldn't this result in
	rcu_barrier() returning prematurely?

The current rcu_barrier() implementation is more complex, due to the need
to avoid disturbing idle CPUs (especially on battery-powered systems)
and the need to minimally disturb non-idle CPUs in real-time systems.
However, the code above illustrates the concepts.


rcu_barrier() Summary

Loading