BACKPORT: rcu: Fix missed wakeup of exp_wq waiters
Tasks waiting within exp_funnel_lock() for an expedited grace period to elapse can be starved due to the following sequence of events: 1. Tasks A and B both attempt to start an expedited grace period at about the same time. This grace period will have completed when the lower four bits of the rcu_state structures ->expedited_sequence field are 0b0100, for example, when the initial value of this counter is zero. Task A wins, and thus does the actual work of starting the grace period, including acquiring the rcu_state structures .exp_mutex and sets the counter to 0b0001. 2. Because task B lost the race to start the grace period, it waits on ->expedited_sequence to reach 0b0100 inside of exp_funnel_lock(). This task therefore blocks on the rcu_node structures ->exp_wq[1] field, keeping in mind that the end-of-grace-period value of ->expedited_sequence (0b0100) is shifted down two bits before indexing the ->exp_wq[] field. 3. Task C attempts to start another expedited grace period, but blocks on ->exp_mutex, which is still held by Task A. 4. The aforementioned expedited grace period completes, so that ->expedited_sequence now has the value 0b0100. A kworker task therefore acquires the rcu_state structures ->exp_wake_mutex and starts awakening any tasks waiting for this grace period. 5. One of the first tasks awakened happens to be Task A. Task A therefore releases the rcu_state structures ->exp_mutex, which allows Task C to start the next expedited grace period, which causes the lower four bits of the rcu_state structures ->expedited_sequence field to become 0b0101. 6. Task Cs expedited grace period completes, so that the lower four bits of the rcu_state structures ->expedited_sequence field now become 0b1000. 7. The kworker task from step 4 above continues its wakeups. Unfortunately, the wake_up_all() refetches the rcu_state structures .expedited_sequence field: wake_up_all(&rnp-> exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]); This results in the wakeup being applied to the rcu_node structures ->exp_wq[2] field, which is unfortunate given that Task B is instead waiting on ->exp_wq[1]. On a busy system, no harm is done (or at least no permanent harm is done). Some later expedited grace period will redo the wakeup. But on a quiet system, such as many embedded systems, it might be a good long time before there was another expedited grace period. On such embedded systems, this situation could therefore result in a system hang. This issue manifested as DPM device timeout during suspend (which usually qualifies as a quiet time) due to a SCSI device being stuck in _synchronize_rcu_expedited(), with the following stack trace: schedule() synchronize_rcu_expedited() synchronize_rcu() scsi_device_quiesce() scsi_bus_suspend() dpm_run_callback() __device_suspend() This commit therefore prevents such delays, timeouts, and hangs by making rcu_exp_wait_wake() use its "s" argument consistently instead of refetching from rcu_state.expedited_sequence. Fixes: 3b5f668e ("rcu: Overlap wakeups with next expedited grace period") Signed-off-by:Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by:
Paul E. McKenney <paulmck@kernel.org> conflicts: Wrap the commit message to fit 75 chars per line Resolved diffs bewteen 4.9 and upstream MTK-Commit-Id: 6b6450d5ec74ac7d6538340d63108592e491f44b Change-Id: Ib027ff341a47d6a67ffad1c17fa4bbbe244d7a93 Signed-off-by:
Cheng Jui Wang <cheng-jui.wang@mediatek.com> CR-Id: ALPS05359266 Feature: [Module]Official Kernel Patch (cherry picked from commit 61a05e9c0a49c6dc60fc5d7fd87672720cc78f42)
Loading
Please register or sign in to comment