Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 723b1133 authored by Nicholas Piggin's avatar Nicholas Piggin Committed by Michael Ellerman
Browse files

powerpc/watchdog: improve watchdog comments



The overview comments in the powerpc watchdog are out of date after
several iterations and changes of the code. Bring them up to date.

Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
parent 8183d99f
Loading
Loading
Loading
Loading
+38 −20
Original line number Diff line number Diff line
@@ -26,15 +26,45 @@
#include <asm/paca.h>

/*
 * The watchdog has a simple timer that runs on each CPU, once per timer
 * period. This is the heartbeat.
 * The powerpc watchdog ensures that each CPU is able to service timers.
 * The watchdog sets up a simple timer on each CPU to run once per timer
 * period, and updates a per-cpu timestamp and a "pending" cpumask. This is
 * the heartbeat.
 *
 * Then there are checks to see if the heartbeat has not triggered on a CPU
 * for the panic timeout period. Currently the watchdog only supports an
 * SMP check, so the heartbeat only turns on when we have 2 or more CPUs.
 * Then there are two systems to check that the heartbeat is still running.
 * The local soft-NMI, and the SMP checker.
 *
 * This is not an NMI watchdog, but Linux uses that name for a generic
 * watchdog in some cases, so NMI gets used in some places.
 * The soft-NMI checker can detect lockups on the local CPU. When interrupts
 * are disabled with local_irq_disable(), platforms that use soft-masking
 * can leave hardware interrupts enabled and handle them with a masked
 * interrupt handler. The masked handler can send the timer interrupt to the
 * watchdog's soft_nmi_interrupt(), which appears to Linux as an NMI
 * interrupt, and can be used to detect CPUs stuck with IRQs disabled.
 *
 * The soft-NMI checker will compare the heartbeat timestamp for this CPU
 * with the current time, and take action if the difference exceeds the
 * watchdog threshold.
 *
 * The limitation of the soft-NMI watchdog is that it does not work when
 * interrupts are hard disabled or otherwise not being serviced. This is
 * solved by also having a SMP watchdog where all CPUs check all other
 * CPUs heartbeat.
 *
 * The SMP checker can detect lockups on other CPUs. A gobal "pending"
 * cpumask is kept, containing all CPUs which enable the watchdog. Each
 * CPU clears their pending bit in their heartbeat timer. When the bitmask
 * becomes empty, the last CPU to clear its pending bit updates a global
 * timestamp and refills the pending bitmask.
 *
 * In the heartbeat timer, if any CPU notices that the global timestamp has
 * not been updated for a period exceeding the watchdog threshold, then it
 * means the CPU(s) with their bit still set in the pending mask have had
 * their heartbeat stop, and action is taken.
 *
 * Some platforms implement true NMI IPIs, which can by used by the SMP
 * watchdog to detect an unresponsive CPU and pull it out of its stuck
 * state with the NMI IPI, to get crash/debug data from it. This way the
 * SMP watchdog can detect hardware interrupts off lockups.
 */

static cpumask_t wd_cpus_enabled __read_mostly;
@@ -47,19 +77,7 @@ static u64 wd_timer_period_ms __read_mostly; /* interval between heartbeat */
static DEFINE_PER_CPU(struct timer_list, wd_timer);
static DEFINE_PER_CPU(u64, wd_timer_tb);

/*
 * These are for the SMP checker. CPUs clear their pending bit in their
 * heartbeat. If the bitmask becomes empty, the time is noted and the
 * bitmask is refilled.
 *
 * All CPUs clear their bit in the pending mask every timer period.
 * Once all have cleared, the time is noted and the bits are reset.
 * If the time since all clear was greater than the panic timeout,
 * we can panic with the list of stuck CPUs.
 *
 * This will work best with NMI IPIs for crash code so the stuck CPUs
 * can be pulled out to get their backtraces.
 */
/* SMP checker bits */
static unsigned long __wd_smp_lock;
static cpumask_t wd_smp_cpus_pending;
static cpumask_t wd_smp_cpus_stuck;