Commit d1a9b2e2 authored Apr 11, 2025 by Douglas Anderson Committed by Doug Anderson May 06, 2025
init: reboot: Simplify the RebootMonitorThread()

Since its introduction in r.android.com/946516, the
RebootMonitorThread() has had a bit of a complex control
flow. Specifically:

1. We create it in DoReboot() but it doesn't kick off right away. It
   starts out paused and an odd number of semaphore signals will
   resume it (and even number will pause it).
2. We signal it to start via signaling the semaphore (1 semaphore
   signal = odd = resume)
3. Before starting an fsck(), we pause the monitor via semaphore (2
   semaphore signals = even = pause).
4. After finishing fsck(), we resume the monitor via semaphore (3
   semaphore signals = odd = resume).
5. Finally, before doing a few last steps we signal the thread to
   exit.

NOTE: when the code was first added RebootMonitorThread()'s timeout
was 30 seconds. As of r.android.com/1275254 it's now 300 (it can be
overridden by a property, but code seems designed for it to be long).

Let's take a step back and make a few observations.
a) We really don't want _any_ part of the reboot/shutdown sequence to
   be "unguarded" by a watchdog. Any time we pause and/or turn off the
   watchdog it's an opportunity for the system to get stuck. When the
   system is running shutdown code the user has no way to get out so
   we need some guarantee it won't take too long.
b) The cases where were pausing/unpausing the watchdog weren't really
   effective. It can be observed that in `UmountPartitions()` (which
   runs _before_ the fsck() we paused for) we finishing killing nearly
   all userspace processes (earlier we killed the
   non-shutdown-critical ones, here we kill the shutdown-critical
   ones). Assuming we have a userspace program patting something in
   /dev/watchdog* (like Pixel devices do) that watchdog program will
   be killed. Now it will be a matter of time before the kernel
   reboots anyway. Thus the pause for `fsck()` makes little sense.
   Similarly it doesn't make sense to bother stopping the shutdown
   thread before the final steps.

As per the above logic, simplify things. When we start
RebootMonitorThread() we'll just do a big delay. If the system is
still alive after the delay finishes then that's bad and we should
take evasive actions.

Given the way things work now, let's also not bother accounting for
the "clean" shutdown timeout. That number is designed to be much
smaller than this 300 second last-resort and adding it in just
increases complexity.

NOTE: future changes can continue to refine the RebootMonitorThread().
For now we'll just do the simple thing and keeps all the "evasive
actions" the same.

ALSO NOTE: If someone made an Android phone where they have heavily
tuned "ro.build.shutdown.watchdog.timeout" to be as short as possible,
this could potentially mess them up. I can't quite see a reason for
someone to have done that, though. A future change will enforce a
minimum of 60 seconds.

Bug: 409835922
Test: Hack delays in various parts of DoReboot() and see timeout.
Change-Id: I4966b6987965b25ad38a63ef026277cee1e93927
parent df66c4b2
Show whitespace changes
Inline Side-by-side
Please register or to comment