Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit b117b9a4 authored by Remi NGUYEN VAN's avatar Remi NGUYEN VAN
Browse files

Remove uptime check for crash rate limiting

When the network stack process is lost and ConnectivityModuleConnector
skips crashing, other system components generally crash shortly after
(for example NsdService trying to use tethering callbacks, or wifi
trying to communicate with IpClient). The only way to recover from the
bad state is to crash, so make sure it happens even in the first 30
minutes after boot.

This makes sure the system recovers faster, and should make it easier
to investigate and fix issues causing the network stack process crash,
as it avoids reports coming from other components.

The "at most once every 6 hours" rate limit is still present (for now),
so this should still avoid bootloops. Reducing or removing that limit
may be considered if it turns out the system is still left in a bad
state too often (causing reports of crashes in other components).

Bug: 434820757
Test: m
Flag: EXEMPT bugfix
Change-Id: I344052e1614f39c295b8c0701f46871e8c6491a2
parent e4de5321
Loading
Loading
Loading
Loading
+12 −19
Original line number Diff line number Diff line
@@ -29,7 +29,6 @@ import android.os.Build;
import android.os.Environment;
import android.os.IBinder;
import android.os.Process;
import android.os.SystemClock;
import android.os.UserHandle;
import android.provider.DeviceConfig;
import android.text.format.DateUtils;
@@ -54,7 +53,6 @@ public class ConnectivityModuleConnector {
    private static final String PREFS_FILE = "ConnectivityModuleConnector.xml";
    private static final String PREF_KEY_LAST_CRASH_TIME = "lastcrash_time";
    private static final String CONFIG_MIN_CRASH_INTERVAL_MS = "min_crash_interval";
    private static final String CONFIG_MIN_UPTIME_BEFORE_CRASH_MS = "min_uptime_before_crash";
    private static final String CONFIG_ALWAYS_RATELIMIT_NETWORKSTACK_CRASH =
            "always_ratelimit_networkstack_crash";

@@ -64,11 +62,6 @@ public class ConnectivityModuleConnector {
    // This is the default value: the actual value can be adjusted via device config.
    private static final long DEFAULT_MIN_CRASH_INTERVAL_MS = 6 * DateUtils.HOUR_IN_MILLIS;

    // Even if the network stack is lost, do not crash the system server if it was less than
    // this much after boot. This avoids bootlooping the device, and crashes should address very
    // infrequent failures, not failures on boot.
    private static final long DEFAULT_MIN_UPTIME_BEFORE_CRASH_MS = 30 * DateUtils.MINUTE_IN_MILLIS;

    private static ConnectivityModuleConnector sInstance;

    private Context mContext;
@@ -296,36 +289,36 @@ public class ConnectivityModuleConnector {
    private synchronized void maybeCrashWithTerribleFailure(@NonNull String message,
            @Nullable String packageName) {
        logWtf(message, null);
        // uptime is monotonic even after a framework restart
        final long uptime = SystemClock.elapsedRealtime();
        final long now = System.currentTimeMillis();
        final long minCrashIntervalMs = DeviceConfig.getLong(DeviceConfig.NAMESPACE_CONNECTIVITY,
                CONFIG_MIN_CRASH_INTERVAL_MS, DEFAULT_MIN_CRASH_INTERVAL_MS);
        final long minUptimeBeforeCrash = DeviceConfig.getLong(DeviceConfig.NAMESPACE_CONNECTIVITY,
                CONFIG_MIN_UPTIME_BEFORE_CRASH_MS, DEFAULT_MIN_UPTIME_BEFORE_CRASH_MS);
        final boolean alwaysRatelimit = DeviceConfig.getBoolean(DeviceConfig.NAMESPACE_CONNECTIVITY,
                CONFIG_ALWAYS_RATELIMIT_NETWORKSTACK_CRASH, false);

        final SharedPreferences prefs = getSharedPreferences();
        final long lastCrashTime = tryGetLastCrashTime(prefs);

        // Only crash if there was enough time since boot, and (if known) enough time passed since
        // the last crash.
        // Only crash if (if known) enough time passed since the last crash.
        // time and lastCrashTime may be unreliable if devices have incorrect clock time, but they
        // are only used to limit the number of crashes compared to only using the time since boot,
        // which would also be OK behavior by itself.
        // - If lastCrashTime is incorrectly more than the current time, only look at uptime
        // - If it is much less than current time, only look at uptime
        // are only used as a best-effort to avoid crashloops. When crashing the last crash time is
        // overwritten, so unless the clock is consistently jumping back-and-forth the next boot
        // will not crash again. If the clock is consistently at the same time during boot, we may
        // always skip crashing, which is not a big problem either, although it leaves the device in
        // a bad state until another component crashes or the user reboots.
        // - If lastCrashTime is incorrectly more or much less than the current time, crash.
        // - If current time is during the next few hours after last crash time, don't crash.
        //   Considering that this only matters if last boot was some time ago, it's likely that
        //   time will be set correctly. Otherwise, not crashing is not a big problem anyway. Being
        //   in this last state would also not last for long since the window is only a few hours.
        //
        // In practice when the system does not crash here, other components that depend on the
        // network stack process often end up crashing anyway, as they do not support re-syncing
        // state with a restarted network stack process.
        final boolean alwaysCrash = Build.IS_DEBUGGABLE && !alwaysRatelimit;
        final boolean justBooted = uptime < minUptimeBeforeCrash;
        final boolean haveLastCrashTime = (lastCrashTime != 0) && (lastCrashTime < now);
        final boolean haveKnownRecentCrash =
                haveLastCrashTime && (now < lastCrashTime + minCrashIntervalMs);
        if (alwaysCrash || (!justBooted && !haveKnownRecentCrash)) {
        if (alwaysCrash || !haveKnownRecentCrash) {
            // The system is not bound to its network stack (for example due to a crash in the
            // network stack process): better crash rather than stay in a bad state where all
            // networking is broken.