Fix backup-agent timeouts
Away in the misty span of very-long-ago, it was suggested that spinning a separate thread to run the backup process was wasteful, and that it could just run it inline on the dedicated HandlerThread that the backup manager uses for its own operations. That was indeed true, except that the timeout management was also using delayed messages to that handler. You see where this is going: timeouts were never actually being processed, with the effect that a badly-behaving app's backup agent could lock up the entire backup / restore system until the device was rebooted. This is bad. Backup operations are now driven as an asynchronous state machine: each step (init, call one agent to obtain data, send resulting data to the transport, finalize the backup) is handled as a formal state transition on-looper. No synchronous wait-for-completion or -timeout is performed on any thread. As an additional effect this greatly tightens up the serialization and locking semantics. We no longer have to worry about an in- flight operation involving a standalone thread spinning off on its own; everything is on the HandlerThread and can be coherently manipulated from that perspective. Along the way, this CL tightens up the per-agent error handling logic. Previously a single failed agent would abort the entire backup process, tantamount to a transport-level failure. This could mean that the aforesaid badly-behaving app's agent could in effect starve out other apps whose agents were routinely showing up later in the queue. There's some nondeterminism involved, but in practice it could and did happen. Furthermore, the failure case would reschedule *immediately* in this case, because the transport itself would see that all is well and sure, why not run a backup soon? This, as you might imagine, causes battery-life issues. Now we note that the single agent has failed, mark it for a future repeat attempt, and process the rest of the queue normally, pretending success at the transport level even though we didn't actually send any data for that app. This means that (a) we now finish running backups for everything in the queue, (b) reschedule backups only for those apps whose agents individually failed during this run, and (c) perform the retry after the normal interval [typically on the order of an hour] rather than immediately. NOTE: this CL does not retool the restore code path, just backup. Restore is similarly vulnerable to misbehaving apps, though, so a future CL will address that bug vector. Addresses bug 5074923 Change-Id: I67e3f8d06f322607881eaa4093de6d675b85ff2c
Loading
Please register or sign in to comment