Commit 8e294d45 authored Aug 31, 2011 by Christopher Tate
Fix backup-agent timeouts

Away in the misty span of very-long-ago, it was suggested that spinning
a separate thread to run the backup process was wasteful, and that it
could just run it inline on the dedicated HandlerThread that the
backup manager uses for its own operations.  That was indeed true,
except that the timeout management was also using delayed messages
to that handler.  You see where this is going: timeouts were never
actually being processed, with the effect that a badly-behaving
app's backup agent could lock up the entire backup / restore system
until the device was rebooted.

This is bad.

Backup operations are now driven as an asynchronous state machine:
each step (init, call one agent to obtain data, send resulting
data to the transport, finalize the backup) is handled as a formal
state transition on-looper.  No synchronous wait-for-completion
or -timeout is performed on any thread.

As an additional effect this greatly tightens up the serialization
and locking semantics.  We no longer have to worry about an in-
flight operation involving a standalone thread spinning off on
its own; everything is on the HandlerThread and can be coherently
manipulated from that perspective.

Along the way, this CL tightens up the per-agent error handling
logic.  Previously a single failed agent would abort the entire
backup process, tantamount to a transport-level failure.  This could
mean that the aforesaid badly-behaving app's agent could in effect
starve out other apps whose agents were routinely showing up later
in the queue.  There's some nondeterminism involved, but in practice
it could and did happen.  Furthermore, the failure case would
reschedule *immediately* in this case, because the transport itself
would see that all is well and sure, why not run a backup soon?
This, as you might imagine, causes battery-life issues.

Now we note that the single agent has failed, mark it for a future
repeat attempt, and process the rest of the queue normally, pretending
success at the transport level even though we didn't actually send
any data for that app.  This means that (a) we now finish running
backups for everything in the queue, (b) reschedule backups only for
those apps whose agents individually failed during this run, and
(c) perform the retry after the normal interval [typically on the
order of an hour] rather than immediately.

NOTE: this CL does not retool the restore code path, just backup.
Restore is similarly vulnerable to misbehaving apps, though, so a
future CL will address that bug vector.

Addresses bug 5074923

Change-Id: I67e3f8d06f322607881eaa4093de6d675b85ff2c
parent 9100473a
Expand all Hide whitespace changes
Inline Side-by-side
Please register or to comment