sched: Reset decay_count when task is enqueued
A non-zero and positive decay_count indicates the time when a task
went to sleep and thus was removed from its cfs_rq.
cfs_rq->blocked_load_avg tracks load of such "blocked" tasks.
cfs_rq->blocked_load_avg is decayed over time and in turn signifies
decay of (blocked) tasks load. cfs_rq->decay_counter represents time
when blocked_load_avg was last decayed. cfs_rq->decay_counter is
derived from rq->clock_task, which can be different for each cpu.
When tasks go to sleep, their decay_count is set to
cfs_rq->decay_counter.
When task wakeup from sleep, its (new decayed) load_avg needs to be
removed from cfs_rq->blocked_load_avg (as tasks is no longer blocked).
Amount of decay for task's load_avg is defined by its sleep time,
roughly derived as (cfs_rq->decay_counter - se->decay_count). This is
accomplished in __synchronize_entity_decay().
Once task's load_avg is decayed and is subtracted from
cfs_rq->blocked_load_avg, decay_count should be reset to 0, to
indicate that task is no longer sleeping and its load_avg has been
synchronized with decay of blocked_load_avg. A zero decay count thus
signifies that a task is on runqueue and its load_avg has been decayed
and synchronized with that of cfs_rq->blocked_load_avg.
A negative decay_count on the other hand signifies a task that is
being migrated across cpus during wakeup. Lets say task went to sleep
on CPU0 and is waking on CPU1. In this case, task's load_avg needs to
be decayed first (over its sleep time derived as
cfs_rq0->decay_counter - se->decay_count), then subtracted from
cfs_rq0->blocked_load_avg and finally task's load metrics
(runnable_avg_sum) needs to be decayed over its sleep time. As task's
sleep_time is deduced from (rq->clock_task -
se->avg.last_runnable_update), and since se->avg.last_runnable_update
is in reference to CPU0's clock_task, it would be inappropriate to
deduce task's sleep time going by CPU1's rq->clock_task. Thus, in this
case, when task is migrated to a different cpu at wakeup time, its
decay_count is set to negative sleep time derived as
- (CPU0 cfs_rq->decay_counter - se->avg.decay_count). This information
is used during enqueue of task on CPU1 to adjust task's
se->avg.last_runnable_update as (cpu1 rq->clock_task -
(-se->avg.decay_count). This will let task's runnable_avg_sum to be
decayed correctly over its sleep time by referencing CPU1's
rq->clock_task and task's se->avg.last_runnable_update.
The bug that currently exists is when task wakes up from a "short"
sleep (couple of ms), is woken on the same cpu where it last ran and
subsequently migrated.
t0 -> task A went to sleep on cpu0.
A->se.avg.decay_count = cpu0 cfs_rq->decay_counter = t0
t1 -> task A woke up from sleep. cpu0's cfs_rq->decay_counter is still
t0. Because of this, __synchronize_entity_decay() does nothing.
It also returns *without* resetting task's decay_count
t2 -> CPU0's blocked_load_avg is decayed. cfs_rq->decay_counter = t2
t3 -> Task A is migrated from CPU0 to CPU1. migrate_task_rq_fair()
assumes that this is case of migration during wakeup as A's
decay_count is non-zero and positive. It then deduces task's
sleep time as (t2-t0) and decays its load_avg over that sleep
time. Task's decay_count is set as -(t2-t0). When task is later
enqueued on CPU1, task's load metrics (runnable_avg_sum) is
decayed to account for its "sleep" interval of (t2-t0), which
is *wrong* and further results in inaccurate load information
for the task.
Fix for this is to have __synchronize_entity_decay() reset decay_count
even when it deduces zero sleep time for task.
Change-Id: I1016ecb148d62ff15ed698a5cca1a06afb73151f
Signed-off-by:
Srivatsa Vaddagiri <vatsa@codeaurora.org>
Loading
Please register or sign in to comment