msm: kgsl: Track the fault state of the adreno device
While recovering from a GPU fault the code does a delicate dance with
the device mutex to recover the system and replay the ringbuffer.
After snapshot the mutex is released for a while until the reset
occurs. This opens up the possiblity that other commands will be sent
to the ringbuffer. On 3XX this doesn't matter because the GPU is in
fault and the CP is halted until the reset so the commands in the
ringbuffer go unnoticed. However on 4XX the CP gets reset during
snapshotting due to an errata with the HLSQ and so we drop out of the
snapshot with a live CP which dutifully tries to execute the new
commands on the ringbuffer. But since the CP hasn't been reprogrammed
yet.. hilarity ensues.
But the broader question remains - why are we trying to send commands
during this delicate time? These aren't user commands because the
dispatcher mutex is held - and the only other unsolicted commands to
the ringbuffer that are not part of a standard command batch
submission are the CP initalization sequence or... drumroll... IOMMU
context switch.
In situations where the context is invalidated immediately following
a hang the context destroy will get the mutex while the GPU is still
resetting itself, try to switch to a the default IOMMU pagetable
and thus the problem occurs.
The solution is to maintain a flag of the current fault state and
gracefully bail out from trying to program the IOMMU hardware
which is going to be reset very soon anyway.
CRs-fixed: 664630
Change-Id: Ic0dedbad79bfa9eb93b1fb800db0d41e60cc15bc
Signed-off-by:
Jordan Crouse <jcrouse@codeaurora.org>
Loading
Please register or sign in to comment