Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 7f51b800 authored by Jordan Crouse's avatar Jordan Crouse
Browse files

msm: kgsl: Track the fault state of the adreno device



While recovering from a GPU fault the code does a delicate dance with
the device mutex to recover the system and replay the ringbuffer.
After snapshot the mutex is released for a while until the reset
occurs. This opens up the possiblity that other commands will be sent
to the ringbuffer. On 3XX this doesn't matter because the GPU is in
fault and the CP is halted until the reset so the commands in the
ringbuffer go unnoticed. However on 4XX the CP gets reset during
snapshotting due to an errata with the HLSQ and so we drop out of the
snapshot with a live CP which dutifully tries to execute the new
commands on the ringbuffer. But since the CP hasn't been reprogrammed
yet.. hilarity ensues.

But the broader question remains - why are we trying to send commands
during this delicate time? These aren't user commands because the
dispatcher mutex is held - and the only other unsolicted commands to
the ringbuffer that are not part of a standard command batch
submission are the CP initalization sequence or... drumroll... IOMMU
context switch.

In situations where the context is invalidated immediately following
a hang the context destroy will get the mutex while the GPU is still
resetting itself, try to switch to a the default IOMMU pagetable
and thus the problem occurs.

The solution is to maintain a flag of the current fault state and
gracefully bail out from trying to program the IOMMU hardware
which is going to be reset very soon anyway.

CRs-fixed: 664630
Change-Id: Ic0dedbad79bfa9eb93b1fb800db0d41e60cc15bc
Signed-off-by: default avatarJordan Crouse <jcrouse@codeaurora.org>
parent f897dce9
Loading
Loading
Loading
Loading
+6 −0
Original line number Diff line number Diff line
@@ -289,6 +289,11 @@ struct adreno_device {
 * after power collapse
 * @ADRENO_DEVICE_CORESIGHT - Set if the coresight (trace bus) registers should
 * be restored after power collapse
 * @ADRENO_DEVICE_HANG_INTR - Set if the hang interrupt should be enabled for
 * this target
 * @ADRENO_DEVICE_STARTED - Set if the device start sequence is in progress
 * @ADRENO_DEVICE_FAULT - Set if the device is currently in fault (and shouldn't
 * send any more commands to the ringbuffer)
 */
enum adreno_device_flags {
	ADRENO_DEVICE_PWRON = 0,
@@ -297,6 +302,7 @@ enum adreno_device_flags {
	ADRENO_DEVICE_CORESIGHT = 3,
	ADRENO_DEVICE_HANG_INTR = 4,
	ADRENO_DEVICE_STARTED = 5,
	ADRENO_DEVICE_FAULT = 6,
};

#define PERFCOUNTER_FLAG_NONE 0x0
+10 −0
Original line number Diff line number Diff line
@@ -1199,6 +1199,13 @@ static int dispatcher_do_fault(struct kgsl_device *device)

	mutex_lock(&device->mutex);

	/*
	 * Set the fault bit to make sure that no other threads try to use the
	 * GPU until we are done here
	 */

	set_bit(ADRENO_DEVICE_FAULT, &adreno_dev->priv);

	/* hang opcode */
	kgsl_cffdump_hang(device);

@@ -1511,6 +1518,9 @@ replay:
		}
	}

	/* Clear the fault bit */
	clear_bit(ADRENO_DEVICE_FAULT, &adreno_dev->priv);

	kfree(replay);
	/* restore halt indicator */
	atomic_add(halt, &adreno_dev->halt);
+4 −0
Original line number Diff line number Diff line
@@ -678,6 +678,10 @@ static unsigned int adreno_iommu_set_pt_generate_cmds(
	int num_iommu_units;
	unsigned int *cmds_orig = cmds;

	/* If we are in a fault the MMU will be reset soon */
	if (test_bit(ADRENO_DEVICE_FAULT, &adreno_dev->priv))
		return 0;

	num_iommu_units = kgsl_mmu_get_num_iommu_units(&device->mmu);

	pt_val = kgsl_mmu_get_pt_base_addr(&device->mmu, pt);