Commit f867d556 authored Sep 22, 2015 by Christophe Leroy Committed by Scott Wood Mar 04, 2016

powerpc32: optimise csum_partial() loop



On the 8xx, load latency is 2 cycles and taking branches also takes
2 cycles. So let's unroll the loop.

This patch improves csum_partial() speed by around 10% on both:
* 8xx (single issue processor with parallel execution)
* 83xx (superscalar 6xx processor with dual instruction fetch
and parallel execution)

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>

parent 48821a34

arch/powerpc/lib/checksum_32.S

+15 −1

Original line number	Diff line number	Diff line
		@@ -38,10 +38,24 @@ _GLOBAL(csum_partial)
		srwi. r6,r4,2 /* # words to do */
		adde r5,r5,r0
		beq 3f
		1: mtctr r6
		1: andi. r6,r6,3 /* Prepare to handle words 4 by 4 */
		beq 21f
		mtctr r6
		2: lwzu r0,4(r3)
		adde r5,r5,r0
		bdnz 2b
		21: srwi. r6,r4,4 /* # blocks of 4 words to do */
		beq 3f
		mtctr r6
		22: lwz r0,4(r3)
		lwz r6,8(r3)
		lwz r7,12(r3)
		lwzu r8,16(r3)
		adde r5,r5,r0
		adde r5,r5,r6
		adde r5,r5,r7
		adde r5,r5,r8
		bdnz 22b
		3: andi. r0,r4,2
		beq+ 4f
		lhz r0,4(r3)