Documentation/edac.txt: Add Nehalem specific EDAC characteristics (31983a04) · Commits · e / devices / android_kernel_fairphone_FP5

Documentation/edac.txt

+110 −0

Original line number	Original line	Diff line number	Diff line
	@@ -6,6 +6,8 @@ Written by Doug Thompson <dougthompson@xmission.com>
	7 Dec 2005		7 Dec 2005
	17 Jul 2007 Updated		17 Jul 2007 Updated

			(c) Mauro Carvalho Chehab <mchehab@redhat.com>
			05 Aug 2009 Nehalem interface

	EDAC is maintained and written by:		EDAC is maintained and written by:

	@@ -717,3 +719,111 @@ unique drivers for their hardware systems.
	The 'test_device_edac' sample driver is located at the		The 'test_device_edac' sample driver is located at the
	bluesmoke.sourceforge.net project site for EDAC.		bluesmoke.sourceforge.net project site for EDAC.

			=======================================================================
			NEHALEM USAGE OF EDAC APIs

			This chapter documents some EXPERIMENTAL mappings for EDAC API to handle
			Nehalem EDAC driver. They will likely be changed on future versions
			of the driver.

			Due to the way Nehalem exports Memory Controller data, some adjustments
			were done at i7core_edac driver. This chapter will cover those differences

			1) On Nehalem, there are one Memory Controller per Quick Patch Interconnect
			(QPI). At the driver, the term "socket" means one QPI. It should also be
			associated with the CPU physical socket.

			Each MC have 3 physical read channels, 3 physical write channels and
			3 logic channels. The driver currenty sees it as just 3 channels.
			Each channel can have up to 3 DIMMs.

			The minimum known unity is DIMMs. There are no information about csrows.
			As EDAC API maps the minimum unity is csrows, the driver exports one
			DIMM per csrow.

			Currently, it also exports the several memory controllers as just one. This
			limit will be removed on future versions of the driver.

			2) Nehalem MC has the hability to generate errors. The driver implements this
			functionality via some error injection nodes:

			For injecting a memory error, there are some sysfs nodes, under
			/sys/devices/system/edac/mc/mc0/:

			inject_addrmatch:
			Controls the error injection mask register. It is possible to specify
			several characteristics of the address to match an error code:
			dimm = the affected dimm. Numbers are relative to a channel;
			rank = the memory rank;
			channel = the channel that will generate an error;
			bank = the affected bank;
			page = the page address;
			column (or col) = the address column.
			each of the above values can be set to "any" to match any valid value.

			At driver init, all values are set to any.

			For example, to generate an error at rank 1 of dimm 2, for any channel,
			any bank, any page, any column:
			echo "dimm:2 rank:1" >/sys/devices/system/edac/mc/mc0/inject_addrmatch

			To return to the default behaviour of matching any, you can do:
			echo "dimm:any rank:any" >/sys/devices/system/edac/mc/mc0/inject_addrmatch

			inject_eccmask:
			specifies what bits will have troubles,

			inject_section:
			specifies what ECC cache section will get the error:
			3 for both
			2 for the highest
			1 for the lowest

			inject_socket:
			specifies what QPI (or processor socket) will generate the error.
			on Xeon 35xx, it should be 0.
			on Xeon 55xx, it should be 0 or 1.

			inject_type:
			specifies the type of error, being a combination of the following bits:
			bit 0 - repeat
			bit 1 - ecc
			bit 2 - parity

			inject_enable starts the error generation when something different
			than 0 is written.

			All inject vars can be read. root permission is needed for write.

			Datasheet states that the error will only be generated after a write on an
			address that matches inject_addrmatch. It seems, however, that reading will
			also produce an error.

			For example, the following code will generate an error for any write access
			at socket 0, on any DIMM/address on channel 2:

			echo "channel:2" > /sys/devices/system/edac/mc/mc0/inject_addrmatch
			echo 2 >/sys/devices/system/edac/mc/mc0/inject_type
			echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask
			echo 3 >/sys/devices/system/edac/mc/mc0/inject_section
			echo 0 >/sys/devices/system/edac/mc/mc0/inject_socket
			echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable
			dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null

			The generated error message will look like:

			EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error))

			3) Nehalem specific Corrected Error memory counters

			Nehalem have some registers to count memory errors, reporting it on a
			way that it is different from what EDAC API allows. Due to that, a
			separate sysfs note were created to handle such counters.

			They can be read by looking at the contents of "corrected_error_counts"
			counter:

			$ cat /sys/devices/system/edac/mc/mc0/corrected_error_counts
			dimm0: 15866
			dimm1: 0
			dimm2: 27285