Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 31983a04 authored by Mauro Carvalho Chehab's avatar Mauro Carvalho Chehab
Browse files

Documentation/edac.txt: Add Nehalem specific EDAC characteristics



As Nehalem has a different binding to EDAC API, and its own different
error injection code, documents it.

Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
parent 4157d9f5
Loading
Loading
Loading
Loading
+110 −0
Original line number Diff line number Diff line
@@ -6,6 +6,8 @@ Written by Doug Thompson <dougthompson@xmission.com>
7 Dec 2005
17 Jul 2007	Updated

(c) Mauro Carvalho Chehab <mchehab@redhat.com>
05 Aug 2009	Nehalem interface

EDAC is maintained and written by:

@@ -717,3 +719,111 @@ unique drivers for their hardware systems.
The 'test_device_edac' sample driver is located at the
bluesmoke.sourceforge.net project site for EDAC.

=======================================================================
NEHALEM USAGE OF EDAC APIs

This chapter documents some EXPERIMENTAL mappings for EDAC API to handle
Nehalem EDAC driver. They will likely be changed on future versions
of the driver.

Due to the way Nehalem exports Memory Controller data, some adjustments
were done at i7core_edac driver. This chapter will cover those differences

1) On Nehalem, there are one Memory Controller per Quick Patch Interconnect
   (QPI). At the driver, the term "socket" means one QPI. It should also be
   associated with the CPU physical socket.

   Each MC have 3 physical read channels, 3 physical write channels and
   3 logic channels. The driver currenty sees it as just 3 channels.
   Each channel can have up to 3 DIMMs.

   The minimum known unity is DIMMs. There are no information about csrows.
   As EDAC API maps the minimum unity is csrows, the driver exports one
   DIMM per csrow.

   Currently, it also exports the several memory controllers as just one. This
   limit will be removed on future versions of the driver.

2) Nehalem MC has the hability to generate errors. The driver implements this
   functionality via some error injection nodes:

   For injecting a memory error, there are some sysfs nodes, under
   /sys/devices/system/edac/mc/mc0/:

   inject_addrmatch:
      Controls the error injection mask register. It is possible to specify
      several characteristics of the address to match an error code:
         dimm = the affected dimm. Numbers are relative to a channel;
         rank = the memory rank;
         channel = the channel that will generate an error;
         bank = the affected bank;
         page = the page address;
         column (or col) = the address column.
      each of the above values can be set to "any" to match any valid value.

      At driver init, all values are set to any.

      For example, to generate an error at rank 1 of dimm 2, for any channel,
      any bank, any page, any column:
		echo "dimm:2 rank:1" >/sys/devices/system/edac/mc/mc0/inject_addrmatch

	To return to the default behaviour of matching any, you can do:
		echo "dimm:any rank:any" >/sys/devices/system/edac/mc/mc0/inject_addrmatch

   inject_eccmask:
       specifies what bits will have troubles,

   inject_section:
       specifies what ECC cache section will get the error:
		3 for both
		2 for the highest
		1 for the lowest

   inject_socket:
       specifies what QPI (or processor socket) will generate the error.
          on Xeon 35xx, it should be 0.
          on Xeon 55xx, it should be 0 or 1.

   inject_type:
       specifies the type of error, being a combination of the following bits:
		bit 0 - repeat
		bit 1 - ecc
		bit 2 - parity

       inject_enable starts the error generation when something different
       than 0 is written.

   All inject vars can be read. root permission is needed for write.

   Datasheet states that the error will only be generated after a write on an
   address that matches inject_addrmatch. It seems, however, that reading will
   also produce an error.

   For example, the following code will generate an error for any write access
   at socket 0, on any DIMM/address on channel 2:

   echo "channel:2" > /sys/devices/system/edac/mc/mc0/inject_addrmatch
   echo 2 >/sys/devices/system/edac/mc/mc0/inject_type
   echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask
   echo 3 >/sys/devices/system/edac/mc/mc0/inject_section
   echo 0 >/sys/devices/system/edac/mc/mc0/inject_socket
   echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable
   dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null

   The generated error message will look like:

   EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error))

3) Nehalem specific Corrected Error memory counters

   Nehalem have some registers to count memory errors, reporting it on a
   way that it is different from what EDAC API allows. Due to that, a
   separate sysfs note were created to handle such counters.

   They can be read by looking at the contents of "corrected_error_counts"
   counter:

	$ cat /sys/devices/system/edac/mc/mc0/corrected_error_counts
	dimm0: 15866
	dimm1: 0
	dimm2: 27285