Loading...
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 | Error Detection And Correction (EDAC) Devices ============================================= Main Concepts used at the EDAC subsystem ---------------------------------------- There are several things to be aware of that aren't at all obvious, like *sockets, *socket sets*, *banks*, *rows*, *chip-select rows*, *channels*, etc... These are some of the many terms that are thrown about that don't always mean what people think they mean (Inconceivable!). In the interest of creating a common ground for discussion, terms and their definitions will be established. * Memory devices The individual DRAM chips on a memory stick. These devices commonly output 4 and 8 bits each (x4, x8). Grouping several of these in parallel provides the number of bits that the memory controller expects: typically 72 bits, in order to provide 64 bits + 8 bits of ECC data. * Memory Stick A printed circuit board that aggregates multiple memory devices in parallel. In general, this is the Field Replaceable Unit (FRU) which gets replaced, in the case of excessive errors. Most often it is also called DIMM (Dual Inline Memory Module). * Memory Socket A physical connector on the motherboard that accepts a single memory stick. Also called as "slot" on several datasheets. * Channel A memory controller channel, responsible to communicate with a group of DIMMs. Each channel has its own independent control (command) and data bus, and can be used independently or grouped with other channels. * Branch It is typically the highest hierarchy on a Fully-Buffered DIMM memory controller. Typically, it contains two channels. Two channels at the same branch can be used in single mode or in lockstep mode. When lockstep is enabled, the cacheline is doubled, but it generally brings some performance penalty. Also, it is generally not possible to point to just one memory stick when an error occurs, as the error correction code is calculated using two DIMMs instead of one. Due to that, it is capable of correcting more errors than on single mode. * Single-channel The data accessed by the memory controller is contained into one dimm only. E. g. if the data is 64 bits-wide, the data flows to the CPU using one 64 bits parallel access. Typically used with SDR, DDR, DDR2 and DDR3 memories. FB-DIMM and RAMBUS use a different concept for channel, so this concept doesn't apply there. * Double-channel The data size accessed by the memory controller is interlaced into two dimms, accessed at the same time. E. g. if the DIMM is 64 bits-wide (72 bits with ECC), the data flows to the CPU using a 128 bits parallel access. * Chip-select row This is the name of the DRAM signal used to select the DRAM ranks to be accessed. Common chip-select rows for single channel are 64 bits, for dual channel 128 bits. It may not be visible by the memory controller, as some DIMM types have a memory buffer that can hide direct access to it from the Memory Controller. * Single-Ranked stick A Single-ranked stick has 1 chip-select row of memory. Motherboards commonly drive two chip-select pins to a memory stick. A single-ranked stick, will occupy only one of those rows. The other will be unused. .. _doubleranked: * Double-Ranked stick A double-ranked stick has two chip-select rows which access different sets of memory devices. The two rows cannot be accessed concurrently. * Double-sided stick **DEPRECATED TERM**, see :ref:`Double-Ranked stick <doubleranked>`. A double-sided stick has two chip-select rows which access different sets of memory devices. The two rows cannot be accessed concurrently. "Double-sided" is irrespective of the memory devices being mounted on both sides of the memory stick. * Socket set All of the memory sticks that are required for a single memory access or all of the memory sticks spanned by a chip-select row. A single socket set has two chip-select rows and if double-sided sticks are used these will occupy those chip-select rows. * Bank This term is avoided because it is unclear when needing to distinguish between chip-select rows and socket sets. Memory Controllers ------------------ Most of the EDAC core is focused on doing Memory Controller error detection. The :c:func:`edac_mc_alloc`. It uses internally the struct ``mem_ctl_info`` to describe the memory controllers, with is an opaque struct for the EDAC drivers. Only the EDAC core is allowed to touch it. .. kernel-doc:: include/linux/edac.h .. kernel-doc:: drivers/edac/edac_mc.h PCI Controllers --------------- The EDAC subsystem provides a mechanism to handle PCI controllers by calling the :c:func:`edac_pci_alloc_ctl_info`. It will use the struct :c:type:`edac_pci_ctl_info` to describe the PCI controllers. .. kernel-doc:: drivers/edac/edac_pci.h EDAC Blocks ----------- The EDAC subsystem also provides a generic mechanism to report errors on other parts of the hardware via :c:func:`edac_device_alloc_ctl_info` function. The structures :c:type:`edac_dev_sysfs_block_attribute`, :c:type:`edac_device_block`, :c:type:`edac_device_instance` and :c:type:`edac_device_ctl_info` provide a generic or abstract 'edac_device' representation at sysfs. This set of structures and the code that implements the APIs for the same, provide for registering EDAC type devices which are NOT standard memory or PCI, like: - CPU caches (L1 and L2) - DMA engines - Core CPU switches - Fabric switch units - PCIe interface controllers - other EDAC/ECC type devices that can be monitored for errors, etc. It allows for a 2 level set of hierarchy. For example, a cache could be composed of L1, L2 and L3 levels of cache. Each CPU core would have its own L1 cache, while sharing L2 and maybe L3 caches. On such case, those can be represented via the following sysfs nodes:: /sys/devices/system/edac/.. pci/ <existing pci directory (if available)> mc/ <existing memory device directory> cpu/cpu0/.. <L1 and L2 block directory> /L1-cache/ce_count /ue_count /L2-cache/ce_count /ue_count cpu/cpu1/.. <L1 and L2 block directory> /L1-cache/ce_count /ue_count /L2-cache/ce_count /ue_count ... the L1 and L2 directories would be "edac_device_block's" .. kernel-doc:: drivers/edac/edac_device.h |