Silicon Chip – July 2019

22 Silicon chip Australia’s electronics magazine siliconchip.com.au

Pretty much all modern processors are fabricated with a CMOS process, ie, with a chip made up of N-channel and P-channel Mos- fets formed from doped semiconductor layers and insulating oxide layers, plus metal layers to form the wiring which distributes power and signals between the transistors. In CMOS devices, radiation can result in the accumulation of charge in the oxide layer, leading to a shift in the gate-source voltage for a given drain current. NMOS devices typically see a lowering in the threshold voltage, increasing current when the device is both off and on. PMOS devices tend to get ‘weaker’, ie, higher gate voltages are required to turn the device on, and when on, the drive strength is decreased. This is not the only way in which CMOS devices are degraded by exposure to high-energy particles: other processes tend to result in a linearisation of the drain current vs. gate voltage curve, which for both NMOS and PMOS devices leads to an increase in gate voltage required to turn the device fully on. These defects are effectively permanent and will continue until the transistor is entirely unusable. It is quite easy to measure this damage; techniques such as deliberately timing-critical ‘canary’ logic paths, structures such as ring oscillators, or even param- eters such as the total power consumed by a device can be moni- tored during operation, with changes indicating impending failure. As CMOS circuits have continued to shrink in size, radiation strong enough to alter the electronic state of a circuit but not so strong as to permanently damage it has become a common concern. For a while, the decomposition of radioactive lead isotopes in solder joints was a significant source of single-event upsets, but these days, the dominant source of SEUs is exposure to cosmic radiation. The digital circuits most sensitive to single-event upsets are those for which a voltage is used to indicate the state by a multi- stable circuit, such as in the classic six-transistor SRAM cell, where a pair of coupled inverters store a single bit of information and are isolated when not in use. As the size of the four MOSFETs, the local interconnect, and the operating voltage has decreased over time, there has been a significant decrease in the amount of energy required for an energetic particle to change the state of such a bit cell. Non-array elements like latches and flip-flops, and other array memories including DRAMs and flash memories, are also susceptible. One way that the reliability of these cells has been increased in the face of radiation is to spread the transistor gates over wider areas to ensure that ion strikes affect only a single node potential rather than two or more. Fortunately, the decrease in size of CMOS circuits has also al- lowed an increase in complexity which can also be utilised to com- bat radiation-induced events. So in addition to lower level design techniques like the increased gate area mentioned above, it is also possible to add redundancy to critical flip-flop cells, or even add error detection and correction coding to critical registers. Higher level protection techniques can also be used, including active software- or microcode-driven ‘scrubbing’ of critical memory contents, replicating critical logic blocks to operate in lock-step, with majority vote comparators, or ‘stop and retry’ logic which causes the processor to recalculate any results where the veracity of the previous calculation may be in question. Where field programmable gate arrays (FPGAs) are used, or other chips with configurable logic blocks, it is also possible to perform ‘online’ reprogramming of any logic blocks where a fault has been detected. In chips where robustness is critical, designers even go so far as

to add ‘fault injection’ logic. This allows the fault mitigation techniques described above to be more rapidly and thoroughly tested, compared to what is possible with typical lab-based radiation tests.

An example: reliable instruction fetching One critical function in any microprocessor is instruction fetching. The processor needs a continual supply of instructions to tell each of the processor’s functional units what they should be do- ing at any point in time. It’s vital that this be done at high speed (otherwise the microprocessor might remain idle), but it is even more critical that this be done reliably, as a corrupt instruction could easily lead to a va- riety of different errors, including potentially subtle corruption of program state, rather than an immediate crash or hang. To meet the speed requirement, instruction fetching is typically performed with a hierarchy of logic blocks, each ‘closer to the ac- tion’ than the next. At the top level is typically a high-speed instruction cache, which stores a limited number of the most frequently executed instructions, eg, the bodies of frequently-called functions. If for any reason this top-level cache is unable to immediately provide an instruction to be executed, the result will be an undesir- able stall of the microprocessor while the cache attempts to fetch instructions from slower cache levels, memory, or perhaps even a disk or network. Due to its limited size and speed-critical nature, radiation harden- ing of a top-level instruction cache frequently involves maintaining a completely separate copy. This copy is kept physically separated from the original to the maximum practical extent, to ensure that a radiation strike corrupts only one of the copies. For speed reasons, typically only the original is “plumbed through” to the processor’s core functional units, and an inde- pendent unit is tasked with checking that both the primary cache and its copy provide identical results. In case a mismatch is detected, a high speed “stop!” signal is asserted to pause the rest of the processor before a potentially in- correct instruction is executed. This remains asserted until a more complex mechanism (such as an error correcting code) provides a known-good instruction and restores this correct entry to both the original cache and the copy. This “stop!” signal is frequently one of, and sometimes the most speed-critical path in the entire processor. Given that it toggles relatively rarely, it is often implemented using special, power-hungry, high-speed circuit techniques. Moving away from the high-speed core of a processor, error- correction techniques which take correspondingly longer times to use are justified. As the size of caches and memories increases, making complete copies of these becomes less practical. So lower-level caches and main memories are frequently pro- tected with modified Hamming codes where, for example, 64 bits of data are encoded into 72 bits so that the corruption of any two of the 72 bits can be detected, and the corruption of any one of the 72 bits can be seamlessly corrected. In a radiation-hardened environment, main memories are frequently guarded with additional, software-based scrubbers which continually calculate and recalculate checksums for instruction memory blocks, and compare those against known-good values. These blocks can be encoded with quite complex codes, need- ing thousands or millions of machine cycles to correct an error, but can be designed in such a way as to virtually assure recovery of the original data whilst still maintaining a relatively low overhead in terms of space required to store the encoded data.

How modern semiconductors are radiation hardened – by Duraid Madina

SC

Silicon Chip – July 2019

Get our desktop app

Company

Features

Documentation

Resources