Silicon Chip – July 2019

(Frankie) #1

22 Silicon chip Australia’s electronics magazine siliconchip.com.au


Pretty much all modern processors are fabricated with a CMOS
process, ie, with a chip made up of N-channel and P-channel Mos-
fets formed from doped semiconductor layers and insulating ox-
ide layers, plus metal layers to form the wiring which distributes
power and signals between the transistors.
In CMOS devices, radiation can result in the accumulation of
charge in the oxide layer, leading to a shift in the gate-source volt-
age for a given drain current.
NMOS devices typically see a lowering in the threshold voltage,
increasing current when the device is both off and on. PMOS de-
vices tend to get ‘weaker’, ie, higher gate voltages are required to
turn the device on, and when on, the drive strength is decreased.
This is not the only way in which CMOS devices are degraded
by exposure to high-energy particles: other processes tend to re-
sult in a linearisation of the drain current vs. gate voltage curve,
which for both NMOS and PMOS devices leads to an increase in
gate voltage required to turn the device fully on.
These defects are effectively permanent and will continue until
the transistor is entirely unusable. It is quite easy to measure this
damage; techniques such as deliberately timing-critical ‘canary’
logic paths, structures such as ring oscillators, or even param-
eters such as the total power consumed by a device can be moni-
tored during operation, with changes indicating impending failure.
As CMOS circuits have continued to shrink in size, radiation strong
enough to alter the electronic state of a circuit but not so strong as
to permanently damage it has become a common concern. For a
while, the decomposition of radioactive lead isotopes in solder joints
was a significant source of single-event upsets, but these days, the
dominant source of SEUs is exposure to cosmic radiation.
The digital circuits most sensitive to single-event upsets are
those for which a voltage is used to indicate the state by a multi-
stable circuit, such as in the classic six-transistor SRAM cell, where
a pair of coupled inverters store a single bit of information and are
isolated when not in use.
As the size of the four MOSFETs, the local interconnect, and the
operating voltage has decreased over time, there has been a sig-
nificant decrease in the amount of energy required for an energetic
particle to change the state of such a bit cell. Non-array elements
like latches and flip-flops, and other array memories including
DRAMs and flash memories, are also susceptible.
One way that the reliability of these cells has been increased in
the face of radiation is to spread the transistor gates over wider
areas to ensure that ion strikes affect only a single node potential
rather than two or more.
Fortunately, the decrease in size of CMOS circuits has also al-
lowed an increase in complexity which can also be utilised to com-
bat radiation-induced events. So in addition to lower level design
techniques like the increased gate area mentioned above, it is also
possible to add redundancy to critical flip-flop cells, or even add
error detection and correction coding to critical registers.
Higher level protection techniques can also be used, including
active software- or microcode-driven ‘scrubbing’ of critical memory
contents, replicating critical logic blocks to operate in lock-step,
with majority vote comparators, or ‘stop and retry’ logic which
causes the processor to recalculate any results where the veracity
of the previous calculation may be in question.
Where field programmable gate arrays (FPGAs) are used, or
other chips with configurable logic blocks, it is also possible to
perform ‘online’ reprogramming of any logic blocks where a fault
has been detected.
In chips where robustness is critical, designers even go so far as

to add ‘fault injection’ logic. This allows the fault mitigation tech-
niques described above to be more rapidly and thoroughly tested,
compared to what is possible with typical lab-based radiation tests.

An example: reliable instruction fetching
One critical function in any microprocessor is instruction fetch-
ing. The processor needs a continual supply of instructions to tell
each of the processor’s functional units what they should be do-
ing at any point in time.
It’s vital that this be done at high speed (otherwise the micro-
processor might remain idle), but it is even more critical that this
be done reliably, as a corrupt instruction could easily lead to a va-
riety of different errors, including potentially subtle corruption of
program state, rather than an immediate crash or hang.
To meet the speed requirement, instruction fetching is typically
performed with a hierarchy of logic blocks, each ‘closer to the ac-
tion’ than the next. At the top level is typically a high-speed instruc-
tion cache, which stores a limited number of the most frequently
executed instructions, eg, the bodies of frequently-called functions.
If for any reason this top-level cache is unable to immediately
provide an instruction to be executed, the result will be an undesir-
able stall of the microprocessor while the cache attempts to fetch
instructions from slower cache levels, memory, or perhaps even
a disk or network.
Due to its limited size and speed-critical nature, radiation harden-
ing of a top-level instruction cache frequently involves maintaining
a completely separate copy. This copy is kept physically separated
from the original to the maximum practical extent, to ensure that
a radiation strike corrupts only one of the copies.
For speed reasons, typically only the original is “plumbed
through” to the processor’s core functional units, and an inde-
pendent unit is tasked with checking that both the primary cache
and its copy provide identical results.
In case a mismatch is detected, a high speed “stop!” signal is
asserted to pause the rest of the processor before a potentially in-
correct instruction is executed. This remains asserted until a more
complex mechanism (such as an error correcting code) provides
a known-good instruction and restores this correct entry to both
the original cache and the copy.
This “stop!” signal is frequently one of, and sometimes the most
speed-critical path in the entire processor. Given that it toggles rel-
atively rarely, it is often implemented using special, power-hungry,
high-speed circuit techniques.
Moving away from the high-speed core of a processor, error-
correction techniques which take correspondingly longer times to
use are justified. As the size of caches and memories increases,
making complete copies of these becomes less practical.
So lower-level caches and main memories are frequently pro-
tected with modified Hamming codes where, for example, 64 bits
of data are encoded into 72 bits so that the corruption of any two
of the 72 bits can be detected, and the corruption of any one of
the 72 bits can be seamlessly corrected.
In a radiation-hardened environment, main memories are fre-
quently guarded with additional, software-based scrubbers which
continually calculate and recalculate checksums for instruction
memory blocks, and compare those against known-good values.
These blocks can be encoded with quite complex codes, need-
ing thousands or millions of machine cycles to correct an error,
but can be designed in such a way as to virtually assure recovery
of the original data whilst still maintaining a relatively low overhead
in terms of space required to store the encoded data.

How modern semiconductors are radiation hardened – by Duraid Madina


SC
Free download pdf