# COTS for the LHC radiation environment: the rules of the game

# F. Faccio

CERN, CH-1211 Geneva 23, Switzerland (e-mail: Federico.Faccio@cern.ch)

# Abstract

The use of Commercial Off The Shelf (COTS) components in the LHC raises a series of questions concerning their reliability in a radiation environment. Unfortunately, most often there is no alternative to the use of commercial-grade components, and system designers have to manage the risk associated to their use. This paper identifies the main sources of radiation-induced problems that are likely to affect COTS in LHC, and indicates strategies to achieve reliable systems. Existing sources of data on radiation effects are pointed out, and indications on how to interpret these data for the LHC environment are given. Moreover, appropriate test methodologies are discussed.

# I. INTRODUCTION

Most electronic systems in the LHC experiments and along the machine rely on the use of Commercial Off The Shelf (COTS) components. Though the radiation environment in the regions where COTS parts will be used is not as demanding as in the trackers and in some parts of the electromagnetic calorimeters, it is still severe enough to heavily affect the performance of most commercial electronic components.

Ensuring the reliability of COTS-based systems is a challenge that LHC teams have to take, since in the vast majority of the cases the use of hi-rel radiation-hard components is not an option. This class of components, available because of the needs of the military and Space market, is generally extremely expensive. Since LHC systems need hundreds to thousands or more parts, the available budget does not allow to purchasing such components systematically. Moreover, their range of functional performance is more limited than for their counterparts available in the commercial marketplace.

The use of COTS complicates the radiation hardness assurance process. Contrary to radiation-hard parts, there is normally no information on what is actually inside the package: one only knows that the part satisfies the specifications reported in the datasheet. Only testing can give an indication on the radiation tolerance of the part, and indicate which malfunctions (transients or permanent) can occur in a radiation environment. Partto-part variability in the radiation response is common: the logistic effort required to ensure the traceability of the components and to qualify them is significant. Therefore, the cost associated to the use of COTS is considerably higher than the bare part cost. Moreover, it is very difficult to estimate this cost in advance, since testing often leads to unveil unforeseen problems.

With all these problems in mind, the aim of this paper is to help system designers to identify the radiation effects that can possibly disrupt the correct functioning of their systems, and to discuss a systematic approach to ensure the reliability of their COTS-based systems in LHC. After a short introduction on the main radiation effects on electronic devices, emphasis is given to highlight the importance of risk management and to illustrate the fundamental steps to follow in dealing with the use of COTS in a radiation environment. Finally, issues associated to COTS procurement, plastic packaging and burn-in are discussed.

As a starting point, it is mandatory to clarify which component can be defined as COTS. This might seem trivial, but several definitions exist. Sometimes, radiation-hard parts from a manufacturer's catalogue are considered COTS [1]. In this paper, I define COTS a component in the part list of any manufacturer for which no specific effort has been made to improve, assure and most often even test the radiation tolerance.

# **II. SUMMARY OF RADIATION EFFECTS**

Radiation effects in electronic devices can be divided in two main categories: cumulative effects and Single Event Effects (SEE), as shown in Figure 1.



Figure 1: Summary of radiation effects.

# A. Cumulative effects

They are due to the creation or activation of microscopic defects in the device, the effect of each individual defect not being significantly affecting the device characteristics. The steady accumulation of defects nevertheless causes measurable effects which can ultimately lead to device failure.

#### 1) Total Ionizing Dose (TID)

TID effects are due to the energy deposited in the electronics by radiation in the form of ionization. The unit for TID in the International System is the Gray (Gy), but the radiation effects community still widely uses the old unit, the rad. The conversion between the two units is easy: 1 Gy = 100 rad.

The performance of electronics is affected by the dose deposited in the silicon dioxide used in semiconductor devices for isolation purposes. Ionization in this material leads to the generation of electron-hole pairs, which can be separated by a local electric field. Holes can be trapped in the oxide or migrate to the Si-SiO<sub>2</sub> interface to participate to the complex mechanism of interface states creation. Both kind of defects (trapped holes or interface states) accumulate to affect the behaviour of the semiconductor devices.

The consequent macroscopic effect varies with the technology. In CMOS technologies the threshold voltage of transistors shifts, their mobility and transconductance decrease, their noise and matching performance degrade, and leakage currents appear. In bipolar technologies, transistors gain decreases and leakage currents appear.

#### 2) Displacement damage

Non-ionizing energy losses in silicon cause atoms to be displaced from their normal lattice sites, seriously degrading the electrical characteristics of semiconductor devices. For displacement damage, it is common practice to express the radiation environment in terms of the particle fluence (particles/cm<sup>2</sup>). Since the induced damage is a function of the particle nature and energy, the Non-Ionizing Energy Loss (NIEL) is used as a parameter to correlate the effects observed in different radiation environments. Though this correlation is not free from uncertainties and fails in some cases [2], it is still commonly used to translate a complex radiation environment into a simpler mono-energetic equivalent, namely 1 MeV neutrons [3, 4].

The macroscopic effect of displacement damage varies with the technology. CMOS transistors are practically unaffected up to particle fluences much higher than those expected at LHC. In bipolar technologies, displacement damage increases the bulk component of the transistor base current, leading to a decrease in gain. Other devices being sensitive to displacement damage are some types of light sources, photodetectors and optocouplers.

# B. Single Event Effects (SEE)

These effects are due to the direct ionization of a single particle, able to deposit sufficient energy in

ionization processes to disturb the operation of the device. In the LHC, the charged hadrons and the neutrons representing the particle environment do not directly deposit enough energy to generate a SEE. Nevertheless, they might induce a SEE through nuclear interaction in the semiconductor device or in its close proximity. The recoil from the interaction is indeed often capable of a sufficient energy deposition. In the special case of photodiodes used as optical detectors and in optocouplers, the direct ionization from charged hadrons might trigger a transient error.

Most available SEE data refer to heavy ion irradiation tests, and express the sensitivity of the components as a function of the Linear Energy Transfer (LET) of the incoming particle. The details on how to interpret such data in view of LHC applications are given in paragraph IV.D. To have a feeling, devices with threshold LET below 15 MeVcm<sup>2</sup>mg<sup>-1</sup> can be sensitive to SEE in the LHC environment. Below this value, the lower is the threshold, the higher the sensitivity of the component.

Due to their statistical nature, it is possible to speak of SEEs only in terms of their probability to occur, which will depend on the device and on the flux and nature of particles. Therefore, the best one can do is to estimate their rate in the radiation environment.

The family of SEE is very wide, the main members are listed in the following sub-sections.

#### 1) Permanent SEEs

Also known as "Hard errors", they may be destructive. Single Event Latchup (SEL) occurs in CMOS technologies. The onset of a parasitic npnp thyristor can be triggered by the ionizing energy deposition in a sensitive point of the circuit. This leads to an almost short-circuit current on the power lines, which can permanently damage the device. Sometimes, this condition can be local and the current limited (microlatch), but the effect can still be destructive.

Single Event Burnout (SEB) occurs in power MOSFETs, BJT and diodes when these power devices are in the "off" state. The short-circuit current induced across the high voltage junction can permanently damage the device.

Single Event Gate Rupture (SEGR) also affects power MOSFETs in the "off" state. The gate dielectric can be permanently damaged when, due to the energy deposited by an incoming particle, the electric field across the oxide is temporarily increased beyond the breakdown limit.

Stuck Bits have been observed in SRAM and DRAM circuits irradiated with heavy ions [5]. The state of the memory point is permanently changed to a logic value, without the possibility to rewrite the correct value. This event was traced back to the ionization energy deposition of a single ion with high Linear Energy Transfer (LET). Modern technologies should not be very

sensitive to this effect [6], which was by the way only once observed during proton irradiation, for a 16 Mbit DRAM from Hitachi [7].

### 2) Static SEEs

Static effects are not destructive, and happen whenever one or more bits of information stored by a logic circuit are overwritten by the charge collection following the ionization event. This effect is defined Single Event Upset (SEU). A special case of SEU is called Single Event Functional Interrupt (SEFI). This happens in complex circuits due to an error induced on a bit of information controlling a special function of the circuit (most often, a special test mode). A reset is necessary to bring the circuit back to the operational condition.

#### 3) Transient SEEs

Charge collection from an ionization event creates a spurious signal that can propagate in the circuit. This can happen in most technologies, and its effect varies very significantly with the device, the amplitude of the initial current pulse, and the time of the event with respect to the circuit. Typical examples are transient pulses in combinational logic, which can propagate and ultimately be latched in a register, and rail-to-rail voltage pulses at the output of operational amplifiers (SET).

#### **III. RISK MANAGEMENT**

The use of COTS in the LHC radiation environment brings along one direct consequence: *risk avoidance is impossible*. This applies also to Space systems, which fly more and more frequently commercial components.

In this scenario, the only possible approach is focused on *risk management*. In particular, our aim of having LHC running with the experiments taking good data does not necessarily imply that any temporary failure of any component is unacceptable. This, of course, provided that no vital function of the overall system is touched, and that sufficient margins have been foreseen to recover from the failure condition. In this context, the radiation hazard can much more effectively be dealt at the system level than at the component level.

The above few lines highlight the heart of the whole problem of COTS reliability in the LHC radiation environment: *the risk should be evaluated in the context of the system functionality*. Radiation is just another, even though sometimes more severe, threat to the reliability of the system, and has to be tackled with the aim of keeping the global system functionality alive in the long run.

System designers are already used to include Design Margins in their developments, to take into account that the components used to build the system are not ideal but exhibit a variability around some nominal value, and that the environmental conditions are not as benign as in the test laboratory (noise on power supplies, electromagnetic disturbances, temperature variations, ...). Radiation is yet another source of "non-ideality" of the environment, which can also be included when defining the Design Margins. In this approach, the radiation tolerance requirements on the single component are defined from the system requirements (top-down approach).

There is one major difficulty in this risk management approach, at least for what concerns the risks related to radiation effects. The system designer should in fact either have her/himself a deep knowledge of the radiation hazard, or work in close collaboration with a radiation effects engineer. This second solution is chosen, for instance, by NASA for space flight projects. The High Energy Physics (HEP) community, however, has a young tradition in radiation effects in semiconductor devices, and very few engineers have a good understanding of radiation effects. As a consequence, the big majority of them are learning how radiation affects electronics devices during the system developments, very often when the development has reached quite an advanced state. The inertia this learning process generates might lead to the need of applying more stringent requirements on COTS components, to put inadequate pressure on test engineers, or even to the need to redesign the system. All that translates into delays and higher costs.

# IV. SYSTEMATIC APPROACH TO DEAL WITH THE RADIATION HAZARD

The above arguments prove how important it is for a system designer to have at least some knowledge on how to deal with the radiation hazard. In this section, I discuss some of the fundamental steps that should be part of the methodology used when developing a system for a radiation environment. These steps are the schematically summarised in Figure 2, and are individually discussed in the following.



Figure 2: Approach to deal with the radiation hazard.

### A. The radiation environment

The electronics at the LHC will be exposed to high hadron fluxes. In the experiments, pions will be the main component of such fluxes very close to the collision point. Neutrons will instead dominate the particle fluxes in the rest of the experiments, in particular in all the regions where COTS are planned to be used.

A good comprehension of the radiation environment implies the knowledge of the environment characteristics in terms that are meaningful to estimate its impact on the electronics. The characteristics of the radiation environment that a system designer should ideally know are summarised in Table 1.

Only in the past couple of years, when SEEs started to be considered as a potential threat to LHC electronics, the energy distribution of the particles has been recognised as an important parameter to characterise the radiation environment. If the detailed energy distribution is not available, the total flux/fluence of all hadrons above 20 MeV is a sufficient parameter to allow the estimate of the SEU rates. For destructive SEEs, it is nevertheless useful to have at least an idea of the highest energy of the hadrons in the environment. To evaluate SEE rates, the 1 MeV equivalent neutron fluence used to evaluate displacement damage effects is instead useless.

 Table 1: Required characteristics of the radiation environment

 where a system has to operate

| Environment characteristic  | Effect              |
|-----------------------------|---------------------|
| Total Ionizing Dose         | TID                 |
| 1 MeV equivalent neutron    | Displacement damage |
| flux/fluence                |                     |
| Flux/fluence of the main    | SEEs                |
| particle species, and their |                     |
| energy distribution         |                     |

The present understanding of the LHC radiation environment is certainly not complete. The study is performed with the aid of powerful, though complex, simulation programs, which require knowledgeable physicist to output reliable data. In many cases there is not an identified and experienced physicist in charge of this delicate and important task, and in this situation system designers do not know where to get the necessary information. It should be stressed how this leads to either a delay in recognising a possible radiation hazard, or to the application of "generous" and sometimes arbitrary Safety Factors on top of the estimated parameters. This in turn may lead to heavy over-specification of the radiation requirements, which means extra costs.

The issue of the Safety Factors is very important. The "most probable" radiation levels output from the simulation are affected by an uncertainty, and only the physicist running the simulation can estimate it. The uncertainty varies with the position in the experiment/machine, with the materials surrounding the specific area, and with the local energy distribution of particles. It is certainly a possibility to add the estimated error on top of the most probable value systematically, and define a "worst case environment". This comes at the cost of over-specifying the radiation requirements. Instead, the safety margin can be varied for components with different importance in the system, and the safest (and more expensive) requirements can be defined for the vital components only.

# B. Effects of the environment on the electronics

Once the radiation environment is clearly understood, it is possible to go one step further and analyse how it could affect the functionality of the electronics. In this section, I present an overview of the effects that the main classes of electronic and optoelectronic components are likely to experience in LHC.

#### 1) CMOS technologies

Components manufactured in CMOS technologies are generally sensitive to TID and SEEs. They are instead unaffected by displacement damage.

The threshold for TID-induced failure of CMOS components varies widely. Dose rate effects have of course to be taken into account, because they can significantly change the failure threshold. Typically, most CMOS components can stand dose levels of the order of 5-10 krad. Very few of them fail below 3krad, some can make it to 30-50 krad, a small minority can survive up to 100 krad. In logic circuits, failure often appears as an increase of the power supply current above the maximum specification. High precision circuits, relying on very demanding electrical parameters (such as 14-bit or higher ADCs), can exhibit an enhanced sensitivity to TID and should therefore be used carefully.

Amongst SEEs, latchup (SEL) is traditionally considered as an important threat to CMOS components for Space applications. In the LHC radiation environment, SEL is nevertheless not very likely to happen: so far only a limited number of components amongst the great and varied panoply of devices tested has shown a sufficiently low SEL threshold to be triggered in an environment where heavy ions are absent. The best known example is the K5 microprocessor from AMD, extremely sensitive to latchup even though fabricated using an epitaxial substrate [8]. Devices having shown SEL under proton irradiation include SRAM memories from several manufacturers (Cypress, NEC, Toshiba and EDI) [9]. Also several ADCs (from Crystal Semiconductor, Datel, Space Electronics Inc., Analog Devices), DSPs (from Motorola and Analog Devices) and FPGAs (from Xilinx and Actel) have in the past shown a suspiciously low threshold [10] that would not exclude possible latchup in an LHC-like environment.

With the down-scaling of CMOS technologies, and the accompanying decrease in thickness of the gate oxide, some concern has emerged for a possible sensitivity to SEGR [11] or Soft Breakdown (SB, a mechanism similar to SEGR but characterised by a smaller gate current). Recent results [12, 13, 14] have pointed out how the critical field for both SEGR and SB increases for low LET of the incoming particles. Even though all these experimental studies require further work to get a more thorough understanding of the phenomenon, they indicate that oxide breakdown is very unlikely to happen in an environment where heavy ions are absent. The critical voltage across the oxide for breakdown in that case would exceed largely the maximum applied voltage.

Single Event Upset can disrupt the operation of CMOS components in several ways, therefore several categories of devices will be separately treated.

# Memories (SRAM, DRAM, Flash, EEPROM)

<u>SRAMs</u> are sensitive to SEU. In these devices the radiation sensitivity levels are observed to vary significantly, but the threshold for upset is generally quite low (often below 1 MeVcm<sup>2</sup>mg<sup>-1</sup>) [10]. These circuits are sometimes subject to Multiple Bit Upset (MBU), with more than one memory point being corrupted by the charge deposition originated from the same particle [15]. Stuck bits are sometimes observed, but only during irradiation with particles of high LET. Newer generations of SRAMs, using a 6T cell design, are expected to have an improved SEU and TID behaviour [15, 16].

DRAMs have historically been considered as devices very sensitive to SEU, since the first circuit errors induced by radiation were observed on these components back in the seventies. Their high sensitivity is due to their characteristic of passively (that is, without active signal regeneration) storing binary information as charge in a circuit node. In addition, the read amplifiers sensing the small amount of charge stored have reduced noise margins and are also quite susceptible to radiationinduced charge perturbation. Though the amount of charge stored decreases steadily from one technology generation to the next, the estimated error rates in a radiation environment has been found to decrease [17]. This is due to the phenomenal decline in the cell area which has accompanied the down-scaling of DRAM technologies, and which dominates over the decrease in the critical charge. In other words, modern DRAMs have a lower critical charge for upset (often below 1 MeVcm<sup>2</sup>mg<sup>-1</sup>), but an overall reduced sensitivity to SEU [17]. To further improve the situation, techniques to increase the signal-over-noise ratio have been introduced in commercial DRAMs. They include special design of the storage capacitor and of the memory cell, sense amplifier design, introduction of Error Detection And Correction (EDAC) where adjacent physical cells

do not belong to the same logical word to limit the impact of multiple bit upset. Overall, modern DRAMs seem to have comparable SEU cross-sections to SRAMs in a proton environment [10]. It should nevertheless be noted that SEU sensitivity of DRAM families using different capacitance technologies changes considerably (up to 3 orders of magnitude in the cross-section) [18]. Since DRAMs include increasingly complex control circuitry, the upset of a single control register can place the whole circuit in a special test mode and eventually lead to a "lock-up" condition needing a reinitialisation or a power cycle to be eliminated [19, 20]. The crosssection of such events is fortunately very small.

In Flash Memories, SEU effects are dominated by errors in their complex internal architecture rather than in the non-volatile storage array [21, 22]. Test runs have shown that no error is produced when the devices are irradiated in an un-powered mode, and that errors in the memory array can only be produced for high LET of the incoming particle (of the order of 40 MeVcm<sup>2</sup>mg<sup>-1</sup>). Since errors occur mostly in the complex control circuitry, the functional consequences of the SEU can be multiple. The circuit often requires a power cycling to recover the correct functionality. Sometimes, a steep increase of the current consumption is observed during or even after irradiation, probably due to logic conflicts in some internal register, address or buffer. Seldom, this current can be so high as to destroy the device. Compared to DRAMs and SRAMs, the sensitivity of Flash memories is nevertheless generally much lower. The threshold charge for upset is considerably higher (LET of 7 MeVcm<sup>2</sup>mg<sup>-1</sup> or more), and the cross-section is much lower because only a small portion of the circuit, containing the control logic, is sensitive. Proton test data show a cross-section typically 100-1000 times smaller compared to DRAMs or SRAMs [10].

Also <u>*EEPROMs*</u> have higher threshold for upset with respect to DRAMs and SRAMs [10]. In read mode operation, EEPROMs are not very sensitive to SEU, showing a threshold LET typically higher than 11 MeVcm<sup>2</sup>mg<sup>-1</sup> [23, 24]. In write mode, they are more sensitive and the threshold can be as low as 7 MeVcm<sup>2</sup>mg<sup>-1</sup>. SEFI has been observed in EEPROMs, introducing systematic errors in words at various address locations [20]. Sometimes these errors required power cycling to be removed, and the threshold LET for this error was such that it could possibly occur in an LHC-like radiation environment.

### Field Programmable Gate Arrays (FPGA)

In programmable devices, it is critical that the configuration information remains reliably valid during operation. In this aspect, antifuse-based components (such as the FPGAs from Actel) have an advantage over SRAM-based devices (such as those from Xilinx), because the configuration can not be corrupted by SEUs. SRAM-based components dominate nevertheless the

commercial market because they offer the highest gate number, the highest speed, and the highest flexibility (they can be reconfigured easily). The number of available programmable gates in a chip is steadily increasing, having exceeded 1 million recently. This is accompanied by a sharp increase in the chip complexity, which translates into more complex radiation effects.

SRAM-based FPGAs, which are configured by loading state information into SRAM cells, are prone to SEUs affecting their configuration memory [25, 26]. The number of configuration bits needed in average to program one used gate varies from vendor to vendor, and can easily be of the order of 30 (for some Xilinx devices). The consequence of an SEU in a configuration bit varies from no observable effect (if that bit was not used, or if there was redundancy in the program) to the destruction of the whole device [25]. This can happen if two output drivers internal to the chip are connected, resulting in a high-current state that can exceed the maximum tolerable value. Other problems may come from bus fights on internal tri-state busses, isolation of pull-up resistors on tri-state busses, change in output slew rates, change of input delays, turning input modules into an output configuration (this might lead to overstress of other components in the board). The available test data show that these effects occur with a very low LET threshold, and are observed during a 100 MeV neutron irradiation [27]. They will therefore also happen in the LHC radiation environment, and the exercise of estimating their rate is mandatory to introduce the necessary correction scheme, which requires the reprogramming of the configuration. This operation might be complex and time consuming, depending on how it has been foreseen in the system: where is the backup configuration stored, how is it going to be transferred to the FPGA, how the error is localised in the system. Some state-of-the-art FPGAs, such as the Virtex series from Xilinx, allow for reading out the configuration bits without interfering with the chip functionality [28, 29]. This characteristic might be used to quickly localise the errors in the program.

Antifuse-based FPGAs, instead, have a non-volatile configuration that can be programmed only once. Therefore, it can not be corrupted by SEU. Oxide-Nitride-Oxide (ONO) antifuses, which until recently represented the standard technology, are subject to a destructive SEE, but the threshold for this effect has been measured to be generally high enough to require heavy ions to be triggered [25]. Newer technologies, such as the metal-to-metal amorphous silicon antifuses, have demonstrated to be possibly even more resistant to destructive events, and their spread is motivated by a considerable increase in integration density and speed performance [26]. This technology is used, for instance, in the latest SX series from Actel: test results recently presented on components from this series (A54SX32 and

A54SX32A) have shown no destructive events up to an LET of 100  $MeVcm^2mg^{-1}$  [30].

Flip-flops and combinatorial logic gates integrated in both SRAM and antifuse-based FPGAs are prone to SEU [25, 26]. The usual mitigation techniques can be used to limit the impact of such errors on the system, in particular Triple Modular Redundancy (TMR) can successfully be implemented in antifuse-based devices [31]. In SRAM-based devices, the routing implementing the TMR scheme can be affected by upsets in the configuration logic. In the new Virtex series from Xilinx, TMR can be instead safely implemented via a hard-wired AND-OR logic structure existing as Tri-State Buffers (BUFTs) [28]. Since different implementations are actually possible for flip-flop cells, the SEU sensitivity is different for each implementation. For instance, the R-cell in the Actel SX series has a high threshold for SEU (about 7 MeVcm<sup>2</sup>mg<sup>-1</sup>) [30]. No upsets have been observed during irradiation with 55 MeV protons, and from all these results it seems that the SEU rate for this cell in LHC would be very low.

SEUs in JTAG circuitry (in particular in the TAP controller) in both SRAM and antifuse-based devices can lead to functional interrupt (SEFI), though the cross-section of such an event is generally very low. Both Xilinx and Actel propose solutions to this problem, either by ensuring a stable Test-Logic-Reset state or by taking care that this state is re-established within 5 cycles of the test clock TCK [28, 32].

Both Xilinx and Actel propose specific products for radiation applications, for which they guarantee TID tolerance up to tens to hundreds of krads, and SEL immunity [33]. These products are manufactured using the standard masks on a thin epitaxial substrate, which improves the SEL performance considerably. It should be noted that SEL data are widely available for state-ofthe-art devices in these product lines, but not for their standard commercial counterparts. Since each FPGA manufacturer actually uses several foundries, it is reasonable to expect a wide variability in the performance of the same device, especially for TID and SEL. This has in fact been observed [25, 26].

One final note about SEU in FPGAs concerns the mitigation techniques. Both Xilinx and Actel show an interest in the space and avionics market, and have therefore produced documents advising mitigation techniques against SEU in their products [28, 31, 34]. Actel has also implemented some of these techniques in the software used to program their devices [35, 36].

#### Microprocessors and DSPs

Microprocessors and DSPs are complex circuits comprised of several major functional sections, and it is unlikely that these sections will be in use simultaneously during the processing of a program. The application software the part is executing determines how many sections and registers are in used at any one time. Moreover, each section might have a different sensitivity to SEU. Therefore, the SEU-induced effects in microprocessors and DSPs are strongly applicationdependent, and it is very difficult to generalise the results to the whole device category. Testing is performed running the same program foreseen in the real environment, or sometimes with programs developed on purpose to exercise separately each section.

For all the above reasons, the consequence of an SEU can be very different: no observable effect on the program execution, code and/or keyboard stopped (power cycle required), calculation error, cache data error (sometimes also requiring power cycle), etc. Some examples of recent SEU testing on microprocessors and DSPs can be found in [10, 37, 38, 39, 40]. What can be stated in general terms is that for most devices the SEU threshold is sufficiently low to have errors happening in a proton environment [40], therefore in the LHC.

### 2) Bipolar technologies

Bipolar technologies are sensitive to TID effects, displacement damage and SEEs.

As for CMOS technologies, TID effects in bipolar technologies are due to charge trapping in the oxide and creation/activation of interface states. This can lead either to the inversion of silicon under a thick oxide, which opens a conductive channel (leakage current), or to a degradation of the transistor gain (increase of the surface base current). The latter effect is generally more pronounced in lateral PNP transistors, the vertical PNP being the less sensitive (even less than the NPN) [41]. Also, this effect is more pronounced when the transistor operates in a regime of low injection. To complicate the picture about TID effects, especially for what concerns the qualification of components, bipolar linear circuits have been found to present an Enhanced Low Dose Rate effect (ELDR). Since the first publication on this effect back in 1991 [42], a great variety of circuits from different manufacturers have been found prone to this effect. The TID-induced damage appears to be enhanced for low dose rates. This effect is variable from one technology to another, and in some cases it does not seem to saturate down to rates of the order of 0.002 rad/s [43]. Under low rate (0.1 rad/s), the gain degradation can be enhanced by a factor of 10-20 (or more) with respect to a high rate irradiation (1 krad/s). The ELDR effect has been frequently observed in operational amplifiers and comparators (such as the LM101, LM324, OP42, LM111, LM158), and in voltage regulators (like the LM137, LM117) [41, 44, 45]. Also for bipolar devices, the tolerated TID levels present a wide variability, but most devices should be able to stand doses of the order of 2-3 krad, with a number of components surviving beyond some 10 krad.

The sensitivity of bipolar transistors to displacement damage is due to the radiation-induced increase of the

bulk component of the base current. Therefore, this effect is particularly important for devices with a thick base region (and low bandwidth), such as lateral and substrate PNP [46]. Unfortunately, even though new processes with high bandwidth and thin base are available, most linear ICs are still manufactured in old junction-isolated processes. When these processes were developed, lateral PNPs had poor reproducibility and performance, and they were therefore carefully avoided in critical points of the circuits. Nowadays these limitations have been overcome, and these sensitive devices are now commonly used in critical positions, such as for input stages (examples are the LM111, LM139 and LM124) [41]. This of course worsens the IC sensitivity to displacement damage. Again, these effects have typically been observed in voltage regulators, comparators and operational amplifiers [46, 47]. The radiation data on such components indicate that PNP transistors in a typical junction-isolated process generally start to be affected beyond a fluence of about  $3 \cdot 10^{11} \text{ p/cm}^2$  (with 50 MeV protons), whilst for NPN transistors this fluence needs to increase to  $3 \cdot 10^{12} \text{ p/cm}^2$ [46]. Since the majority of data on linear components refer to 50-200 MeV proton tests, it is necessary to translate them in terms of 1 MeV equivalent neutrons. This can be done using the appropriate Non Ionizing Energy Loss (NIEL) ratio. In particular, 50 MeV protons are about 1.75 times more damaging than 1 MeV equivalent neutrons [47]. This ratio has been reasonably confirmed by experiments on a particular linear circuit, a LM111 comparator [47].

It should be noted that displacement damage and TID effects will both simultaneously affect the gain of bipolar transistors in LHC, since they increase two separate components of the base current.

Amongst SEE, SEL is not considered as a problem for bipolar technologies, since it has never been observed in any circuit. Radiation induced transients are instead often observed at the output of comparators such as the LH139, the LM111 and the LM119 [48]. They are called Single Event Transients (SET), their amplitude can be rail-to-rail, and they have a fast rising time with exponential decay (typically, 5  $\mu$ s). This effect is due to the ionization energy deposited in a sensitive node of the linear circuit, most probably the input, by an incident particle. This signal is amplified by the circuit, and can be transmitted at the output as rail-to-rail. Depending on the application, sometimes it is possible to prevent this spurious signal to propagate by simply adding a low-pass filter at the circuit output.

# 3) Power devices

Power devices, other than being sensitive to TID and displacement damage effects, are also subject to destructive events such as SEB and SEGR.

Power MOSFETs, bipolar (IGBT) and diodes can experience SEB in a radiation environment [49, 50]. Most of the radiation tests on these devices are run using heavy ions, but recently also neutrons and protons have been used to successfully induce burnout (including 14 MeV neutrons in some cases). The aim of the tests is to find the drain-source voltage  $V_{ds}$  at which the device can safely operate. This normally requires a de-rating of the component below the rated  $V_{ds}$ , this de-rating increasing for higher energy of the protons or neutrons [51], or for higher LET of the heavy ions. The higher the rated voltage of the device, the higher the percentual derating generally required (500 V MOSFETs can require operation at 65% of the rated  $V_{ds}$ , whilst 200 V devices might be safe up to 95% [51]). The cross-section for SEB is not negligible if the part is operated at the rated value: values of the order of  $10^{-7}$  cm<sup>2</sup> for high-energy neutrons have been measured [51]. It is interesting to note that p-channel power MOSFETs are much less sensitive to burnout [52], and they are considered insensitive in a hadron environment such as the LHC. High power diodes and GTO (gate turn-off) thyristors have also shown evidence of SEB induced by cosmic rays, even though they were normally operated at 50-60% of their rated voltage [52].

SEGR is a radiation-induced breakdown of the gate oxide in power MOSFETs (both n- and p-channel) and IGBTs [49, 50]. Only recently, this catastrophic SEE has been observed during a proton irradiation (44 and 200 MeV) [53]: it might therefore be a concern in the LHC radiation environment. It appears, as it can be intuitively suspected, that devices with thicker gate oxides are to be preferred, since they are more resistant to SEGR. The results of this recent work indicate that the critical electrical parameter affecting the sensitivity to proton-induced SEGR is the applied gate voltage rather than the source-drain voltage. Power MOSFETs with gate oxide thickness of 30 nm (the thinnest gate thickness available in commercial devices) are subject to SEGR in the "off" state for gate voltages exceeding -20 V.

It should be pointed out that the vast majority of radiation test data on power devices refer to heavy ion irradiation runs. From such data, it is not straightforward to infer the de-rating necessary to avoid SEB and SEGR in an LHC-like environment. This is particularly true since heavy ion data most often are taken for LETs higher than 25 MeVcm<sup>2</sup>mg<sup>-1</sup>. Instead, one would rather prefer data taken with LETs of the order of 10-15 MeVcm<sup>2</sup>mg<sup>-1</sup>, not far from the maximum LET of Si recoils in hadron-Si inelastic collisions. In the absence of such data, it is possible to apply the de-rating indicated by experiments run with higher LETs (for instance, 26 MeVcm<sup>2</sup>mg<sup>-1</sup>). This de-rating might be excessive in the LHC environment, but should ensure reliable device operation.

### 4) Optocouplers

Optocouplers are worth mentioning in this context because of their sensitivity to radiation effects. Since they are often used in DC-DC converters, they determine the radiation tolerance of this important class of components. Until recently, optocouplers were thought to be sensitive only to TID effects. In the last few years, several works have pointed out the extreme sensitivity of some optocoupler to displacement damage, with a severe decrease (a factor of ten) of the Current Transfer Ratio (CTR) already after proton fluences of the order of 1- $5 \cdot 10^{10}$  p/cm<sup>2</sup> [54, 55]. This was the case for the 4N49 from Micropac and Optek and for the P2824 from Hamamatsu. The dominant mechanism for this effect has been traced to degradation of the LED, but also a decrease in the photoresponse of the transistor contributes to the overall degradation [54]. Other devices, using a different LED and a different mechanical coupling LED/phototransistor, have instead shown a good resistance to displacement damage. This was the case for the 6N140 [54], the 6N134 [55] and the 6N139 [56], all manufactured from HP. Optocouplers to be used in the LHC should therefore be carefully selected.

Optocouplers are also sensitive to Single Event Transients (SET). Their sensitivity increases with the speed of the component, therefore this effect is getting more and more pronounced [57]. Transients are induced by the charge deposited in the photodetector, which is a very efficient particle detector as well! The sensitivity depends on the type of photodetector used, but often the direct ionization from a proton with energy below some 200 MeV is sufficient to induce SET [58]. Recoils and secondary particles produced by the interaction of the primary particle in the device itself also contribute to the observed SET rate. This effect in optocouplers might induce a transient output dropout in DC-DC converters.

# C. Definition of the radiation requirements

The knowledge of the system where a component has to operate is necessary to set the requirement concerning its radiation tolerance. The approach should be topdown: the system impact of the radiation effects on the component has to be evaluated to set the requirement. This is particularly true for SEU and SET, since the effect of upsets and transients, and their propagation, is highly system-dependent.

For cumulative effects, the radiation requirements are based on the estimated environment (TID and equivalent 1 MeV neutron fluence), with the addition of some safety factor. The need of ensuring reliability to the system, therefore large safety factors, has to be compromised with the need not to largely over-specify the requirements. For instance, a safety factor of 25 on top of an estimated TID of 2 krad (level which is easily tolerated by the vast majority of components) leads to a requirement of 50 krad, too high for most COTS. The cost consequence in this case might be very heavy, since it might require the use of radiation-hard components.

The safety factor is determined by several reasons. The uncertainties on the estimated radiation levels, the application of uncomplete test procedures, and the wide variability in radiation performance of COTS, all contribute to the safety factor. It can therefore be significantly reduced by refining the simulation of the radiation environment, and by using correct and complete test procedures. Moreover, it can be relaxed in some cases if the component does not perform a vital task, or if it can be easily accessed and replaced.

Setting the radiation requirements for destructive SEEs is less straightforward. In an environment dominated by heavy ions, the requirement could be more easily fixed: the threshold LET for the destructive SEE (SEL or SEB or SEGR) has to be higher than the maximum LET of the particles in the environment. To follow a similar approach for the LHC radiation environment, one could require that the LET threshold is higher than the maximum LET of Si recoils (about 15 MeVcm<sup>2</sup>mg<sup>-1</sup>). Also in this case, a safety factor could be applied on the threshold LET. Since to satisfy this requirement it would be necessary to run a heavy ion irradiation, which is not easy for encapsulated commercial devices and also demands an additional radiation test, an alternative approach can be followed. The requirement could be formulated in terms of crosssection for the destructive SEEs to occur in a highenergy hadron environment. The required cross-section has to be set in agreement with the estimated particle fluence in the radiation environment.

As a practical example, let's assume that a component has to survive without any SEL in a position where the estimated hadron fluence (above 20 MeV and including a safety margin from the simulation uncertainties) is  $10^{11}$ cm<sup>-2</sup>. The requirement on the cross-section might be set to 10<sup>-11</sup> cm<sup>2</sup> during an irradiation with 200 MeV protons (that is, no SEL observed up to a fluence similar to the one expected). It is important to understand that the cross-section measurement only gives a statistical information: maybe no SEL has been observed because the cross-section is  $10^{\cdot 12}$ , which does not ensure that no SEL will be observed in LHC. Actually, in this example, it would be statistically reasonable to expect 1 SEL in one device out of ten during operation in the LHC. Once more, setting the requirement in this case is part of the risk management strategy, and the knowledge of the system helps in choosing a reasonable safety margin. Moreover, it is often possible to get help from available heavy ion data (to get a feeling of the threshold of the SEEs), or from technological considerations.

For what concerns SEU and SET, the requirements can only be set on the basis of the system impact. For some parts of some components, basically handling data, a relatively high SEU rate can be accepted. For other parts of the same components the acceptable rate is much lower. This is the case, for instance, for setup parameters in front-end chips (containing bias and thresholds information), for bunch and event identifiers, for programs in SRAM-based FPGAs and DSPs, and for JTAG TAP state machines. The requirement in this case is set in terms of an acceptable rate of errors in LHC. The rate can be estimated on the basis of the crosssection measured in a mono-energetic hadron beam, as it will be specified in the following section.

# D. Identification of the candidate components

Once the electrical and radiation specifications have been set, it is possible to look for candidate components able to meet them. To this purpose, available radiation data represent a valid help in individuating parts that can potentially satisfy the radiation requirements. Radiation data are reported in several databases accessible on the web [59], and maintained by institutes and agencies mainly involved in space missions. In particular, the compendia available in the "JPL radiation effects database" web page are very helpful in comparing the radiation performance of classes of components and, within the same class, of different products with similar functions. Useful links to web pages are reported in the CERN RD49/COTS web page [60]. Very useful data can also be found in the "Workshop records" of the annual NSREC conference [61], a workshop organised to allow the spread and sharing of radiation test data. The IEEE Transaction on Nuclear Science volume issued in December every year, and dedicated to the papers presented at the NSREC, is also a good source of radiation data, and insight into the radiation effects. Useful data can be found also in the ESA/ESTEC web page, in particular in the annual "OCA final presentation day" web page [62]. ESA also has a database that should be in the public domain in the near future. The problem with all these databases is that they are not usually updated very regularly, and they often do not include data on state-of-the-art components. In this respect, the "Workshop records" and the papers from NSREC have the advantage of containing "fresher" data. For FPGAs, up-to-date test data can normally be found in the web page of the manufacturer [33].

If the interpretation of TID data from databases is straightforward, the situation for SEEs is different. In particular, most often data refer to heavy ion tests, which do not directly apply to the LHC environment.

For destructive SEEs, all components exhibiting sensitivity at LET below about 15 MeVcm<sup>2</sup>mg<sup>-1</sup> have to be considered in potential danger. Below this value, the lower the LET threshold, the higher is the risk.

For SEU sensitivity, it is worthwhile to give some guidelines on how to interpret data that can be found in databases or in publications. These guidelines are based on a recent simulation work addressing upset rates in the LHC radiation environment [63]. This work has shown that the upset rates will be dominated by the interaction of all hadrons above 20 MeV with silicon nuclei in the integrated circuits. These guidelines can help designers to have a rough but useful estimate of the error rates.

Whenever SEU data referring to high energy (60-200 MeV or higher) proton or neutron irradiation are available, the estimate of the SEU rate in LHC is straightforward. Data are normally reporting the SEU cross-section measured during the irradiation of the component with a mono-energetic proton or neutron beam. The typical energy used varies from 60 to 200 MeV (above some 20 MeV, contributions to SEU from neutrons and protons can be considered as very closely similar [63, 64]). In this case, the error rate in LHC can be estimated by multiplying the reported cross-section at the highest energy by the expected hadron flux (all hadrons above 20 MeV) in LHC.

As an example, let's take the case of an FPGA from Xilinx, the XC4010XL, for which an SEU cross-section of  $4.4 \cdot 10^{-15}$  cm<sup>2</sup>/bit has been measured during a 100 MeV neutron test. For an LHC estimated hadron flux of  $2 \cdot 10^3$  cm<sup>-2</sup>s<sup>-1</sup> (that is, a fluence of about  $10^{11}$  cm<sup>-2</sup> during 10 years operation, counted as  $5 \cdot 10^7$  s at maximum luminosity), this corresponds to  $8.8 \cdot 10^{-12}$  errors/(bit·s). Since each chip contains about 283000 configuration bits, the configuration error rate for the chip is  $2.5 \cdot 10^{-6}$  s<sup>-1</sup>. For a set of 110 FPGA used, we can estimate that one of them will loose a configuration bit (possibly needing the full reconfiguration of the whole chip) every hour.

Whenever the SEU data are available only for heavy ion tests, the estimate of the error rate in LHC is more complex and requires the availability of the four parameters of the Weibull curve fitting the experimental points. With these parameters, it is possible with some hypothesis on the Sensitive Volume (SV) size, to reproduce the cross-section curve as a function of the energy deposited in the SV, as shown in Figure 3.

At this point, it is necessary to know the probability, in the LHC environment, of the deposition of any energy  $E_{den}$  in the SV. This requires running a simulation where the primary hadrons interact with the silicon nuclei, and the interaction products (secondary hadrons and recoils) are transported in the silicon and their energy deposition in the SV is computed. This was the heart of the work in [63], and the probability curve for an LHC-like primary particle spectrum has been produced. Since this probability curve is not extremely sensitive on a small change in the particle energetic spectrum, the computed curve can be used for any LHC environment for approximate rate computations. The folding of the probability curve on top of the Weibull curve, as shown in Figure 3, leads to the estimated cross-section for the component in the LHC.

This cross-section has to be multiplied by the hadron flux (all hadrons above 20 MeV) to obtain the estimated error rate. This procedure requires that the probability curve from the simulation is known [65]. Details on this procedure can be found in [63].

Whenever the full Weibull curve is not available in the database, it is not possible to estimate a rate. Just to have a feeling, components with LET threshold below 5 MeVcm<sup>2</sup>mg<sup>-1</sup> will be quite sensitive to SEU in the LHC environment. If the LET threshold is instead above 15 MeVcm<sup>2</sup>mg<sup>-1</sup>, the error rate should be negligible.



Deposited energy

Figure 3: Example of folding the probability curve for the energy depositions in the SV on the Weibull curve. In this example, three probability curves are shown. In the environment 1, the component will not be sensitive to SEU. In environments 2 and 3, the component will be increasingly sensitive, the cross-section for SEU in the environment being proportional to the shaded areas.

# E. Test of the candidate components

Irradiation data contained in databases should never be used as a qualification source for COTS, but only as a tool to individuate candidate components. After this preselection of candidate components, testing is mandatory. Several issues on testing are discussed in the following.

#### 1) Radiation source

The proper radiation source has to be selected for each radiation effect. TID effects are as a common practice tested using a <sup>60</sup>Co source. For displacement damage, low energy neutron sources are preferred, for which it is simple to express the fluence in 1 MeV neutron equivalent. For SEE testing, different hadron sources can be used, with the requirement that the energy of the particles is high enough (I recommend 60 MeV or higher). The reason for preferring mono-energetic beams is that they allow the measure of the cross-section at one precise value of particle energy, and with the use of moderators it is possible to repeat the measure at different energies. Therefore, it is easily possible to

estimate the error rate in the LHC. Instead, when using an irradiation facility where the particle species and energy spectra are multiple, it is not possible to tell which particle or energy is responsible for the observed SEEs, therefore to make any rate prediction for the LHC.

In the case of SEU testing, 60-200 MeV protons represent a very good candidate: they are accessible in several laboratories in both Europe and the United States. The vast majority of the components, especially those for which the error rate will be higher, can be very effectively tested with a 60 MeV proton beam. For a few components with low sensitivity, such test can nevertheless lead to a significant under-estimate of the error rate. This should be taken into account, but it should be noted that this under-estimate concerns at worst those components that will experience a low SEU rate in LHC anyway.

In the case of destructive SEEs, an under-estimate of the rate can be unacceptable if the loss of the component is forbidden in the system. Therefore, 60 MeV proton beams can be helpfully used to make a first screening of the candidate components, but the final qualification should be made with higher energy beams (200 MeV or higher).

In some cases, especially when the resources for testing are limited and the available time is short, it is possible to test for several radiation effects using one only radiation source. Such a procedure has been recently proposed and adopted for the electronics that will be installed in the hadron calorimeter, the muon chambers and the cavern of CMS [66]. The test plan in that case is based on the use of 60-200 MeV proton beams to simultaneously test the components for TID, displacement damage and SEEs.

Though there are several facilities where it is possible to run irradiation tests with mono-energetic proton beams, the access to them might require a long-term notice and might be very expensive (up to 500-700 USD per hour). To help users in getting more easily beam time, through the RD49/COTS project it has been possible to reach an agreement with the Cyclotron Research Center (CRC) in Louvain-la-Neuve, Belgium. The cyclotron at CRC can accelerate protons up to 60 MeV and can also deliver almost mono-energetic neutron beams [67, 68]. The access has been organized as 3-4 "irradiation campaigns" per year, each regrouping several users, and CRC only asks for a contribution covering electricity, taking charge of the rest of the actual beam cost. All LHC collaborations can benefit from this agreement upon request [69].

### 2) Irradiation procedure

To get reliable results, a correct irradiation procedure has to be followed. TID testing of CMOS components requires prompt (oxide trapped charge) and slow (interface states) charge trapping to be tested. The physics behind these phenomena is reasonably understood, and the existing procedures [70, 71] are effective in evidencing the worst-case response of the component. All procedures require a high dose rate irradiation followed by high temperature annealing, with the device constantly under bias.

For bipolar components, the recently observed ELDR effect complicates the testing procedure, since the basic mechanism behind this phenomenon is not yet understood. Most of the proposed procedures focus on an irradiation at high temperature, which considerably increases the testing complexity. To date, the most complete set of recommendations on bipolar TID testing is probably the one from JPL [72].

Whenever the specified level is below 30 krad, a test at Low Dose Rate (LDR) is manageable in a period of a few weeks. In this case, the recommendation is to test the device at room temperature both at high rate (50 rad/s) and at low rate (preferably 0.005 rad/s) and compare the device response. If the part fails at 1.5 times the specified TID in any of the two tests, it should not be used. The situation is more complex for specified levels above 30 krad, when the LDR test requires too much time. In this case, the recommendation is to test the device up to 30 krad in three different conditions [73]: at high rate (50 rad/s) and room temperature, at moderate rate (1 rad/s) and high temperature (90°C), at low rate (0.005 rad/s) and room temperature. The comparison of the three results can highlight the component sensitivity to ELDR effects. If the moderate rate at high temperature and the low rate give comparable results, then use the high temperature test up to the specified TID level to qualify the component. JPL also recommends in this case using an additional safety factor of 2 on the TID level.

Since the effect of bias on the TID effects in bipolar circuits is variable with the technology, it is advised to bias the device during test in a way closely similar to the real application. Often the worst damage is observed when all device terminals are grounded: this bias condition is therefore often recommended.

The displacement damage test is generally quite simple: devices are normally exposed at room temperature and with all terminals grounded (though there is sometimes an effect of bias, this condition is seen as the worst case).

Destructive SEEs need a special test setup to be developed. The destruction of the device through SEB or SEL can be avoided by using a circuit that detects the increase in the supply current and temporarily cuts the power to re-establish the correct functionality. The delay and the duration of the power cut have to be carefully chosen to sufficiently protect the device (typically, the delay should be as short as possible and the duration of the order of 50-100 ms). In this way, it is possible to accumulate a sufficient statistics on these destructive events to be able to estimate a cross-section, which in turn allows to computing the failure rate in LHC. For SEGR, unfortunately, there is to date no procedure avoiding the device destruction, therefore the test should be repeated on several components to get an idea of the cross-section. An important aim of the SEB and SEGR testing is to find the derating conditions sufficient to ensure the reliability of the component in LHC [49]. Therefore, the testing should be repeated at different bias conditions.

SEU and SET effects should also be measured using a dedicated setup to count the number of radiation-induced errors. The setup changes considerably with the component under test, therefore it is impossible to give a general description of it. The aim is the measure of the cross-section, for which the number of errors has to be divided by the particle fluence. In the test of devices that can have complex SEU effects, such as microprocessors, it is important to perform the test in a condition representative of the final application.

For proton irradiation, it is often useful to repeat the measurement of the cross-section for several proton energies, which can be done by decreasing the beam energy with moderators of variable thickness. During the irradiation, one should monitor the state of the device to promptly detect the possible occurrence of SEFI or SEL. To estimate the SEU cross-section, it is important to know the effective particle fluence: the irradiation periods during which the device is not correctly functioning (for instance, because of SEFI) should not be counted. This procedure also allows to estimating the cross-section for events like SEFI, which are more serious concern than simple SEU.

Due to the high energy of the hadron beams normally used for SEE characterisation, testing can be performed in air and with packaged devices.

Since temperature differently affects the SEE sensitivity of components, the testing temperature should be carefully chosen to be either representative of the real condition during device operation, or to give a worst case response. For instance, high temperature increases SEL and decreases SEB sensitivity.

Finally, whichever the radiation effect to be tested, there is the question on how many devices to test. The answer is different if the test is aimed at exploring the radiation performance (pre-selection of parts) or at qualifying a component to be used in the LHC. For the qualification, the number of devices to be tested has to be such that the result is representative of the whole population of parts to be used. Of course this is easier, and therefore the sample size can be smaller, if the origin of the components is known and certified, which ensures a reasonable homogeneity. The sample size also changes with the radiation effect: TID effects are generally more sensitive to technology changes than SEE effects, the sample size has therefore to be greater. As an indication, a sample size of 10 for TID and displacement damage and of 5 for SEEs is reasonable for qualification purposes. These numbers can nevertheless be lowered in presence of a wide radiation tolerance margin with respect to the requirements and of a homogeneous device response. On the other hand, they can also be increased when the tested samples show very different radiation response and/or just enough tolerance to satisfy the requirements.

#### 3) Board-level testing and hybrid devices

Component-level testing allows measuring the radiation effects on several electrical parameters as well as on the overall component functionality. Board-level testing, instead, gives only information on the device radiation response within a specific system: several devices in the board could well be close to failure without giving any sign of problem within the system. In this case, failure could occur in another board, for which the initial electrical performance of the components is different from that of the components in the tested board. In other words, there is no information on whether a sufficient safety margin exists for the board to correctly operate. Additionally, in case of failure during the test, it is not always straightforward to understand the device and the phenomenon responsible, especially for what concerns non-destructive SEEs.

For all these reasons, board-level testing is not advised as a qualification tool, but only as a go/no-go test. If this recommendation is not followed, the safety factor on the radiation requirements should be increased significantly (sometimes, a factor of 3 is advised [72]). Board-level testing can be useful in pointing out how errors can propagate in the system. For instance, it can highlight SEU-induced failure mechanisms that can still be corrected within the system and that are not easily anticipated from component-level testing.

Similar considerations can be made for hybrid devices, for which it is often not even known which parts actually mounted on the hybrid. Hybrid are manufacturers consider their designs as proprietary, and are not willing to give any information about the individual parts. Sometimes, even if the manufacturer is collaborative, limited information is available anyway. Examples of these problems are reported in [72] and [19] for power converters. In one case [72], much lower radiation tolerance was observed in a late version of a DC-DC power converter with respect to an earlier version previously tested. The later version failed after 2 krad in a proton environment, versus 30 krad for the earlier version. It turned out that the earlier version was inadvertently made with an LED within the optocoupler relatively insensitive to proton displacement damage. The later version, instead, used an LED from the same manufacturer but with different wavelength, and extremely sensitive to displacement. The hybrid

manufacturer, though, had no record of which version of the optocoupler was used in his various products.

### F. Engineering the system

After the testing, results are analysed in view of the use of the parts in the system. The possible outcomes are summarised in Figure 4. For some of the candidate components, testing might have pointed out a radiation tolerance inferior to the expectation, and not satisfying the requirements. Alternative components might at this point be searched, repeating the procedure of preselection and testing. If such components cannot be found, other strategies can be followed.



Figure 4: Flowchart of the possible actions following the test of candidate components.

For instance, a better study of the radiation environment, with the correct definition of all the materials and geometry might lead to a redefinition of the radiation hazard, in particular to a decrease of the safety factors. The component might in that case satisfy the new requirements. For what concerns SEU, mitigation techniques can be successfully used: Error Detection And Correction (EDAC), Triple Modular Redundancy (TMR) within the component, component redundancy, watchdog techniques can all contribute to considerably lower the radiation requirements on the components. A detailed discussion of these mitigation techniques is beyond the scope of this paper, and can be found in other works [74, 75] and related references.

Another strategy, which might sometimes be applied, is to foresee to replace the components during the 10years operation of LHC. Some of the components might in fact be accessed easily enough to envisage such solution as the cheapest and the simplest.

Where and how the initial requirements can be lowered depends on system-level considerations and is part of the more global issue of risk management already discussed. This process has to occur merging the knowledge on the system function and implementation and on the radiation effects. The full cycle described above can be iterated several times and for several parts. The final objective is to reach a reliable system configuration, which can be implemented with qualified components. Since the parts are COTS, the qualification has to be made by the users and implies an effort in procuring a traceable and homogeneous lot of components.

### V. COTS PROCUREMENT

The variability in radiation tolerance is unfortunately well known for COTS components. Big semiconductor vendors produce their components in several manufacturing plants worldwide: parts manufactured in different production lines are very likely to have different radiation tolerance. Moreover, provided the specifications in the datasheet are met, the manufacturer can introduce as many changes as wished in the product, without even notify that to the customers. Those changes, varying from some steps in the technology to the redesign of the circuit in another technology, are usually not documented. Different parts therefore cannot be distinguished by looking at the packaged part.

The procurement of COTS parts for a radiation environment poses therefore a problem. Ideally, the buyer wants to have a homogeneous lot of components, all coming from the same process line and having been manufactured at the same time. In this way, the characterisation of the radiation response of a small sample of parts is representative of the performance of the whole lot. Unfortunately this request is an exception in the commercial marketplace, and manufacturers are not used to ensure such a high level of traceability of their parts to the customers. To complicate the picture further, parts are often purchased from distributors. The distributor can ensure that all the parts delivered come from the same commercial lot, which has no direct relation to a production lot. The components within the same commercial lot can still considerably differ from each other (they might even come from different foundries).

In cases where the order volume is interesting for the vendor, it might be possible to negotiate with the manufacturer/distributor better traceability conditions. The LHC being such a negligible market for big semiconductor manufacturers, this level of collaboration is often left to the good will of the company. Nevertheless, all efforts should be made to ensure the purchase of a lot of parts as homogeneous as possible. Qualification radiation testing on a sufficient number of samples will then indicate the homogeneity of the radiation response. If the homogeneity seems to be poor, the sample size has to be increased, and the radiation performance has to be carefully compared with the requirements for the part.

The ideal approach to the variability problem, which is often followed by Space agencies, is to reach an agreement with the manufacturer allowing the freezing of a production lot of parts waiting for the results of a radiation test run on a few samples. The lot is actually bought only if the samples have passed the test, otherwise the procedure is repeated for another production lot. In this way, it is possible to purchase a homogeneous and qualified lot of parts. Such an agreement can sometimes be reached, for some products with limited market, for a relatively small number of purchased parts.

### VI. PLASTIC PACKAGING AND BURN-IN

The effects of plastic packaging and burn-in on the radiation tolerance, in particular for TID effects, has been investigated in several recent works. Since plastic and ceramic packaging processes are different in their thermal cycles, some difference in their TID tolerance can not be excluded.

Some early work evidenced enhanced radiation degradation in plastic packaged CMOS devices that were previously burned-in [76]. Similarly, differences in the radiation response of burned-in bipolar components have also been reported [77]. More recently, the conclusion of an extensive study on individual bipolar and CMOS transistors pointed out the opposite, that is a lower degradation of plastic encapsulated devices [78]. Moreover, the differences induced by burn-in were within the experimental error. No evidence of enhance degradation on plastic encapsulated transistors and circuits was found also in [79].

In an attempt to summarise all the literature work on the subject, JPL concluded that there is no evidence to date suggesting that plastic package directly influences the radiation response unless high temperature burn-in (above the maximum operating temperature) is used for the qualification process [72]. A reasonable recommendation is nevertheless that any radiation test aimed at qualification should be representative of the parts actually used in the system in terms of packaging and burn-in [78].

# VII. CONCLUSION

The unavailability or unaffordability of qualified radiation-hard components moves the focus of the radiation hardness assurance from the individual component to the system level. The use of COTS parts subject to a wide range of radiation effects brings along complications on radiation performance variability and traceability and makes risk avoidance practically impossible. In these conditions, risk management is a forced approach that can be applied more efficiently at the system level.

Unfortunately, the complexity of the issues involved is such that no universal recipe exists. Reliability of systems using COTS in LHC is a game whose main rule is the understanding of the environment, of the radiation hazard and of the system performance. This requires merging a wide spectrum of competencies in a single team, which represents the true challenge for all teams working for LHC.

### ACKNOWLEDGEMENTS

I am grateful to G. Stefanini and J. Christiansen at CERN for having revised the manuscript, and for the useful suggestions that have contributed to improve it.

### REFERENCES

- P.S.Winokur *et al.*, "Use of COTS Microelectronics in Radiation Environments", IEEE Trans. Nucl. Science, Vol.46, No.6, December 1999, p.1494
   S.J.Watts, "Status of the RD48/ROSE
- [2] S.J.Watts, "Status of the RD48/ROSE Collaboration", proceedings of the 5<sup>th</sup> LEB Workshop, Snowmass, Colorado, 20-24 September 1999, p.201
- [3] M.Huhtinen, "Displacement damage what is it", http://cmsdoc.cern.ch/~huu/tut2.pdf
- [4] RD48/ROSE Collaboration Status Report 1997, CERN/LHCC 97-39, accessible on line through the RD48 web page http://rd48.web.cern.ch/RD48/
- [5] R.Koga et al., "On the Suitability of Non-Hardened High Density SRAMs for Space Applications", IEEE Trans. Nucl. Science, Vol.38, No.6, December 1991, p.1507
- [6] T.R.Oldham *et al.*, "Total Dose Failures in Advanced Electronics from Single Ions", IEEE Trans. Nucl. Science, Vol.40, No.6, December 1993, p.1820
- [7] R.Harboe-Sorensen *et al.*, "Heavy Ion, Proton and Co-60 Radiation Evaluation of 16Mbit DRAM Memories for Space Application", 1995 IEEE Radiation Effects Data Workshop, p.42
- [8] A.H.Johnston *et al.*, "Latchup in Integrated Circuits from Energetic Protons", IEEE Trans. Nucl. Science, Vol.44, No.6, December 1997, p.2367
- [9] R.Harboe-Sorensen, "A Summary of ESA's Ground Test Data", 1997 RADECS Conference Data Workshop, p.89
- [10] J.R.Coss *et al.*, "Device SEE Susceptibility Update: 1996-1998", 1999 IEEE NSREC Data Workshop, p.60
- [11] A.H.Johnston *et al.*, "Using Commercial Semiconductor Technologies in Space", Proceedings of the Third RADECS Conference, P.175-182 (1995)
- [12] F.W.Sexton *et al.*, "Single Event Gate Rupture in Thin Gate Oxides", IEEE Trans. Nucl. Science, Vol.44, No.6, December 1997, p.2345
- [13] A.H.Johnston *et al.*, "Breakdown of Gate Oxides During Irradiation with Heavy Ions", IEEE Trans. Nucl. Science, Vol.45, No.6, December 1998, p.2500
  [14] F.W.Sexton *et al.*, "Precursor Ion Damage and
- [14] F.W.Sexton *et al.*, "Precursor Ion Damage and Angular Dependence of Single Event Gate Rupture in Thin Oxides", IEEE Trans. Nucl. Science, Vol.45, No.6, December 1998, p.2509
- [15] C.Poivey *et al.*, "Radiation Characterisation of Commercially Available 1Mbit/4Mbit SRAMs for Space Applications", 1998 NSREC Data Workshop, p.68

- [16] A.J.Lelis *et al.*, "Radiation Response of Advanced Commercial SRAMs", IEEE Trans. Nucl. Science, Vol.43, No.6, December 1996
- [17] L.W.Massengill, "Cosmic and Terrestrial Single-Event Radiation Effects in Dynamic Random Access Memories", IEEE Trans. Nucl. Science, Vol.42, No.2, April 1996, p.576
- [18] G.J.Hofman *et al.*, "Light-Hadron Induced SER and Scaling Relations for 16- and 64-Mb DRAMS", IEEE Trans. Nucl. Science, Vol.47, No.2, April 2000, p.403
- [19] K.A.LaBel *et al.*, "Emerging Radiation Hardness Assurance (RHA) Issues: A NASA Approach for Space Flight Programs", IEEE Trans. Nucl. Science, Vol.45, No.6, December 1998, p.2727
- [20] R.Koga *et al.*, "Single Event Functional Interrupt (SEFI) Sensitivity in Microcircuits", proc. of RADECS97, 15-19 September 1997, Cannes, France, p.311
- [21] H.R.Schwarts *et al.*, "Single Event Upset in Flash Memories", Trans. Nucl. Science, Vol.44, No.6, December 1997, p.2315
- [22] D.N.Nguyen *et al.*, "Radiation Effects on Advanced Flash Memories", Trans. Nucl. Science, Vol.46, No.6, December 1999, p.1744
- [23] K.A.LaBel *et al.*, "Current Single Event Effect Test Results for Candidate Spacecraft Electronics", 1997 NSREC Data Workshop, p.19
  [24] P.Hsu *et al.*, "Single Event Effects and Total
- [24] P.Hsu *et al.*, "Single Event Effects and Total Ionizing Dose Results of a Low Voltage EEPROM", presented at the 2000 NSREC, to be published in the Conference Data Workshop
- [25] R.Katz *et al.*, "Radiation Effects on Current Field Programmable Technologies", IEEE Trans. Nucl. Science, Vol.44, No.6, December 1997, p.1945
- [26] R.Katz et al., "Current Radiation Issues for Programmable Elements and Devices", IEEE Trans. Nucl. Science, Vol.45, No.6, December 1998, p.2600
- [27] M.Ohlsson *et al.*, "Neutron Single Event Upsets in SRAM-Based FPGAs", 1998 IEEE NSREC Data Workshop, p.177, <u>http://www.xilinx.com/appnotes/ FPGA\_NSREC98.pdf</u>
- [28] C.Carmichael *et al.*, "SEU Mitigation Techniques for Virtex FPGAs in Space Applications", presented at MAPLD99, http://www.xilinx.com/appnotes/ VtxSEU.pdf
- [29] E.Fuller *et al.*, "Radiation Test Results of the Virtex FPGA and ZBT SRAM for Space Based Reconfigurable Computing", presented at MAPLD99, http://www.xilinx.com/appnotes/ VtxTest.pdf
- [30] R.Koga et al., "SEE Sensitivity of FPGAs with Amorphous Silicon Antifuse", presented at the 12<sup>th</sup> SEE Symposium, Manhattan Beach, Ca, April 2000
- [31] Actel Application Note, "Design Techniques for Radiation-Hardened FPGAs", September 1997, http://www.actel.com/appnotes/5192642.pdf
- http://www.actel.com/appnotes/5192642.pdf [32] Actel Technical Paper, "Using IEEE 1149.1 JTAG Circuitry in Actel SX Devices", http://www.actel.com/products/devices/radhard/JTA G\_SX.pdf
- [33] See the web pages of Xilinx and Actel for presentation of their products and for a collection of application notes and paper on the subject: <u>http://www.actel.com/hirel/</u>, http://www.xilinx.com/products/hirel\_qml.htm#Radi ation\_Hardened

- [34] Xilinx Application Note, "SEU Mitigation Techniques for the XQR4000XL", http://www.xilinx.com/xapp/xapp181.pdf
- [35] Actel Application Note, "Enhanced Tools for Minimizing Single Event Upset Effects", http://www.actel.com/products/devices/radhard/ R298Note.pdf
- [36] Actel Application Note, "Using Synplify to Design in Actel Radiation-Hardened FPGAs", http://www.actel.com/appnotes/SynplifyRH.pdf
- [37] S.H.Crain *et al.*, "Single Event Effects Test Results for the 80C186 and 80C286 Microprocessors and the SMJ320C30 and SMJ320C40 Digital Signal Processors", 1998 NSREC Conference Data Workshop, p.51
- [38] S.H.Crain *et al.*, "Radiation Effects in a Fixed-Point Digital Signal Processor", 1999 NSREC Conference Data Workshop, p.30
- [39] V.Asenek *et al.*, "SEU Induced Errors Observed in Microprocessor Systems", IEEE Trans. Nucl. Science, Vol.45, No.6, December 1998, p.2876
- [40] D.M.Hiemstra *et al.*, "Single Event Upset Characterization of the Pentium MMX and Pentium II Microprocessors using Proton Irradiation", IEEE Trans. Nucl. Science, Vol.46, No.6, December 1999, p.1453
- [41] D.W.Emily, "Total Dose Response of Bipolar Microcircuits", Short course presented at the 1996 NSREC conference
- [42] E.W.Wnlow *et al.*, "Response of Advanced Bipolar Process to Ionizing Radiation", IEEE Trans. Nucl. Science, Vol.38, No.6, December 1991, p.,1342
  [43] A.H.Johnston *et al.*, "Enhanced Damage in Linear
- [43] A.H.Johnston *et al.*, "Enhanced Damage in Linear Bipolar Integrated Circuits at Low Dose Rate", IEEE Trans. Nucl. Science, Vol.42, No.6, December 1995, p.1650
- [44] R.L.Pease *et al.*, "Evaluation of Proposed Hardness Assurance Method for Bipolar Linear Circuits with ELDR Sensitivity", IEEE Trans. Nucl. Science, Vol.45, No.6, December 1998, p.2665
- [45] A.H.Johnston *et al.*, "Enhanced Damage in Bipolar Devices at Low Dose Rates: Effects at Very Low Dose Rates", IEEE Trans. Nucl. Science, Vol.43, No.6, December 1996, p.3049
  [46] B.G.Rax *et al.*, "Displacement Damage in Bipolar
- [46] B.G.Rax *et al.*, "Displacement Damage in Bipolar Linear Integrated Circuits", IEEE Trans. Nucl. Science, Vol.46, No.6, December 1999, p.1660
- [47] B.G.Rax et al., "Proton Damage Effects in Linear Integrated Circuits", IEEE Trans. Nucl. Science, Vol.45, No.6, December 1998, p.2632
- [48] J.Howard, "SETs in 139 Comparators", and S.Buchner, "Total-Dose and Displacement-Damage Effects on the Single-Event Transients in Voltage Comparators", papers presented at the 12<sup>th</sup> SEE symposium, Manhattan Beach, Ca, April 2000
- [49] J.L.Titus *et al.*, "Experimental Studies of Single-Event Gate Rupture and Burnout in Vertical Power MOSFETs", IEEE Trans. Nucl. Science, Vol.43, No.2, April 1996, p.533
- [50] G.H.Johnson *et al.*, "Catastrophic Single-Event Effects in the Natural Space Radiation Environment", Short Course presented at the 1996 NSREC conference
- [51] D.L.Oberg *et al.*, "First Observations of Power MOSFET Burnout with High Energy Neutrons", IEEE Trans. Nucl. Science, Vol.43,No.6, December 1996, p.2913

- [52] E.Normand *et al.*, "Neutron-Induced Single Event Burnout in High Voltage Electronics", IEEE Trans. Nucl. Science, Vol.44, No.6, December 1997, p.2358
- [53] J.L.Titus *et al.*, "Proton-Induced Dielectric Breakdown of Power MOSFETs", IEEE Trans. Nucl. Science, Vol.45, No.6, December 1998, p.2891
- [54] B.G.Rax *et al.*, "Total Dose and Proton Damage in Optocouplers", IEEE Trans. Nucl. Science, Vol.43, No.6, December 1996, p.3167
- No.6, December 1996, p.3167
  [55] R.Reed *et al.*, "Test Report of Proton and Neutron Exposures of Devices that utilize Optical Components and are contained in the CIRS Instrument", http://flick.gsfc.nasa.gov/radhome/papers/i090397.html
- [56] B.Hallgren, report to the 2<sup>nd</sup> RADWG meeting, http://pignard.home.cern.ch/pignard/radiationworki/ default.htm
- [57] K.A.LaBel *et al.*, "Proton-Induced Transients in Optocouplers: In-Flight Anomalies, Ground Irradiation Test, Mitigation and Implications", IEEE Trans. Nucl. Science, Vol.44, No.6, December 1997, p.1885
- p.1885
  [58] A.H.Johnston *et al.*, "Angular and Energy Dependence of Proton Upset in Optocouplers", IEEE Trans. Nucl. Science, Vol.46, No.6, December 1999, p.1335
- [59] The main web pages containing radiation test data are from JPL ( <u>http://radnet.jpl.nasa.gov/cgi-win/</u> <u>1/FrontPage CGI Project?|main</u>), GSFC ( http:// radhome.gsfc.nasa.gov/top.htm), DTRA (http:// erric.dasiac.com/)
- [60] http://rd49.web.cern.ch/RD49/cotswelcome.html
- [61] "IEEE Radiation Effects Data Workshop", a small volume published by IEEE every year and containing papers presented at the Data Workshop held in conjunction with the NSREC conference. These volumes can be ordered to IEEE.
- [62] http://www.estec.esa.nl/qcswww/tos\_qca/
- [63] M.Huhtinen and F.Faccio, "Computational method to estimate Single Event Upset in an accelerator environment", Nucl. Inst. and Meth. A450, p.155, August 2000
- [64] G.J.Hofman et al., "Light-Hadron Induced SER and Scaling Relations for 16- and 64-Mb DRAMS", IEEE Tran. Nucl. Science, Vol.47, No.2, April 2000, p.403
- [65] The author and Mika Huhtinen, both at CERN, can quickly assist you in this calculation. You can contact them giving the four parameters of the heavy ions Weibull fit, and they will provide you with the component cross-section in an LHC-like environment. Contact: Federico.Faccio@cern.ch or Mika.Huhtinen@cern.ch
- [66] "A global radiation test plan for CMS electronics in HCAL, Muons and Experimental Hall", http://cmsdoc.cern.ch/~faccio/proced.pdf
- [67] G.Berger *et al.*, "CYCLONE A Multipurpose Heavy Ion, Proton and Neutron SEE Test Site", 1997 RADECS Conference Data Workshop, p.51
- [68] http://www.cyc.ucl.ac.be/
- [69] For more information or to request beam time through this program, contact the coordinator: federico.faccio@cern.ch

- [70] ESA SCC Basic Specification 22900, available on the Web at: http://atlas.web.cern.ch/Atlas/GROUPS/ FRONTEND/WWW/22900.pdf
- [71] US MIL-STD-883E Method 1019.4/1019.5 "Ionizing Radiation (Total Dose) Test Procedure", http://www.dscc.dla.mil/Downloads/MilSpec/Docs/ MIL-STD-883/std883inc-1000.pdf
- [72] A.H.Johnston, G.M.Swift, "Radiation Test Requirements for Ionization and Displacement Damage", JPL internal report, July 1999
- [73] A.H.Johnston (JPL), private communication
- [74] K.A.LaBel and M.M.Gates, "Single-Event-Effect Mitigation from a System Perspective", IEEE Trans. Nucl. Science, Vol.43, No.2, April 1996, p.654
- [75] J.D.Kinnison, "Achieving Reliable, Affordable Systems", Short Course presented at the 1998 NSREC conference
- [76] S.D.Clark *et al.*, "Plastic Packaging and Burn-in Effects on Ionizing Dose Response in CMOS Microcircuits", IEEE Trans. Nucl. Science, Vol.42, No.6, December 1995, p.1607
- [77] C.Barillot *et al.*, "Effects of Reliability Screening Tests on Bipolar Integrated Circuits During Total Dose Irradiation", IEEE Trans. Nucl. Science, Vol.45, No.6, December 1998, p.2638
- [78] J.J.Wall *et al.*, "The Effects of Space Radiation and Burn-in on Plastic Encapsulated Semiconductors", 1999 NSREC Data Workshop, p.96
  [79] J.L.Gorelick *et al.*, "Radiation Evaluation of Plastic
- [79] J.L.Gorelick *et al.*, "Radiation Evaluation of Plastic Encapsulated Transistors and Microcircuits for Use in Space Applications", 1999 NSREC Data Workshop, p.102