Why SSD fails?

SSDs (Solid State Drives) store data in Flash-NAND chips. They are faster and more efficient (at least that's what manufacturers and sellers claim) than hard disks. They are quiet and consume less electricity. Because they store data on integrated circuits, they do not need a complex, precise, and fault-prone mechanism.
The resistance of SSDs to mechanical failures has become a reliability dogma in marketing materials. Falling prices have made SSDs very popular in recent years, and at the same time, reality has brutally dispelled the myth that they are failure-free. In theory, it is possible to construct electronic devices that can operate without failure for several or even several dozen years, but in practice it is not difficult to encounter computer components that fail on the first day of operation. And very often it is the SSD that refuses to cooperate already at the stage of installing the operating system. So why do SSDs have such a high failure rate and to what extent can they be trusted as data carriers?

Data storage in SSD

Modified NPN field - effect transistors (FET) are responsible for storing data in media using Flash-NAND chips (not only SSDs, but also pendrives, memory cards and memories built into devices such as smartphones). Such a transistor consists of three electrodes - areas of the semiconductor - two rich in electrons (and therefore negatively charged, hence the designation "n", for "negative"), a source and a drain separated by an area poor in electrons (positively charged - hence the letter "p", as "positively"). In these transistors, electrons are the carrier responsible for the flow of current. Between the source and drain there is an electrically isolated gate used to control the transistor. There are also pnp transistors in which the source and drain are positively charged and the gate - negatively. However, these types of transistors are not used in flash chips.
Field-effect transistors are called unipolar because the current in them is carried only by majority carriers. For npn transistors these are electrons, and for pnp - the so-called holes. In addition to unipolar transistors, bipolar transistors can be found in other applications, in which electric current is carried by both majority and minority carriers. Bipolar transistors are also called junction or layer transistors.
In field - effect transistor, under the influence of an electric field (that's why the transistor is called "field-effect"), a channel called n-channel is created, which allows electrons to flow between the source and drain. However, if we apply voltage to the gate, the n channel closes and the current between the source and drain stops flowing. Transistors in which the n channel is open and closes when a voltage is applied to the gate are called depletion channel transistors, and in electronics we also use enriched channel transistors, in which applying a voltage to the gate results in the opening of the transistor channel.
To use such a transistor for non-volatile data storage, it was necessary to modify the gate. It was divided into a control gate and, the most important part for storing information, a floating gate. A floating gate is an electrically isolated area in which an electric charge can be accumulated, which remains after the device is disconnected from the power supply. And this is what storing data is all about - not losing it when the media is disconnected from the power supply.
If we accumulate an electric charge in the floating gate, we will achieve the effect of closing the n channel, just as if we had applied a voltage to the gate. If the floating gate is empty, the n channel will be open and current will be able to flow freely between the source and drain. Therefore, usually an empty floating gate is interpreted as a logical one, and a charged gate closing the n channel - as a logical zero.

Data addressing in SSD

From a user's perspective, we are accustomed to addressing data in the logical structures of file systems. Every day we use partitions, files and directories. If we have a bit of interest in computers, we also know about clusters and sectors. In turn, NAND chips are probably not interested in computers, because they have no idea about sectors.
LBA sectors are used in communication between various devices and software. SSD drives accept commands to perform operations on specific LBA addresses and return them on the external interface for compatibility with communication protocols such as the ATA protocol. However, internal communication between the controller and memory systems is carried out in accordance with the ONFI standard.
According to this standard, data is addressed in pages and blocks. A page is the minimum unit of reading and writing (prorgamming). Its size corresponds to the size of the page register and currently reaches approximately 16 kB. Yes, this is the equivalent of 32 LBA sectors of 512 B each, which the controller cuts out from the appropriate page in response to the computer's request. In addition to user sectors, the page also contains redundant information storing various types of data necessary for the correct operation of the SSD. The structure for arranging user data and redundant information within a page is called the page format.
How does the controller know which page the sector he is looking for is located? The subsystem for translating logical addresses into physical ones (FTL - Flash Translation Layer) is responsible for this, to which we will return a few times later. The next addressing unit is a block of several to even several hundred pages. This is the minimum data erasure unit.
Unlike magnetic media, Flash systems cannot directly overwrite the previous content. We can only program transistors whose floating gates have been previously deleted - emptied of electrons. Therefore, brand new or completely empty chips read on the programmer return the value 0xFF. This value also returns empty blocks and unprogrammed trailing portions of pages if there are such unused areas in the page format.

Basic operations and their impact on the wear of Flash-NAND chips

Since in the case of Flash chips we do not have the physical ability to directly overwrite existing content, we can only save (program) empty floating gates of transistors. Therefore, it is necessary to support three basic operations: programming the desired content, reading it and erasing outdated data. Editing from the physical side involves reading the original content into the buffer, changing it and saving it in a physically different location.
Here again we touch on the Flash Translation Layer, which must register the transfer of LBA addresses to another physical place and the destination of their original location for erasing. The erased blocks can be used during subsequent writes and then new appropriate LBA addresses will be assigned to them in the translation tables.
Flash-NAND chips are designed to wear out gradually. This has to do with the way electrons are placed in the floating gates in the programming operation and released during erasing. Because the floating gate is electrically separated from the rest of the transistor by an insulator, if we want to place electrons in it or release them, they must overcome the potential barrier created by the insulator. This task is usually performed using the Fowler–Nordheim tunneling phenomenon known from quantum mechanics.
Fowler-Nordheim tunneling uses the wave properties of electrons to overcome the potential barrier, but requires the use of higher voltages, reaching several or even up to 20 V. This operation is associated with energy losses dissipated in the form of electrical work heat (Joule heat, which we know best from the operation of the chips on which we have to place the radiators and from electric heaters) and loads the insulator, leading to its damage over time. The degraded insulator no longer holds charge in the floating gate, which causes electrons to escape and thus data leakage.
Reading involves applying a voltage between the source and the drain (we check whether the n-channel is open or closed, and thus indirectly whether the floating gate is empty or charged). If the transistor is open, the controller interprets it as a logical one, and if closed - as zero. This is an operation that does not burden the floating gate insulator, and therefore is neutral for the life of the circuit. That's why the lifespan of NAND chips is measured in the number of programm/erase (P/E) cycles.

How about increasing the number of bits stored in the transistor?

Flash-NAND chips were very expensive in their early days, so it is not surprising that the manufacturers were looking for a way to improve the capacity-to-price ratio. One way to double the capacity of the chips was to place two data bits in one transistor. This effect can be achieved by charging the floating gate to a specific level, which causes an appropriately controlled closing of the n channel. To be able to store two bits in one transistor, we need to distinguish 4 logical states (00, 01, 10 and 11), and therefore 4 corresponding to these states floating gate charge levels.
It is natural that accountants liked this idea and expected engineers to further develop multi-state technology, but it was not that simple. Placing the third bit in the transistor no longer doubles the capacity of the chip, but only increases it by half. So what happens to the transistor's charge levels? Yes, we still need to double their number. If we want to place as many as three bits in one transistor, we need to distinguish 8 charge levels corresponding to logical values from 000 to 111. And of course, adding each subsequent bit to the transistor causes an increase in the capacity of the system to a lesser and lesser extent while at the same time an exponential increase in the required distinguishable floating gate charge levels.
The deteriorating signal-to-noise ratio resulting from the increasingly smaller gap between voltage values representing subsequent logical states favors reading errors and bit errors. The operation of programming transistors must also be performed more and more precisely, because errors may also occur during writing. Introducing electrons into floating gates using quantum mechanics does not allow for repeatable precision and accuracy. At most, it is possible to charge these gates with approximately the number of electrons needed. This means that many floating gates contain charges with values similar to neighboring logical states, and some representing states other than the intended ones.
Theoretically, it would be possible to respond to each write error by repeating the operation, but in practice this is not possible. When reprogramming a page of thousands of bytes, i.e. tens of thousands of bits, it is very likely that some errors will occur again during the next write. This way you could look forward to the write operation completing successfully. Let's not forget that each subsequent write wearing the chip and brings us closer to final failure. Therefore, when the number of errors is acceptably small, you have to give up striving for perfection and rely on the mathematics of correction ECC - Error Correction Code.
When chips offering storage of two bits in one transistor were introduced to the market, older systems storing one bit of data per transistor were designated SLC (Single Level Cell), while systems with two bits per transistor were called MLC (Multi Level Cell). The next chips that store three bits in each transistor are TLC (Triple Level Cell). The newest currently available Flash-NAND chips are marked QLC (Quad Level Cell) and store 4 bits in each transistor. And since the MLC designation is sometimes also used generically to refer to all types of multi-state memories, some less honest sellers also use this abbreviation to designate inferior TLC and QLC chips.
Placing subsequent bits of information in the transistor not only lowers the signal-to-noise ratio, but also negatively affects the performance and life of Flash-NAND chips. Reading requires comparing the transistor's open voltage with several reference voltage values, which is time consuming. Programming is also carried out in several stages, which results not only in longer operation time, but also in greater load on floating gate insulators. As a consequence, the insulator degrades faster and the lifespan of the systems, which in the case of SLC memory exceeded 100,000 programm/erase operations, for MLC memory drops to a still reasonable level of several thousand P/E cycles. For TLC memory, the values declared by manufacturers are usually within the range of 3-5 thousand cycles, but the weakest systems of this class withstand only about 1,500 cycles. In the case of the latest QLC memories, their lifespan drops to several hundred (typically about 600 programmerase operations).
Due to the drastic decline in the durability of NAND chips, manufacturers resort to a marketing trick of replacing information about the resources of programm/erase operations with the TBW (Total Bytes Written) parameter. The information that a 1 TB SSD has a TBW of 1.5 PB will certainly arouse much greater confidence than that the systems used in it have a lifespan of one and a half thousand P/E cycles. We can calculate this lifespan by dividing the TBW parameter by the media capacity. And let's not forget that the minimum recording unit is a page, often 8 or 16 kB, so we usually lose these written bytes much faster than it might seem at first glance.
Despite the exponentially growing problems related to the occurrence of bit errors and the life of chips, some manufacturers are already announcing the introduction of systems storing 5 bits in each transistor, which are to be marked with the PLC symbol. PLC chips would have to distinguish 2⁵=32 charge levels. With a nominal supply voltage of 3.3 V, this will mean the need to distinguish subsequent logical states every ~0.1 V. At the same time, it is difficult to expect that the life of such systems will exceed 100 programm/erase operations. For a rewritable mediom, that's not much at all.

Reducing the size of the transistor.

This has been happening since the beginning of modern electronics - efforts were made to reduce the size of components even before the appearance of integrated circuits. Reducing the lithographic process allows for the production of increasingly cheaper, smaller, less power-consuming, less heat-generating, and at the same time more and more efficient integrated circuits with a higher degree of integration. This process also applies to flash chips. If we reduce the size of transistors, we can cram more of them into an integrated circuit package with standardized dimensions, and thus increase its capacity.
But such a process cannot be developed to infinity. Here we face limitations in the form of the physical size of atoms. For example, a silicon (Si) atom has a diameter of less than a quarter of a nanometer. In the case of transistors with sizes of several nm, it is very difficult to transfer their production from laboratories to factory conditions of mass production.
Another obstacle to reducing the size of transistors is the need to use light with increasingly shorter wavelengths in the lithography process. Already, it is required to use wavelength ranges barely within the limits of extreme ultraviolet. If transistors are further made smaller, it will be necessary to use X-rays. Using increasingly higher frequency waves also requires an increasingly cleaner environment. Therefore, manufacturing processes using deep ultraviolet must be carried out under vacuum conditions.
This problem is also experienced by processor manufacturers, who find it increasingly difficult to meet market expectations due to difficulties in production and large amounts of waste. Moore's Law, which for decades stated that the number of transistors in a chip doubles every year and a half, has recently been corrected. It is currently assumed that the number of transistors in a chip doubles every two years. It is possible that Moore's Law will soon stop working altogether.
Reducing the size of a transistor also means reducing the size of its components, including the thickness of the insulator layer and the volume of the floating gate. The thickness of the insulator affects its durability and effectiveness in retaining the charge accumulated in the floating gate. These are critical factors for reliable data storage. Too thin an insulator layer not only degrades more easily during erase and programm operations, but also allows individual electrons to escape, which in turn may lead to a change in the charge state to such an extent that when reading the content it will be interpreted as a different logical state.
The volume of the floating gate, or more precisely the number of atoms it holds, also has a significant impact on information storage. Negatively charged electrons tend to repel each other. This means that despite their small size, they cannot be stuffed into the floating gate in any large quantities. Electrons must be located in the outer valence shell of atoms, where for most types of atoms their number can be a maximum of 8 per atom. This is a limitation resulting from quantum mechanics - Pauli's rule, which states that each orbital can contain a maximum of two electrons. Silicon, from which transistors are made, has 4 such orbitals on its outer valence shell.
The recommended thickness of the insulator layer for permanent and safe data storage is approximately 4 nm. In the case of chips made in 15 nm lithography, it drops to approximately 2 nm. The number of electrons that can be stored in the floating gate also decreases from several to approximately one thousand. In practice, this means that in the latest TLC and QLC chips, the escape of just a few dozen electrons causes the reading of an incorrect logical state. And electrons escape more easily the thinner the insulator is. It should therefore come as no surprise that the highest failure rate occurs in systems made in lithography below 20 nm, which at the same time store three or four bits in one transistor.

3D-NAND.

Another way to increase the capacity of NAND chips is to stack transistors in multiple layers one above the other. Thanks to this, you can multiply the memory capacity without increasing the surface area of the chip. This solution appeared relatively recently, although it might seem that the idea itself is so trivial that it should have appeared a long time ago. Well, this solution is not without its drawbacks.
The first problem is induction between adjacent transistors. It also occurred in two-dimensional planar chips, causing the appearance of parasitic capacitances and resulting in the risk of bit errors. What happens when subsequent layers appear in the chip? In addition to the induction from the charges accumulated in adjacent transistors in the plane, there is additional induction from the charges located in the layers above and below.
Another problem is the previously mentioned Joule heat emitted, especially during the operation of erasing and programming the data. We already know that it promotes the degradation of floating gate insulator. Therefore, they should be released into the environment as quickly as possible.
The rate of heat release depends on many factors, the most important of which is the heat release surface. This is why the best heat sinks have many thin plates giving a large surface area. And with the help of a radiator we can increase the surface of heat dissipation from the chip. But in the case of multi-layer chips, the crux of the problem lies in preventing heat accumulation inside the chip and properly removing it from between the layers.
Both problems increase as the number of layers increases and the spacing between layers decreases. The electromagnetic field decreases with the square of the distance, so the smaller the distances between transistors, the stronger the inductive interactions between the charges stored in them. Thermal interactions are also more destructive when the size of objects placed in the silicon structure of the systems is reduced.

The most common failure mechanism of semiconductor data carriers.

We already know that during the operation of Flash-NAND chips, they wear out, which is the main cause of bit errors. Typically, when the number of bit errors exceeds the ability to correct them with ECC codes, a given block is considered damaged, is recorded on the defect list and is excluded from further operation. Until recently, defect management algorithms were so effective that there were practically no situations in which warning signals such as reading problems and damage to files or logical structures occurred before a failure. And still, failures usually occur suddenly - the computer suddenly resets or does not load the operating system after startup, and a moment of diagnostics allows you to determine that the BIOS does not see the SSD or recognizes it under some strange name and with zero or suspiciously low capacity.
This is because not only the blocks storing user information are worn out and damaged, but also the blocks containing translation tables that are important for correct data addressing. If the problem affects entries in such tables, it is not possible to correctly assign logical addresses to the appropriate physical addresses and the controller is not able to correctly build an image of logical structures or provide access to user files. If errors occur in this (or any other important) part of the firmware, the controller cuts off access to the NAND chips. Instead of the SSD model, in response to the BIOS request to present the identifier, it remains in a busy state (suspended) or the so-called technological passport (e.g. SATAFIRM S11), and instead of the capacity of the entire SSD disk, it likes to return the capacity of some available buffer.