Overview

1.3 Fault Recovery in TrueFFS

A fault can occur whenever data is written to flash. Thus, a fault can occur in response to a write request from the file system, during garbage collection, or even when TrueFFS formats or erases the flash. In all cases, TrueFFS can recover from the fault. In the case of new data being written to flash for the first time, the data being written at the time of the fault is lost. However, TrueFFS takes great care to assure that all the data already resident in flash is recoverable. Thus, the file and directory structures of the disk are retained.

1.3.1 "Erase after Write" Guarantees Data Integrity

The key to the robustness of TrueFFS is the fact that it uses an "erase after write" algorithm. When updating a sector on the flash medium, the previous data is not erased until after the update operation has completed and the newly stored data has been verified. One consequence of this "erase after write" algorithm is that a data sector cannot be in a partially written state. Either the operation completed, in which case the new sector is valid, or the operation did not complete, in which case the old sector is still valid.

This has obvious advantages for the stability of user data already written to the flash medium. If the write phase of an update operation fails, the old data is not lost or in any way corrupted. This "erase after write" algorithm also has positive consequences for the coherency of the flash memory map. For, although TrueFFS does use a RAM-resident mapping table to track the contents of flash memory, TrueFFS is careful never to store critical mapping information in RAM only. Thus, if an update to the mapping information is interrupted by a power loss or the removal of the flash medium, the old version of the mapping information is still valid.

When power is restored (or the medium reconnected), TrueFFS is able to use information resident in flash to reconstruct or verify the RAM-resident version of the flash mapping table. Given that the mapping information could reside any where on the medium, this might, at first, seem to be an impossible task.

Fortunately, each erase unit in flash memory maintains header information at a predictable location. By carefully cross-checking the header information in each erase unit, TrueFFS is able to rebuild or verify a RAM copy of the flash mapping table. Thus, after data is safely written to flash, that data is essentially immune to power failures. In fact, the only consequence of such interruptions is the need to restart any garbage collection that might have been under way at the time of the failure. ¹

1.3.2 Recovering from Failures During Write or Erase Operations

A write or erase operation can fail because of a hardware problem or a power failure. To prevent the possible loss of data, the success of each write operation is monitored and verified in TrueFFS. Most flash components use an on-chip register to report the success of an operation. TrueFFS uses this register (if available) to verify successful writing and erasing. In addition, TrueFFS verifies the operation by reading back the actual written data and comparing it to the user data.

If the first attempt at a write operation fails, this failure is usually not reported back to the user. Instead, TrueFFS uses its dynamic mapping ability to retry the write operation again to a different location on the flash medium. This ensures the integrity of the data by making failure recovery automatic. This write-error recovery mechanism becomes particularly valuable as the flash medium approaches its cycling limit (end-of-life). At that time, flash write/erase failures become more frequent, but the only user-observed effect is a gradual decline in performance (because of the need for write retries).

1.3.3 Recovering from Failures During Garbage Collection

TrueFFS reclaims garbage space (space occupied by sectors that have been deleted by the host) by moving data from one unit to another (a transfer unit), and erasing the original unit. If, because of a consistent flash failure, it is not possible to do the necessary write operations to move data, or it is not possible to erase the old unit, the garbage collection operation fails.

To minimize the possibly that the write part of a transfer fails, TrueFFS formats the flash medium to contain more than one transfer unit. Thus, if the write to one transfer unit fails, TrueFFS retries the write using a different transfer unit. If all transfer units fail, this does not have a direct effect on the user data (all of which is already safely stored). However, the medium no longer accepts new data and becomes a read-only device.

1.3.4 Recovering from Failures During Formatting

In some cases, sections of the flash medium could be found unusable (typically because they are unerasable) when flash is first formatted. Provided the number of bad units does not exceed the number of transfer units, formatting can succeed and the medium is usable. The only adverse effect observed by the user is that the formatted capacity of the flash medium is reduced.

1: The header information, normally at offset 0 of an erase unit, can also reside at an alternate offset if offset 0 is unusable. As a result, the location of the header information is dynamic.