Fault tolerant computer architecture
For many years, most computer architects have pursued one primary goal: performance. Architects have translated the ever-increasing abundance of ever-faster transistors provided by Moore's law into remarkable increases in performance. Recently, however, the bounty provided by Moore's law h...
Main Author: | |
---|---|
Format: | Electronic |
Language: | English |
Published: |
San Rafael, Calif. (1537 Fourth Street, San Rafael, CA 94901 USA) :
Morgan & Claypool Publishers,
c2009.
|
Series: | Synthesis lectures on computer architecture (Online),
# 5. |
Subjects: | |
Online Access: | View fulltext via EzAccess |
Table of Contents:
- Introduction
- Goals of this book
- Faults, errors, and failures
- Masking
- Duration of faults and errors
- Underlying physical phenomena
- Trends leading to increased fault rates
- Smaller devices and hotter chips
- More devices per processor
- More complicated designs
- Error models
- Error type
- Error duration
- Number of simultaneous errors
- Fault tolerance metrics
- Availability
- Reliability
- Mean time to failure
- Mean time between failures
- Failures in time
- Architectural vulnerability factor
- The rest of this book
- References
- Error detection
- General concepts
- Physical redundancy
- Temporal redundancy
- Information redundancy
- The end-to-end argument
- Microprocessor cores
- Functional units
- Register files
- Tightly lockstepped redundant cores
- Redundant multithreading without lockstepping
- Dynamic verification of invariants
- High-level anomaly detection
- Using software to detect hardware errors
- Error detection tailored to specific fault models
- Caches and memory
- Error code implementation
- Beyond EDCs
- Detecting errors in content addressable memories
- Detecting errors in addressing
- Multiprocessor memory systems
- Dynamic verification of cache coherence
- Dynamic verification of memory consistency
- Interconnection networks
- Conclusions
- References
- Error recovery
- General concepts
- Forward error recovery
- Backward error recovery
- Comparing the performance of FER and BER
- Microprocessor cores
- FER for cores
- BER for cores
- Single-core memory systems
- FER for caches and memory
- BER for caches and memory
- Issues unique to multiprocessors
- What state to save for the recovery point
- Which algorithm to use for saving the recovery point
- Where to save the recovery point
- How to restore the recovery point state
- Software-implemented BER
- Conclusions
- References
- Diagnosis
- General concepts
- The benefits of diagnosis
- System model implications
- Built-in self-test
- Microprocessor core
- Using periodic BIST
- Diagnosing during normal execution
- Caches and memory
- Multiprocessors
- Conclusions
- References
- Self-repair
- General concepts
- Microprocessor cores
- Superscalar cores
- Simple cores
- Caches and memory
- Multiprocessors
- Core replacement
- Using the scheduler to hide faulty functional units
- Sharing resources across cores
- Self-repair of noncore components
- Conclusions
- References
- The future
- Adoption by industry
- Future relationships between fault tolerance and other fields
- Power and temperature
- Security
- Static design verification
- Fault vulnerability reduction
- Tolerating software bugs
- References.