Fault tolerant computer architecture

For many years, most computer architects have pursued one primary goal: performance. Architects have translated the ever-increasing abundance of ever-faster transistors provided by Moore's law into remarkable increases in performance. Recently, however, the bounty provided by Moore's law h...

Full description

Bibliographic Details
Main Author: Sorin, Daniel J.
Format: Electronic
Language:English
Published: San Rafael, Calif. (1537 Fourth Street, San Rafael, CA 94901 USA) : Morgan & Claypool Publishers, c2009.
Series:Synthesis lectures on computer architecture (Online), # 5.
Subjects:
Online Access:View fulltext via EzAccess
Table of Contents:
  • Introduction
  • Goals of this book
  • Faults, errors, and failures
  • Masking
  • Duration of faults and errors
  • Underlying physical phenomena
  • Trends leading to increased fault rates
  • Smaller devices and hotter chips
  • More devices per processor
  • More complicated designs
  • Error models
  • Error type
  • Error duration
  • Number of simultaneous errors
  • Fault tolerance metrics
  • Availability
  • Reliability
  • Mean time to failure
  • Mean time between failures
  • Failures in time
  • Architectural vulnerability factor
  • The rest of this book
  • References
  • Error detection
  • General concepts
  • Physical redundancy
  • Temporal redundancy
  • Information redundancy
  • The end-to-end argument
  • Microprocessor cores
  • Functional units
  • Register files
  • Tightly lockstepped redundant cores
  • Redundant multithreading without lockstepping
  • Dynamic verification of invariants
  • High-level anomaly detection
  • Using software to detect hardware errors
  • Error detection tailored to specific fault models
  • Caches and memory
  • Error code implementation
  • Beyond EDCs
  • Detecting errors in content addressable memories
  • Detecting errors in addressing
  • Multiprocessor memory systems
  • Dynamic verification of cache coherence
  • Dynamic verification of memory consistency
  • Interconnection networks
  • Conclusions
  • References
  • Error recovery
  • General concepts
  • Forward error recovery
  • Backward error recovery
  • Comparing the performance of FER and BER
  • Microprocessor cores
  • FER for cores
  • BER for cores
  • Single-core memory systems
  • FER for caches and memory
  • BER for caches and memory
  • Issues unique to multiprocessors
  • What state to save for the recovery point
  • Which algorithm to use for saving the recovery point
  • Where to save the recovery point
  • How to restore the recovery point state
  • Software-implemented BER
  • Conclusions
  • References
  • Diagnosis
  • General concepts
  • The benefits of diagnosis
  • System model implications
  • Built-in self-test
  • Microprocessor core
  • Using periodic BIST
  • Diagnosing during normal execution
  • Caches and memory
  • Multiprocessors
  • Conclusions
  • References
  • Self-repair
  • General concepts
  • Microprocessor cores
  • Superscalar cores
  • Simple cores
  • Caches and memory
  • Multiprocessors
  • Core replacement
  • Using the scheduler to hide faulty functional units
  • Sharing resources across cores
  • Self-repair of noncore components
  • Conclusions
  • References
  • The future
  • Adoption by industry
  • Future relationships between fault tolerance and other fields
  • Power and temperature
  • Security
  • Static design verification
  • Fault vulnerability reduction
  • Tolerating software bugs
  • References.