Table of Contents: Fault tolerant computer architecture

Fault tolerant computer architecture

For many years, most computer architects have pursued one primary goal: performance. Architects have translated the ever-increasing abundance of ever-faster transistors provided by Moore's law into remarkable increases in performance. Recently, however, the bounty provided by Moore's law h...

Full description

Bibliographic Details
Main Author:	Sorin, Daniel J.
Format:	Electronic
Language:	English
Published:	San Rafael, Calif. (1537 Fourth Street, San Rafael, CA 94901 USA) : Morgan & Claypool Publishers, c2009.
Series:	Synthesis lectures on computer architecture (Online), # 5.
Subjects:	Fault-tolerant computing. Self-stabilization (Computer science) Computer architecture.
Online Access:	View fulltext via EzAccess

Table of Contents:

Introduction
Goals of this book
Faults, errors, and failures
Masking
Duration of faults and errors
Underlying physical phenomena
Trends leading to increased fault rates
Smaller devices and hotter chips
More devices per processor
More complicated designs
Error models
Error type
Error duration
Number of simultaneous errors
Fault tolerance metrics
Availability
Reliability
Mean time to failure
Mean time between failures
Failures in time
Architectural vulnerability factor
The rest of this book
References
Error detection
General concepts
Physical redundancy
Temporal redundancy
Information redundancy
The end-to-end argument
Microprocessor cores
Functional units
Register files
Tightly lockstepped redundant cores
Redundant multithreading without lockstepping
Dynamic verification of invariants
High-level anomaly detection
Using software to detect hardware errors
Error detection tailored to specific fault models
Caches and memory
Error code implementation
Beyond EDCs
Detecting errors in content addressable memories
Detecting errors in addressing
Multiprocessor memory systems
Dynamic verification of cache coherence
Dynamic verification of memory consistency
Interconnection networks
Conclusions
References
Error recovery
General concepts
Forward error recovery
Backward error recovery
Comparing the performance of FER and BER
Microprocessor cores
FER for cores
BER for cores
Single-core memory systems
FER for caches and memory
BER for caches and memory
Issues unique to multiprocessors
What state to save for the recovery point
Which algorithm to use for saving the recovery point
Where to save the recovery point
How to restore the recovery point state
Software-implemented BER
Conclusions
References
Diagnosis
General concepts
The benefits of diagnosis
System model implications
Built-in self-test
Microprocessor core
Using periodic BIST
Diagnosing during normal execution
Caches and memory
Multiprocessors
Conclusions
References
Self-repair
General concepts
Microprocessor cores
Superscalar cores
Simple cores
Caches and memory
Multiprocessors
Core replacement
Using the scheduler to hide faulty functional units
Sharing resources across cores
Self-repair of noncore components
Conclusions
References
The future
Adoption by industry
Future relationships between fault tolerance and other fields
Power and temperature
Security
Static design verification
Fault vulnerability reduction
Tolerating software bugs
References.

Fault tolerant computer architecture

Similar Items