ECE 254 / CPS 225

Fault-Tolerant and Testable Computing Systems

Fall 2006
Professor Daniel J. Sorin

                  

 Course Objective and Content
Objective: To provide students with an understanding of fault tolerant computers, including both the theory of how to design and evaluate them and the practical knowledge of real fault tolerant systems.

Content: The main themes of this course are: technological reasons for faults, fault models, information redundancy, spatial redundancy, backward and forward error recovery, fault-tolerant hardware and software, modeling and analysis, testing, and design for test.

The course includes a project that will allow the students to apply what they have learned in class.

Prerequisites: ECE 152 or CPS 104 or consent of instructor.
Class Location and Hours

 

Class meets M/W/F from 10:20am - 11:10am.

Location: 201 Hudson Hall

 Instructor

 

Professor Daniel J. Sorin

Office: 209C Hudson Hall

Office Hours: Monday 2:00-3:00, Wednesday 11:10 (after class) -12:00

Email: Email Address of Daniel Sorin

 Textbook

 

There is no textbook for this course.  If you still think you'd like one, let me know and I can recommend one.

 Assignments and Grading

Students are responsible for:

The project is a significant assignment that requires:
Deadlines will be enforced except under extreme circumstances.  I would prefer that you turn in something not quite done on the due date rather

than waiting until after the deadline to try to finish it.  Each day late will result in a 10% reduction of the grade given.


Academic Misconduct: I will not tolerate academically dishonest work.  This includes cheating on the homework and exams and plagiarism on the project.  
Be careful on the project to cite prior work and to give proper credit to others' work. 

 Course Topics, Lecture Notes, and Readings

I will post lecture notes (in PowerPoint format) the night before I cover them in class.  Feel free to print them out and bring them to class with you.

Topic
Readings
Introduction: Terminology and Metrics

Faults and Their Causes "IBM Experiments in Soft Fails in Computer Electronics" (Ziegler),
"A Large-Scale Study of Failures in High-Performance Computing Systems" (Schroeder and Gibson),
"Why Do Internet Services Fail, and What Can be Done About It?" (Oppenheimer et al.)
General Fault Tolerance Concepts
     - Physical redundancy
     - Error detecting/correcting codes
     - Re-execution techniques
     - Backward error recovery

"The Teramac Custom Computer: Extending the Limits with Defect Tolerance" (Culbertson et al.),
"AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors" (Rotenberg),
"A Survey of Rollback-Recovery Protocols in Message-Passing Systems" (Elnozahy et al.)

Real Systems: Fault Tolerant Hardware
"RAS Strategy for IBM S/390 G5 and G6" (Mueller et al.),
"A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory" (Dell),
"End-to-End Arguments in System Design (Saltzer et al.)
Real Systems: Fault Tolerant Software
"Proactive Management of Software Aging" (Castelli et al.),
"Software Implemented Fault Tolerance: Technologies and Experience" (Huang and Kintala),
"Web Search for a Planet: The Google Cluster Architecture" (Barroso et al.),
Modeling and Evaluation
"Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic" (Shivakumar et al.),
"The Impact of Technology Scaling on Lifetime Reliability" (Srinivasan et al.)
Testing, Design for Test, and Verification
"Validating the Pentium 4 Microprocessor" (Bentley and Gray)
"IDDQ Test: Will It Survive the DSM Challenge?" (Sabade and Walker)
 
 Homework Assignments

Homework #1, Due Friday, Sept 15 in class

Homework #2, Due Monday, Sept 25 in class (Exercise 3.2 must be emailed to me by 10:00am on Sept 25)

Homework #3, Due Monday, Oct 30 in class (Exercise 6.4 must be emailed to me by 10:00am on Oct 30)

Homework #4, Due Monday, Nov 20 in class

 Tentative Schedule (subject to change)

 

Week
Monday
Wednesday Friday
Aug 28
introduction faults and their causes faults and their causes
Sep 4 "IBM Experiment in Soft Fails" "A Large-Scale Study of Failures" "Why Do Internet Services Fail?"
Sep 11 basic FT concepts physical redundancy "The Teramac"
Sep 18 information redundancy re-execution techniques "AR-SMT"
Sep 25 backward error recovery "A Survey of Rollback-Recovery Protocols" review for midterm
Oct 2
MIDTERM EXAM FT microprocessors, "RAS Strategy for IBM S/390" FT memory, "Chipkill Memory"
Oct 9

FALL BREAK

FT disks FT networks, "End-to-End Arguments" 
Oct 16
FT multiprocessors

project proposals due

FT software, "Proactive Management of Software Aging" FT software, "Software Implemented Fault Tolerance"
Oct 23
FT  software, "The Google Cluster Architecture" modeling/evaluation, "Modeling the Effect of Technology Trends" modeling/evaluation
Oct 30
modeling/evaluation, "The Impact of Technology Scaling on Lifetime Reliability" modeling/evaluation modeling/evaluation
Nov 6
testing

project progress reports due

testing testing
Nov 13
design for test "Validating the Pentium 4 Microprocessor" "IDDQ Test"
Nov 20
review for final THANKSGIVING
Nov 27


PROJECT PRESENTATIONS [PROJECTS DUE NOV 27]