| ECE
254 / CPS 225 |
|
Fault-Tolerant
and Testable Computing Systems |
| Fall 2008 |
| Professor Daniel J. Sorin |
| Course
Objective and Content |
| Objective:
To provide students with an understanding
of fault tolerant computers, including both the theory of how to design
and evaluate them and the practical knowledge of real fault tolerant
systems. |
| Content: The main
themes of
this course are: technological reasons for faults, fault models,
information redundancy, spatial redundancy, backward and forward error
recovery, fault-tolerant
hardware and software, modeling and analysis, testing, and design for
test. |
| The course includes a project
that will allow the
students to apply what they have learned in class. |
| Prerequisites: ECE 152 or CPS 104 or consent of instructor. |
| Class Location and Hours |
Class meets M/W/F from 10:20am - 11:10am.
Location: TBD
| Instructor |
Office: 209C Hudson Hall
Office Hours: TBD
Email: 
| Textbook |
There is no textbook for this course. If you still think you'd like one, let me know and I can recommend one.
| Assignments and Grading |
Students are responsible for:
| The project is a significant assignment that requires: |
| Deadlines will be enforced except under extreme circumstances. I would prefer that you turn in something not quite done on the due date rather | |
|
than
waiting until after the deadline to try to finish it. Each day late will result in a 10%
reduction of the grade given. |
|
| |
|
| Academic Misconduct: I will not tolerate academically dishonest work. This includes cheating on the homework and exams and plagiarism on the project. | |
Be careful on the
project to cite prior work and to give proper credit to others'
work.
|
| Course
Topics, Lecture Notes, and Readings |
| Homework Assignments |
Homework #1, Due TBD in class
| Tentative
Schedule (subject to change) |
| Week |
Monday |
Wednesday | Friday |
| Aug 25 |
introduction | faults and their causes | faults and their causes |
| Sep 1 | "IBM Experiment in Soft Fails" | "A Large-Scale Study of Failures" | "Why Do Internet Services Fail?" |
| Sep 8 | basic FT concepts | physical redundancy | "The Teramac" |
| Sep 15 | information redundancy | re-execution techniques | "AR-SMT" |
| Sep 22 | backward error recovery | "A Survey of Rollback-Recovery Protocols" | "End-to-End Arguments" |
| Sep 29 |
FT microprocessors, "RAS Strategy for IBM S/390" | "Argus" | review for midterm |
| Oct 6 | MIDTERM EXAM | "SWIFT" | FT memory and disks |
| Oct 13 |
FALL BREAK |
FT networks | FT software, "Proactive Management of Software Aging" |
| Oct 20 |
FT software, "Software Implemented Fault Tolerance" | FT software, "The Google Cluster Architecture" | modeling/evaluation, "Modeling the Effect of Technology Trends" |
| Oct 27 |
modeling/evaluation | modeling/evaluation, "The Impact of Technology Scaling on Lifetime Reliability" | modeling/evaluation |
| Nov 3 |
modeling/evaluation | testing | testing |
| Nov 10 |
testing | design for test | "Validating the Pentium 4 Microprocessor" |
| Nov 17 |
"IDDQ Test" | review for final | PROJECT PRESENTATIONS |
| Nov 24 |
PROJECT PRESENTATIONS |
THANKSGIVING | |
| Dec 1 | READING PERIOD | ||