Search This Blog

Friday, August 05, 2011

Fault-tolerant Digital Systems


Fault-tolerant Digital Systems


Course Synopsis

Our daily lives are becoming increasingly dependent on computer systems, from small, embedded computers to large-scale data centers. Any disruption in or malfunctioning of these systems can lead to devastating consequences for society as a whole. The reliability and availability of these systems is thus essential for our quality of life and for the smooth functioning of society. Therefore, it is important to build computer systems that operate correctly in the face of errors and failures.
This course focuses on the design of fault-tolerant and reliable computer systems. In particular, we will attempt to understand the root causes of faults in computer systems and their impact. We will study both traditional and cutting-edge techniques to provide fault-tolerance and error resilience. Finally, we will explore the practical applications of the techniques in the context of real systems.
An important thread that runs through the course is the evaluation of fault-tolerant systems. To this end, we will study techniques ranging from analytical modeling to empirical validation. The assignments will give you hands-on exposure to cutting edge tools and techniques for dependability evaluation, and will prepare you for the final project. You are encouraged (but not required) to work on a project related to your research interests. The final project constitutes a significant part of the grade.

Topics Covered

Some slides are based on Prof. Saurabh Bagchi’s slides for “Fault Tolerant Computer System Design” (ECE 695B) at Purdue University. Used with permission.
TopicLecturesSub-topics
Introduction and Overview3Introduction to the courseBasic conceptsSources of faults in computer systems
Modeling and Evaluation -12Probability review and discrete probabilityContinuous probability and TMR
Hardware fault-tolerance2Architectural techniques
Modeling and Evaluation -22Markov processes, Stochastic Activity Networks
Software fault-tolerance3N-version programming, recovery blocks, robust data structures and process pairs
Modeling and Evaluation – 32Fault-injection: techniques and tools, Formal methods
Parallel and Distributed systems4Check-pointing and recoveryByzantine fault-tolerance and paxos
Case Studies2Stratus and AT&T systems

Textbooks

There is NO required textbook. However, the following books are recommended:
  1. D. P. Siewiorek and R. S. Swarz, Reliable Computer Systems – Design and Evaluation, 3rd edition, 1999, A.K. Peters, Limited.
  2. K. Trivedi, Probability and Statistics with Reliability, Queuing and Computer Science Applications, 2nd edition, 2001, John Wiley & Sons.

No comments:

Post a Comment

Popular Courses

Resources Higher Education Blogs - BlogCatalog Blog Directory Resources Blogs