Tutorial on Architectural Design for Soft Errors

Dr. Joel Emer & Dr. Shubu Mukherjee, Intel Corporation
Sunday, June 21st, Morning session.

As kids many of us were fascinated by black holes and solar flares in deep space. Little did we know that particles from deep space could affect computing systems at the surface of the earth causing blue screens and incorrect bank balances. CMOS technology has shrunk to a point where radiation from deep space and packaging material have started causing such malfunctions at an increasing rate. These radiation-induced errors are termed "soft" since the state of one or more bits in a silicon chip could flip temporarily without damaging the hardware. The lack of any appropriate shielding material has caused the design community to look for process, circuit, architectural, and software solutions to mitigate the effect of soft errors.

This tutorial will cover architectural techniques to tackle the soft error problem. Computer architecture has long coped with various types of faults, including ones induced by radiation. For example, error correction codes are commonly used in memory systems. High-end systems have often used redundant copies of hardware to detect faults and recover from errors. Many of these solutions have, however, been prohibitively expensive and difficult to justify in mainstream commodity computing market.

The necessity to find cheaper reliability solutions has driven a whole new class of quantitative analysis of soft errors and corresponding solutions that mitigate their effects. This tutorial will cover the new methodologies for quantitative analysis of soft errors as well as novel cost-effective architectural techniques to mitigate them. This tutorial will also re-evaluates traditional architectural solutions in the context of the new quantitative analysis.

More specifically, in this tutorial we will cover:

Much of the material in this tutorial will be based on the book, "Architecture Design for Soft Errors." Elsevier, Inc has copyright (c) to some of the material covered in the tutorial and on this announcement.

Dr. Joel S. Emer is an Intel Fellow working in the Digital Enterprise Group, where he is director of micro-architecture research. Before joining Intel he spent 22 years as a Digital/Compaq employee, where he worked on processor architecture, performance analysis and performance modeling methodologies for a number of VAX and Alpha CPUs. He is widely recognized for his architecture contributions, including pioneering efforts in simultaneous multithreading, and for his seminal work on the now pervasive quantitative approach to processor evaluation. He also has researched heterogeneous distributed systems and networked file systems at DEC and during a three year sabbatical at MIT. His current research interests include processor reliability, multithreaded processor organizations, techniques for increased instruction level parallelism, pipeline organization, instruction and data cache organizations, branch prediction schemes, and performance modeling. Dr. Emer holds a Ph.D. in Electrical Engineering from the University of Illinois, and M.S.E.E. and B.S.E.E. degrees from Purdue University. He is also a Fellow of both the ACM and the IEEE.

Shubu Mukherjee is a Principal Engineer and Director of Intel's SPEARS Group (Simulation and Pathfinding of Efficient and Reliable Systems). The SPEARS Group is responsible for spearheading architectural change and innovation in the delivery of enterprise processors and chipsets by building and supporting simulation and analytical models of performance, power, and reliability. Dr. Mukherjee is widely recognized both within and outside Intel as one of the experts on architecture design for soft errors. He has made pioneering contributions towards the design of Intel's System Environment Monitoring Agent (SEMA) that runs on more than 200,000 processor cores within Intel, architectural vulnerability modeling for soft errors, Redundant Multithreading (RMT) techniques, creation of performance modeling infrastructures called Cameroon (jointly with a team of Intel engineers) and Asim (jointly with Dr. Joel Emer), design of the Alpha 21364 interconnection network, and the creation of the first shared memory prediction scheme. Prior to joining Intel, Dr. Mukherjee worked in Compaq for 3 years and Digital Equipment Corporation for 10 days. Dr. Mukherjee received his B.Tech. from the Indian Institute of Technology, Kanpur, where he serves as an adjunct faculty now. He got his M.S. and PhD from the University of Wisconsin-Madison. He is a Fellow of IEEE. He was the General Chair of ASPLOS (Architectural Support for Programming Languages and Operating Systems), 2004. He has co-authored over 40 external papers. He holds 16 patents and has filed over 25 more in Intel. Dr. Mukherjee's book titled, "Architecture Design for Soft Errors" appeared in the market in February 2008. Dr. Mukherjee serves in the Editorial Board of IEEE Computer Architecture Letters (CAL), as an Associate Editor of IEEE Transactions of Secure and Dependable Computing (TDSC), in National Science Foundation (NSF) panels, in numerous technical program committees, in Intel Corporation's patent committee, and in the Board of Trustees of Merrimack Repertory Theatre.