ECE 259 / CPS 221

Advanced Computer Architecture II

Spring 2006
Professor Daniel J. Sorin

                  

 Objectives
The objective of this course is to provide students with an understanding of parallel computer architectures.  Students will read research papers, 
lead in-class discussions of papers, perform a research project, and present their research projects both in written and oral formats.

The course focuses on both the design and evaluation of multiprocessor systems. The main design themes of this course are: parallel programming, system organizations, shared memory multiprocessors, memory consistency models, interconnection networks, high availability systems, interactions with current microprocessor and I/O technology, novel architectures, and emerging technologies.  The evaluation portion of this course will focus on metrics, modeling, simulation, and workloads for benchmarking.

Prerequisites: ECE 252, CPS 220, or consent of instructor.
Class Location and Hours

 

Class meets Monday/Wednesday/Friday from 10:20am - 11:10am.

Location: LSRC D106.

 Instructor

 

Professor Daniel J. Sorin

Office: 209C Hudson Hall

Office Hours: Monday 11:15-noon, Wednesday 2:45-3:30

Email: Email Address of Daniel Sorin

 Materials
This course has a textbook for background material and for reference, but the emphasis of the class will be discussions of research papers.  

Textbook: Parallel Computer Architecture.  David Culler and J.P. Singh

 Assignments and Grading
This is a graduate level class that will not require "busy work."  This class will, however, require that students learn the reading material and learn
how to present research in both written and oral formats (see Hill and Patterson for useful advice for presentations).  Communication is very 
important in this class.  Students who struggle with reading and writing are encouraged to take this course but should expect to work hard and to 
improve their communication skills in the process.  

Students are responsible for:

The project is a semester-long assignment that should reflect the goal of being no more than "a stone's throw" away from a research paper.  As
such, the project will require:
Deadlines will be enforced except under extreme circumstances.  I would prefer that you turn in something not quite done on the due date rather than waiting until after the deadline to try to finish it.  Any assignment/project that is late by less than 24 hours will lose 50%.  Any assignment/project that is more than 24 hours late will receive a zero.

Academic Misconduct: I will not tolerate academically dishonest work.  This includes cheating on the final exam and plagiarism on the project.  
Be careful on the project to cite prior work and to give proper credit to others' research. 

 Programming Assignments

There are two programming assignments: one using shared memory and one using message passing.  Both assignments are to be done INDIVIDUALLY!  Here is some documentation on how to begin parallel programming.  So that you gain some appreciation for the differences between shared memory and message passing, you will solve the same problem (how to compute prime numbers) in both programming models.  There will be a 5% bonus on each assignment for the fastest program (that is correct!).

Assignment #1: Use shared memory (pthreads) to compute the Nth (N is a command line input to your program) prime number.
          You must email your program to me by 10:00am on Weds January 25.
Assignment #2: Use message passing (MPI) to compute the Nth (N is a command line input to your program) prime number.
          You must email your program to me by 10:00am on Weds February 1.
 Lecture Notes

I will post lecture notes (in PowerPoint format) shortly before I cover them in class.  

Segment 1: Introduction

Segment 2: Parallel Programming

Segment 3: System Organizations and Scalable Machines without Cache Coherence

Segment 4: Shared Memory and Cache Coherence
               4.1: Snooping
               4.2: Directories
               4.3: CMPs and COMA

Segment 5: Memory Consistency and Synchronization Optimizations

Segment 6: Interconnection Networks

Segment 7: Evaluation

Segment 8: Availability

No slides for material past this point.

 Paper Presentation Notes
Split-C (Dan Sorin)

CM-5  (Pete Golden)

Starfire  (Mahmut Yilmaz)

Multicast Snooping (Derek Hower)

AlphaServer GS320 (Curt Harting)

Token Coherence (Jerry Wu)

Piranha (Garver Moore)

Niagara (Bogdan Romanescu)

R-NUMA (Anita Lungu)

Wildfire (Terry Arnold)

SC+ILP=RC (Costi Pistol)

Speculative Lock Elision (Nathan Sadler)

Virtual Channel Flow Control (John Calandrino)

Alpha 21364 Interconnection Network (Luis Campos)

AMVA Model (Bogdan Romanescu)

Simics (Clif Kerr)

WRL Commercial Workloads (Jerry Wu) 

Simulating $2M Server (Derek Hower)

IBM Mainframes (Mahmut Yilmaz)

SafetyNet (Anita Lungu)

ROC (John Calandrino)

Tarantula (Curt Harting)

Raw (Garver Moore)

 Topics and Readings

Readings in italics are optional material.

 Theme

#

Topic

Readings

Introduction

1

Why Study Multiprocessors

parallelism, limits, Amdahl’s Law

Culler/Singh 1.0, 1.1

"Rise and Fall of MP Research"

Parallel Programming

2

Programming Models

message passing, shared memory, performance and scaling

Culler/Singh 2,3 (can skim/skip 3.5)

"Parallel Programming in Split-C"

3

Synchronization Basics

atomic operations, locks, barriers

Culler/Singh 2.3.4-2.3.6, 5.5

Machine Organizations and Scalable Systems without Cache Coherence

4

System Organizations

SIMD: MMX, vectors, DSP

MIMD

Culler/Singh 1.2

5

Scalable, Non-Coherent Multiprocessors

message passing: Paragon, CM5, active messages

shared physical memory: Cray T3E

Culler/Singh 7.2, 7.5, 7.6

"The Network Architecture of the Connection Machine CM-5"

"Synchronization and Communication in the Cray T3E Multiprocessor"

"Active Messages: A Mechanism for Integrated Communication and Computation"

Cache-Coherent Shared Memory Multiprocessors

6

Shared Memory & Cache Coherence

Culler/Singh 5.0, 5.1

7

Snooping Cache Coherence


Culler/Singh 5.3-5.7 (skim 5.4), 6

"Starfire: Extending the SMP Envelope"

"Multicast Snooping: A New Coherence Method Using a Multicast Address Network"

"Timestamp Snooping: An Approach for Extending SMPs"

"Gigaplane: A High Performance Bus for Large SMPs"

8

Directory Cache Coherence

Culler/Singh 8 (skim 8.3)

"The Stanford DASH Multiprocessor"

"Architecture and Design of AlphaServer GS320"

"An Evaluation of Directory Schemes for Cache Coherence"

9

Advanced Coherence Topics: Token Coherence, Chip Multiprocessors

"Token Coherence: Decoupling Performance and Correctness"

"Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing"

"Niagara: A 32-way Multithreaded SPARC Processor"

10

COMA - Cache Only Memory Arch

classic: DDM, KSR-1

new: S-COMA, R-NUMA, Wildfire

Culler/Singh 9.2.2

"DDM--A Cache-Only Memory Architecture"

"Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA"  

"WildFire: A Scalable Path for SMPs"

Memory Consistency Models

11

Memory Consistency Basics

Culler/Singh 5.2, 9.1

"Shared Memory Consistency Models: A Tutorial"

12

Consistency Optimizations

speculation, Scheurich's optimization

"Two Techniques to Enhance the Performance of Memory Consistency Models"

"Is SC + ILP = RC?"

13

Synchronization Optimizations

 

"Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution"

"Efficient Synchronization: Let Them Eat QOLB"

Interconnection Networks

14

Interconnection Network Basics

topology, routing, flow control

Culler/Singh 10

"The Alpha 21364 Network Architecture"

15

Deadlock Avoidance

virtual channels, turn model, hot-potato routing

"Virtual Channel Flow Control"

"A Survey of Wormhole Routing Techniques in Direct Networks" [includes "Turn Model" concept]

Evaluation Tools and Methodology

16

Evaluation: Metrics & Modeling

scalability, throughput, why not IPC?

mathematical modeling of performance

Culler/Singh 4 (skim 4.4)

"Cost-Effective Parallel Computing"

"Analytic Evaluation of Shared-Memory Parallel Systems with ILP Processors"

17

Evaluation: Simulation

precision vs. performance

full-system, parallel host

Culler/Singh 4 (skim 4.4)

"Simics: A Full System Simulation Platform"

"The Wisconsin Wind Tunnel: Virtual Prototyping of Parallel Computers"

18

Evaluation: Workloads

scientific vs. commercial, TLP, importance of benchmark selection

"Memory System Characterization of Commercial Workloads"

"Simulating a $2M Commercial Server on a $2K PC"

Reliability and Availability

19

Available Computers

 

"IBM S/390 Parallel Enterprise Server G5 Fault Tolerance: A Historical Perspective"

"Fault-Tolerant Systems in Commercial Applications" [survey of classic FT systems]

20

Current Topics in Availability

"SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery" 

"Embracing Failure: A Case for Recovery-Oriented Computing (ROC)"

Novel
Architectures

21

Vector Machines

"The Cray-1 Computer System"

"Tarantula: A Vector Extension to the Alpha Architecture"

"Optimizing Compiler for a CELL Processor" (do not worry about compiler details!)

22

Dataflow

Culler/Singh 1.2.6

"Executing a Program on the MIT Tagged-Token Dataflow Architecture"

23

Grid Architectures

"Baring It All to Software: Raw Machines"

"A Design Space Evaluation of Grid Processor Architectures"

24

Supercomputing

 

"Blue Gene: A Vision for Protein Science Using a Petaflop Supercomputer"

Interactions with Processors and I/O

25

Microarchitectural Effects

parallelism: ILP, MLP, TLP

"An Evaluation of Memory Consistency Models for Shared-Memory Systems with ILP Processors"

26

I/O

"Making Network Interfaces Less Peripheral"

New
Technology

27

Quantum Computing

"A Practical Architecture for Reliable Quantum Computers"

28

Bio/Molecular Computing

"Circuit and System Architecture for DNA-Guided Self-Assembly of Nanoelectronics"