Embedded Computer Architecture

2021 - 2022 (Second Quartile)

Code :         5SIA0
Credits:      5 ECTS
Lecturers : Prof. dr. Henk Corporaal
Tel. :           +31-40-247 5195 (secr.) 5462 (office)
Email:        H.Corporaal at tue.nl

Project assistance:
- Ali Banagozar (a.banagozar at tue.nl), specifically for the CGRA lab
- Sun Wei (w.sun at tue.nl), for the GPU lab
- Patrick Wijnings (p.w.a.wijnings at tue.nl), for lab on Architecture Exploration

with some (background) help from:
- Barry de Bruin (e.d.bruin at tue.nl)

News

Jan 20: All lecture slides, including wrap-up, are now online.
Lecture starts: Monday, November 15, 10.45 h.
and following lecture Wednesday November 17, 17.30h.
Note: this website will be updated continuously during the course.
We also use oncourse.tue.nl, Search for course Embedded Computer Architecture (5SIA0) in Electrical Engineering.
Interesting article about processor design a 90 minute guide!
Tentative Course schedule.

Description

When looking at future embedded systems and their design, especially (but not exclusively) in the multi-media and machine learning domains, we observe several problems:

high performance (1 TOPS and far beyond) has to be combined with low power, often less than 1 Watt;
many systems are mobile and battery operated;
time-to-market (to get your design done) constantly reduces;
most embedded processing systems have to be extremely low cost;
the applications show more dynamic behavior (resulting in greatly varying quality and performance requirements);
more and more the implementer requires flexible and programmable solutions, allowing late changes of used algorithms and system updates;
reliability gets an issue with denser silicon technologies and more on-chip circuitry;
huge latency gap between processors and memories; and
design productivity does not cope with the increasing design complexity.

In order to solve these problems we foresee the use of programmable multi-processor platforms, having an advanced memory hierarchy, this together with an advanced design trajectory. These platforms may contain different processors, ranging from general purpose processors, to processors and accelerators which are highly tuned for a specific application or application domain. This course treats the design principles of the current advanced processor architectures, shows how to program them, and compares their efficiency in terms of cost, power and performance. Furthermore the tuning of processor architectures is treated.

Several advanced Multi-Processor Platforms, combining discussed processors, are treated. A set of 3 mandatory very advanced lab exercises complements this course.

Purpose:
This course aims at getting an understanding of the processor architectures which will be used in future multi-processor platforms, including their memory hierarchy, especially for the embedded domain. Treated processors range from general purpose to highly optimized ones. Trade-offs will be made between performance, flexibility, programmability, energy consumption and cost. It will be shown how to tune processors in various ways.

Studying the architecture, organization and use of the newest (micro)processors currently on the market, and the latest research developments in computer architecture. Architectures exploiting instruction-level parallelism (ILP), data-level parallelism (DLP, like exploited by GPUs), thread-level and task-level parallelism are treated. Starting from basic architecture concepts we will end with discussing the latest commercial processors.

This course also treats how processors can be combined in a multiprocessing platform, e.g. by using a Network-on-Chip. Inter-processor communication issues will be dealt with. Furthermore some loop transformation techniques needed for exploiting ILP, parallelism and data reuse, will be treated.
Note, code generation will be far more extensively treated in a special course on Parallelization, Compilers and Platforms; 5LIM0.

Special emphasis will be on quantifying design decisions in terms of energy, performance and cost. The intention of the course is to give students the ability to understand the design principles and operation of new (multi-)processor architectures, and evaluate them both qualitatively and quantitatively. Although we treat several examples, the emphasis will be on architecture concepts. Furthermore, 3 intensive lab exercises are part of course; they will learn you the design space of multi- and graphics processors.

We may invite several guest lectures treating State-of-the-Art topics in the computer architecture area.

Topics:

Basic principles (like RISC and instruction set design), pipelining and its consequences.

All processor architecture varieties, including VLIW (very long instruction word, including TTAs, Transport Triggered Architectures), Superpipelined, Superscalar, SIMD (single instruction, multiple data, used in vector and sub-word parallel processors) and MIMD (multiple instruction, multiple data) architectures; SMT (Simultaneous Multi-Threading); CGRAs (Coarse Grain Reconfigurable Arrays), and Accelerators.

Concepts like: Out-of-Order and speculative execution; Branch prediction; Data (value) prediction; Design of advanced memory hierarchies; Near- and In-memory computing; Memory coherency and consistency; Multi-threading; Exploiting task-level and instruction-level parallelism; Inter-processor communication models; Input and output; Network Communication Architecture; and Networks-on-Chip.

In all cases it is shown how to program these architectures. Furthermore their combination and interconnection in an MPSoCs (Multi-Processor Systems-on-Chips) is treated.

Most of the topics will be supplemented by very elaborate hands-on exercises.

Book and Handouts

The lecture slides will be made available during the course; see also below.
Papers and other reading material.

Book:

Course book Computer Architecture: A quantitative approach
Hennessy and Patterson
6th Edition (make sure you have this edition!)
November 2017
Especially study Chapters 1-5, and 7, and Appendices A-C,
Background reading E, F and K

Alternative book, less recent but more conceptual, strongly recommended for background reading:

P Alternative course book arallel Computer Organization and Design
Authors: Michel Dubois, Murali Annavaram, and Per Stenström

Cambridge university press
October 2012

In addition: Learn Chapter 2 on Computer Architecture Trends
taken from the book "Microprocessor Architectures, from VLIW to TTA"
by Henk Corporaal, publisher John Wiley, 1998.

Slides (per topic; see also the course description)

Note: all slides will be updated during the course
Chapter numbers correspond, unless mentioned otherwise, to the book of Dubois. Related chapters from Hennessy and Patterson's book are mentioned in the slide sets.

Course introduction + Guidelines of this lecture, including preliminary Course Schedule
Fundamentals of Computer Architecture
RISC: recap ISA, based on the MIPS architecture
Memory hierarchy 1
Simulation + Gem5 + Lab 3 (see oncourse.tue.nl)
Loop transformations
RISC: recap pipelining and hazards
Memory hierarchy 2
DLP and SIMD architectures
GPU: Graphic Processing Units

Guest lecture by Gert-Jan van de Braak (Philips Medical Systems)

Instruction Level Parallel Architectures 1
Instruction Level Parallel Architectures 2
Instruction Level Parallel Architectures 3
Coarse Grain Reconfigurable Architecture + Lab 1 (see oncourse.tue.nl)
Thread-level parallelism + Multi-Processors
Networks
Coherence, Synchronization, and Consistency
Domain Specific Architectures + Machine Learning

Guest lecture by Maurice Peemen (Thermo Fisher, Eindhoven)

Wrap-Up

Slides and material corresponding to labs: Go to oncourse.tue.nl for all the lab material !!

Hands-on lab work

note: lab courses will be updated during the course;
check especially the oncourse.tue.nl site !!

Becoming a very good Computer Architect you have to practice a lot. Therefore, as part of this course we have put a lot of effort in preparing 3 very interesting lab assignments. For each lab there is a website with all the required documentation and preparation material. These lab assignments can be made (largely, but check the wiki pages) on your own laptop, with for certain parts, remote access to our server systems.
For every lab you have to write a report, which has to be sent to one of the course assistants, or via the oncourse web site (always check the lab specific instructions).

Lab DSE: Using GEM5 simulator for architecture design space exploration (DSE)

In this lab, you'll get some hands-on experience using the GEM5 simulator; see also the Gem5 slides.
The main goal of this lab is to learn how a simulator can be exploited for exploring the impacts of architectural modifications on performance, and in particular how design choices in processor and cache hierarchy impact the trade-off between energy efficiency and performance. Furthermore, you will learn: how to do basic power modelling of a full system; how to deal with a large amount of simulation parameters and big output data files; and how to write C code that optimally exploits a given cache hierarchy.

Lab GPU: Programming Graphic Processing Units

Graphic processing units (GPUs) can contain upto thousands of Processing Engines (PEs). They achieve performance levels of Tera FLops (10^12 floating point operations per second). In the past GPUs were very dedicated, not general programmable, and could only be used to speedup graphics processing. Today, they become more-and-more general purpose and even appear in high end embedded systems. The latest GPUs of ATI/AMD and NVIDIA can be programmed in Cuda and/or OpenCL. For this lab we will use GPUs together with the OpenCL (based on C) programming environment.

For details on the assignment see the oncourse site.
First study GPU lecture and the matrix multiplication example in the cookbook, before starting the real assignment.

Lab CGRA: Processor Design Space Exploration, based on the CGRA from TU/e

You will be asked to design and optimize a low power and very flexible CGRA, Coarse Grain Reconfigurable Array processor for a certain application. You can both configure the CGRA and make a smart mapping of the application to this CGRA. In particular you have to trade-off performance and energy consumption.

Detailed documentation about these assignments will be put on our 5SIA0 oncourse.tue.nl website, you need to log in.

Student presentations guidelines

As part of this lecture you have to study a hot topic related to this course, and make a short slide presentation about this topic.
The slides have to be presented during the oral exam (usually in the beginning of Q3)

Guidelines are as follows:

Choose one hot topic which interests you and which is highly related to this course!
Select one technical (in depth) research paper from the web, based on this topic.
The paper should be published in 2019 or later.
The paper should have sufficient technical depth; i.e. it should clearly explain all the details of the proposed method or solution. So e.g. do not choose company white or business papers. You can also check whether the paper is from well perceived journals or conferences, like IEEE, or ACM conferences and journals (see e.g. IEEE.org, and ACM.org). E.g., have a look at the following conferences:

ISCA: International Symposium on Computer Architecture: iscaconf.org
IEEE MICRO: Symposium on Microprocessor Architectures: www.microarch.org
ASPLOS: Architectural support for languages and operating systems: asplos-conferenc.org
ICS: International Conference on Supercomputing: www.ics-conference.org
ISSCC: International Solid State Circuits Conference: isscc.org
DAC: Design Automation Conference: www.dac.com
DATE: Design Automation and Test in Europe: www.date-conference.com
CODES (Hardware-Software Codesign) + ISSS (International Symposium on System Synthesis): www.codes-isss.org
CASES: Compilers, Architectures, and Synthesis for Embedded Systems: www.casesconference.org
IEEE MICRO: Symposium on Micro Arch: www.microarch.org
HPCA: High-Performance Computer Architecture: www.hpcaconf.org
PACT: parallel architectures and compilation techniques: www.eecg.toronto.edu/pact

A much larger list can be found here.
PRESENTATION:
You should make a powerpoint presentation on your topic; max 5 min. per presentation (e.g. 5 slides; one slide introducing the problem, then the approach and results of each paper, and final conclusion and suggestions from your side on this topic; add / use clear pictures to explain the approach)
The presentation should contain at least the following:

Summary of the paper contributions (including technical details)
Your evaluation of the paper and topic

strong points
weak points
applicability of proposed methodology / solution
indicate new / future directions of research

In order to evaluate the paper you may wish to read related material on the same topic.
Your presentation will be evaluated by us. This evaluation will be taken into account for the final grading.

Examination

The examination will be a written exam, online (using your laptop), about the treated course theory (all treated slides + corresponding parts of the used book), and the 3 labs. Maximum grading:

for the three lab reports, each 1 point
written exam 7 points

of which 4.5 points directly related to the labs and corresponding architectures, and
2.5 points for the remaining theory questions (book + slides)

Note, the written exam includes many questions directly related to your labs.
When: in the exam period (check the time table).

In addition, as a small bonus (max 1 point), there will be a short oral presentation of a recent article you studied. This is not mandatory.