Embedded Computer Architecture
2011 - 2012 (1st semester)
Code : 5KK73
Credits: 5 ECTS
Lecturers : Prof. dr.
Henk Corporaal, Dr. Bart Mesman
Tel. : +31-40-247 5195 / 3653 (secr.) 5462 (office)
Email: B.Mesman at tue.nl;
H.Corporaal at tue.nl
Project assistance: Yifan He (Y.He at tue.nl), Dongrui She
(D.She at tue.nl), and Akash
Kumar
(A.Kumar at tue.nl), Shakith Fernando (S.Fernando at tue.nl)
News
- Jan 19, 2012: All material and slides are now online.
- Third lab exercise on Data
Memory Management is online !
Deadline is Friday, January 20, 2012.
- Second lab can be
started; deadline is Dec 18.
- Update the schedule for 2011
- Student
internship / thesis projects in the Computer Architecture area
- Oct 17: First lab on designing your own VLIW is now online; check
lab1.
Akash Kumar and Shakith Fernando can help you with questions
(see email addresses above).
- Menno Lindwer (INTEL - Silicon Hive) will introduce the lab and
required tools on Oct 17.
- Start lectures on Monday, September 5, 2011, Location EE 9.05
Information on the course:
Description
When looking at future embedded systems and their design, especially
(but
not exclusively) in the multi-media domain, we observe several
problems:
- high performace (100 GOPS and far beyond) has to be combined
with
low
power (many systems are mobile);
- time-to-market (to get your design done) constantly reduces;
- most embedded processing systems have to be extremely low cost;
- the applications show more dynamic behavior (resulting in
greatly varying quality and performance requirements);
- more and more the implementer requires flexible and programmable
solutions;
- huge latencie gap between processors and memories; and
- design productivity does not cope with the increasing design
complexity.
In order to solve these problems we foresee the use of programmable
multi-processor platforms, having an advanced memory hierarchy, this
together with an advanced design trajectory. These platforms may
contain different processors, ranging from general purpose processors,
to processors which are highly tuned for a specific application or
application domain. This course treats several processor architectures,
shows how to program and generate (compile) code for them, and compares
their efficiency in terms of cost, power and performance. Furthermore
the tuning of processor architectures is treated.
Several advanced
Multi-Processor Platforms, combining discussed processors, are treated.
A set
of lab exercises complements the course.
Purpose:
This course aims at getting an understanding of the processor
architectures
which will be used in future multi-processor platforms, including their
memory
hierarchy, especially for the embedded domain. Treated processors range
from
general purpose to highly optimized ones. Tradeoffs will be made
between
performance, flexibility, programmability, energy consumption and cost.
It will
be shown how to tune processors in various ways.
Furthermore this course looks into the required design trajectory,
concentrating on code generation, scheduling, and on efficient data
management
(exploiting the advanced memory hierarchy) for high performance and low
power.
The student will learn how to apply a methodology for a step-wise
(source code)
transformation and mapping trajectory, going from an initial
specification to
an efficient and highly tuned implementation on a particular platform.
The
final implementation can be an order of magnitude more efficient in
terms of
cost, power, and performance.
Topics:
In this course we treat different processor architectures: DSP (digital
signal processors), VLIWs (very long instruction word, including
Transport Triggered Architectures), ASIPs (application specific
processors), and highly tuned, weakly programmable processors. In all
cases it is shown how to program these architectures. Code generation
techniques, especially for VLIWs, are treated, including methods to
optimize code at source or assembly level. Furthermore the design of
advanced data and instruction memory hierarchies will be detailed. A
methodology is discussed for the efficient use of the data memory
hierarchy.
Most of the topics will be supplemented by hands-on exercises.
For more information see:
course description and schedule.
The lecture slides will be made available during the course; see also
below.
Papers and other reading material
- Learn Chapter 2 on Computer Architecture Trends
From "Microprocessor Architectures, from VLIW to TTA" by Henk
Corporaal, publisher John Wiley, 1998.
- Related to Data Memory Management:
- A paper about data reuse.
Formalized methodology for data reuse exploration in hierarchical
memory mapping.
J.Ph.Diguet e.a.
- Code transformations.
Code transformations for data transfer and storage exploration
preprocessing multimedia processors.
Francky Catthoor, Nikil D. Dutt, Koen Danckaert and Sven Wuytack
IEEE Design and Test of Computers, May-June 2001
- Data storage components.
Random-access data storage components in customized architectures
Lode Nachtergaele, Francky Catthoor and Chidamber Kulkarni
IEEE Design and Test of Computers, May-June 2001
- Data optimizations.
Data memory organization and optimizations in Application Specific
systems
P.R. Panda e.a.
IEEE Design and Test of Computers, May-June 2001
Slides (per topic; see also the course description)
** Slides as far as available; will be updated regularly during the
course.
Student presentations guidelines
As part of this lecture you have to study a hot topic related to this
course, and make a short slide presentation about this topic.
The slides have to be presented during the oral exam.
Guidelines are as follows:
- Choose one hot topic
which
interests you and which is highly related to this course.
- Select two technical
(in depth) research papers
from the web, based on this
topic; both papers should be from different groups but about the
same topic.
- The papers should have
sufficient technical depth; i.e. it should
clearly explain all the details of the proposed method or solution. You
can also check whether the paper is from well perceived journals or
conferences, like IEEE, or ACM conferences and journals (see e.g.
IEEE.org, and ACM.org). E.g., have a look at the
following conferences:
- DATE:
Design Automation and Test in Europe:
www.date-conference.com
- CODES
(Hardware-Software Codesign) + ISSS (International Symposium on System
Synthesis): www.codes-isss.org
- CASES:
Compilers, Architectures, and Synthesis for Embedded Systems:
www.casesconference.org
- IEEE
MICRO: Symposium on Micro Arch: www.microarch.org
- HPCA:
High-Performance Computer Architecture: www.hpcaconf.org
- PACT:
parallel architectures and compilation techniques:
www.eecg.toronto.edu/pact
- A larger list can be found here.
- The papers should be published in 2010 or later (try to
choose recent papers).
- You should make a powerpoint presentation on your topic; max 10
min. per presentation (so about 10
slides; e.g. a few slides
introducing the problem, then the approach and results of each paper,
and final conclusion and suggestions from your side on this topic; add
/ use clear pictures to
explain the approach)
- The presentation should contain at least the following:
- Summary of the
papers contirbution (including technical details)
- Your evaluation
of the papers and topic
- strong points
- weak points
- applicability of proposed methodology / solution
- indicate new / future directions of research
- In order to evaluate the papers you may have to read related
material on the same topic.
- Your presentation will be evaluated by us. This evaluation will
be taking into account for the final grading.
Hands-on lab work
Becoming a very good Embedded Computer Architect you have to practice a
lot. Therefore, as part of this course we have put a lot of effort to
prepare 3 very interesting lab assignments. For each lab there is a
website with all the required documentation and preparation material.
These lab
assignments can be made on your own laptop, with for certain parts,
remote access to our server systems.
For every lab you have to write a report, which has to be sent to one
of the course assistants.
Hands-on 1:
Processor Design Space Exploratoin, based on the Silicon Hive
Architecture
In the past we had several architecture design space exploration (DSE)
labs, using the Transport Triggered Architecture (TTA) framework,
using the Imagine Processor, and one using the AR|T tools. This year we
base the first lab on the reconfigurable processor from Silicon
Hive
For this excercise:
- Check the link http://www.es.ele.tue.nl/~akash/5kk73.php
You'll find several pdf files. Have a look at
all of them first.
- Then check the start-up guide in detail.
- Thereafter start with the assignment.It also describes what are
the deliverables you have to sent in (as a small report). The report
should be send to Akash Kumar. He can also help you with questions.
Hands-on 2:
Platform Programming
In this lab you are asked to program a (multi-)processor platform. In
the
past we developed various labs:
- using the Wica platform (with a 320 PE SIMD processor, the
Xetal); see (c) below.
- using the CELL platform (CELL is used in the e.g. the PlayStation
3; it has besides a PowerPC RISC processor upto 8 other processors for
the high-performance kernels; these processors also exploit sub-word
SIMD); see (b) below.
This year, 2011, we will take an x86 plus graphic processing unit (GPU)
as
platform.
Programming Graphic Processing Units
Graphic processing units (GPUs) can contain upto hundreds of Processing
Engines (PEs). They achieve performance levels of hundreds of GFLOPS
(10^9 floating point operations per second). In the past GPUs were very
dedicated, not general programmable, and could only be used to
speedup graphics processing. Today, they become more-and-more general
purpose. The latest GPUs of ATI and NVIDIA can be programmed in C and
OpenCL. For this lab we will use NVIDIA GPUs together with the CUDA
(based on C) programming environment. Start with setting up the CUDA
environment, studying the available learning materials, and running the
example programs.
We added one extensive example program, about matrix multiplication,
which demonstrates various GPU programming optimizations.
You will see getting something running using CUDA is not so difficult,
but getting it efficiently running will take quite some effort.
After studying the example and learning material you have to perform
your own assignment and hand in a small report. The purpose is
the use your GPU as efficient as possible.
All the details about this assignment can be found on the GPU-assignment site.
The assignment is made by Dongrui She and Zhenyu Ye. For questions
contact d.she _at_ tue.nl.
When finished, send in a small report about your result and various
applied optimizations to Dongrui She.
Hands-on 3: Exploiting the data memory hierarchy for high
performance and low power
In this exercise you are asked to optimize a C algorithm by using the
discussed data management techniques. This should result into an
implementation which shows a much improved memory behavior. This
improves performance and energy consumption. In this exercise we mainly
concentrate on reducing energy consumption. You need to download the
following, and follow the instructions.
The 2011 year assignment can be found here.
The algorithm is based on Harris corner detection.
You will start with a default platform, containing 2 levels of cache.
First calculate the results of your code optimizations
for this platform. Thereafter you are free to tune the platform for the
given application, e.g. changing the caches, or even
use ScratchPad memory (SRAM) instead of, or in addition to, caches.
Success !
Examination
The examination will be oral about the treated course theory, the
lab report(s), and studied articles.
Date: 4th week of January 2011 or beginning of February.
Grading depends on your results on theory, lab exercises and your
presentation.
Related material and other links
Interesting processor architectures:
- The cell
architecture, made by Sony, IBM and Toshiba, and used e.g. in
Playstation 3
- TRIPS architecture,
combining several types of parallelism
- The tile based RAW
architecture from MIT
- Imagine,
a hybrid SIMD - VLIW architecture from Stanford
- Merrimac, the
successor of the Imagine
- ChipCon, check e.g. their system-on-chip: CC1110
- MAXQ
from MAXIM, Dallas; a Transport Triggered Architecture
- Aethereal,
a Network-on-Chip from Philips
Back to homepage of Henk Corporaal