Embedded Computer Architecture
2012 - 2013 (1st semester)
Code : 5KK73
Credits: 5 ECTS
Lecturers : Prof. dr. Henk Corporaal, Dr. Bart Mesman
Tel. : +31-40-247 5195 / 3653 (secr.) 5462 (office)
Email: B.Mesman at tue.nl; H.Corporaal at tue.nl
Project assistance: Yifan He (Y.He at tue.nl), Dongrui She
(D.She at tue.nl), and Akash Kumar (A.Kumar at tue.nl), Shakith
Fernando (S.Fernando at tue.nl)
- Jan 17, 2013: all slides are now online. Success with
the preparation and exams!!
- Note you have to study also 1 very recent technical
in-depth paper (see guidelines below)
- For the second lab assignment: Check the GPU assignment site.
- Slides about Neural Architectures are added.
- Slides of Silicon Hive tools have been put online.
article about processor design a 90 minute guide!
- Lab1-AES slides are available;
Lab1 starts Oct 22.
- September 11, 2012: start lecture
Information on the course:
When looking at future embedded systems and their design, especially
(but not exclusively) in the multi-media domain, we observe several
In order to solve these problems we foresee the use of programmable
multi-processor platforms, having an advanced memory hierarchy, this
together with an advanced design trajectory. These platforms may
contain different processors, ranging from general purpose
processors, to processors which are highly tuned for a specific
application or application domain. This course treats several
processor architectures, shows how to program and generate (compile)
code for them, and compares their efficiency in terms of cost, power
and performance. Furthermore the tuning of processor architectures
- high performace (100 GOPS and far beyond) has to be combined
with low power (many systems are mobile);
- time-to-market (to get your design done) constantly reduces;
- most embedded processing systems have to be extremely low
- the applications show more dynamic behavior (resulting in
greatly varying quality and performance requirements);
- more and more the implementer requires flexible and
- huge latencie gap between processors and memories; and
- design productivity does not cope with the increasing design
Several advanced Multi-Processor
Platforms, combining discussed processors, are treated. A set of
lab exercises complements the course.
This course aims at getting an understanding of the processor
architectures which will be used in future multi-processor
platforms, including their memory hierarchy, especially for the
embedded domain. Treated processors range from general purpose to
highly optimized ones. Tradeoffs will be made between performance,
flexibility, programmability, energy consumption and cost. It will
be shown how to tune processors in various ways.
Furthermore this course looks into the required design
trajectory, concentrating on code generation, scheduling, and on
efficient data management (exploiting the advanced memory
hierarchy) for high performance and low power. The student will
learn how to apply a methodology for a step-wise (source code)
transformation and mapping trajectory, going from an initial
specification to an efficient and highly tuned implementation on a
particular platform. The final implementation can be an order of
magnitude more efficient in terms of cost, power, and performance.
In this course we treat different processor architectures: DSP
(digital signal processors), VLIWs (very long instruction word,
including Transport Triggered Architectures), ASIPs (application
specific processors), and highly tuned, weakly programmable
processors. In all cases it is shown how to program these
architectures. Code generation techniques, especially for VLIWs, are
treated, including methods to optimize code at source or assembly
level. Furthermore the design of advanced data and instruction
memory hierarchies will be detailed. A methodology is discussed for
the efficient use of the data memory hierarchy.
Most of the topics will be supplemented by hands-on exercises.
For a preliminary schedule see: schedule.
The lecture slides will be made available during the course; see
Papers and other reading material
- Learn Chapter 2 on Computer Architecture Trends
From "Microprocessor Architectures, from VLIW to TTA" by Henk
Corporaal, publisher John Wiley, 1998.
- Related to Data Memory Management:
- A paper about data reuse.
Formalized methodology for data reuse exploration in
hierarchical memory mapping.
- Code transformations.
Code transformations for data transfer and storage exploration
preprocessing multimedia processors.
Francky Catthoor, Nikil D. Dutt, Koen Danckaert and Sven
IEEE Design and Test of Computers, May-June 2001
- Data storage
components. Random-access data storage components in
Lode Nachtergaele, Francky Catthoor and Chidamber Kulkarni
IEEE Design and Test of Computers, May-June 2001
- Data optimizations.
Data memory organization and optimizations in Application
P.R. Panda e.a.
IEEE Design and Test of Computers, May-June 2001
Slides (per topic; see also the course description)
** Slides as far as available; will be updated regularly during the
- Overview of this lecture
- Topic 1: RISC processors, MIPS
instruction set (ISA) and processor implementation details and
- Topic 2: VLIWs: Very Long
Instruction Word Architectures
- Topic 3: ASIPs and Energy efficient
Lecture by Bart Mesman
- Topic 4: VLIW: compilation
/ code generation
- Topic 5:
Silicon Hive VLIW cores; Introduction to the first lab
Guest lecture by Ir. Menno Lindwer, PDEng
- Topic 6: SDRAM memory and
control & Mixed Time-Criticality
Guest lecturer: Ir. Sven Goossens
- Topic 7: Graphic Processing Units
(GPUs): what's inside and how to program them efficiently?
Guest lecturer: MSc. Zhenyu Ye
- Topic 8: Architecture
Accelerators for Neural Networks
Guest lecturer: Ir. Maurice Peemen
- Topic 9: DMM: Data Memory Management
- Topic 10:
- Topic 11: Design of
Embedded Signal Processing Systems using MATLAB and Simulink
Guest lecture by Giorgia Zucchelli, from MathWorks
- Topic 12: NoCs:
Guest lecturer Prof Kees Goossens
As part of this lecture you have to study a hot topic related to
this course, and make a short slide presentation about this topic.
The slides have to be presented during the oral exam.
Guidelines are as follows:
- Choose one hot topic
which interests you and which is highly related to this
- Select one technical (in depth) research paper from the web,
based on this topic.
- The paper should have
sufficient technical depth; i.e. it should clearly
explain all the details of the proposed method or solution. So
e.g. do not choose company white or business papers. You can
also check whether the paper is from well perceived journals or
conferences, like IEEE, or ACM conferences and journals
(see e.g. IEEE.org, and ACM.org). E.g., have a look at
the following conferences:
Automation and Test in Europe:
Codesign) + ISSS (International Symposium on System
Architectures, and Synthesis for Embedded Systems:
Symposium on Micro Arch: www.microarch.org
Computer Architecture: www.hpcaconf.org
architectures and compilation techniques:
- A larger list can be found here.
- The paper should be published in 2010 or later
(try to choose a
very recent papers).
- You should make a powerpoint presentation on your topic; max 10 min. per presentation
(so about 10 slides;
e.g. a few slides introducing the problem, then the approach and
results of each paper, and final conclusion and suggestions from
your side on this topic; add / use clear pictures to explain the
- The presentation should contain at least the following:
- Summary of
the paper contirbution (including technical details)
- Your evaluation
of the paper and topic
- strong points
- weak points
- applicability of proposed methodology / solution
- indicate new / future directions of research
- In order to evaluate the paper you may have to read related material on the same
- Your presentation will be evaluated by us. This evaluation
will be taking into account for the final grading.
Hands-on lab work
Becoming a very good Embedded Computer Architect you have to
practice a lot. Therefore, as part of this course we have put a lot
of effort to prepare 3 very interesting lab assignments. For each
lab there is a website with all the required documentation and
preparation material. These lab assignments can be made on your own
laptop, with for certain parts, remote access to our server systems.
For every lab you have to write a report, which has to be sent to
one of the course assistants.
Processor Design Space Exploratoin, based on the Silicon Hive
In the past we had several architecture design space exploration
(DSE) labs, using the Transport Triggered Architecture (TTA)
framework, using the Imagine Processor, and one using the AR|T
tools. This year we base the first lab on the reconfigurable
processor from Silicon Hive
For this excercise:
- Check the link http://www.es.ele.tue.nl/~akash/5kk73.php
You'll find several pdf files. Have a
look at all of them first.
- Then check the start-up guide in detail.
- Thereafter start with the assignment.It also describes what
are the deliverables you have to sent in (as a small report).
The report should be send to Akash Kumar. He can also help you
In this lab you are asked to program a (multi-)processor platform.
In the past we developed various labs:
This year, 2012, we will take an x86 plus graphic processing unit
(GPU) as platform.
- using the Wica platform (with a 320 PE SIMD processor, the
Xetal); see (c) below.
- using the CELL platform (CELL is used in the e.g. the
PlayStation 3; it has besides a PowerPC RISC processor upto 8
other processors for the high-performance kernels; these
processors also exploit sub-word SIMD); see (b) below.
Programming Graphic Processing Units
Graphic processing units (GPUs) can contain upto hundreds of
Processing Engines (PEs). They achieve performance levels of
hundreds of GFLOPS (10^9 floating point operations per second). In
the past GPUs were very dedicated, not general programmable,
and could only be used to speedup graphics processing. Today,
they become more-and-more general purpose. The latest GPUs of ATI
and NVIDIA can be programmed in C and OpenCL. For this lab we will
use NVIDIA GPUs together with the CUDA (based on C) programming
environment. Start with setting up the CUDA environment, studying
the available learning materials, and running the example programs.
We added one extensive example program, about matrix multiplication,
which demonstrates various GPU programming optimizations.
You will see getting something running using CUDA is not so
difficult, but getting it efficiently running will take quite some
After studying the example and learning material you have to perform
your own assignment and hand in a small report. The purpose is
the use your GPU as efficient as possible.
All the details about this assignment can be found on the GPU-assignment
The assignment is made by Dongrui She and Zhenyu Ye. For questions
contact d.she _at_ tue.nl.
When finished, send in a small report about your result and various
applied optimizations to Dongrui She.
Hands-on 3: Exploiting the data memory hierarchy for high
performance and low power
In this exercise you are asked to optimize a C algorithm by using
the discussed data management techniques. This should result into an
implementation which shows a much improved memory behavior. This
improves performance and energy consumption. In this exercise we
mainly concentrate on reducing energy consumption. You need to
download the following, and follow the instructions.
The 2011 year assignment can be found here.
The algorithm is based on Harris corner detection.
You will start with a default platform, containing 2 levels of
cache. First calculate the results of your code optimizations
for this platform. Thereafter you are free to tune the platform for
the given application, e.g. changing the caches, or even
use ScratchPad memory (SRAM) instead of, or in addition to, caches.
The examination will be oral about the treated course theory, the
lab report(s), and studied articles.
Likely week: 4th week of January 2013.
Grading depends on your results on theory, lab exercises and your
Related material and other links
Interesting processor architectures:
- The cell
architecture, made by Sony, IBM and Toshiba, and used e.g.
in Playstation 3
architecture, combining several types of parallelism
- The tile based RAW
architecture from MIT
a hybrid SIMD - VLIW architecture from Stanford
- Merrimac, the
successor of the Imagine
- ChipCon, check e.g. their system-on-chip: CC1110
from MAXIM, Dallas; a Transport Triggered Architecture
a Network-on-Chip from Philips
Back to homepage of Henk Corporaal