Embedded Computer Architecture

2014 - 2015 (1st semester)

Code : 5KK73
: 5 ECTS
Lecturers : Prof. dr. Henk Corporaal, Dr. Bart Mesman
Tel. : +31-40-247 5195 / 3653 (secr.) 5462 (office)
Email:  B.Mesman at tue.nl; H.Corporaal at tue.nl
Project assistance: Yifan He (Y.He at tue.nl), Luc Waeijen (L.J.W.Waeijen at tue.nl), Mark Wijtvliet (M.Wijtvliet at tue.nl), and Shakith Fernando (S.Fernando at tue.nl)


Information on the course:


When looking at future embedded systems and their design, especially (but not exclusively) in the multi-media domain, we observe several problems: In order to solve these problems we foresee the use of programmable multi-processor platforms, having an advanced memory hierarchy, this together with an advanced design trajectory. These platforms may contain different processors, ranging from general purpose processors, to processors which are highly tuned for a specific application or application domain. This course treats several processor architectures, shows how to program and generate (compile) code for them, and compares their efficiency in terms of cost, power and performance. Furthermore the tuning of processor architectures is treated. 

Several advanced Multi-Processor Platforms, combining discussed processors, are treated. A set of lab exercises complements the course.

This course aims at getting an understanding of the processor architectures which will be used in future multi-processor platforms, including their memory hierarchy, especially for the embedded domain. Treated processors range from general purpose to highly optimized ones. Tradeoffs will be made between performance, flexibility, programmability, energy consumption and cost. It will be shown how to tune processors in various ways.

Furthermore this course looks into the required design trajectory, concentrating on code generation, scheduling, and on efficient data management (exploiting the advanced memory hierarchy) for high performance and low power. The student will learn how to apply a methodology for a step-wise (source code) transformation and mapping trajectory, going from an initial specification to an efficient and highly tuned implementation on a particular platform. The final implementation can be an order of magnitude more efficient in terms of cost, power, and performance.


In this course we treat different processor architectures: DSP (digital signal processors), VLIWs (very long instruction word, including Transport Triggered Architectures, or TTAs), SIMDs (Single Instruction Multiple Data), GPUs (Graphic Processing Units), ASIPs (application specific processors), and highly tuned, weakly programmable processors. In all cases it is shown how to program these architectures. Furthermore their combination and interconnection in an MPSoCs (Multi-Processor Systems-on-Chips) is treated.
Code generation techniques exploiting instruction level parallelism, especially for VLIWs, are treated, including methods to optimize code at source or assembly level. Furthermore the design of advanced data and instruction memory hierarchies will be detailed. A methodology is discussed for the efficient use of the data memory hierarchy.

Most of the topics will be supplemented by very elaborate hands-on exercises.
For a preliminary schedule see: schedule.


The lecture slides will be made available during the course; see also below.
Papers and other reading material

Slides (per topic; see also the course description)

** Slides as far as available; not all links are updated yet; they will be updated regularly during the course.

Student presentations guidelines

As part of this lecture you have to study a hot topic related to this course, and make a short slide presentation about this topic.
The slides have to be presented during the oral exam.

Guidelines are as follows:

Hands-on lab work

Becoming a very good Embedded Computer Architect you have to practice a lot. Therefore, as part of this course we have put a lot of effort to prepare 3 very interesting lab assignments. For each lab there is a website with all the required documentation and preparation material. These lab assignments can be made on your own laptop, with for certain parts, remote access to our server systems.
For every lab you have to write a report, which has to be sent to one of the course assistants.
** labs will be put online during the course **

Hands-on 1: Processor Design Space Exploration, based on the Silicon Hive Architecture from INTEL

You will be asked to design and optimize a low power VLIW processor for the ECG application.  In particular you have to trade-off performance and energy consumption.
Details and instructions about this assignment can be found here.

Hands-on 2: Programming Graphic Processing Units

Graphic processing units (GPUs) can contain upto thousands of Processing Engines (PEs). They achieve performance levels of Tera FLops (10^12 floating point operations per second). In the past GPUs were very dedicated, not general programmable, and  could only be used to speedup graphics processing. Today, they become more-and-more general purpose. The latest GPUs of ATI/AMD and NVIDIA can be programmed in C and OpenCL. For this lab we will use  GPUs together with the OpenCL (based on C) programming environment.

After studying the example and learning material you have to perform your own assignment and hand in a small report. The assignment this year is about generating money, coins, called Coinporaals. Generate as many as possible by making your program extremely efficient.

All the details about this assignment can be found on the GPU-assignment site.

Hands-on 3: Exploiting the data memory hierarchy for high performance and low power

In this exercise you are asked to optimize a C algorithm by using the discussed data management techniques. This should result into an implementation which shows a much improved memory behavior. This improves performance and energy consumption. In this exercise we mainly concentrate on reducing energy consumption. You need to download the following, and follow the instructions.

The 2014/2015 year assignment can be found here. The algorithm is based Cavity detection, as discussed in the DMM lectures. You will start with a default platform with all images in external DRAM memory. First calculate the results of your code optimizations for this platform. Thereafter you are free to tune the platform for the given application, by
using ScratchPad memory (SRAM), 1-level or even 2-levels.
Success ! Show how much energy and performance improvement you can get.


The examination will be oral about the treated course theory, the lab report(s), and studied articles.
Likely week: 4th week of January 2015, or first week of February. We will discuss the dates with you.
Grading depends on your results on theory, lab exercises and your presentation.

Related material and other links

Interesting processor architectures:

Back to homepage of Henk Corporaal