Platform-based Design

2009- 2010 (1st semester) mica2dot

Code : 5KK70
Credits: 5 ECTS
Lecturers : Prof. dr. Henk Corporaal, Dr. Bart Mesman
Tel. : +31-40-247 5195 / 3653 (secr.) 5462 (office)
Email: B.Mesman at tue.nl; H.Corporaal at tue.nl
Project assistance: Yifan He (Y.He at tue.nl), Dongrui She (D.She at tue.nl), and Akash Kumar (A.Kumar at tue.nl)

News

18 Jan: added final slides: the architecture modeling slides
4 Jan: all DMM lecture slides are now online
21 Dec: the 3rd lab assignment has been added
7 Dec: added slides on SDRAM and predictable and composable memory control
20 Nov: added NOC slides for coming Monday
Note: GPU lab deadline is Monday Dec 14
16 Nov: 2nd lab assignment on GPUs online (see assignment 2a)
13 Nov: Added slides for MPSoC, Loop transformations and GPUs
Check the adapted schedule for 2009
Note: Material and links will be updated/added during the course
6 Sept 2009: First lecture slides been added

Description

When looking at future embedded systems and their design, especially (but not exclusively) in the multi-media domain, we observe several problems:

high performace (10 GOPS and beyond) has to be combined with low power (many systems are mobile);
time-to-market (to get your design done) constantly reduces;
most embedded processing systems have to be extremely low cost;
the applications show more dynamic behavior (resulting in greatly varying quality and performance requirements);
more and more the implementer requires flexible and programmable solutions;
huge latencie gap between processors and memories; and
design productivity does not cope with the increasing design complexity.

In order to solve these problems we foresee the use of programmable multi-processor platforms, having an advanced memory hierarchy, this together with an advanced design trajectory. These platforms may contain different processors, ranging from general purpose processors, to processors which are highly tuned for a specific application or application domain. This course treats several processor architectures, shows how to program and generate (compile) code for them, and compares their efficiency in terms of cost, power and performance. Furthermore the tuning of processor architectures is treated.

Several advanced Multi-Processor Platforms, combining discussed processors, are treated. A set of lab exercises complements the course.

Purpose:
This course aims at getting an understanding of the processor architectures which will be used in future multi-processor platforms, including their memory hierarchy, especially for the embedded domain. Treated processors range from general purpose to highly optimized ones. Tradeoffs will be made between performance, flexibility, programmability, energy consumption and cost. It will be shown how to tune processors in various ways.

Furthermore this course looks into the required design trajectory, concentrating on code generation, scheduling, and on efficient data management (exploiting the advanced memory hierarchy) for high performance and low power. The student will learn how to apply a methodology for a step-wise (source code) transformation and mapping trajectory, going from an initial specification to an efficient and highly tuned implementation on a particular platform. The final implementation can be an order of magnitude more efficient in terms of cost, power, and performance.

Topics:

In this course we treat different processor architectures: DSP (digital signal processors), VLIWs (very long instruction word, including Transport Triggered Architectures), ASIPs (application specific processors), and highly tuned, weakly programmable processors. In all cases it is shown how to program these architectures. Code generation techniques, especially for VLIWs, are treated, including methods to optimize code at source or assembly level. Furthermore the design of advanced data and instruction memory hierarchies will be detailed. A methodology is discussed for the efficient use of the data memory hierarchy.
Most of the topics will be supplemented by hands-on exercises.
For more information on course and lecture schedule see: course description

Handouts

The lecture slides will be made available during the course; see also below.
Papers and other reading material

Learn Chapter 2 on Computer Architecture Trends
From "Microprocessor Architectures, from VLIW to TTA" by Henk Corporaal, publisher John Wiley, 1998.
Related to Data Memory Management:

A paper about data reuse. Formalized methodology for data reuse exploration in hierarchical memory mapping.
J.Ph.Diguet e.a.
Code transformations. Code transformations for data transfer and storage exploration preprocessing multimedia processors.
Francky Catthoor, Nikil D. Dutt, Koen Danckaert and Sven Wuytack
IEEE Design and Test of Computers, May-June 2001
Data storage components. Random-access data storage components in customized architectures
Lode Nachtergaele, Francky Catthoor and Chidamber Kulkarni
IEEE Design and Test of Computers, May-June 2001
Data optimizations. Data memory organization and optimizations in Application Specific systems
P.R. Panda e.a.
IEEE Design and Test of Computers, May-June 2001

Slides (per topic; see also the course description)

Slides as far as available (will be updated regularly during the course).

Overview of this lecture
Topic 1: Introduction + Programmable CPU / RISC cores
Detailed discussion of the MIPS architecture and implementation, based on the book of Patterson and Hennessy, Computer Organization
Topic 2: Programmable Digital Signal Processors
Topic 3: VLIW architectures
Topic 4: ILP compilation
Topic 5: ASIPs: Application Specific Instruction-set Processors
Topic 6: SIMD: Single Instruction Multiple Data architectures
Topic 7: Silicon Hive VLIW cores; Introduction to the first lab assignment
Guest lecture by Ir. Menno Lindwer, PDEng
Topic 8: MPSoC and CELL Platform
Topic 9: Loop transformation overview
Topic 10: GPU: Graphics Processing Unit:
Introduction to the second lab exercise
By Zhenyu He

Topic 11: MPSoC continued, Real-time Scheduling
Topic 12: NOC: Networks-on-Chip. How to connect all your IP blocks.
Guest lecture by Prof. Kees Goossens
Topic 13: Making memory access predictable and composable
Guest lecture by MSc. Benny Akesson

Topic 14: DMM: Data Memory Management. Optimizing the memory use of your program.

Recap on Caches and Memory
part a: Overview of the whole methodology
part b: Platform & Layer independent steps
part c: Platform & Layer dependent steps
part d: Data layout for Cache based memory hierarchy

Topic 15: Architecture Modeling
In particular how to model a register file, and application to the Imagine processor
Student presentations
See below for details.

Student presentations guidelines

As part of this lecture you have to study a hot topic related to this course, and make a short slide presentation about this topic.
The slides have to be presented during the oral exam.

Guidelines are as follows:

Choose one hot topic which interests you and which is highly related to this course.
Select two technical (in depth) research papers from the web, based on this topic; both papers should be from different groups but about the same topic.
The papers should have sufficient technical depth; i.e. it should clearly explain all the details of the proposed method or solution. You can also check whether the paper is from well perceived journals or conferences, like IEEE, or ACM conferences and journals (see e.g. IEEE.org, and ACM.org). E.g., have a look at the following conferences:

DATE: Design Automation and Test in Europe: www.date-conference.com
CODES (Hardware-Software Codesign) + ISSS (International Symposium on System Synthesis): www.codes-isss.org
CASES: Compilers, Architectures, and Synthesis for Embedded Systems: www.casesconference.org
IEEE MICRO: Symposium on Micro Arch: www.microarch.org
HPCA: High-Performance Computer Architecture: www.hpcaconf.org
PACT: parallel architectures and compilation techniques: www.eecg.toronto.edu/pact

A larger list can be found here.
The papers should be published after 2000 (try to choose recent papers).
You should make a powerpoint presentation on your topic; max 10 min. per presentation (so about 10 slides; e.g. a few slides introducing the problem, then the approach and results of each paper, and final conclusion and suggestions from your side on this topic; add / use clear pictures to explain the approach)
The presentation should contain at least the following:

Summary of the papers contirbution (including technical details)
Your evaluation of the papers and topic

strong points
weak points
applicability of proposed methodology / solution
indicate new / future directions of research

In order to evaluate the papers you may have to read related material on the same topic.
Your presentation will be evaluated by us. This evaluation will be taking into account for the final grading.

Hands-on lab work

During the course there are three lab exercises to be made (so called hands-on); see also the links below. They will be explained at the corresonding lectures.

Hands-on 1: Processor Design Space Exploratoin, based on the Silicon Hive Architecture

In the past we had several architecture design space exploration (DSE) labs, using the Transport Triggered Architecture (TTA) framework, using the Imagine Processor, and one using the AR|T tools. This year we base the first lab on the reconfigurable processor from Silicon Hive
For this excercise:

Check the link http://www.ics.ele.tue.nl/~akash/education/5kk70/
You'll find several pdf files. Have a look at all of them first.
Then check the start-up guide in detail.
Thereafter start with the assignment.It also describes what are the deliverables you have to sent in (as a small report). The report should be send to Akash Kumar. He can also help you with questions.

Hands-on 2: Platform Programming

In this lab you are asked to program a multi-processor platform. In the past we developed various labs:

using the Wica platform (with a 320 PE SIMD processor, the Xetal); see (c) below.
using the CELL platform (CELL is used in the e.g. the PlayStation 3; it has besides a PowerPC RISC processor upto 8 other processors for the high-performance kernels; these processors also exploit sub-word SIMD); see (b) below.

This year, 2009, we will take an x86 plus graphic processing unit as platform. So you should make assignment (a) below.

a. Programming Graphic Processing Units

Graphic processing units (GPUs) can contain upto hundreds of Processing Engines (PEs). They achieve performance levels of hundreds of GFLOPS (10^9 floating point operations per second). In the past GPUs were very dedicated, not general programmable, and could only be used to speedup graphics processing. Today, they become more-and-more general purpose. The latest GPUs of ATI and NVIDIA can be programmed in C and OpenCL. For this lab we will use NVIDIA GPUs together with the CUDA (based on C) programming environment. Start with setting up the CUDA environment, studying the available learning materials, and running the example programs.
We added one extensive example program, about matrix multiplication, which demonstrates various GPU programming optimizations.
You will see getting something running using CUDA is not so difficult, but getting it efficiently running will take quite some effort.
After studying the example and learning material you have to perform your own assignment and hand in a small report. The purpose is
the use your GPU as efficient as possible.
All the details about this assignment can be found on the GPU-assignment site.
The assignment is made by Dongrui She and Zhenyu Ye. For questions contact d.she _at_ tue.nl.
When finished, send in a small report about your result and various applied optimizations to Dongrui She.

b. Programming the CELL Broadband Engine

The CELL contains a PowerPC processor and 8 SPEs (Synergetic Processing Engines) of which you can use 6 (number 7 is used by the operating system and number 8 is not guaranteed to be functional). The CELL processor is part of the Sony Playstation 3, which we will use as target. But also a good simulator and compiler environment is available.
All details about the architecture, the simulation and compiler environment, and example programs can be found at the CELL-assignemnt page. Read this page carefully and follow the instructions.

c. Programming the WiCa 1.1 board

The WiCa 1.1 board is developed by Philips and NXP. The board is meant for being used in Smart Camera's. It contains among others the Xetal SIMD image processing chip, containing 320 Processing Elements, and an 8051 microcontroler. To observe the world it contains two image sensors; this allows even for stereo vision and depth calculation.
To connect to your PC it has an USB interface, but you can also attach a ZigBee low power interface to make a smart wireless sensor network.
The assignment is on using the image sensors to detect simple objects, their movement, and if possible, cooperate with other boards.
All details about the WiCa platform, the simulation and compiler environment, and example programs can be found at the WiCa-assignemnt page. Read this page carefully and follow the instructions.

Hands-on 3: Exploiting the data memory hierarchy for high performance and low power

In this exercise you are asked to optimize a C algorithm by using the discussed data management techniques. This should result into an implementation which shows a much improved memory behavior. This improves performance and energy consumption. In this exercise we mainly concentrate on reducing energy consumption. You need to download the following, and follow the instructions:

Guidelines. This describes stepwise what you should do.
The algorithm and other required files.
Send the report to Corporaal, and bring a hardcopy + final code to the oral exam.

Examination

The examination will be oral about the treated course theory, the lab report(s), and studied articles.
Date: 4th week of January 2010
Grading depends on your results on theory, lab exercises and your presentation.