Embedded Systems in Silicon (2005-2006)

Code :	TD5102
Lecturer :	Prof. Dr. Henk Corporaal
Email :	H.Corporaal at tue.nl
Phone :	TU/e: +31-40-247 5195 or 3653 (secr. TU/e) / 5462 (office TU/e) NUS-ECE-DTI +65-6874 4188 (secr. DTI) / 4182 (office DTI)
Assistance:	Dr. Yajun Ha: NUS-ECE E1-08-17, elehy at nus.edu.sg, tel 2258 MSc. Valentin Gheorghita (TU/e: LCC compiler): S.V.Gheorghita at tue.nl Ir. Sander Stuijk (TU/e: SystemC and MIPS models): s.stuijk at tue.nl
Prerequisets:	Course in computer architecture; Programming experience in C, C++, or equivalent language

News and Update

8 December 2005: added the SystemC user manual; see paper to read. Study especially chapter 2 containing an example system with a sender and receiver.
12 December 2005: added the slides on memory management and loop transformations.
13 December 2005: updated lab. exercise B.

Objective

When looking at future embedded systems and their design, especially (but not exclusively) in the multi-media domain, we observe several problems:

high performace (10 GOPS and beyond) has to be combined with low power (many systems are mobile);
time-to-market (to get your design done) constantly reduces;
most embedded processing systems have to be extremely low cost;
the applications show more dynamic behavior (resulting in greatly varying quality and performance requirements);
more and more the implementer requires flexible and programmable solutions;
huge latencie gap between processors and memories; and
design productivity does not cope with the increasing design complexity.

In order to solve these problems we propose to use programmable multi-processor platforms, having an advanced memory hierarchy, this together with an advanced design flow for these platforms. The treated design flow starts with an executable specification of your application and ends with a very efficient mapping (in terms of performance and energy consumption) on a certain platform. This course treats how to design these future embedded systems, solving the mentioned problems by using above solutions.
Note, we do not treat how to go from idea to an executable specification. For that high level part of the design flow see e.g. the DTI course TD5101 Specification of Complex Hardware/Software systems.

Topics

In this course we treat a selection of the following topics:

1. Overview

We start with discussing recent trends, platform developments, treat what do we mean by mapping, design space exploration, and give an overview of the design flow trajectory for embedded systems; in particular of streaming based systems, as found in the Multi-Media application domain.

2. Platform and platform components

As we already mentioned, we foresee that platforms will raise the abstraction level for future application and system designers. Platforms essentially are multi-processor based systems, often realized as a single chip. Depending on the application domain you will see a variety of different processors, including RISC, VLIW and domain specific accelerators. We will treat the following:

RISC based processors.
They are small and cheap, can run a variety of operating systems, including real-time operating systems.
We assume that you know the basics of computer architecture, and have programmed in assembly language. We will concentrate on how to implement different versions of a MIPS processor. This part will be based on the book of Patterson and Hennessy; see below.
ILP (Instruction Level Parallel) processor components.
Besides single issue RISC processors, future platforms contain components exploiting parallelism at the instruction or operation level. A well known example is the VLIW (very long instruction word) processor. We will treat their principles and discuss several examples.
Accelerators.
VLIW components are usually already somewhat tuned to a certain application domain, but still fully programmable (mostly supporting ANSI C as well). If you require more speed or achieve the same performance for a lower power budget, application domain dedicated accelerators can be added. In this course we will treat the ART design environment. Using this environment we can generate dedicated, VLIW like, processors, starting from a high level description in C.
Reconfigurable logic components.
A dedicated ASIC (Application Specific IC) solution can combine high performance, high (area) efficiency, with low power consumption. However design time may be very long, and design and realization cost very high. Reconfigurable logic can be an alternative. The most well known example are FPGAs (Field Programmable Gate Arrays). Popular FPGAs will be described and analysed, together with their design trajectory and design tools.

3. Data Management

In this part the emphasis is on efficient data management, exploiting the advanced memory hierarchy, achieving high performance and low power. We distinguish the management of dynamic (heap based) data structures, and (big) static data structures, like (multi-dimensional) arrays.
The student will learn how to apply a methodology for a step-wise (source code) transformation and mapping trajectory, going from an initial specification to an efficient and highly tuned implementation on a particular platform. The final implementation can be an order of magnitude more efficient in terms of cost, power, and performance.

4. Task Concurrency Management

In this part we will show you how to partition applications, and in particular what are good metrics to evaluate at a high level the quality of a certain task partition. This will be based on the Yapi programming language/environment and the CAST tooling from Sander Stuijk (TU/e).

5. Code generation

Platform based design abstracts from the underlying processor components. This means we have to rely on high quality compilers to bridge the gap between a high level language (like C or C++) and the processor ISA (instruction set architecture). Since many processors exploit some kind of instruction level parallelism, compilers have to be extended with a so-called scheduling phase. In this part we will discuss different scheduling algorithms, from Basic Block Scheduling, to Modulo Software Pipelining.

Book and Handouts

Computer Organization and Design
- The Hardware/Software Interface 3nd Edition

David A. Patterson and John L. Hennessy
Morgan Kaufmann Publishers

Above book will be used for the RISC architecture and processor implementation part (mainly chapters 2, 5 and 6). We can also recommend the companion book of Hennessy and Patterson: Computer Architecture, a quantitative approach (currently 3rd edition). This book discusses recent trends in computer architecture and advanced architecture concepts.
Besides above book we will distribute handouts in the form of papers and slides (see below).

Slides

Missing slides will be put on the web as we proceed during the course.

Introduction to the course (including course schedule)
Overview; This overview introduces the design flow, from executable specification to implementation. It starts with outlines several major trends.
MIPS Instruction Set Architecture; This mainly treats the whole instruction set and instruction formats of a MIPS architecture, and how to translate C code statements and constructs into MIPS code.
MIPS design Non-pipelined implementations: a single cycle implementation, and a multi-cycle, FSM controled, implementation.
MIPS pipelined implementation
FPGA and CAD slides by Dr. Yajun Ha (NUS)
Data Management, part 1; Gives an overview of the design flow for static data management.
Data Management, part 2; Describes loop transformations and data reuse techniques.
Intermezzo: Loop Transformations
Data Management, part 3; Describes data access ordening (within loop bodies), memory allocation and assignment, and data layout techniques.
Advanced architectures; Concentrate on instruction level parallel (ILP) processor architectures, as you will see them in future multi-processor platforms.
Advanced code generation; Concentrate on code generation techniques for ILP architectures.
SystemC tutorial Slides from Bernhard Niemann, Fraunhover institute. Mainly about SystemC1.0
A good introduction to SystemC can be found in the user manual; check the SystemC website .
Other processor architectures. Showing alternatives to register/load-store based RISC architectures, the intel 80x86 architectures, including the Pentium 4, and the JVM (Java Virtual Machine) architecture.

Papers and other documentation to read

Chapter 2 on Computer Architecture Trends
From "Microprocessor Architectures, from VLIW to TTA" by Henk Corporaal, publisher John Wiley, 1998.
SystemC user manual. Study in particular the example in chapter 2.
Also interesting is the following white paper; it gives an introduction to modeling with SystemC 2.0
Data reuse. Formalized methodology for data reuse exploration in hierarchical memory mapping.
J.Ph.Diguet e.a.
Code transformations. Code transformations for data transfer and storage exploration preprocessing multimedia processors.
Francky Catthoor, Nikil D. Dutt, Koen Danckaert and Sven Wuytack
IEEE Design and Test of Computers, May-June 2001
Data storage components. Random-access data storage components in customized architectures
Lode Nachtergaele, Francky Catthoor and Chidamber Kulkarni
IEEE Design and Test of Computers, May-June 2001
Data optimizations. Data memory organization and optimizations in Application Specific systems
P.R. Panda e.a.
IEEE Design and Test of Computers, May-June 2001

Lab exercises

Will be partly updated during course !

A. MIPS assembly programming exercise

In this exercise we use the SPIM simulator to program the MIPS processor in assembly language. See the home page of the SPIM simulator (for MIPS R2000/R3000 architectures). You are asked to make and demonstrate a program implementing an interesting algorithm which at least contains either a non-leaf function and/or a recursive function. Give both the C-code and assembly code of your program.

B. Design and mapping to FPGA

We organize two labsessions. For details about these sessions see here.

B.1 In the first session you will implement a full adder and map it to a Xilinx Virtex FPGA board. You will learn the Xilinx ISE / Webpack design flow.
B.2 In the second labsession you will design an embedded processor system, based on a microBlaze processor. You will learn how to use the EDK design flow from Xilinx.

C. Design and implementation of an embedded RISC processor

Note: the mapping to FPGA in this exercise is optional, but highly recommended. This means you have a couple of options: 1) do everything using SystemC simulation only; 2) using the mapping tools to FPGA, giving you e.g. area and timing estimates (this still can be done on a PC); 3) really get things running at the FPGA board (giving you the tremendous speed of an FPGA. You have to at least show the first option.

For this exercise we use two small MIPS processors:

The mmMIPS (mini-mini-MIPS) processor more or less reflects the MIPS processor as described in the book (chapter 5). It supports only 8 instructions (although it is very easy to add many more instructions).
The mMIPS (mini-MIPS). This processor supports about 30 instructions, enough to let a compiler (in our case the LCC compiler) generate code for this processor.

Both processors are described in SystemC, a language in which you can both describe hardware and software. See www.systemc.org. Study the user manual of SystemC which you can find on this website, especially the example in chapter two. The exercise contains 4 assignments, as described below; see for further details MIPS Lab exercises.
Note that for the NUS specific instruction we had to change the documentation about the mMIPS and specifically its design flow; therefore see MIPS Lab at NUS .

Assignment D.1
The purpose of this first exercise is to change the implementation of the given mmMIPS. We will start from a single cycle implementation of the mmMIPS in SystemC. It is your job to change this into a pipelined implementation, as described in chapter 6 of the book. Advanced features like early branch calculation and forwarding may be implemented, but are not required. Test your processor with a small program.
Assignment D.2
In this assignment we work with the mMIPS, the non-synthesizable version, written in SystemC. The purpose is to get this simulator running using your own C-program (which has to be compiled with the LCC compiler to MIPS code). Your C-program should be a relevant algorithm (e.g. a signal processing algorithm) in which the multiply operation is used frequently (dynamically speaking); this is needed for the following assignments.
Assignment D.3
This assignment works with the synthesizable mMIPS version, i.e. tools exists which can synthesize the mMIPS to gate-level. The end result is that you will have 3 versions of the mMIPS processor, optionally implemented within an FPGA. The multiply operations is supported in 3 different ways: 1) using a software library (this is actually the version you started with; a multiply operation is executed, using a library, by adds and shift operations); 2) multiply based on synthesized hardware; 3) multiply implemented using the 'hardwired' multipliers which are o the FPGA chip. You have to run the same C-program as in the previous assignment for these 3 versions, and compare the results in terms of number of executed cycles, FPGA area (e.g. number of used FPGA cells), and cycle time (or maximum running frequency).
Assignment D.4
The last assignment lets you optimize your application and mMIPS processor in various ways. The purpose is to get the most speedup as compared to the original program and implementation. Several options exists like rewriting the source code, change of LCC, adding MIPS instructions, adding special operations (which e.g. perform a couple of operations simultaneously), and enhancing the pipelined implementation of the mMIPS.

D. Optimize the Memory Use of C code algorithm

In this exercise you are asked to optimize a C algorithm by using the discussed data management techniques. This should result into an implementation which shows a much improved memory behavior. This improves performance and energy consumption. In this exercise we mainly concentrate on reducing energy consumption. You need to download the following, and follow the instructions:

Guidelines. This describes stepwise what you should do.
The algorithm and other required files.

Interesting links and other material

Computer Architecture Home Page
HOT Chips symposium; Have a look at the archives
For loop transformations see the book of Michael Wolfe: High Performance Compilers for Parallel Computing
Addison-Wesley 1996
The polyhedral model is described in
- A.Schrijver
  Theory of linear and integer linear programming
  J.Wiley & Sons, 1998.
- Regularity improvement through loop transformations. Paper by:
  Sven Verdoolaege e.a.
  An access regularity criterion and regularity improvement heuristics for data transfer optimization by global loop transformations
  ODES workshop 2003
The following paper:
An optimal memory allocation scheme for scratch-pad-based embedded systems
by Oren Avissar, Rajeev Barua and Dave Stewart
presents a compiler strategy that automatically partitions the data among different memories.
Book by John R. Levine about linkers and loaders (beta version online available): http://linker.iecc.com/ .
A brief description is available at www.iecc.com/linker/linker00.rtf

Examination

The examination will be oral; dates likely in the second week of January 2006. Grading will be 50 % based on your lab. exercises and reports, 50 % on the theory discussed during the lectures.

Back to homepage of Henk Corporaal