
Francisco Barat
Introduction and motivation:
VLIW processors can achieve the required performance
levels for future multimedia applications. However, they incur in energy
penalties due to their parallelism. Techniques to improve the performance while
reducing or keeping constant the energy consumption are needed to increase the
lifetime of mobile multimedia terminals.
Hardware contribution:
The topic of this dissertation is CRISP, a coarse-grained
reconfigurable instruction set processor designed for multimedia applications
that can accelerate multimedia applications in a power efficient manner. The
power of this architecture is based on the following architectural
features:
- A scalable instruction fetch engine: Due to the varying
instruction bandwidth requirements in standard multimedia applications, we have
developed a scalable instruction fetch path that scales the energy consumption
with the instruction bandwidth. During inner loops (highly parallel), the
fetch unit can provide a maximum bandwidth through a cluster of energy
efficient loop buffers. For the rest of the code, the fetch unit accesses an
instruction cache with a width matched to the limited instruction level
parallelism.
- Software controlled functional unit chaining: Data
dependences represent the main performance limitation of inner loops in very
wide VLIW processors. Functional unit chaining permits the execution of two
data dependent operations in a single cycle through a simple modification to
the bypass network in the processor. With this chaining, the length of
recurrent dependencies can be greatly reduced and represents a step forward
the achievement of single cycle loops (the best in energy efficiency regarding
the instruction memory).
Aditionally, in order to have wide VLIW processors, we use
a clustered datapath architecture.
Software contribution:
The above hardware features do not provide any improvement
if they are not used by the compiler. To this end, we have developed a set of
optimizations to exploit them:
- A simple data path clustering algorithm for clusters
with three or four functional units.
- A version of software pipelining that uses functional
unit chaining to increase the peformance.
- A version of software pipelining that takes into account
the scalable instruction fetch and minimizes the instruction bandwidth while
ensuring optimal performance.
The above compiler optimizations have all been integrated
in a prototype compiler based on Trimaran.