Contents |
Presented here is a walk-trough of preparing the JPEG decompression algorithm for the combined SDF3-MAMPS tool flow. This page starts with introducing a parallelized version (KPN) of the JPEG decompression application and provides a step-by-step description for the creation of a working implementation on the MAMPS platform.
The sourcecode used in the following examples can be found in the downloads section.
1. The application
1.1 KPN model
The JPEG decompression algorithm can be expressed using the KPN below. Such a KPN graph can either be constructed manually from the application or by using tools like PNGen.
JPEG decompression is done in five steps as can be seen from the KPN.
- The VLD process takes the input file, parses the file headers and performs the variable-length decoding (decompression) it produces a minimal coded unit (MCU) which consists of one or more (up to ten) 8x8 pixel blocks of data (dependent on compression options) representing a part of the image.
- The IQZZ process takes 8x8 pixel blocks of data, performs inverse quantization and reordering on the block, and forwards the block to the next step.
- The IDCT process calculates an inverse discrete cosine transform of the 8x8 pixel blocks.
- The CC process combines the 8x8 pixel blocks inside an MCU into RGB pixel values.
- The Raster process puts the pixel values for the MCU in place in the image and writes the image to disk when the conversion is complete.
NOTE: A more complete description of the JPEG algorithm can be found here
The SubHeader1 and SubHeader2 parameters store information from the JPEG file header (image size, block size, etc...) and are used as preparation for the translation of the KPN into SDF. The information in these parameters is completely contained in the JPEG file header and therefore not necessarily part of the KPN graph.
1.2 Sequential implementation
The implementations for the KPN processes can be found in the actors folder with their definitions in actors/actors.h, any tokens communicated between processes are defined in actors/structures.h.
The actors.h file lists a function for each process except for the VLD process which is implemented by two functions.
- init_header_vld, opens the input file, parses the file headers and initializes all parameters.
- header_vld, implements a single firing of the VLD actor, extracting one block of data from the file.
actors/actors.h
#ifndef _JPEG_ACTORS_H_INCLUDED
#define _JPEG_ACTORS_H_INCLUDED
#include "structures.h"
#ifdef __cplusplus
extern "C" {
#endif
void init_header_vld(FValue * mcu_after_vld, SubHeader1 * SH1, SubHeader2 * SH2);
void header_vld(FValue * mcu_after_vld, SubHeader1 * SH1, SubHeader2 * SH2);
void iqzz(const FValue * V, FBlock * B);
void idct(const FBlock * input, PBlock * output);
void cc(const SubHeader1 * SH1, const PBlock * PB, ColorBuffer * CB);
void raster(const SubHeader2 * SH2, const ColorBuffer * CB);
#ifdef __cplusplus
}
#endif
#endif /* _JPEG_ACTORS_H_INCLUDED */
The sequential version of the JPEG decoder is provided in the file sequential/run.c and listed below. The number of 8x8 pixel blocks in an MCU is given as part of the SubHeader1 structure (n_comp). The code in run.c file is a simple C program that calls the KPN processes in a fixed order. This program is a working version of the algorithm and can be considered as the starting point for this walk-trough.
sequential/run.c
#include <stdlib.h>
#include "../actors/actors.h"
int main(const int argc, const char **argv)
{
init_header_vld(NULL, NULL, NULL);
while (1) {
int i;
SubHeader1 sh1;
SubHeader2 sh2;
FValue MCU[10];
PBlock PB[10];
ColorBuffer CB;
header_vld(MCU, &sh1, &sh2);
for (i = 0; i < sh1.n_comp; i++) {
FBlock FB;
iqzz(&MCU[i], &FB);
idct(&FB, &PB[i]);
}
cc(&sh1, PB, &CB);
raster(&sh2, &CB);
}
return 0;
}
The sequential version of the JPEG decoder can be compiled using supplied makefile with the command make jpeg. Running executable created by this command (jpeg) results in the creation of a file output.bmp which has the bitmap version of the input file (surfer.jpg).
NOTE: The file expected-output.bmp contains a known good version of the output for comparison.
2. Creating the HAPI implementation
The algorithm was only tested in a sequential version up to this point. It is possible that there are some hidden dependencies between actors (which lead to strange results when the actors are run in parallel). The HAPI library is a header-only C++ library that, using the SystemC library, allows to create process networks.
2.1 The creation of processes
Each KPN process is translated into a HAPI process module, this is done by writing a small C++ object around the process implementation. This object will perform the necessary communication and calling of the process implementation.
The VLD process is used as an example because it is the only process that has an initialization step. All other processes are translated in a similar way.
NOTE: The HAPI process implementation can be found in the hapi folder where each process has a .h file for the implementation.
A HAPI process implementation usually consists of four parts:
- File header and class definition
- Ports definition, each port of the process is listed here with the type of token that is communicated.
- The Constructor which takes a name parameter. The constructor is a good place to do process initialization (i.e. calling the init_header_vld function).
- A main() method. This method contains an infinite loop implementing the process. It performs the reading and writing of data from and to the process ports.
NOTE: The implementation of the HAPI model has been somewhat simplified by removing the variability in the number of 8x8 pixel blocks inside a MCU. The implementation fixes this number at 10 (the worst-case situation) as can be noted from the aquireSpace and write calls for outport0.
hapi/vld.h (part 1)
#include "hapi.h"
#include "../actors/actors.h"
class VLD:public Process {
public:
hapi/vld.h (part 2)
OutPort < FValue > outport0; OutPort < SubHeader1 > outport1; OutPort < SubHeader2 > outport2;
hapi/vld.h (part 3)
VLD(std::string name):Process(name), outport0("outport0"), outport1("outport1"), outport2("outport2") {
init_header_vld(NULL,NULL,NULL);
}
hapi/vld.h (part 4)
void main() {
FValue mcu_after_vld[10];
SubHeader1 subheader1;
SubHeader2 subheader2;
do {
outport0->acquireSpace(10);
outport1->acquireSpace(1);
outport2->acquireSpace(1);
cout << "VLD" << endl;
header_vld(mcu_after_vld, &subheader1, &subheader2);
outport0->write(mcu_after_vld, 10);
outport1->write(subheader1);
outport2->write(subheader2);
} while (true);
}
};
2.2 Constructing the network
The next step is to construct the network that connects all the processes and forms the KPN graph.
The HAPI framework shows its limitations at this step. KPN specifies that all edges in the network should behave as unbounded FIFO queues, HAPI does not provide a data-type capable of this.
Two solutions for this problem exist:
- Use the YAPI framework, this framework is similar to HAPI but does provide an appropriate data-type. However, YAPI has not been maintained on its project site for a while and using a newer version of GCC gives conflicts during compilation.
- The easier solution for this is to use bounded FIFO queues with a sufficiently large bound. Choosing these bounds for this example is relatively simple but can prove difficult for more complex examples.
The implementation for the HAPI process network can be found in hapi/top.h. This file:
- Starts with including the HAPI process definitions as written in the previous step.
- Creates a class top which is derived from the HAPI ProcessNetwork.
- Instantiates all processes and FIFO buffers, the size of a FIFO is given as second parameter during construction.
- Connects them according to the network graph.
- Calls the init_PN method from its parent to initialize the network.
hapi/top.h
#include "vld.h"
#include "iqzz.h"
#include "idct.h"
#include "cc.h"
#include "raster.h"
class top:public ProcessNetwork {
public:
VLD * vld;
IQZZ *iqzz;
IDCT *idct;
CC *cc;
Raster *raster;
Fifo < FValue > *vld2iqzz0;
Fifo < SubHeader1 > *vld2cc0;
Fifo < SubHeader2 > *vld2raster0;
Fifo < FBlock > *iqzz2idct0;
Fifo < PBlock > *idct2cc0;
Fifo < ColorBuffer > *cc2raster0;
top(std::string name):ProcessNetwork(name) {
vld2iqzz0 = new Fifo < FValue > ("vld2iqzz0", 10);
vld2cc0 = new Fifo < SubHeader1 > ("vld2cc0", 1);
vld2raster0 = new Fifo < SubHeader2 > ("vld2raster0", 1);
iqzz2idct0 = new Fifo < FBlock > ("iqzz2idct0", 1);
idct2cc0 = new Fifo < PBlock > ("idct2cc0", 10);
cc2raster0 = new Fifo < ColorBuffer > ("cc2raster0", 1);
vld = new VLD("vld");
vld->outport0(*vld2iqzz0);
vld->outport1(*vld2cc0);
vld->outport2(*vld2raster0);
iqzz = new IQZZ("iqzz");
iqzz->inport0(*vld2iqzz0);
iqzz->outport0(*iqzz2idct0);
idct = new IDCT("idct");
idct->inport0(*iqzz2idct0);
idct->outport0(*idct2cc0);
cc = new CC("cc");
cc->inport0(*idct2cc0);
cc->inport1(*vld2cc0);
cc->outport0(*cc2raster0);
raster = new Raster("raster");
raster->inport0(*cc2raster0);
raster->inport1(*vld2raster0);
init_PN();
}
};
2.3 The final steps
The last few steps are relatively simple. A sc_main function needs to be written to serve as entry-point for the SystemC application. This function needs to instantiate the process network (as created in the previous step) and calls sc_start(). The hapi/main.cc file implements this last part of the HAPI implementation.
hapi/main.cc
#include "top.h"
using namespace sc_core;
int sc_main(int argc, char *argv[])
{
top top1("Top1");
sc_start();
return 0;
}
Building the HAPI version can be done with the supplied makefile using the command make jpeg-hapi.
3. Translation of the KPN into a SDF model
Transforming the KPN into a SDF model is a relatively simple step. The SDF model uses the term actors where KPN uses process. Actors are different from processes in that they require any process state to be modeled explicitly by using self-edges. The SDF graph furthermore adds token-rates and initial tokens to the edges and adds two edges for the communication of information from the image header to the CC and Raster actors.
Combining this gives the following graph:
Using SDF to model the application implies that token rates are fixed in the graph. The token-rate for the output port of the VLD (producing 1 MCU of up to 10 blocks at a time) is therefore fixed at the worst-case number. Similarly, the same holds for the input of the CC actor.
4. Creating the use-case specification for SDF3
The use-case specification for SDF3 is written manually based on the SDF graph and the >SDF3 use-case XML specification. This is the usecase.xml file in the example.
This file starts with a set of use-case descriptions (sets of applications) that should be mapped by the tool. Only one application is mapped in the current example so only one use-case is specified.
usecase.xml (snippet)
<usecase name='jpeg'> <application name='jpegdecoder' weight='1' /> </usecase>
The use-case section is followed by an applicationGraph node for each application listed in the use-cases. Each application-graph is constructed in two parts.
- The SDF graph
- The properties of the actors and channels in the SDF graph
4.1 Creating the SDF graph in XML format
Writing the XML code for the SDF graph is pretty straight-forward. Each actor is defined as an actor node (the type attribute of the actor node can be ignored) with its input and output ports and each port is assigned a rate that equals the rate in the graph. Channels are defined with the channel tag, connected to their source and destination and optionally filled with a number of initial tokens.
usecase.xml (snippet)
<sdf name="jpegdecoder" type="JPEG_decoder">
<actor name="vld" type="A0">
<port name="out" type="out" rate="10" />
<port name="out_subheader1" type="out" rate="1" />
<port name="out_subheader2" type="out" rate="1" />
<port name="state_in" type="in" rate="1" />
<port name="state_out" type="out" rate="1" />
</actor>
<actor name="iqzz" type="A1">
<port name="in" type="in" rate="1" />
<port name="out" type="out" rate="1" />
</actor>
<actor name="idct" type="A2">
<port name="in" type="in" rate="1" />
<port name="out" type="out" rate="1" />
</actor>
<actor name="cc" type="A3">
<port name="in" type="in" rate="10" />
<port name="in_subheader1" type="in" rate="1" />
<port name="out" type="out" rate="1" />
</actor>
<actor name="raster" type="A3">
<port name="in" type="in" rate="1" />
<port name="in_subheader2" type="in" rate="1" />
<port name="state_in" type="in" rate="1" />
<port name="state_out" type="out" rate="1" />
</actor>
<channel name="vld2iqzz" srcActor="vld" srcPort="out" dstActor="iqzz" dstPort="in" />
<channel name="vld2cc_subheader1" srcActor="vld" srcPort="out_subheader1" dstActor="cc" dstPort="in_subheader1" />
<channel name="vld2raster_subheader2" srcActor="vld" srcPort="out_subheader2" dstActor="raster" dstPort="in_subheader2" />
<channel name="vld_state" srcActor="vld" srcPort="state_out" dstActor="vld" dstPort="state_in" initialTokens='1' />
<channel name="iqzz2idct" srcActor="iqzz" srcPort="out" dstActor="idct" dstPort="in" />
<channel name="idct2cc" srcActor="idct" srcPort="out" dstActor="cc" dstPort="in" />
<channel name="cc2raster" srcActor="cc" srcPort="out" dstActor="raster" dstPort="in" />
<channel name="raster_state" srcActor="raster" srcPort="state_out" dstActor="raster" dstPort="state_in" initialTokens='1' />
</sdf>
The next section in the use-case description lists the properties of the SDF graph elements (actors and channels). Each actor has an actorProperties node and each channel has a channelProperties node in this section.
4.2 Adding the properties of the actors
This part will explain the values in the actorProperties node. The actorProperties node contains information about the implementation of the actor.
The following properties are defined in this node:
- Which function is used to implement the actor and (optionally) which function should be used for initialization.
- The processor types this actor can be mapped to.
- The Worst-Case Execution-Time (WCET) and memory usage for each implementation
- The files required to compile the implementation.
- The mapping of actor ports to the function arguments of the implementation.
4.2.1 Determining WCET
The WCET of an actor is probably the most difficult property to determine. There are some automated tools that can perform WCET analysis of C code but they all have their limitations (both in functionality and in availability).
Therefore we will first take a short moment to think about the reasons for using the WCET of an actor in the calculations. SDF3 uses the WCET to determine the real-time behavior and to verify the timing constraints (i.e. frame-rate) of the application. Under-estimating the WCET leads to missing of (hard-) real-time constraints while over-estimating WCET leads to over-use of resources (computation time) in the calculations and therefore a possibly infeasible design (according to SDF3).
The numbers used in the example have been gathered by profiling using a combination of two methods:
- Instruction counting of simple blocks (using a cross compiler to determine the instruction count for the microblaze target architecture)
- Simulation of the application while keeping a track of the number of instructions executed in these basic blocks.
This gives a coarse estimate which can be used in mapping soft-real-time applications but are not usable for hard-real-time applications.
NOTE: the code used for counting the instructions for the actors is not included in the example due to licensing issues with the tools used
4.2.2 Determining memory usage
Calculating the memory usage is much easier. The usecase.xml file format requires three types of memory to be specified:
- .code: instruction memory
- .data: data memory (both initialized and uninitialized)
- sharedVar: runtime data memory (stack)
NOTE:The sharedVar memory entry is not used in the current version but should still be specified with size 0.
The MAMPS platform requires that no heap is used in the actor code (i.e. no malloc and related functions). This implies that all memory is allocated statically and, thus, reserved in the compiled object file for the actor implementation. The easiest method to gather the instruction and data size of an actor is therefore to look at the object files used in its implementation.
The VLD actor is constructed from several files (header_vld.c, huffman.c, and utilities.c). Compiling these files with the Microblaze cross-compiler and running size on the object files yields the following:
mb-gcc -c -o actors/header_vld.o actors/header_vld.c mb-gcc -c -o actors/huffman.o actors/huffman.c mb-gcc -c -o actors/utilities.o actors/utilities.c size actors/header_vld.o actors/huffman.o actors/utilities.o text data bss dec hex filename 7091 0 444 7535 1d6f actors/header_vld.o 2538 16 1137 3691 e6b actors/huffman.o 1227 0 4 1231 4cf actors/utilities.o
Adding the numbers for text gives the code-size of the actor (7091+2538+1227 => 10856) and adding the numbers for both data and bss gives the size of the required data-memory (0+16+0+444+1137+4 = 1601).
4.2.3 Completing the actorProperties tag
Putting these numbers in the actorProperties tag for the VLD actor gives the following result:
usecase.xml (snippet)
<actorProperties actor="vld">
<function name='header_vld' />
<initFunction name='init_header_vld' />
<processor type="microblaze0" default="true">
<executionTime>37235</executionTime>
<memoryElement name='.code'>
<size>10856</size>
<accessCnt>1</accessCnt>
<accessType>IFetch</accessType>
<accessSize>word</accessSize>
</memoryElement>
<memoryElement name='.data'>
<size>1601</size>
<accessCnt>1</accessCnt>
<accessType>DRead,DWrite</accessType>
<accessSize>halfword</accessSize>
</memoryElement>
<memoryElement name='sharedVar'>
<size>0</size>
<accessCnt>1</accessCnt>
<accessType>DRead,DWrite</accessType>
<accessSize>byte</accessSize>
</memoryElement>
<fileToCompile name='actors/header_vld.c' />
<fileToCompile name='actors/huffman.c' />
<fileToCompile name='actors/huffman.h' />
<fileToCompile name='actors/structures.h' />
<fileToCompile name='actors/utilities.c' />
<fileToCompile name='actors/utilities.h' />
</processor>
<portMapping portName='out' arg='1' />
<portMapping portName='out_subheader1' arg='2' />
<portMapping portName='out_subheader2' arg='3' />
</actorProperties>
4.3 Adding the properties for the channels
The last part of the graph properties (the channelProperties) are easy to complete. The only value required for the channel-properties is the size of the token to be communicated.
Two different kinds of channels can be noted in this part:
- Channels that are used for calculation purposes (i.e. self edges for actors that keep state information)
- Channels that are used for the transfer of tokens
Channels of the first type usually do not have a physical implementation in the actor code but (i.e.) are implemented using static global variables in the actor code. This implies that the data held on this edge is already counted as part of the actor memory. These channels can therefore be assigned a token-size of 0.
usecase.xml (snippet)
<channelProperties channel="vld_state"> <tokenSize sz="0" /> </channelProperties>
The token-size for channels of the second type should be set to sizeof(token_type). This value can be obtained by using the following C program and replacing token with the type of token that is communicated over the channel:
get_size.c
#include <stdio.h>
#include "actors/structures.h"
int main(int, char **)
{
printf("Size of token: %d\n", sizeof(token));
return 0;
}
4.4 Completing the usecase.xml
Finally, the SDF3 program needs to have a time constraint for the execution of the application. This constraint is measured in iterations of the SDF graph per clock cycle of the system (which runs at 50MHz for the MAMPS platform).
usecase.xml (snippet)
<graphProperties>
<timeConstraints>
<throughput>0.000000005</throughput>
</timeConstraints>
</graphProperties>
NOTE: The completed usecase.xml can be found with the example code
5. Rewriting the actor code for the MAMPS platform
There have been just a few restrictions that have been imposed on the actor implementations up to this point. However, there are a few things to note before trying to run your code on the MAMPS platform.
The MAMPS platform, being an embedded platform, has limited support for the standard C library. The two most important differences are that 1) the MAMPS platform has no HEAP, and 2) file-IO is implemented in a not-so-standard way.
5.1 No HEAP
The HEAP section is normally used by programs to store dynamically allocated memory. Therefore, absence of a HEAP section implies that no dynamic memory allocation (i.e. malloc, calloc, free, mmap, ...) is possible on the MAMPS platform. Care should be taken to replace any dynamically allocated memories by fixed size statically allocated memory.
This implies that the size of some buffers needs to be fixed where at its worst-case value, which in turn can imply restrictions on the input data. One example would be restricting the size of an image that is decompressed in order to determine a maximum size for the image buffer.
The example code does this in the actors/raster.c by defining a 64x64 pixel image (of 3 byte color values) as maximum output size.
actors/raster.c (snippet)
#define IMAGE_SIZE 3*64*64
5.2 File-IO
Removing the file-IO dependencies from an application can be more cumbersome. The MAMPS platform provides access to the Xilinx FatFS library which implements a limited subset of the file-IO functionality available in the C library. These functions have a different name (names are prefixed with sysace_) and offer limited functionality.
The available functions in the Xilinx FatFS library are (directory support is disabled on the MAMPS platform):
- sysace_fopen
- sysace_fclose
- sysace_fread
- sysace_fwrite
These functions are enough to perform the most rudimentary file-IO but most applications will also be using different functions (like ftell and fseek) to perform random access reads to the input data.
Two methods exist to avoid this problem. The first method is removing the unsupported function calls from the actor implementations, the second method is in writing a replacement (stub) for the missing functions.
- Removing unsupported function calls, often implies rewriting parts of the code (which means that the actor metrics need to be determined again). A method to achieve this is to read the input file entirely before starting to process the data but this can require large amounts of data memory for storing the input file. Storing the input data in memory does however give completely random access to it and can simplify the transition for applications that rely on random access.
- Writing a replacement (stub) for missing functions can also give good results without having to change (a lot of) actor code. It can be difficult however to access some data that might be needed by the replacement implementations. Writing a replacement for ftell, for example, requires knowledge of the current position in the file, and a replacement for fseek requires changing the data inside the structures used for reading and writing data.
5.2.1 Extended MAMPS version
The two functions ftell and fseek are the most difficult to replace from the application context. A choice was therefore made to add support (be it limited) for both functions to the MAMPS platform library.
- The behavior of ftell is implemented in long sysace_ftell(SYSACE_FILE *stream).
- The implementation of fseek is implemented in int sysace_fseek(SYSACE_FILE *stream, long offset, int whence). However sysace_fseek is currently limited in its capabilities as it only allows forward searching (offset > 0) from the current position (whence = SEEK_CUR).
5.2.2 Wrapper code for the example project
The JPEG-decoder example choses to use the extended MAMPS version (with sysace_ftell and sysace_fseek) and provides a small wrapper library to provide compatibility with the PC platform. This wrapper library extends the functionality even further to include the sysace_fgetc and syscace_putc functions.
Compatibility for the PC platform (for testing with the sequential and HAPI versions) is provided by a set of definitions in the actors/stdio-wrapper.h file. These functions replace the sysace_* functions with their normal C library variants.
NOTE: the C definition microblaze can be used to distinguish between the PC and MAMPS platforms.
actors/stdio-wrapper.h
#include <stdio.h> #ifndef STDIO_WRAPPER_INCLUDED #define STDIO_WRAPPER_INCLUDED #ifdef microblaze # include <sysace_stdio.h> int sysace_fgetc(SYSACE_FILE *stream); int sysace_putc(int c, SYSACE_FILE *stream); #else # define SYSACE_FILE FILE # define sysace_fopen fopen # define sysace_fclose fclose # define sysace_fread fread # define sysace_fwrite fwrite # define sysace_ftell ftell # define sysace_fseek fseek # define sysace_fgetc fgetc # define sysace_putc putc #endif #endif /* STDIO_WRAPPER_INCLUDED */
The availability of sysace_fread and sysace_fwrite makes writing the sysace_fgetc and sysace_putc functions relatively easy as can be seen in actors/stdio-wrapper.c.
actors/stdio-wrapper.c
#include "stdio-wrapper.h"
#ifdef microblaze
int sysace_fgetc(SYSACE_FILE *stream)
{
int ret;
unsigned char c;
ret = sysace_fread(&c, 1, 1, stream);
if( ret != 1 )
return EOF;
return (int)c;
}
int sysace_putc(int c, SYSACE_FILE *stream)
{
int ret;
unsigned char b = c;
ret = sysace_fwrite(&b, 1, 1, stream);
if( ret != 1 )
return EOF;
return c;
}
#endif
6. Running the example with SDF3 and MAMPS
Running the tool-flow is very simple once you have prepared your application and the Makefile supplied with the example application shows how to call SDF3 and MAMPS in order to construct the Xilinx project.
In short, the following commands should give you a working example (assuming you have installed everything correctly and applied the patch to the Xilinx library)
$ make mapping $ make mamps $ make -f system.make download
