Multi-processor JPEG decoder

1. Introduction
2. Compile and run
3. Change the input image
4. Simulation results

1. Introduction

1.1. JPEG Decoding process

The JPEG decoder takes a JPEG image and converts it to a (uncompressed) bitmap using a decoding method called the baseline decoding process. A short introduction to the JPEG decoding process can be found in chapter 2 and appendix of the report "Design and implementation of a JPEG decoder" by Sander Stuijk. The multi-processor decoder is based on a single-processor decoder that is in ./c_prog/djpeg_orig. Figure 1 shows the process steps (blue) of the JPEG decoding process:

Figure 1
: JPEG decoding process

The yellow blocks describe the data that is sent between nodes (mMIPS processors). See the next paragraph for more information.

1.2. Decoder partitioning

The JPEG decoder process has been divided up into three nodes as depicted in figure 1. Each step runs on a separate node of the mMIPS network: node1 is at (X,Y) = (0,0), node2 is at (1,0), node3 is at (0,1) and the node (1,1) remains unused. This partition was chosen because a quick investigation of the sources of the original JPEG decoder revealed that partitioning just before and after IDCT-function was the easiest to realize. This choice also has the advantage that the Huffman decoding and dequantization tables required by the VLD and DQ units respectively do not need to be sent over the network. This partition leads to the distribution of the workload as shown in table 1.

Table 1: Workload for nodes 1, 2 and 3 in the JPEG decoder.

Node Steps + workload % Total workload %
1 VLD (35%) + ZZ (5%) + DQ (10%) 50%
2 IDCT (20%) 20%
3 Color conversion (15%) + Reorder (15%) 30%

1.3 File structure

Table 2 describes the files and folders in ./c_prog/djpeg_mmips.

Table 2: Overview of the files and folders in ./c_prog/djpeg_mmips.

File Contents / purpose
/dumps Contains the files mips_ram.xXyY.dump, which give the data memory contents after a simulation has ended for nodes (X,Y). These files are moved to the folder automatically if the Linux shell script dolcc is used.
/example This folder contains the simulation results of the JPEG decoder for an example image.
/node.xXyY If the Linux shell script dogcc is used, then a node generates files named output_to_0xADDR.bin. The file contains the data the node tried to send to the node with relative address ADDR. See the next chapter for more information on the dogcc script or check out the page on the C Communications library for more information on the output-to-file capability.
/test_images Contains JPEG images that can be used to test the decoder.
color.* Contains color_conversion() that converts YUV colors to their RGB equivalents.
Project files that describe the JPEG decoder in the Linux programming environment kdevelop.
dogcc Use the script dogcc to compile and run the JPEG decoder from the Linux shell using gcc. It uses Makefile to compile and renames the output files in such a way that the communications pattern depicted in figure 1 is achieved (see paragraph 1.4 of the C communications library for more information on how communication simulation via files works). The script is for the 2-by-2 mMIPS NOC.
dolcc Use the script dolcc X to compile the JPEG decoder using lcc and run a simulation for X minutes. The script is for the 2-by-2 mMIPS NOC.
dump dump is used in the scripts dolcc to create and initialize the node data memories.
fast_int_idct.* Inverse Discrete Cosine Transform (IDCT) using Integers.

Performs the Variable length decoding (VLD), Zigzag scan (ZZ) and Dequantization (DQ), see also figure 1.

jpeg.h Constants, preprocessor symbols and data structures common to all nodes.
Makefile Used by the script ./dogcc to compile the nodes using gcc.
mips Ready to use hardware simulator for a 2x2 mMips NOC
mips_ram.empty.bin Empty data memory which is used by the script dolcc to initialize the data memories.
parse.* Contains mgetc() which allows functions within step1.c access to the input image. Also contains some other functions that involve retrieving specific data from the image.
stepX.* Contains the function main() for node X. See figure 1 for a description of the JPEG decoding steps performed by each node. Dumps the output of mprintf() for each node. Since the mMIPS NOC does not have a terminal, the printf() clone mprintf() is used to output to memory.
sunraster.* Extracts a viewable bitmap image from the memory dump of the the last node / step in the JPEG decoding process.
tree_vld.* Creates the Huffman table.

2. Compile and run

The sources of the multi-processor JPEG decoder are located in ./c_prog/djpeg_mmips. The compilation steps needed to compile the JPEG decoder are comparable to those for gossip. The compilation process and run types for gossip are discussed in the application design flow. You need to make few modifications to the C sources and shell scripts if you want to change the input image of the JPEG decoder or change the memory layout of the mMIPSes.

3. Change the input image

A change of input image involves the following changes to the source code and scripts:

  1. Function load_image() in parse.c
    The path and filename of the input image is hard coded in this function and needs to be changed.
  2. Header file jpeg.h
    If the decoder is run on a PC/i386 directly (using the script dogcc), then the preprocessor symbols JPG_IMAGE_BUFFER_SIZE and BMP_IMAGE_BUFFER_SIZE determine the memory reserved to store the input and output images respectively.
    If the decoder is run on the NOC simulator (using the script dolcc) or the actual FPGA hardware, then the preprocessor symbol JPG_IMAGE_MAX_ADDR determines maximum address in the data memory that could contain input image data. An error is printed using mprintf() if the decoder tries to read beyond that address, but code execution continues. See the mMIPS page for more information on its addressing modes and memory layout.
  3. The script dolcc
    The JPEG image is copied to the start of mips_ram.x0y0.bin. This file contains the data memory of node 1.

4. Simulation results

The subfolders dogcc and dolcc in the folder ./c_prog/djpeg_mmips/example contain the output of a run of dolcc and dogcc respectively for a 32x24 pixels color JPEG image of a surfer (see figure 2).

Figure 2
: Input image surfer.jpg (actual size)

Any output written to stdout or stderr during the execution of these scripts was saved in the file output_surfer.txt. For dolcc the dumps subfolder contains the contents of the data memories of the mMIPSes after completion. The output bitmap is in the data memory file mips_ram.x0y1.dump beginning at address 0x0 by default. khexedit (an X-Server like Hummingbird Exceed is needed) can be used to verify that it is exactly the same as the file output.ras that dogcc generates (see next). The syntax of the HEX-editor is khexedit <filename>. For dogcc the node.xXyY subfolders contain any data that was sent and output.ras is the resulting bitmap (in Sun raster format). You can view the output file with eog (Eye of Gnome) if you have an X-Server like Hummingbird Exceed using the command: eog output.ras &

The decoding of surfer.jpg on a 2x2 mMIPS NOC using the hardware simulator took 29 hours on a Pentium III 1GHz processor running GNU/Linux 2.4.20 with 2048 MB of RAM. This process the took 604 milliseconds on the actual FPGA implementation. This simulation speed may be too low for some situations. The main two reasons for the low simulation speed are the cycle-accurate RTL level simulation and the lack of a multiply instruction. Fortunately, there are a number of things we can do to improve the performance of the simulator.

  1. Raise the abstraction level of the simulator from cycle-accurate RTL level simulation to non-synthesizable cycle-accurate or instruction set simulation.
  2. Increase the speed of the multiplication. The absence of a multiply instruction means that all multiplications need to be done using a software call (see the section soft ops on the mMIPS page). Any occurrence of the symbol '*' in C is translated into a call to the function __mul (the assembly implementation of this file is in ./lcc/lib/emu_asm.s). This implementation is time-consuming and can be increased: multiplications can be inlined instead of invoked using a function call, negative operands in the multiplication can be inversed and operands could be sorted according to their value. Alternatively we could implement the multiplication in hardware.

A faster simulator is one of the goals for the future.