Parallel Implementation of Arbitrary-Shaped MPEG-4 Decoder for Multiprocessor Systems

MPEG-4 is the first standard that combines synthetic objects, like 2D/3D graphics objects, with natural rectangular and non-rectangular video objects. The independent access to individual synthetic video objects for further manipulation creates a large space for future applications. This paper addresses the optimization of such complex multimedia algorithms for implementation on multiprocessor platforms. It is shown that when choosing the correct granularity of processing for enhanced parallelism and splitting time-critical tasks, a substantial improvement in processing efficiency can be obtained. In our work, we focus on non-rectangular (also called arbitrary-shaped) video objects decoder. In previous work, we motivated the use of a multiprocessor System-on-Chip (SoC) setup that satisfies the requirements on the overall computation capacity. We propose the optimization of the MPEG-4 algorithm to increase the decoding throughput and a more efficient usage of the multiprocessor architecture. First, we present a modification of the Repetitive Padding to increase the pipelining at block level. We identified the part of the padding algorithm that can be executed in parallel with the DCT-coeficient decoding and modified the original algorithm into two communicating tasks. Second, we introduce a synchronization mechanism that allows the processing for the Extended Padding and postprocessing (Deblocking & Deringing) Filters at block level. The first optimization results in about 58% decrease of the original Repetitive-Padding task computational requirements. By introducing the previously proposed data-level parallelism and exploiting the inherent parallelism between the separated color components (Y, Cr, Cb), the computational savings are about 72% on the average. Moreover, the proposed optimizations marginalize the processing latency from frame size to slice order-of-magnitude.