

computing@computingonline.net www.computingonline.net ISSN 1727-6209 International Journal of Computing

# A FLEXIBLE IMAGE PROCESSING DESIGN BASED ON 2D DCT/IDCT FOR A SYSTEM ON A PROGRAMMABLE CHIP

#### Mohamed Atri, Wajdi Elhamzi, Rached Tourki

Faculty of Sciences Monastir, 5000, Tunisia, Mohamed.Atri@fsm.rnu.tn, elhamziwajdi@yahoo.fr, Rached.Tourki@fsm.rnu.tn

**Abstract:** Many multimedia applications require a flexible image processing architecture. In this paper, we present the use of a hardware acceleration module (Discrete Cosine Transform (DCT) and Inverse DCT (IDCT)) coupled with a software partition running on a PowerPC Processor of a Xilinx FPGA. Therefore we have the benefits of flexible software partition on the PowerPC and the acceleration given by the remaining logic of the same FPGA. This implementation can be used in the context of video coding, object recognition, etc. The experimental results show optimization in processing time offered by hardware acceleration vs. software implementation.

Keywords: System on Programmable Chip, 2D-DCT/IDCT, Xilinx FPGA, Embedded Processor, Image Processing.

#### **1. INTRODUCTION**

Most of video and image compression algorithms require high-speed video data processing capabilities in real-time, on-line applications such as digital video broadcasting and recording. Many compression standards including JPEG, MPEG, and H.26X use 2-D DCT that can be described by:

$$DCT(i, j) = \frac{1}{\sqrt{2N}} C(i)C(j) \sum_{x=0}^{N-1} \sum_{y=0}^{N-1} Pixel(x, y) \cos\left[\frac{(2x+1)i\pi}{2N}\right] \cos\left[\frac{(2y+1)j\pi}{2N}\right]$$

$$Pixel(x, y) = \frac{1}{\sqrt{2N}} \sum_{i=0}^{N-1} \sum_{j=0}^{N-1} C(i)C(j)DCT(i, j) \cos\left[\frac{(2x+1)i\pi}{2N}\right] \cos\left[\frac{(2y+1)j\pi}{2N}\right]$$

$$C(k) = \begin{cases} 1/\sqrt{2} & , if \quad k = 0\\ 1 & , if \quad k > 0 \end{cases}$$

Several standards had implemented software (C, C++, Java...), such as JPEG, JPEG2000 and MPEG. The major problem is the existence of few block which dominate the processing time, from where the limitation of performances. A new challenge already exists, it is the co-design. The principal idea is tasks partitionng, a material block and a software block, carrying out same function as a pure software block. This partitioning enables us to have an optimized circuit, on silicon area and processing time.

In this paper, the embedded PowerPC processor [1] is used to achieve the software partition of the image processing system. Based on the IBM CoreConnect [2], an IBM developed on-chip bus communication link that enables chip cores from multiple sources to be interconnected to create entire new chips. First, we had implemented the forward Discrete Cosine Transform (2-D DCT) using an

image Lena grayscale sized 32x32. Second, we had implemented the 2-D IDCT in order to obtain the result of inverse scheme. Finally, we had calculated PSNR (Peak signal to-noise-ratio) using Matlab tools.

The implementation of our system will be carried out using XPS tool (Xilinx Platform Studio) provided by EDK environment (Embedded Development Kits) [3] as well as the Core Generator tools included in ISE (integrated software environment) [4], to profit from an IP Xilinx DCT [5] respecting time to market constraint.

The remainder of this paper is divided into seven sections. Some related works are presented in section 2. Section 3 details an architectural and functional description of Xilinx forward and inverse DCT [5]. In Section 4 we give a brief description of the used platform. Section 5 presents an overall System on a Programmable Chip (SoPC), where the combined speed and area optimized version of the DCT and IDCT peripheral core is described. Section 6 presents experimental results and performs software vs hardware comparison. Section 7 draws some concluding remarks.

#### 2. RELATED WORKS

Nowadays FPGAs are becoming more complex and integrating Block RAMs (BRAMs) and multipliers. Furthermore, PowerPC processors are inserted in the Virtex II Pro family of Xilinx [6]. These embedded processors can be used to perform a software task combined with a hardware partition implemented on the logic part of the FPGA. For purposes of evolvable hardware, Vaši<sup>\*</sup>cek [7] explore the idea of Glette and Torresen for a realworld application – image filter evolution. The objective of this work is to design, implement and evaluate a system for image filter evolution on a single Virtex II Pro FPGA chip.

In the other hand several works have been realized to develop architecture and algorithms of discrete cosine transform. In [8], an efficient architecture for implementing the scaled DCT with distributed arithmetic is also proposed which requires even less area by making use of the fact that a scaled version of DCT outputs is adequate for most DCT applications. The main objectives in [9] are to match system throughput to maximize hardware utilization and to minimize chip area, that a fast DCT algorithm is mapped into a hardware structure that consists of log2N modules. Based on a recursive fast algorithm, a hardware structure is proposed to directly interface with the sequentially presented data stream.

Seen the importance of this transform for image and video compression, several works existed with intention to optimize the execution time and the chip size. In [10], the proposed architecture consists to implement the DCT and the quantization blocks as being a material accelerator for the H264 standard. In the same context, but for the H263 standard [11], the design and implementation of the coprocessor DCT/IDCT allows improving the execution time. The implementation of the DCT Hard/Soft based on the platform ALTERA Startix II, show the provision of material acceleration lives the soft version.

# 3. XILINX DCT/IDCT

# **3.1 ARCHITECTURE**

The IP of 2D-DCT [5], figure 1, is composed from two blocks 1-D DCT, and linked up by a memory. This "ping-pong" 2-D DCT core implements the basic 2-D DCT structure by employing row-column decomposition of the data and applying 1-D DCT on the data row-wise and then on the row DCT results, column-wise. First, the 1-D DCT structure is applied on the data row-wise. The row-DCT results, thus obtained, are transposed in memory and the same structure is applied to it row-wise. The results are emitted sequentially on the output port of the core.

The block diagram explaining the functional structure is shown in figure1. The Ping-Pong memory helps to have a better throughput. The results are available at the output of the core column-wise.



Fig. 1 – 2D-DCT block diagram

The core offers 8x8 Forward 2-D DCT, 8x8 Inverse 2-D DCT. Both the forward and inverse DCT offers various design parameters to customize the core for different applications.

## **3.2 FUNCTIONAL DESCRIPTION**

The 2-D DCT [5] uses a three-signal handshake control interface: RFD, ND and RDY. Handshaking provides for flow control through the core. The Ready for Data (RFD) output signal, when asserted, indicates that the core is capable of accepting new input data. RFD is used to qualify the New Data input signal. When RFD is asserted, the input signal ND assertion samples the valid data on the DIN port. Once the final sample of input is received the processing begins. Output results on the DOUT port are validated by the assertion of the RDY signal for a single clock cycle per output result value. Figure 2 shows some examples of how the control signals interact.



Fig. 2 – Timing diagram for 2D DCT

## 4. XILINX BOARD

The Memec Virtex II Pro (P7-FF672) Development Kit [12] provides a complete development platform for designing and verifying applications based on the Xilinx Virtex-II Pro FPGA family. This kit enables designers to implement embedded processor based applications with extreme flexibility using IP cores and customized modules.

## 4.1 VIRTEX II PRO FPGA

The Virtex-II Pro FPGA [13] was the first FPGA introduced by Xilinx that contained the embedded PowerPC405 Processor core. In addition to more

than 11,000 logic cells, over 792 Kb of BRAM, 44 multipliers blocks and PowerPC processor available in the FPGA, the Memec board provides The RocketIO Multi-Gigabit Transceivers (MGT) makes it possible to develop highly flexible and high-speed serial transceiver applications.



Fig. 3 - Memec Virtex 2 Pro Development Board

An integrated System ACE Compact Flash controller is deployed to perform board bring-up and to load applications from the Compact Flash card.

#### 4.2 POWERPC HARD PROCESSOR

The Virtex II Pro FPGAs provide up to two PowerPC405 processors [3], 32-bit RISC processor cores in a single device.

These industry-standard processors offer high performance and a broad range of third-party support. The IBM PowerPC 405 core is integrated into the Virtex-II Pro device using the IP-Immersion architecture which allows hard IP cores to be diffused at any location deep inside the FPGA fabric. The processor core operates at a maximum frequency of 400 MHz.

As shown in Figure 4, the PowerPC 405 processor contains the following elements:

- A five-stage pipeline consisting of fetch, decode, execute, write-back, and load writeback stages.
- A virtual-memory-management unit that supports multiple page sizes and a variety of storage-protection attributes and access-control options.
- Separate 16 kB instruction-cache and data-cache units.
- Three programmable timers.
- On-Chip Memory (OCM) controller and.
- Variety of interfaces, including: Processor Local Bus (PLB) interface, Device Control Register (DCR) interface, clock and power management interface and JTAG port interface.



Fig. 4 – PowerPC hard processor block Diagram

## **5. SYSTEM DESIGN**

The Memec Virtex II Pro (P7-FF672) Development Kit [12] provides a complete development platform for designing and verifying applications based on the Xilinx Virtex-II Pro FPGA family. This kit enables designers to implement embedded processor based applications with extreme flexibility using IP cores and customized modules.

#### **5.1 ARCHITECTURE**

One of the biggest challenges of this work was to get a working System On a Programmable Chip (SOPC) with forward and inverse discrete cosine transform. This means implementing both software and hardware components. As the target device, a Memec Virtex 2-Pro was chosen due to its great promise of integrating both the hardware and software co-designs into one flow. The basic structure for the system is shown in Figure 5. The hard core Power PC (PPC) is used in a stand-alone mode to run a software program (written in C) which is loaded into BRAM.



Fig. 5 – OPB integration of DCT/IDCT

This software program is used to communicate with an external PC through UART (RS232). The software is mainly used in two ways. First, it receives data through the UART and performs the transformation with the use of DCT/IDCT peripheral core. Second, it is used to transmit the result of transform and display it under HyperTerminal.

#### 5.2 DCT AND IDCT INTEGRATION

To simplify the connection of user logic module to OPB CoreConnect bus, there is an interface called IPIF [13] (IP InterFace) that manage signals bus, protocol of communication and, more generally, all the characteristics of the bus. The IPIF also includes a management interface user logic called IPIC (IP InterConnect). When the user logic is designed with IPIC, it may be portable and easily reused on other bus by changing only IPIF. VHDL files containing the IPIF and provide the necessary code for the user to add its modules, simplify the task of connecting the user logic.

For our system, we used two FIFOs for software access, the first is interested to receiving the pixels of image Lena 32x32 grayscale. Each pixel is encoded on 8 bits where size of FIFO is 256x32 bit. The second used to recovery result in order to transmit them one by one to the bus OPB.

#### 6. IMPLEMENTATION RESULTS

The 2-D DCT and 2-D IDCT architectures were synthesized into a Xilinx Virtex 2 Pro family FPGA [13]. The complete synthesis results to Xilinx FPGAs are presented in Table 1.

These results have been obtained with separate implementation of the particular modules 2-D DCT and 2-D IDCT accelerator core. The entire 2-D DCT utilizes 54% of slices, 4% of BRAM, 5% of IOBs and 6% of the clock generator. We can see that there is sufficient free space for other applications. The whole design works with a 115MHz system clock. However, the 2-D IDCT occupies almost the totality of slices, this increase in surface is due to high precision supported by this core.

Table 1. Resources utilization of hardware accelerator

| Logic       | Used |      | Available |
|-------------|------|------|-----------|
| Utilization | DCT  | IDCT |           |
| Slices      | 2681 | 4337 | 4928      |
| BRAM        | 3    | 2    | 44        |
| IOBs        | 28   | 23   | 396       |
| GLCKs       | 1    | 1    | 16        |

The same 2-D DCT and 2D- IDCT algorithm was implemented in software and it was run on the PPC processor. It takes about  $360 \ \mu s$  to compute an 8x8

Macro-block in a PPC running at 100 MHz. On the other hand, DCT core can compute the same result in about 0.7  $\mu$ s and runs also at 100 MHz

Similar DCT implementations reported in [10] and [11] were chosen for comparison. It clearly shows the superiority of our proposal design. Also, a comparison of the clock cycles necessary to code an  $8 \times 8$  block in [12] to transform an 8x8 Macro-block, the 2-D DCT and 2-D IDCT need 720 clock cycles. However in our design they need 70 clock cycles. This coprocessor achieves good acceleration results with low resources and memory bandwidth. The impact of the design is shown in the speedup obtained with hardware implementation.

Using Matlab tool we have restored Lena image in order to calculate the PSNR (Peak Signal to-Noise-Ratio) which is equal to 41dB, therefore we can conclude that the quality of image is acceptable.

## 7. CONCLUSION

We demonstrated in this work the capability of SoPC to integrate the transform cosine discrete through HW/SW co-design in an efficient manner. The designed architecture performs the 2-D DCT calculation of a 32 x 32 pixels gray level image in about  $10\mu s$ , allowing its use in a JPEG or MPEG hardware compressor.

This writing introduces a method to implement an image processing system in FPGA. The PowerPC processor is used to make easily in designing embedded systems. The proposed approach represents a general computing model which can be extended to many different applications of embedded design.

## 8. REFERENCES

- [1] IBM. "PPC405-S Embedded Processor Core", User's Manual Version 1.0 July 19, 2007.
- [2] IBM corporation. The CoreConnect Bus Architecture, white paper. International Business Machines Corporation, 2004.
- [3] Xilinx Inc. Embedded Development Kit EDK 7.1i.Xilinx Inc, 2005.
- [4] www.xilinx.com.
- [5] Xilinx Product Specification, "2-D Discrete Cosine Transform (DCT) V2.0", March 14, 2002.
- [6] Xilinx, "Virtex-II Pro Platform FPGA User Guide", UG012 (v2.4) June 30, 2003.
- [7] Z. Vašíček and L. Sekanina. An evolvable hardware system in Xilinx Virtex II Pro FPGA. *Int. J. Innovative Computing and Applications*, (2007) Vol. 1, No. 1, pp.63–73.
- [8] S. Yu and E. E. Swartzlander Jr. DCT Implementation with Distributed Arithmetic. *IEEE Transactions on Computers* 50 (9) 2001.

- [9] T. C. Tan, Guoan Bi, Yonghong Zeng, H. N. Tan. DCT hardware structure for sequentially presented data. *Signal Processing* 81 (2001), p. 2333–2342.
- [10] R. Kordasiewicz, S. Shirani. On hardware implementations of DCT and quantization blocks for H.264/AVC. *Journal of VLSI Signal Processing Systems* 47 (2) (2007), p. 93-102.
- [11] A. B. Atitallah, P. Kadionik, F. Ghozzi, P. Nouel, N. Masmoudi, H. Levi. Hw/Sw codesign of the H.263 video coder. *International Journal of Electronics and Communications* (AEU) 2007, pp 605-620.
- [12] MEMEC DESIGN. Virtex-II Pro<sup>™</sup> FF672 Development Board User's Guide. Version 3.0 December 2003.
- [13] Xilinx Product Specification. OPB IPIF Architecture. DS414 (v1.21) July 30, 2002.



**Mohamed Atri** born in 1971, received his Ph.D. Degree in Micro-electronics from the Science Faculty of Monastir in 2001. He is currently an Associate Professor a member of the Laboratory of Electronics & Microelectronics. His research includes *Circuit and System Design, Image processing, Network Communication, IPs and SoCs.* 



**Wajdi Elhamzi** born in 1982, and received his Master Degree in Micro-electronics from the Science Faculty of Monastir in 2007. He is currently a PhD student. His research interest System on Programmable Chip, Image Processing and video coding.



**Rached Tourki** was born in 1948. He received the B.S. degree in Physics (Electronics option) from Tunis University, in 1970; the M.S. and the Doctorat de 3eme cycle in Electronics from Institut d'Electronique d'Orsay, Parissouth University in 1971 and 1973 respectively. From 1973 to 1974 he served as microelectronics

engineer in Thomson-CSF. He received the Doctorat d'etat in Physics from Nice University in 1979. Since this date he has been professor in Microelectronics and Microprocessors with the physics department, Faculte des Sciences de Monastir. His research includes IP design, Image and Video Processing, cryptography and SoCs.