## POWER ESTIMATION OF PIPELINE FFT PROCESSORS

S. Reza Talebiyan, Saied Hosseini-Khayat

Electrical department, Ferdowsi University of Mashhad, Mashhad, Iran

Abstract: Pipeline FFT processors are used in mobile communication systems and in particular in OFDM-based systems. This paper presents a method for power analysis of pipeline FFT processors. This method applies to various architectures with different radices. It also presents a method for mapping utilization rate onto clock-gating, which results in efficient power consumption.

Keywords: Pipeline FFT processor, power analysis, utilization rate, clock-gating.

# 1. INTRODUCTION

Low power consumption is an important issue in today's mobile communication system design. An FFT (Fast Fourier Transform) processing engine is at the heart of most mobile communication systems. Using a pipelined FFT processor is the best and perhaps the only choice for obtaining high-throughput and low-power solutions (Lee *et al.*,) (He *et al.*, 1998a, b). There are different architectures for pipeline FFT processors. Single-path Delay Feedback (SDF) and Multiple-path Delay switch (MDS) are two well-known examples of different kinds of pipeline FFT processors. Each architecture requires a different amount of hardware resources.

Researchers have proposed a variety of techniques for low power FFT processors (Maharatna *et al.*, 2004; Hasan *et al.*, 2003; Kristensen *et al.*, 2002). However, there is no analytical comparison of different FFT architectures from the power consumption point of view. Such comparison would enable designers to design power efficient FFT processors.

In this paper, we present an analysis of the worst case power consumption in different kinds of pipeline FFT processors. We emphasize that our aim is not estimating power consumption for a particular pipeline FFT processor, rather we aim to present a general method for power estimation in different kinds of pipeline FFT processors. Our methodology uses the average power consumption of basic processing elements.

## 2. PIPELINE FFT PROCESSOR

The general structure of a pipeline FFT processor is shown in Fig. 1. In this figure, BF denotes the butterfly unit. This unit carries out addition and subtraction operations. W is the twiddle factor that must be multiplied into the butterfly unit result. The internal structure of buffers and butterfly units can vary in different applications. Therefore, several structures of pipeline FFT processors can be specified.



Fig.1. The main structure of pipeline FFT processor.

## 3. PIPELINE FFT PROCESSOR ARCHITECTURES

The main structures of pipeline FFT processor are Single-path Delay Feedback (SDF) and Multiple-path Delay switch (MDS). Each structure is suitable for a special application. For example, the MDS structure is suitable for MIMO communications (Lee *et al.*,), and the SDF structure is suitable for reconfigurable FFT processors (Kristensen *et al.*, 2002). Here pipeline FFT processor structures are described.

## 3.1. SDF Structure

This structure is one of the well-known structures for pipeline FFT processors with suitable amount of hardware resources (He *et al.*, 1998a). Buffers in this kind of architecture are in feedback form of butterfly units. Thus, the output of each butterfly unit is stored in buffers. Buffers are in a shift-register form. Therefore, each time only one of the butterfly outputs is stored in buffer, and the other is propagated to the next stage. Different radices such as radix 2 (R2SDF) or radix 4 (R4SDF) are used in this structure (He *et al.*, 1998a). Fig.2 shows a 16-point pipeline FFT processor in R2SDF structure.



Fig.2. 16-point R2SDF pipeline FFT processor.

R2SDF structure needs  $(Log_2 N - 2)$  multipliers, and  $(Log_2 N)$  butterfly units. Each butterfly unit needs 2 adders, and each R2SDF structure needs (N-1) buffer elements. *N* denotes the number of FFT points.

## 3.2. MDS Structure

This structure is the most common implementation method for radix-2 FFT algorithm (He *et al.*, 1998a). Although it is possible to have higher radices for this structure, but they need a large amount of hardware resources. Fig. 3 shows a 16-point pipeline FFT processor in R2MDS structure.



Fig.3. 16-point R2MDS pipeline FFT processor.

Data sequence is broken into two streams with buffer elements. Buffers save data and also make an interval between data streams. SW denotes switch boxes and determines the data direction to butterfly unit or to the next buffer. The number of multipliers and adders is the same as R2SDF structure, but the number of buffers is more than R2SDF structure. (3/2N - 2) registers (for buffers) are required.

## 3.3. Hardware resources of pipeline FFT processors and their utilization rates

Hardware resources of pipeline FFT processors are dependent on their structure and FFT radix. A pipeline FFT processor is made up of three basic hardware elements. These basic elements are buffers, adders, and multipliers. To analyze pipeline FFT processor we must select the circuit for each of these elements.

Each component has a utilization rate, which is the amount of time the component is active during a given time cycle. For example, in a 16-point FFT processor, a utilization of 50% rate for adders has the meaning that in 16 clock cycles which are needed for FFT computation, the adders work effectively for only 8 clock cycles.

Each pipeline structure has a special amount of hardware resources and utilization rate. Tables 1 and 2 show hardware resources and their utilization rates for main pipeline FFT processor structures. The analysis presented here is based on the amount of the hardware resources and utilization rates that are presented in Tables 1 and 2.

Table 1 Pipeline FFT Processors' Hardware Resources.

|       | Multiplier Buffers          | Adder               |
|-------|-----------------------------|---------------------|
| R2SDF | Log <sub>2</sub> N - 2 N-1  | 2Log <sub>2</sub> N |
| R2MDS | Log <sub>2</sub> N-2 1.5N-2 | 2Log <sub>2</sub> N |

 Table 2 Pipeline FFT Processors' Hardware

 Resources' Utilization Rate.

|       | Multipl | Adder |     |
|-------|---------|-------|-----|
| R2SDF | 50%     | 100%  | 50% |
| R2MDS | 50%     | 50%   | 50% |

## 4. POWER ANALYSIS

The power consumption of a pipeline FFT processor is divided into two parts. First part is the power consumption of buffers and the other is the power consumption of processing elements. Division of the power consumption into two parts is described as Eq. (1). In this relation  $P_{buffer}$  is the power consumption of buffer elements and  $P_{PE}$  is the power consumption of processing elements.

$$(1) P_{FFT} = P_{buffer} + P_{PE}$$

### 4.1. Selection of building block circuits

In order to perform power estimation, first we need to select the circuit structures for building block elements, i.e., 1-bit full adders, and buffers. For 1-bit full adder circuit, we choose the circuit that is presented in (Goel *et al.*, 2006) and for buffer circuit, we select the regular dynamic D-flip-flop circuit. These circuits are illustrated in Fig. 4. We will use the average power consumption of building block elements to estimate the power consumption for the whole system. The power consumption of these circuits is obtained through HSPICE simulations in 90nm PTM technology (PTM site 2008).



Fig.4. (a) Selected 1bit-full adder circuit. (b) Selected d-flip-flop circuit.

#### 4.2. Selection of adder and multiplier structures

After the selection of the structure of full adders and multipliers, the next step is the choice of data word-length. Because the multiplier is composed of a number of 1-bit full adders, the power consumption of adders and multipliers will be a function of both the average power consumption and the number of 1bit-full adders. The average power consumption of 1bit-full adder is taken from HSPICE simulations. The number of 1-bit full adders depends on the data word-length or the number of bits (m). This relation is explained as Eqs. (2), (3). In these equations m is the number of bits.

(2) 
$$P_{adder} = f(m).P_{FA}$$
  
(3)  $P_{multiplier} = g(m).P_{FA}$ 

Obviously, by choosing the structure of adder and multiplier and data word-length, f(m) and g(m) will be specified. The important point for selection of adder and multiplier structures is their suitability for

deep submicron technology. On the other hand, interconnection networks are so important in deep submicron technologies (Chen *et al.*, 2003). Therefore, simple structure with low complexity routing networks must be selected. Consequently, we select carry propagation adder (CPA) for adder and Baugh-Wooley for multiplier. Therefore f, g will be simplified as Eqs. (4), (5).

(4) 
$$f(m) = m$$
  
(5)  $g(m) = m^2$ 

### 4.3. Use of the complex numbers

In FFT algorithm complex arithmetic is used. Consequently, we must consider complex computation and its changes for processing elements. For complex addition or multiplication, more than one real adder or multiplier is needed. A complex adder is composed of two real adders. One is for real part and the other for imaginary part. Then we can state the power consumption of complex adder in the worst case is as twice as the power consumption of real adder which is stated in Eq. (6).

(6) 
$$P_{complex\_adder} = 2.P_{real\_adder}$$

For complex multiplier we refer to the structure of complex multiplier in (Jiang *et al.*, 2004). With respect to this structure, each complex multiplier is composed of three real multipliers and five real adders. Then we can state power consumption of complex multiplier as Eq. (7).

(7) 
$$P_{complex\_multiplier} = 5.P_{real\_adder} + 3.P_{real\_multiplier}$$

### 4.4. Mapping utilization rate to clock-gating

Each part of pipeline FFT processor has a utilization rate. By mapping this utilization rate to clock-gating we can reduce the dynamic power consumption. Although, by considering leakage power, each part has dynamic power consumption when it is active; and has static power consumption when it is inactive. Therefore, the total power consumption of each processing element by mapping its utilization rate to clock-gating will be stated in Eq. (8).

(8) 
$$P_{system\_with\_clock\_gating} = P_e \cdot r + P_{\overline{e}} \cdot (1 - r)$$

Which  $P_e$  is the power consumption of the system when it is active (dynamic power consumption),  $P_{e^-}$ is the power consumption of the system when it is inactive (static power consumption) and r is the utilization rate.

By using HSPICE simulations in 90nm PTM technology Eq. (8) will be verified. We select the adder that is presented in (Goel et al., 2006) and add to it enabling facility and test setup as depicted in Fig. 5. We simulate this circuit by an enabling signal which is active in 85% of the cycle. The result of Eq. (8) is 3.2455  $\mu$ w. On the other hand, the result of HSPICE simulations in 90nm PTM technology is 3.2512 µw. The power consumption that is from HSPICE simulation is only from the adder circuit and the power consumption of inverters is subtracted. This simple simulation improves Eq. (8) and implies the condition that clock-gating can be efficient. The clock-gating can be efficient when the power consumption of clock-gating mode (Eq. (8)) will be less than the average power consumption in nonclock-gating mode. By obtaining the average power consumption from HSPICE simulations the upper limit of utilization rate can be obtained.



Fig.5. Full-adder circuit with enabling capability and its test setup.

### 4.5. Total power consumption processing elements

Pipeline FFT processor is composed of numbers of adders and multipliers. These numbers are dependent on the selected structure of pipeline FFT processors. Then the total power consumption of processing elements by considering the number of adders and multipliers will be as Equations (9) and (10) respectively.

$$P_{total\_adders} = N_{adder}.2m.P_{FA}$$
(9)

$$P_{total\_multipliers} = N_{multiplier}.(5m + 3m^2).P_{FA}$$
(10)

 $N_{adder}$  is the number of adders that FFT processor needs to work.  $P_{FA}$  is the power consumption of the 1-bit full adder and  $N_{multiplier}$  is denoted to the number of multipliers that FFT processor needs.

## 4.6. Power consumption of buffers

Buffer elements are composed of shift registers. Shift registers are composed of D-flip-flops. Then the power consumption of buffer elements is obtained from the power consumption of D-flip-flops and their required numbers. Therefore, the total power consumption of buffers which is composed of m-bit registers will be as Eq. (11).  $P_{DFF}$  is denoted to the average power consumption of D-flip-flops that is obtained from HSPICE simulations.

(11) 
$$P_{buffers} = N_{buffers} .m.P_{DFF}$$

By combining Eqs. (9), (10) and (11) the total power consumption of pipeline FFT processor can be obtained. Eq. (12) shows the final relation for power estimation.

(12) 
$$P_{FFT} = N_{buffers} m P_{DFF} + N_{adder} 2m P_{FA} + N_{multipliet} (5m + 3m^2) P_{FA}$$

### 5. APPLICATION

We present a mathematical method to compare the power consumption of all kinds of pipeline FFT processor with each other. Two famous structures of pipeline FFT processors (R2SDF and R2MDS) are selected for power estimation. The average power consumptions of 1-bit full adder and D-flip-flop circuits are obtained from HSPICE simulations in 90nm PTM technology. Eq. (12) is used for different FFT points. Fig. 6 shows the results.



Fig.6. Power estimation graph for R2SDF and R2MDS pipeline FFT processor structures.

This figure states that the power consumption of R2MDS is higher than R2SDF, especially in high FFT points. In fact, such thing is also true in reality, because R2MDS structure needs higher amount of hardware resources in compare to the SDF structure. This method can also be used for all kinds of pipeline FFT processors with different radices.

### 6. CONCLUSION

We presented a method for power analysis of pipeline FFT processors. Using this technique a designer can compare power consumption of any pipeline FFT processor to another pipeline FFT processor. This can be done by selecting some basic computational element such as 1-bit full adder and D-flip-flop circuits, also determining data wordlength and adder and multiplier structures. This method can also be used for power estimation of other DSP system component.

## 7. REFERENCES

- Chen, X., L. Peh (2003). Leakage power modeling and optimization in interconnection networks, Proc. of ISLPED, Seoul, Korea, August.
- Goel, S., A. Kumar, and M. A. Bayoumi (2006). Design of Robust, Energy-Efficient Full Adders for Deep-Submicrometer Design Using Hybrid-CMOS Logic Style, *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 14, no. 12, pp. 1309–1321.
- Hasan, M., T. Arslan, and J. S. Thompson (2003). *A* delay spread based low power reconfigurable *FFT* processor architecture for wireless receiver, in Proceedings of International Symposium on System-on-Chip, pp. 135-138, 19-21.
- He, S., M. Torkelson (1998a). Design and Implementation of 1024-Point Pipeline FFT Processor, Custom Integrated Circuits Conference, Processing of the IEEE, May, pp. 131-134.
- He, S. and M. Torkelson (1998b). *Designing pipeline FFT processor for OFDM (de)modulation*, in Proc. of ISSSE, vol. 2, pp. 945-950.
- Jia, L., Y. Gao, J. Isoaho, and H. Tenhunen (1998). *A New VLSI-Oriented FFT Algorithm and Implementation*, Proc. of Eleventh Annual IEEE Int'l ASIC Conference, pp. 337-341.
- Jiang, M., B. Yang, A. Jiang, X. Wang, X. Gan, B. Zhao, T. Zhang (2004). Design of FFT processor with low power complex multiplier for OFDM-based high-speed wireless applications, in Proc. of ISCIT'2004, pp. 639-641.
- Kristensen, F., P. Nilsson, A. Olsson (2002). *A flexible FFT processor*, Proc. 20<sup>th</sup> NORCHIP Conf. pp. 121-126.
- Lee, S., Y. Jung, and J. Kim, *Low complexity pipeline FFT processor for MIMO-OFDM systems*, IEICE Electronics Express, *vol.* 4, no. 23, pp. 750-754.
- Maharatna, K., E. Grass, and U. Jagdhold (2004). A 64-Point Fourier Transform Chip for High-Speed Wireless LAN Application Using OFDM, *IEEE Journal of Solid-State Circuit*, vol. 39, no. 3.
- PTM Site (2008). Predictive Technology Model home page, [online]: www.eas.asu.edu/~ptm/.