Hardware Design of the Discrete Wavelet Transform: an Analysis of Complexity, Accuracy and Operating Frequency

The purpose of this paper is to present a comparative analysis of hardware design of the Discrete Wavelet Transform (DWT) in terms of three design goals: accuracy, hardware cost and operating frequency. Every de-sign should take into account the following facts: method (non-polyphase, polyphase and lifting), topology (multiplier-based and multiplierless-based), structure (conventional or pipelined), and quantization format (ﬂoating-point, ﬁxed-point, CSD or integer). Since DWT is widely used in several applications (e.g. compression, ﬁltering, coding, pattern recognition among others), selection of adequate parameters plays an important role in the performance of these systems.

Every choice plays an important role in the performance of the system.For example, non-polyphase schemes have easier design than the others, but lower throughput.Lifting schemes with non-pipelined structures have higher path delay than the non-polyphase schemes.Quantization error decreases with long word-bits, but the hardware cost increases.Therefore, the following design aims must be taken into account: high accuracy, high operating frequency or low hardware cost.None of them is able to simultaneously optimize the above objectives.A good design for one of them cannot be a good design for other aim.
The rest of the paper is organized, as follows.Firstly, the background of the Discrete Wavelet Transform is presented.Secondly, a review of works in terms of complexity is shown.Thirdly, main concepts behind accuracy and some of the most remarkable works in terms of accuracy are illustrated.Then, a discussion about pipelined-based and conventional schemes is presented.

Background of the discrete wavelet transform
Discrete Wavelet Transform (DWT) is one of multi-resolution transforms in which both time and frequency of the input signal are analyzed.At the point of view of filter banks, DWT is carried out in two stages: firstly, the input signal is filtered with two half-band filters (i.e.low-pass filter and high-pass filter); secondly, the filtered signals are decimated by a power of two.There are many filters that satisfy the conditions of the wavelet transform and they are grouped in families.In the same family, there are many filters related to the length of the filter.
The easiest representation of the DWT is the convolution (non-polyphase) approach.In this case, the two stages of the DWT are clearly separated (Figure 1).If the DWT is designed as a FSM (Finite State Machine), the first state consists on calculating the filtered signals (it can use several clock cycles), and then in the second state, half of data are eliminated (by the decimation process).Although its hardware implementation is less complex, throughput is not the highest as possible.The second design method is the polyphase one.In this case, in the same state convolution and decimation process are carried out.The input signal is down-sampled (i.e.split) in data of even clock cycles and data of odd clock cycles.Then, data of the even cycles are filtered with the even weights of the filter (low-pass or high-pass) and data of the odd cycles are filtered with the odd weight of the filter.At the end, a sum (between even filtered data and odd filtered data of the same filter) is applied (Figure 2) [29].Unlike the convolution approach, half of data are not wasted.Therefore, throughput of this kind of schemes is the double that of the convolutionbased schemes.A special case of a polyphase scheme is the lifting approach.In a similar way of the polyphase structure, the input signal is down-sampling before the filtering process.Nevertheless, the approximation and detail coefficients are calculated using P (prediction) unit and U (updating) unit.With data of odd part and the result of the P unit, detail coefficients are obtained.

Ingeniería y Ciencia
With the detail coefficients, the result of the U unit and data of the even part, approximation coefficients are found.P and U functions are directly related to the selected wavelet base.Figure 3 shows a generic block diagram in which these functions are not specified.In terms of throughput, the result of the lifting scheme is the same than the result of the polyphase scheme.Differences lie on hardware resources and latency, but it depends on the P and U functions (and therefore the selected wavelet base).

Design goal: complexity
Since DWT uses mathematical operations (adder, sum, down sampling), one parameter to take into account is the topology, which can be multiplierbased or multiplierless-based.In the first case, convolution process between the input signal and the filters weigths are carried out by multiplier units; in the second case, it is calculated by right-shifts and left-shifts.
Where h 0 is the low-pass filter, h 1 is the high-pass filter, k is in the range [0 4] for h 0 , and k is in the range [0 2] for h 1 .
In the case of multiplier-based schemes, the design uses one multiplier for each weight of the filter (i.e. 5 multipliers for h 0 and 3 multipliers for h 1 ) and one adder to obtain approximation coefficients and one adder to obtain detail coefficients.These multiplier units must allow multiply data in floatformat or fixed-format, with several bits in inputs and outputs.Therefore, hardware resources are directly related to the length of the input signals (word-length).The higher the word-length of the inputs, the higher is the hardware cost.Like the multiplier units, the adder unit must work with several bits and then hardware cost is directly related to the word-length.If the input signal is quantized to 16-bits, and the filters weights are quantized to 8-bits (e.g. in fixed-format), multiplier units must work with 23-bits and the adder unit must work with at least 23-bits.The higher the

|134
Ingeniería y Ciencia total number of bits, the higher is the delay of the multiplication process.
On the other hand, in multiplierless-based schemes the multiplier units are eliminated of the design and then mathematical operations are carried out by left-shifts or right-shifts.If the signal is left-shifted, one bit with value of 0 b is added at the right of the signal; otherwise, if the signal is right-shifted, the least significant bit of the signal is discarded.Left-shift operation is equal to multiply by 2 the input; right-shift operation is equal to the integer part of the division by 2 of the input signal.For example, if the input signal is 10111 b , the result of the left-shift is 101110 b and the result of the right-shift is 1011 b .In decimal format, the input signal is 23, the result of the left-shift is 46 and the result of the right-shift is 11.As consequence of the right-shift, a clipping error appears.However, clipping error is enough low and then quantization error is low, too.
In multiplierless-based schemes, if data are quantized with integer format, the length of the internal signals is significantly lower than in the case of multiplier-based topology, even if the later uses integer quantization, too.For example, suppose that the input signal is 5-bits (e.g.23 = 10111 b ), and the weight of the filter is 2-bits (e.g. 2 = 10 b ).With a multiplier unit the result is 7 bits (e.g.46 = 0101110 b ).However, as we explain in the above paragraph, the result with one right-shift (that it is equal to multiply by 2) is 6 bits (e.g.101110 b ).Although input data is the same in both topologies, multiplierless-based schemes have lower bits than the multiplier-based schemes.
Figure 5 shows a generic block diagram for the low-pass filter of the 5/3 wavelet base, for a multiplierless-based topology with integer data.Constant √ 2 has been ignored.It is taken into account in a post-amplifier stage.
An example of multiplierless-based topology and integer quantization of the weights of the filter is found in the work of Ballesteros and Moreno [28].In that case, they use left-shift, right-shift, delay, and split units to compute the wavelet base 5/3.In terms of hardware resources, the wavelet transform use 99 slice registers, 130 slice LUTs, 87 LUT FF-pairs and 51 bounded IOBs.With that design the maximum delay is 3.59 ns with latency equal to 2. This design was used for data hiding purposes [29].Summarizing, in terms of complexity is better a design with multiplierlessbased topology and integer quantization of the weights of the filter than with multiplier-based topologies, even if quantization of the filters weights is integer, too.

Design goal: accuracy
One of the most important requirements in several systems is accuracy.If a system satisfies this requirement, the user knows that the obtained data are highly close to the theoretical data.In the case of hardware waveletbased systems, it is desirable that the quantization error (q e ) of the filters weights is the lower as possible, and therefore the obtained data are highly similar to the real one.If the system uses the decomposition (DWT) and the reconstruction (IDWT) stages, the total error due to the quantization is known as the reconstruction error (r e ).In some applications, the system tolerates values of r e 2 %; but in other cases (e.g.data hiding systems) r e must be lower than 0.1 %.In this section we present some architectures of the DWT-IDWT and they are analyzed in terms of accuracy.
Since accuracy is strongly related with the quantization process of the filters weights, the main point in the design is to select the most appropriate format to represent data.There are four formats, as follows: floating-point, fixed-point, Canonical Signed Digit (CSD) and integer.In floating-point format, the filters weights are represented by several bits related to the integer part and the mantissa.The higher the number of bits, the lower is the quantization error.Nevertheless, higher precision implies higher hardware cost.In the case of fixed-point format, binary representation encompasses two parts: integer part and fractional part.In a similar way of float format, the total number of bits is strongly related to the quantization error.In the third case, in CSD, every bit can be a positive or a negative power of two (i.e.0.1 = −0.25)and then the total number of bits to represent data is lower than in fixed format (because it does not need a sign bit).Finally, integer format is the easiest format in terms of binary representation.It is useful in multiplierless topologies in which data operations (multiplication, division) are performed by right-shifts and left-shifts.
In order to illustrate the quantized error with an example, we have selected the wavelet base 5/3.Their values are shown in Eq. 1 and 2.
In fixed-format, if the weights are represented with six bits, one bit is for the integer part and five bits are for the decimate part (e.g.|h 0 (1)| = 0.00101 b = 0.15625).In this example, the sign is not included within binary data.The quantized filters for the 5/3 wavelet base are obtained as follows: Since by definition h 0 (k) = √ 2 and in the current case: 4375, the total quantized error is 1.64 %.
In the case of integer format, the term √ 2 of this wavelet base can be factorized in a similar way of [28,29], and then the weights are represented by rational terms in which both the numerator as the denominator are integer data.Now, quantization error is only due to the division process which is related to the right-shifts of data (i.e.1/(2 p ) needs p right-shifts).
The authors of [28,29] found that the quantization error is up to 0.0031 %.In the case of data hiding schemes based on LSB (Least Significant Bit) substitution is very useful working with a very low quantization error with the aim of recovering the embedded data.
With a second example of the quantization process with fixed-format, suppose that the wavelet base db2 is selected.The decomposition filters are shown in Equations 5 and 6: With nine bits, the binary representation of the weights is, for example, |h 0 (1)| = 0.00100001 b = 0.12890625.The quantized filters for db2 are obtained as follows: In the current case h 0quantized (k) = 1.4140625 and then the total quantized error is 0.01 %.This result is better than the obtained in the first example; however, in the current case the quantization process uses nine bits instead of six bits.
Table 1 shows the comparison of some works in the design of wavelet transform.In the first column, the method of the design and the proposal are included (Non-pol.: non-polyphase, pol.: polyphase, lif.: lifting).In the second column, it is presented the topology (M: multiplier, Ml: multiplierless).In the third column, it is presented the type of structure (C: conventional, P: pipelined).In the fourth column, quantization format is defined.In the fifth column, the highest quantization error or/and the total reconstruction error are calculated.Finally, in the sixth column, the strengths and weakness of the proposal are identified.

|138
Ingeniería y Ciencia As it is expected, quantization format is the most important parameter in terms of accuracy.Very low quantization error may be obtained with multiplier or multiplierless topologies.However, designs are less complex with multiplierless topologies because they need only shifts (instead of multiplier units).
If accuracy is the goal of the design, it is suggested multiplierless topologies with integer quantization.Since the throughput of lifting and polyphase schemes is the same, any of them can be selected.

Design goal: operating frequency
Another important aspect to take into account in the design of the DWT and the inverse DWT (IDWT) is the operating frequency.Several applications works with signals of high frequency and then it is necessary a scheme fast response.We compare pipelined-based designs and conventional designs.
One disadvantage of the lifting schemes over the non-polyphase schemes is that the latter has a higher value of the delay path and therefore, it is expected that its operating frequency is lower.To overcome this problem pipelined architectures are used.For example, the highest operating frequency of the DWT can increase of 117 MHz to 277 MHz with a pipelined structure [54].In another work, it has been found that the highest operating frequency depends on the number of pipeline stages.The higher the number of pipeline stages, the higher is the highest operating frequency (i.e. 60 MHz with 3 pipeline stages, 186 MHz with 18 pipeline stages [41]).However, pipelined-based scheme does not always ensure a high value of operating frequency.For example, a design of the 9/7 lifting wavelet with pipeline-based structure, fixed-point quantization of the filters weights and multiplierless-based topology has highest operating frequency up to 100 MHz [55].
Another approach consists on using Distributed Arithmetic.For example, the db4 wavelet base is implemented with a ROM lookup table and a cascade of shift registers, into a parallel structure.In this approach, the highest operating frequency is 134 MHz [56].
On the other hand, in some works with multiplierless-based topologies and conventional structures, the highest operating frequency is 110 MHz [30], 140 MHz [31] or 166 MHz [28].These values are lower than the obtained in [54] but higher than the results of [55] and [41] (with 3 pipeline stages).
Summarizing, although pipelined-based structures may have lower delay path, choice of this structure does not guarantee the high values of operating frequency.Other facts, like topology and quantization, should be taken into account, too.

Conclusion
In this paper we revised several works of hardware implementation of the DWT.Proposals were analyzed in terms of three design aims: complexity, accuracy and highest operating frequency.In any design, the following parameters must be taken into account: method (convolution, polyphase, lifting), topology (multiplier-based or multiplierless-based), structure (conventional, pipelined), and quantization format (floating-point, fixed-point, CSD, integer).
Firstly, if the aim of the design is low complexity (and low hardware cost), it is suggested multiplierless topologies.In addition, integer data uses lower number of equivalent blocks than the other formats.In terms of the method, there is not a meaningful difference between polyphase and lifting schemes.
Secondly, if the aim is accuracy, the most important aspect in the design is the quantization format.It has been found low error values when the system works with integer data (it does not matter about the structure).It is worth noting that multiplierless-based schemes take advantage of integer data, and therefore this choice is also suggested.
Finally, if the aim is operating frequency, the best result was found in a pipelined structure.Nevertheless, some conventional designs obtained better results than some pipelined-based designs, and then it is not asserted than pipelined-based structures always outperform conventional structures.

Figure 1 :
Figure 1: Generic block diagram of the non-polyphase scheme.

Figure 2 :
Figure 2: Generic block diagram of the polyphase scheme.

Figure 3 :
Figure 3: Generic block diagram of the lifting scheme.

Figure 4
Figure 4 shows a generic block diagram for the low-pass filter of the 5/3 wavelet base, for a multiplier-based topology.

Table 1 :
Comparison in terms of accuracy.