Eﬃcient Software Implementation of the Nearly Optimal Sparse Fast Fourier Transform for the Noisy Case

In this paper we present an optimized software implementation (sFFT-4.0) of the recently developed Nearly Optimal Sparse Fast Fourier Transform (sFFT) algorithm for the noisy case. First, we developed a modiﬁed version of the Nearly Optimal sFFT algorithm for the noisy case, this modiﬁed al-gorithm solves the accuracy issues of the original version by modifying the ﬂat window and the procedures; and second, we implemented the modiﬁed algorithm on a multicore platform composed of eight cores. The experimental results on the cluster indicate that the developed implementation is faster than direct calculation using FFTW library under certain conditions of sparseness and signal size, and it improves the execution times of previous implementations like sFFT-2.0. To the best knowledge of the authors, the developed implementation is the ﬁrst one of the Nearly Optimal sFFT algorithm for the noisy case.


Introduction
The sFFT term refers to a family of algorithms which allow the estimation of the Discrete Fourier Transform (DFT) of a sparse signal, faster than the FFT algorithms found in the literature [1], [2], [3], [4]; in this case, it is assumed that the signal is sparse or approximately sparse in the DFT domain.
In the one hand, researchers from the Massachusetts Institute of Technology (MIT) presented two sFFT algorithms [2] which improve the runtime over all the previous developments [1], [5], [6], [7] including the most optimized conventional FFT algorithms like FFTW [8]; the first algorithm is intended for the noiseless case, and the second algorithm is intended for the noisy case.
On the other hand, to the best of our knowledge there are only four software implementations of the MIT sFFT algorithms reported in literature; the first one was developed by the algorithm authors for the first version of the sFFT algorithm [1]; the second one is an optimized implementation of the first version of the sFFT algorithm [9]; the third one is a GPU-based implementation of the first version of the sFFT algorithm [10]; and the fourth one is an optimized implementation of the Nearly Optimal sFFT algorithm for the noiseless case [11].Therefore, there is no software implementation reported in literature of the Nearly Optimal sFFT algorithm for the noisy case, which is of practical interest for scientific researchers.
Therefore considering the above, the main contribution of this paper is the development of a modified version of the Nearly Optimal sFFT algorithm and its optimized software implementation on a multicore environment provided by a Beowulf cluster.The modified algorithm has an improved accuracy when compared with the original version described in [2], by modifying the flat window and the procedures; to the best of our knowledge, the modified algorithm is the first implementation of the Nearly Optimal sFFT algorithm for the noisy case, and it is very suitable for hardware implementation using ASICs or FPGAs.
The rest of the paper is organized as follows: Section 2 describes the Nearly Optimal sFFT algorithm, section 3 presents the modified version of the Nearly Optimal sFFT algorithm, section 4 presents experimental results, and section 5 presents the conclusions and future work.

Nearly optimal Sparse Fast Fourier Transform algorithm
In this section, we initially present some concepts about sFFT, then we describe the Nearly Optimal sFFT algorithm, and finally we show some software simulations to verify the basic concepts.Given a discrete time signal Where N is a power of two, [N ] denotes the set of indexes {0, 1, . . ., N −1}, and ω = e −i2π/N is the N -th root of unity.In this case, the number of non-zero elements of the vector x is named the sparsity order K and it is defined in Eq. 2.
Where supp ( x) is the set of indexes of the non-zero elements of the vector x.Then, a time domain signal x is sparse in the DFT domain if K << N .In this context, a set of algorithms named sFFT takes advantage of the signal sparsity in the DFT domain to speed up the runtime of the Fast Fourier Transform (FFT) algorithms used to calculate the DFT [1], [2], [3].Then, taking into account the above, we performed the software implementation of the Nearly Optimal sFFT algorithm for the noisy case presented in [2] by considering the mathematical tools: pseudo-random spectral permutation [1], [7], flat filtering window [2] and hashing function [1], [2]; and using the procedures for approximate sparse recovery [6], [2] and median estimation [1], [2], [7].

Mathematical background about sFFT
2.1.1Pseudo-random spectral permutation This permutation isolates spectral components from each other [7], and it is performed as described in Eq. 3.
Where xp and xp are the permuted spectrum signals in time domain and DFT domain, respectively; π p (k, σ, N ) = σk mod N is the spectral permutation function; and σ ∈ {2c+1 | c ∈ [N/2]} and a ∈ [N ] are the spectral permutation parameters.The spectral permutation function translates the frequency bin from the k-th location to the π p (k, σ, N )-th location, in this case σ −1 mod N exists for all odd σ if N is a power of two.The sFFT algorithm chooses at random the spectral permutation parameters σ and a from a uniform distribution, then the spectral permutation with these pseudo-random parameters is related to a pseudo-random sampling scheme [1], [7].

Flat filtering window
The flat window is a mathematical tool that allows to reduce the FFT size from N to B points, this is accomplished by extending a flat passband region of width N/B around each sparse component; this approach replaces the filter bank of previous sFFT algorithms [7], [5]; additionally, the flat window avoids the use of non equispaced data FFTs [12].A flat window is defined with the time domain
It is important to mention that the practical construction of a flat window lead to a total bandwidth BW Ĝ′ in the DFT domain that satisfies Eq. 5.
Finally, the windowing process is described in Eq. 6, and it is performed in time domain after the pseudo-random spectral permutation is carried out.
Where y is the windowed signal in time domain.

Hashing function
The hashing function obtains B points from the N -point spectrum of the signal y, these points are separated by N/B bins where B is a power of two, and they are obtained by calculating the B-point DFT of the sub-sampled signal y.The vector that has the hashes of the signal y is û ∈ C B and it is calculated using Eq. 7, [1], [2].
From Eqs. 4, 6 and 7, it is possible to note that for each sparse component of the signal x there is one hash located in the offset given by Eq. 8 Where h r (j, σ, N, B) = ⌊π p (j, σ, N )B/N + 0.5⌋ is the round-hash function; and h r (j, σ, N, B) mod B is the index of an element of the vector û that hashes a single sparse component.It is important to note, that due to the condition presented in Eq. 5, the hash can be zero which could reduce the capability of the sFFT algorithm to locate a sparse component.Also, it has been noted that the use of pseudo-random spectral permutation and small support windows leads to a sub-Nyquist random sampling scheme; where, under certain conditions, the average sampling rate is below Nyquist.

Procedures and sFFT algorithm description
The sFFT algorithm, presented in [2] and described in Alg. 1, calculates the DFT estimation ẑ ∈ C N of a noisy sparse signal; the algorithm has the following input parameters: the time domain signal x, the sparsity order K, the constant ǫ for the maximum error in the sparse recovery, and the parameter δ of the flat window [2].
The sFFT algorithm calculates the DFT of a noisy sparse signal using four processing stages: the first one adjusts the flat window parameters B r , α r , and K r ; the second one locates the sparse components using the procedure LocateSignal; the third one estimates the DFT values of the located sparse components using the procedure EstimateValues; and the

|78
Ingeniería y Ciencia fourth one updates the DFT estimation ẑ by accumulating the results of the procedure EstimateValues.In this case, the above procedures use the procedure HashToBins.

Procedure HashToBins
This procedure, presented in Alg. 2, calculates the hashes-error by subtracting the hashes of ẑ from the hashes û, and it has the following input parameters: the time domain signal x; the instantaneous estimation ẑ of x; the parameter B; the spectral permutation parameters σ, and a; and the flat window parameters δ, and α.Input: // Spectral Permutation and Sub-sampling This procedure calculates the hashes-error using three processing stages: the first one simultaneously calculates in time domain the pseudo-random spectral permutation, the windowing, and the hashing process; the second one calculates the DFT domain hashes û by performing the B-point FFT of the time domain hashes u; and the third one calculates the hashes-error by subtracting the DFT domain hashes of ẑ from the hashes û [2].Input: procedure LocateSignal(x, ẑ, B, δ, α) Choose a uniformly at random a from the set [N ]; Choose σ uniformly at random from the set of odd numbers in This procedure locates the sparse components of x using four processing stages: the first one sets the initial conditions, the second one adjusts the frequency locations, the third one reduces the search region, and the fourth one inverts the spectral permutation.The setting of initial conditions is performed by pre-calculating an initial guess of value l of frequency locations in the permuted spectrum, and by pre-calculating the initial values of w, t, D max , and R loc , where w is the width of the region for searching the frequency adjustment, t is the number of candidate adjustments in w, D max is the number of adjustments, and R loc is the number of location iterations for each adjustment.The adjustment of frequency locations is performed by using D max times the procedure Lo-cateInner.The reduction of search region is performed by dividing w by a factor 1/(t ′ ) D−1 at the D-th adjustment, this reduction allows a systematic refining of the frequency location.Finally, the spectral permutation inversion is performed by calculating the function π −1 p (k, σ, N ) on each located frequency.

Procedure LocateInner
This procedure, presented in Alg. 4, performs the adjustment of the frequency locations in the permuted spec-

|80
Ingeniería y Ciencia trum as described in [2], and it has the following input parameters: the time domain signal x; the instantaneous estimation ẑ of x; the parameter B; the spectral permutation parameter σ, the flat window parameters δ, and α; the width of search region w; the number of candidate adjustments t; the number of location iterations R loc ; and the current estimation of frequency locations l.
Algorithm 4: Locate signal inner function. Input: Choose a uniformly at random from the set [N ]; Choose β uniformly at random from the set {⌊ sN t 4w ⌋, . . ., ⌊ sN t 2w ⌋}; û=HashT oBins(x, B, ẑ, σ, a, δ, α); û′ =HashT oBins(x, B, ẑ, σ, a + β, δ, α); This procedure performs the adjustment of the frequency location using five processing stages: the first one sets the initial conditions, the second one calculates the hashes-error, the third one calculates the angles of the hashes-error and candidate frequency bins, the fourth one performs the voting stage, and the fifth one locates the frequency bins.The setting of initial conditions calculates the value of the location threshold s, and clears the vote counters of the t candidate adjustments for the B candidate frequency bins.The hashes calculation stage obtains R loc couples of the form û, û′ , which are calculated from the signal x by using the procedure HashToBins with pseudo-random permutation parameters of the form (σ, a) and (σ, a + β), respectively.The angle calculation stage obtains the vector c, which is the angle differences between the hashes û and the hashes û′ .The voting stage increments the vote counter v j,q corresponding to the j-th frequency bin and q-th adjustment, if Eq. 9 is satisfied.
Where θ j,q is the angle of the j-th candidate frequency bin for the q-th adjustment, that is, θ j,q is related to the q-th candidate frequency adjustment qw/t.Additionally, the above calculation converges if β is chosen at random from the set { snt 4w , . . ., snt 2w } with small enough threshold s.Finally, the frequency location stage selects the minimum q from the set Q = {q ∈ [t] | v j,q > R loc /2}; thus the estimated j-th permuted-frequency bin is refined using Eq. 10

Procedure EstimateValues
This procedure, presented in Alg. 5, calculates the DFT estimation adjustment ŵJ , and it has the following input parameters: the time domain signal x; the instantaneous estimation ẑ of x; the parameter B; the flat window parameters δ and α; the set of located sparse components L; the number of sparse components to estimate K ′ ; and the number of estimation iterations R est .

Ingeniería y Ciencia
Algorithm 5: Estimate values function.This procedure calculates the DFT estimation adjustment ŵJ using three processing stages: the first one calculates the R est sets of hashes-error from the signal x by using the procedure HashToBins and considering different pseudo-random permutation parameters; the second one separately calculates the median of the real and imaginary parts of the calculated hashes-error by only considering the set of located sparse frequency bins in L [1], [7], and by cancelling the pseudo-random spectral permutation; and the third one saves the K ′ most energetic components ŵJ .

Modified nearly optimal Sparse Fast Fourier Transform algorithm
In this section we describe the Modified Nearly Optimal sFFT algorithm and its optimized software implementation named sFFT-4.0,this implementation presents an improved accuracy compared with the original description [2].

Description of the modified nearly optimal sFFT algorithm
The modified sFFT algorithm is accomplished by designing a different flat window and by performing several modifications to the procedures described in section 2.

Modified flat window
In this case, we designed a flat window of the form (1/B, 1/2B, δ, O(B log (N/δ))) as the one described in [1], and considering the time domain and DFT domain representations.The time domain representation of the flat window is described in Eq. 11, and this window does not depend on the parameter α and its support only depends on the parameters B and δ.
In this case, the flat window in time domain has a small support given by O(B log (N/δ)) and it is designed with an ideal low-pass filter using a Gaussian window with finite duration to truncate the impulse response.The cutoff frequency of the low-pass filter is 2C, where C is defined in Eq. 12 and the standard deviation of the Gaussian window is defined in Eq. 13.
The DFT domain representation the flat window is described in Eq. 14.

|84
Ingeniería y Ciencia Ĝ′ Where the vector Ĝ′ is the approximated flat window and ncdf(x) = erfc(−x/ √ 2)/2 is the Normal Cumulative Distribution Function [13].In this case, Ĝ′ satisfies | Ĝ′ − Ĝ| ∞ < δ, and it has a flat super-pass region, a transition band of width 2N 2 log(1/δ)/σ g , and a total bandwidth of BW Ĝ′ = N/B(1  Input: // Spectral Permutation and Sub-sampling In this case, due to the designed flat window has a doubled bandwidth, each sparse component has two hashes located in the offsets o f (j, σ, N, B) and o c (j, σ, N, B) given by the Eqs.15 and 16, respectively.
Where h f (j, σ, N, B) = ⌊π p (j, σ, N )B/N ⌋ is the floor-hash function and h c (j, σ, N, B) = ⌈π p (j, σ, N )B/N ⌉ is the ceil-hash function; h f (j, σ, N, B) mod B and h c (j, σ, N, B) mod B are indexes of the two hashes in û for a single sparse component of x.The proposed flat window improves the accuracy of the algorithm by solving the problem of the zero-value hashes, and it reduces the sampling cost by considering a smaller support.

Modified procedures LocateSignal and LocateInner
These modified procedures are presented in Alg.7 and Alg. 8, respectively.In this case, we reduced the number of modified procedure HashToBins used by the modified procedure LocateInner by fixing the parameter a of the pseudo-random spectral permutation; this allows the usage of the same hashes-error û for all the iterations of the modified procedure LocateInner, thus the sampling cost and the speed are improved.
Choose a uniformly at random a from the set [N ]; Choose σ uniformly at random from the set of odd numbers in    return { ŵJ , J} 3.1.5Modified procedure sFFT Taking into account the above improvements, the proposed modified Nearly Optimal sFFT algorithm is presented in Alg. 10, and it has the following input parameters: the time domain signal x, the sparsity order K, the parameter δ of the flat window [2], the location threshold s, and the number of estimation iterations (R est ).
The modified algorithm simplifies the calculation of the parameters B and K by halving their values during odd-numbered iterations (r mod 2 = 1), also the modified algorithm improves the accuracy of the DFT estimation ẑ.

|88
Ingeniería y Ciencia first, we developed some comparisons of sFFT-4.0implementation against the previous sFFT implementations AAFFT and sFFT-2.0 in terms of runtime and accuracy versus Signal to Noise Ratio (SNR); and second, we present the achieved improvements when the multicore architecture is used for sFFT-4.0implementation.

Comparison tests
Figure 1(a) shows simulation results of the runtime versus N for K = 50; where, red plot represents the runtime of sFFT-4.0 using a single core, the blue plot represents the runtime of sFFT-2.0[1], and the green plot represents the runtime of AAFFT [7].Even though our implementation is running on a single core, it is faster than AAFFT for all N values and faster than sFFT-2.0for N ≥ 2 18 .
Figure 1(b) shows the simulation results of the runtime versus K for N = 2 22 ; in this case, sFFT-4.0 is faster than AAFFT and sFFT-2.0for all K values.From 2 it can be seen that for values of SNR greater than −3 dB sFFT-4.0 has improved accuracy when compared with sFFT-2.0.

Multicore tests
Figure 3(a) shows simulation results of the runtime versus N for K = 50; where, red plot represents the runtime of FFTW3 [8], the blue plot represents the runtime of sFFT-4.0 when one core is used, and the the green plot represents the runtime of sFFT-4.0 when eight cores are used.In this case, sFFT-4.0 is faster than FFTW3 for N above to 2 16 in the eight cores case; however, there is no a significant speed improvement by using the multicore architecture, this is due to the low sparsity of signal that leads to for-loops with small number of iterations which do not take advantage of the multicore architecture, hence the OpenMP's overhead times have a strong influence on the overall execution time [14].
Figure 3(b) shows the simulation results of the runtime versus K for N = 2 22 ; in this case, sFFT-4.0 is faster than FFTW3 for K below to 2500 when a single core is used.From Figure 3(b), it can be noted that there is a significant speed improvement by using the multicore architecture due the increased sparsity of signal; in this case we achieved an acceleration near to 4×.Finally, the modified sFFT algorithm uses the following arithmetic operators: FFT, multiply-accumulate, trigonometric functions (sine, cosine, atan, atan2), modular inversion based on the Fermat's Little Theorem, among others; these operators can be easily mapped to hardware, thus this advantage could facilitate the implementation of the modified sFFT algorithm on an FPGA or an ASIC.

Conclusions
In this paper we present a modified Nearly Optimal sFFT algorithm for the noisy case, this algorithm reduces the sampling cost and corrects the zero-hash issue of the original algorithm by doubling the bandwidth of the flat window and by modifying the original procedures.Additionally, we developed an efficient software implementation of the modified Nearly Optimal sFFT algorithm using a multicore platform, under certain conditions this implementation is faster than the optimized FFT library FFTW3 and previous sFFT implementations.To the best knowledge of the authors, the developed implementation is the first one of the Nearly Optimal sFFT algorithm for the noisy case reported in literature.Finally, the future work

|92
Ingeniería y Ciencia will be addressed to perform an efficient hardware implementation of the modified Nearly Optimal sFFT algorithm using an FPGA.

Algorithm 2 :
Hash to bins function.

2. 2 . 2
Procedure LocateSignal This procedure, presented in Alg. 3, calculates the set L ∈ N O(B) of frequency bins corresponding to O(B) sparse components found in x; and it has the following input parameters: the time domain signal x; the instantaneous estimation ẑ of x; the parameter B; and the flat window parameters δ, and α.

3. 1 . 2
Modified procedure HashToBins This modified procedure is shown in Alg.6 by considering the designed flat window.

Algorithm 6 :
Modified hash to bins function.

3. 1 . 4
Modified procedure EstimateValues This modified procedure, presented in Alg. 9, improves the estimation of the DFT values by adding a correction factor that cancels the effect of the windowing in DFT domain.

Algorithm 9 :
Modified estimate values function.

Figure 2 Figure 2 :
Figure2shows the simulation results of the l 1 -error versus SNR for N = 2 22 and K = 50; where the blue plot represents the error for sFFT-4.0 and the red plot represents the error for sFFT-2.0, the results for AAFFT