A Computational Architecture for Inference of a Quantized-CNN for Detecting Atrial Fibrillation

Atrial Fibrillation is a common cardiac arrhythmia, which is characterized by an abnormal heartbeat rhythm that can be life-threatening. Recently, researchers have proposed several Convolutional Neural Networks (CNNs) to detect Atrial Fibrillation. CNNs have high requirements on computing and memory resources, which usually demand the use of High Performance Computing (eg, GPUs). This high energy demand is a challenge for portable devices. Therefore, eﬃcient hardware implementations are required. We propose a computational architecture for the inference of a Quantized Convolutional Neural Network (Q-CNN) that allows the detection of the Atrial Fibrillation (AF). The architecture exploits data-level parallelism by incorporating SIMD-based vector units, which is optimized in terms of computation and storage and also optimized to perform both the convolutional and fully connected layers. The computational architecture was implemented and tested in a Xilinx Artix-7 FPGA. We present


Introduction
Atrial fibrillation (AF) is an arrhythmia that presents irregular heartbeats, and it is associated with an increase in heart rate due to a disorder in |136 Ingeniería y Ciencia the electrical signals that activate the atria. This type of arrhythmia occurs asymptomatically, to say, there are no symptoms until the first acute episode [1]. However, it is difficult to accurately detect AF in the early stage, and well-trained professional physicians are required to accurately determine the feature information of ECG [2]. Therefore, it is important to develop fast and accurate algorithms for AF automatic detection.
To address this challenge, several studies have proposed the convolutional neural networks (CNN) for the detection of atrial fibrillation with high levels of accuracy [2], [3], [4], [5]. Moreover, some researches have shown that custom hardware for the inference of CNNs could surpass the efficiency of general-purpose processor equivalents in terms of throughput and energy consumption [6].
Quantization is an effective strategy that reduces the precision of both weights and activations. The quantization of a CNN is the first step before implementing a CNN in a custom-hardware.
FPGAs have become striking to implement Q-CNNs because of their flexibility and high energy efficiency. These versatile integrated circuits provide programmable logic blocks and a configurable interconnection, which enable the construction of custom accelerator architectures in the custom hardware [7]. However, there are still many challenges because the CNNs are known for demanding a massive amount of computational and memory resources.
Strategies to perform the inference process at the edge are currently a hot topic in hardware researches. The authors in [8] propose a specific dataflow to minimize the memory access and data movement while maximizing the resource utilization. In [9] is proposed a Winograd transformation-based algorithm to optimize the convolution process, which uses a cross-layer strategy. The algorithm allows a reduction of over 90% in the transfer process of the intermediate data. The authors in [10] proposes an accelerator to handle network layers of different scales through parameter configuration and maximizes bandwidth by using a data stream interface. In [11] is proposed a reconfigurable CNN accelerator that reduces the number of off-chip memory accesses by combining convolution and pooling operations and using a 16-bit dynamic fixed-point format. For further details in custom hardware accelerators, the readers may refer to ing.cienc., vol. 16, no. 32, pp. 135-149, julio-diciembre. 2020. recent surveys on this topic in [12], [13]. The first one ( [12]) focuses on custom hardware in general for CNNs. The second one ( [13]) focuses on FPGA-based accelerators for CNNs.
In this work, we propose a computational architecture for the inference process of a quantized version of the Castillo-Granados CNN [14]. Our goal is to design a specific purpose processor that carries out the inference process by using the minimum amount of computational and memory resources at high accuracy possible. We designed a SIMD architecture (Single Instruction, Multiple Data) with a single vector unit that is optimized to perform both the convolution and fully connected layers. This processor allows the inference of a 22-bits Q-CNN version [14] and achieves a 94% accuracy.
This paper is organized as follows: Section II gives a description of the CNN used. Section III describes the quantization process of CNN. Section IV describes the design of the computational architecture. Section V summarizes the main results of this work. Finally, the article is closed with the conclusions in Section VI.

Convolutional neural network
A typical CNN is made up of different layers. In each layer, there is a certain amount of connected filters that extract information for subsequent layers. The input data passes through layers to generate a feature vector. Then, a classifier is used in the characteristic vector obtained to produce the result of the classification. There are mainly three types of layers in a CNN model: convolutional layers, grouping layers, and fully connected (FC) layers.
In this paper, the CNN Castillo-Granados [14] is implemented ( Figure  1). This model was trained for the detection of AF from ECG signals. These ECG signals were registered by the Einthoven triangle method [15] and stored in a vector of 500 samples with a sampling rate of 250 [samples/s]. This CNN achieved an accuracy of 97.44% using a 64-bit doublefloat format [14]. The CNN has four convolution layers followed by three FC layers. Table  1 summarizes the characteristics of the layers. The network has a total of 9385 parameters and 377428 fixed-point operations (additions and multiplications). Figures 2a and 2b show the distribution percentages of the number of operations carried out and the number of parameters required in both the convolutional and FC layers. Note that, on the one hand, the convolutional layers perform the highest percentage of operations (96% vs. 4%). On the other hand, the FC layers require the highest percentage of parameters (87% vs. 13%).

Quantization process
The implementation of the inference process in custom hardware requires a quantization process. This process allows us to change 64-bit floating-point format for a reduced number of bits by using a fixed-point format. This change reduces considerably the amount of computational and memory resources. Figure 3 shows the results of the fake quantization process that was carried out using Matlab. Note that by using just 12 bits an accuracy of 95% is achieved. Also, note that from 12 bits onwards there is no considerable increase in the accuracy. However, in the hardware implementation, there are some issues related to the truncation error because of the reduction in the number of bits, which will be analyzed in Section 5.

|140
Ingeniería y Ciencia Andrés Jaramillo, Laura Vargas, and Carlos Fajardo   The design has an operation module that computes the input data with the parameters of each layer. This module is controlled by a Finite State Machine (FSM control) that addresses the computational resources that carry out the mathematical operations.
single operation module to perform both the convolutional and FC layers. This strategy demands the use of buffers to temporarily store output results. Thus, the proposed architecture achieves a considerable reduction in the use of computational resources. However, this strategy penalizes the throughput because the reuse strategy does not allow a pipeline implementation.
A functional description of the modules in Figure 4 is given below : • Control FSM : State machine addresses the flow of data processed in each module. Also, the FSM controls the Operation Module to perform all layers. Finally, the FMS carries out the write/read memory process.
• BRAM : Memory to store all parameters of the CNN.
• Operations Module: adaptive module that computes convolution or FC operations.
• Buffer : Set of two memories that store the temporary outputs of each layer, alternating writing, and reading, the read data is returned as inputs of the next layer.
• Input ECG: Memories that stores the ECG segments. In this design, there are two Input ECG memories, which allow us to read a new segment while a previous segment is being processing.
• External Hardware: ECG signals are acquired through External Hardware. This module communicates the FPGA with the ADC using the SPI protocol. External hardware provides the data to Input ECG in groups of 500 samples with a sampling frequency of 250 [samples/s].

Design of the operations module
The operations processing module has been designed based on the convolution layers since these contain the largest number of operations in the architecture (Figure 2).
The Operations Module uses a loop unrolling strategy for the kernels in the convolution layers [16]. A SIMD-based architecture carries out this strategy by a sliding buffer, which contains 27 multipliers, 27 adders, and 27 shift registers. This custom processor allows the reuse of the hardware for all layers.  Figure 5 illustrates the configuration of the logistic resources used for the execution of operations. Note that Kernel and input data (Section 2) flow from left to right in each clock cycle until all 27 registers are filled. Once the first 27 data are saved on the registers, a first temporary data out is obtained. Then, the input data is 1-left shifted and a second temporary data is obtained, and so on. All temporary data are accumulated in a specific position of the Buffer memory.
It is important to note that the dimensions change from one layer to another, so the bias is added in the last tensor dimension.
The FSM controls the data flow by modifying the control signals. The design can be configured to calculate both convolutional and FC layers. This strategy saves the use of logical resources for the description of layers that execute different operations. If better latency is required, more parallelism can be applied (more than 27 operations per clock cycle) and more than one kernel at a time.

Hard-limit transfer function
The original design of the neural network was performed with the Sigmoid activation function [17] (Figure 6). This function is applied after the last FC layer. We replaced the Sigmoid function with a Hard-limit function to reduce computational resources. This function was implemented by using a simple not gate. Our results suggest that the use of a Hard-limit function does not affect the accuracy of the network.

Results
The computational architecture was implemented on the Basys 3 Development Board which is based on the latest Artix-7 FPGA from Xilinx. The synthesis, simulation, and debugging was carried out using the Xilinx Vivado Design Suite R software with the 2019.1 version.
The design was tested with a set of 1000 ECG signals of the MIT BIH Atrial Fibrillation database [18]. These signals were quantized from 12 to 32 bits by using Matlab (Section 3 The percentage truncation error (E t ) is generated for the reduction in the number of bits and calculated by Equation 1.
Where CHR is the Custom Hardware Result and SR is Software Result (Matlab). The E t depends on the number of bits. The error increases when the number the bits is reduced. Besides, this error is propagated through all layers. Thus the bigger E t is found in the last layer. For example, the E t , for 22 bits, in the last layer was around 0.79%.

Accuracy regarding the number of bits
A test was performed using the set of 1000 ECG signals, which 500 corresponds to Fibrillation signals and the other 500 with Not fibrillated signals.  Note that for 12 bits there is an important reduction in the accuracy, which is due to the truncation error. Taking into account the accuracy and the amount of resources required, a 22-bits quantization is adopted.

Performance
A clock frequency of 34.6 [KHz] was implemented, satisfying the required throughput of an inference every two seconds. Table 3 summarizes the main results to obtain maximum performance on FPGA.  Figure 8 shows the breakdown of the execution time to calculate each CNN layer. Note that convolution operations are the ones that consume the most time, therefore, if better latency is required, parallelism techniques can be applied.

Conclusions
A computational architecture was proposed to carry out the inference process of a Q-CNN, which allows the detection of Atrial Fibrillation.
The design is an architecture SIMD-based vector unit with a sliding buffer that is optimized for both convolutional and FC layers. The design aims to reduce the amount of computational and memory resources. The architecture has a throughput of an inference every two seconds, i.e. it works at 34.6 [KHz]. However, the design can achieve a throughput of 736 [inferences/s] at its is maximum design frequency (25.5[Mhz]). The tests show accuracy in the inference of 94% for 22-bits of quantization, which moves approximately 2.97% away from the inference in 64-bits software. Future work focuses on the use of aware quantization strategies, which can improve accuracy by using a lower amount of bits [19], [20]. We also will test different approximation strategies, which have also proved to improve the accuracy [21]. We aim to use this design on the implementation of a Q-CNN-based portable device for automatic detection of AF.