A Low-Cost Raspberry Pi-based System for Facial Recognition

Deep learning has become increasingly popular and widely applied to computer vision systems. Over the years, researchers have developed various deep learning architectures to solve diﬀerent kinds of problems. However, these networks are power-hungry and require high-performance computing (i.e., GPU, TPU, etc.) to run appropriately. Moving computation to the cloud may result in traﬃc, latency, and privacy issues. Edge computing can solve these challenges by moving the computing closer to the edge where the data is generated. One major challenge is to ﬁt the high resource demands of deep learning in less powerful edge computing devices. In this research, we present an implementation of an embedded facial recognition system on a low cost Raspberry Pi, which is based on the FaceNet architecture. For this implementation it was required


Introduction
A person's face contains physical information that can be used for security and access control applications.The main motivation for facial recognition

|78
Ingeniería y Ciencia is because it is considered a passive and non-intrusive system.Most of the biometric data needs to be collected by special hardware such as a fingerprint scanner, a palm print scanner, a DNA analyzer, etc [1].Face recognition does not require physical touch, it is less intrusive than other biometric systems.
A vast amount of work has been done to make facial recognition algorithms more reliable and accurate.In recent years, deep learning approaches have dominated the facial recognition field due to their high performance in learning discriminative features.As an example, the solution proposed in [2] achieved a precision of 99.63 % in the Labeled Faces in the Wild (LFW) [3] dataset using a deep learning system called FaceNet with almost 7.5M parameters.This architecture learns a mapping of facial images to a compact Euclidean space where the distances correspond directly to a measure of facial similarity.Another deep learning solution is [4] that attains an accuracy of 99.52% in the same LFW dataset using a VGGNet-16 neural network architecture, with 138M parameters.This work implements a new loss function called range loss, designed to decrease intra-personal variations while increasing inter-personal differences in extremely unbalanced data.Also, the authors in [5] propose a new loss function called Additive Angular Margin Loss (ArcFace), which incorporates margins in a well-established loss function to maximize face class separability.They use the ResNet100 neural network with 65M parameters and obtain an accuracy of 99.83% in the LFW dataset.
The implementation of these state-of-the-art neural networks need the leverage of high-performance and power-hungry hardware [6].These huge demands of computational and memory resources impedes their deployment in edge devices (e.g.Microcontrollers, SoC, etc.).
Bringing the computation closer to the location where it is needed (computation at the egde) can improve the response times, save bandwidth and minimize the data transmission time.Processing data at the edge also preserves the privacy of the users, since there is no need to upload the data to the cloud.This means that the data is processed at the source.Cameras, speakers, microphones, and multiple sensors are all located at the edge of the network, which provides a great opportunity of running deep learning algorithms here [7].Edge devices are inexpensive, small, and flexible hardware devices.They are characterized by their low energy consumption and reduced cost.
In this work, we present an implementation of an embedded facial recognition system on a Raspberry Pi.The model is based on the FaceNet architecture.The system achieved an accuracy and precision of 77.38% and 81.25%, respectively.The time of execution of each inference is around 11 seconds and only 46 [kB] of RAM are required.
The rest of this paper is structured as follows.Section 2 describes the related work.Section 3 shows the FaceNet generalities and the structure of a facial recognition system.In section 4, we present the various methods used to construct the system.The metrics utilized and their experimental results are presented in section 5. Finally, section 6 draws conclusions of our work and indicates future studies.

Related work
Generally, a facial recognition system is composed of three basic steps: (1) face detection, (2) feature extraction, and (3) face recognition [8].The face detection step locates the face that appears in the image.The feature extraction step extracts a feature vector from the detected face.This feature vector is obtained by v = f (x) where x is the image of the detected face and f (•) is the deep neural network.Finally, the face recognition step compares the extracted features with all registered faces and verifies if it is part of the database [9].Annalakshmi et al., [10] introduced algorithms using the enhanced local binary pattern (SLBP) and histogram of oriented gradients (HOG) to classify the human gender with a SVM classifier.Over the LFW database, their proposed hybrid method achieved an accuracy of 95.7%.They attained an accuracy of 99.1% with the FERET dataset.Xi et al. [11] have introduced a new unsupervised deep learning-based technique, called local binary pattern network (LBPNet), to extract hierarchical representations of data.The LBPNet maintains the same topology as the convolutional neural network (CNN).With an accuracy of 94.04% on LFW, it shows that LBPNet is comparable to other unsupervised techniques.Arigbabu et al. [12] proposed a novel face recognition system based on the Laplacian filter

|80
Ingeniería y Ciencia and the pyramid histogram of gradient (PHOG) descriptor.They reached an accuracy of 88.50% on LFW.In addition, a support vector machine (SVM) was used with different kernel functions as the recognition step in the system.
To achieve better results, computer vision has moved towards Convolution Neural Networks (CNN), which is a deep learning approach and state-of-the-art in computer vision.The authors in [13] introduced a new approach using texture analysis and CNNs to detect face liveness, a technique that is used for face spoofing attacks.Their enhanced architecture based on the inception version 4 network obtained 100% accuracy on the NUAA Photograph Impostor dataset for face liveness detection.An pairwise differential siamese network for occluded face recognition is proposed by Song et al. [14].The AR dataset that contains images with natural occlusions was used for evaluation.Their proposed method outperformed the state-of-the-art algorithms with an accuracy of 99.72% and 100% on the scarf and sunglass subsets of the AR dataset, respectively.Their method also achieved an accuraccy of 99.2% on the LFW dataset.

FaceNet
FaceNet is a face recognition, verification and clustering neural network [2].The authors of this model presented several models with the same overarching name called FaceNet.They discussed two different deep network architectures: The Zeiler&Fergus style networks [15] and the Inception type networks [16].This last model is based on the on GoogLeNet and has 20x fewer parameters and 5x fewer FLOPS when compared to other proposed models in [2].We used this model architecture for our study because of its reduction in size.
FaceNet maps a face or image to a 128-dimensional feature vector in a Euclidean space.Let x be an image, the mapping is represented by f (x) ∈ R 128 , where f is the embedding function, i.e., the GoogLeNet based neural network.The mapping can be of any dimension, but in this system, we used a length of 128 as recommended in [2].Additionally, the embedding is an element of a 128-dimensional hypersphere, i.e., f (x) 2 = 1.This is beneficial in the context of nearest-neighbor classification.The distance ing.cienc., vol.17, no.34, pp.77-95, julio-diciembre.2021.
between each mapping is correlated to a measure of face similarity.In other words, the distance between each feature vector can be used to determine the identity of a person.
The network is trained using the triplet loss function [17,18].This triplet consists of 3 images: an anchor (x a i ), a positive (x p i ) and a negative (x n i ).The anchor and the positive images correspond to the same identity.The negative image has a different identity than the anchor image.The triplet loss function, equation 1, tries to enforce a margin between each pair of faces from one person to all other faces in the embedding space.Equation 1 tries to bring the term, f ( 2 +α, close to zero.This means the distances between the embeddings of the anchor images and the positive images will tend to be smaller than distances between the embeddings of the anchor images and the negative images by a margin of α, i.e. f ( The parameter α is a margin that is enforced between positive and negative pairs.This process happens during training and can be observed in Figure 1.In hard triplets, the negative image is very close to the anchor, and the positive image is very far from it.We used these hard triplets during the training process.Figure 2 shows the block diagram of FaceNet's architecture during the training process which involves the triplet loss step.CelebFaces Attributes Dataset (CelebA) [20] is a large-scale face attributes dataset.It contains 202,599 face images and 10,177 identities.The images in this dataset cover large pose variations and background clutter.The dataset can be employed as the training and test sets for computer vision tasks, such as face recognition, face attribute recognition, and face detection.We chose the CelebA dataset for the training of the neural network because of the higher amount of unique images that each identity had as compared to other facial recognition datasets.
Our system uses the Viola-Jones algorithm [21] in the detection step.Retraining was done to adjust the network to these characteristics.The images were processed using the OpenCV libraries [22], which allow us to locate the person's face in the image and crop it out.We retrained the model with a training set of 4150 images.In the training set, each identity had an average of 26 photos.Figure 3 shows a sample of the training set.We use a GoogLeNet [16] pretrained model, which was downloaded from [23].This model receives as input an image of 96 × 96 × 3 pixels.The size of the retrained GoogLeNet model is around 15[M B].The model was trained using the triple loss function (Equation 1).Hard triplets were used.We used an Adam optimizer with a learning rate of 10 −3 and a batch of 325 images.The number of epochs was 50.Table 1 shows each layer of the implemented model on the microcontroller.The table also shows each layer's respective output size and number of parameters.

Deep learning library
We created our own library to implement the model on the device.One motivation for creating this library is to offer the possibility to deploy Deep Learning Models on any edge device that supports the C language.Our library accepts the parameters of the model in a header file (.h).The retrained model is converted to a .h5file using TensorFlow.Since the weights come from a .h5file, a python script converts them to a .hfile (for C language execution).The number of decimals for each parameter in the network is truncated to six.
Once we have the weights, we build the neural network shown in Table 1 in C. The library developed in C is used to describe each layer in the model.This process converts the model as a function using C++.The library uses the NHWC format [24] for deep learning volumes.Table 2 shows all the functions created with their corresponding description and input parameters.The library dynamically manages RAM used in the process.The "free memory pointer" parameter in Table 2 gives the corresponding function the signal to clear or not clear the memory it creates.This parameter takes the values of 1 or 0. This library and the steps on how to use it are available on GitHub [25].The back-end part of the system was implemented in C++. Figure 4 shows a block diagram of our facial recognition system.The explanation of each block is described in the following.First, we use a detection algorithm from OpenCV to obtain the face of the person.This algorithm uses the Viola-Jones method [21].The output of this block is the image of the detected face.The dimensions of this image are variable due to the multiple windows that the Viola-Jones algorithm uses.The next step is the resizing of the detected face.The input of this algorithm is an image of any dimension, and the output is an image of size 96 × 96 × 3 pixels.After the image has been resized, it passes through the deep neural network model (Table 1).Its output is a 128-dimensional feature vector.Finally, the identification of this encoding uses a database that has other feature vectors stored inside of it.This Identification step consists of comparing the Euclidean distance between the generated encoding and the database's encodings.The system identifies the person when the smallest distance found is within a threshold.The identity corresponding to this encoding is the output of the system.

Front-end
The Front-End consists of an interface between the user and the facial recognition system (Listing 4.4).
The algorithm was developed in C++ and has two processes.The first one allows the addition of a person into the database.This process starts by taking a picture of the person and storing the corresponding feature vector into the database (see Section 3).The second process allows the recognition of the face.This process also starts by taking a picture of the person, which passes through the system, then the generated feature vector goes through the identification step described in section 4.3.Finally, the name of the identified person is displayed on the monitor.The database mentioned above consists of a list of feature vectors (embeddings) corresponding to the anchor images of the identities registered.By running the first process an image is taken (anchor image) and the feature vector corresponding to that anchor image, which is generated by running the inference of the system, is added to the database.

Results
In this work, we used the Raspberry Pi 3B + model, which is a creditcard-sized single-board computer.This model comprises a 1 GB of RAM memory with four USB ports and a 10/100 Ethernet port.A 5MP Raspberry Pi camera module was used.

Test set
To evaluate the performance of the system, we collected 103 images from 25 different individuals.The test set was distributed as follows, 82 images belonged to registered users and 21 images were imposters or unregistered users.The 82 images belonged to 15 individuals and each identity had between 3 and 7 images.The database of the system consisted of the feature vectors for the anchor images of these 15 individuals.The pictures were taken from a distance of around 30[cm] from the camera.The dataset is formed by males and females with ages ranging between 23-85 years.

Model performance evaluation
The performance of a facial recognition system for identification scenarios can be evaluated based on the results obtained by the identification step (see section 4.3).Let n be the number of identities and sample i be the number of face samples of an identity i, the total number of samples is T otal = n i=1 sample i .

Confusion matrix
We use a confusion matrix to measure the model performance.We evaluate the system for different values of the threshold (see section 4.3).This parameter represents a distance threshold that determines whether an embedding is close enough to the anchor embedding to conclude that the embedding in question corresponds to the identity of the anchor embedding.Table 3 shows the results found for each case.In Table 3, the row in bold corresponds to the model performance for a threshold of 0.55.This threshold value achieved the highest F1 score of 0.793 for the system.For a threshold value of 0.55, the accuracy and precision were 77.38% and 81.25%, respectively.As the threshold increased, the values of the true positive rate (TPR) also increased.This behavior is because there was a less strict requirement on deciding a match for the person.Similarly, the number of claimed matches and the value of the false positive rate (FPR) increased as the threshold increased .

ROC curve
The machine learning community often uses the ROC area under the curve AUC statistic for model comparison [26].This practice is questioned because AUC estimates are noisy and suffer from other problems [27].Nonetheless, the coherence of AUC is a respected measure of classification performance.
Figure 5 shows the values of the TPR and the FPR for different values of the threshold parameter.The AUC of the ROC curve is 0.8.The blue dashed line in Figure 5 is an approximated curve for the values.The blue squares represent the performance of the system found for a value of the threshold.The threshold used for Figure 5 ranges between 0.2 and 1. Due to the size of the dataset, the blue squares do not form a smooth curve.

Processing Time of the System
We present the average time that the system spent on various tasks.Table 4 shows the average execution time of both processes (see section 4.4) and the inference of the GoogLeNet based model on the Raspberry Pi.The table shows that most of the run time of both processes is spent by the inference of the model.

Other Aspects
Apart from the test set that the system was evaluated on, the system performed poorly on images of faces with glasses.The majority of these images were not identified correctly.

Conclusions
We implemented a facial recognition system based on a deep learning architecture GoogLeNet on a Raspberry Pi model 3B+.The neural network maps each input to a Euclidean hypersphere where the distance between each mapping correlates to a measure of face similarity.A library developed in the C describes the inference of the GoogLeNet architecture.
We evaluate the system on a test set of 103 images.The pictures were collected using the Raspberry Pi camera module.The algorithm had an accuracy and precision of 77.38% and 81.25%, respectively, on a group of 15 people registered in the database.
A drawback of our system is that the persons are required not to use glasses because our results suggest that the individuals with glasses had a higher probability of not being identified correctly.
The inference time of the GoogLeNet based model on the Raspberry Pi was 9.26 s.

Ingeniería y Ciencia
For future studies, the number of images in the training set can be treated as a hyperparameter.Likewise, the effect of the size of the input image on the model would be interesting to see.Furthermore, optimization of the C library is a task that affects the performance of the system.

Figure 1 :
Figure 1: The result of the triplet loss training for the FaceNet models.The distance between the anchor and positive embeddings reduce while the distance between the anchor and positive embeddings maintain a distance determined by the parameter α. [2]

Figure 2 :
Figure 2: FaceNet's overall architecture for training.The triplet loss uses the embeddings for training.The inference of the FaceNet models doesn't include the triplet loss step.[2]

Figure 3 :
Figure3: Sample from the CelebA Dataset that was used for retraining.[20]

Figure 4 :
Figure 4: Diagram of the facial recognition system.The steps for face detection and image resizing are followed by the inference of the DNN which generates a 128-dimensional embedding.This vector is finally compared with the database (embeddings of the anchor images) to conclude the identity of the input image.

Figure 5 :
Figure 5: The corresponding ROC curve of the classification results.

Table 2 :
Description and parameters of the functions in the deep learning library

Table 3 :
Results of the classification for the facial recognition system.

Table 4 :
Average time for the various tasks that the facial recognition system is capable of carrying out.