End-to-end Facial Recognition Deep Learning Model Specialized for Facial Angle using
Gray Image
AnJaewon1
Choi,,Sang Ho1,2,*
-
(Department of Computer Engineering, Kwangwoon University / Seoul, Korea jaewonan@gmail.com)
-
(School of Computer and Information Engineering, Kwangwoon University / Seoul, Korea
shchoi@kw.ac.kr )
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Face recognition, Angle, Gray image, InceptionResnet
1. Introduction
Face-recognition technology is used to improve identity verification and to strengthen
security in various fields such as financial services and mobile phones [1-3]. Typically, feature points are extracted from a face image and then mapped to a vector.
In the case of an embedded facial image, face-recognition is performed by comparing
the images of the two identities to determine similarities. A machine learning technique
called the K-NN method was used for the embedding and comparison processes [4,5]. Currently, deep learning techniques are being used to increase and automate the
datasets [6-8].
Extensive research has been performed in recent years on various face-recognition
systems using deep-learning techniques. Most of the research on face-recognition systems
is based on FaceNet developed by Google in 2015 [8]. FaceNet extracts high-quality features from face images, makes them a vector with
128 elements, and conducts face search using MTCNN; it then adjusts Triplet Loss to
initiate the learning. FaceNet achieved an accuracy of 98.87% using the LFW Dataset.
However, the face recognition in FaceNet has limitation to the angles since it only
utilizes two-dimensional distance differences. Thus, it cannot be used for angles
such as the up and down angles since it lacks the flexibility of focusing on the front,
left, and right images.
Therefore, a study was conducted to calculate a new loss value to solve this problem.
ArcFace uses a three-dimensional ArcFace Loss with an added angle [9] and achieved an accuracy of 99.82% by using the same LFW Dataset. ArcFace fixes the
margin with an arc value. However, if a problem is found in the dataset, it still
changes the angle to ensure flexibility, which reduces the accuracy.
ElasticFace was proposed to solve this problem, which uses the Arc and Cos values
based on the situation rather than fixing the margin [10]. Accordingly, an accuracy of 99.80% was obtained in the LFW Dataset, where the angle
does not change significantly. However, in the case of the AgeDB [11] and CLFW [12] datasets, the accuracy was 98.35% and 93.28%, which increased by 0.2% and 1.2%, respectively,
when compared to ArcFace.
Datasets that have many angle changes are considerably affected by light and the shadow
of the face. In an environment with constant illumination, models focused on the angle
produce good results for angle changes, but when there is a severe change in the illumination,
the performance tends to deteriorate. This study aims to solve these problems by learning
about the angle and converting the dataset into a grayscale to reduce the effect of
illuminance.
Table 1. The number of image pairs in train and validation dataset.
|
Train
|
Validation
|
Per person
|
48
|
48
|
Match pair
|
1015
|
113
|
Mismatch pair
|
1050
|
105
|
Total Images
|
722,750
|
87,200
|
2. Methodology
2.1 Datasets
To develop a face-recognition model specialized for angles, we used the K-Face dataset
which consists of various angle images. K-Face is a Korean facial image dataset [13] that is divided into 20 types of angles and 30 types of illuminances, which corresponds
to approximately 30,000 images. To create a model specialized for angles, we used
data with ${\pm}$ 45 $^{\circ}$, ${\pm}$ 30 $^{\circ}$, ${\pm}$ 15 $^{\circ}$, 0$^{\circ}$,
seven types of front-in angles and ${\pm}$ 45 $^{\circ}$, ${\pm}$ 15 $^{\circ}$, 0$^{\circ}$,
four types of upper angles. For the training dataset, we obtained 48 images with 12
angle combinations and 4 illuminance levels of 1000, 400, 200, and 150 Lux per person.
A total of 400 people were used for this process, with 350 people in the training
and 50 people in the validation datasets. A combination of match pair and mismatch
pair per person were made to improve face-recognition performance. To prevent overfitting
of the model, the numbers of matched pairs and mismatched pairs were balanced to 1015
and 1050, respectively. A total of 722,750 and 87,200 image pairs were used as the
training and validation datasets, respectively.
Additionally, to verify the robustness of the model learned using K-face dataset in
a real environment, the data were obtained directly from 21 participants using a webcam.
This study was approved by the Institutional Review Board of Kwangwoon University
(IRB No. 7001546-202300131-HR(SB)-001-01). Twelve images were taken per day, including
six photos of the front, left, right, upper, upper left, and upper right in the Lux
1000 and Lux 400 environments. The experiment was conducted over two different days,
and a total of 24 images were obtained per person. Transfer learning was applied to
the real environment data; the first day images were used for training, and the second
day images were used for testing.
2.2 Loss Function
In the existing deep learning models, the binary cross-entropy or mean square error
(MSE) is primarily used as a loss function and these loss functions compare two objects.
However, in the case of face recognition, images of the same person and another person
are learned together, so a loss function method for three objects is needed. FaceNet
proposed a new loss function, called triplet loss to overcome this problem. In the
case of triplet loss, the baseline anchor is a loss function that compares the distance
between the positive and negative inputs. This loss function was compared using the
Euclidean distance function, as shown in (1), and learning was conducted such that it minimized the distance of the positive input
and maximized the distance of the negative input. Fig. 1 depicts the structure of the triplet loss function.
The triplet loss function uses three methods: easy negative, semi-hard negative, and
hard negative. The equations for each method are given below:
In (2), the value obtained by adding an extra margin ($\alpha )$ to the positive input in
the easy negative method is smaller than that in the negative input. This method is
used when the dataset is already well divided between the same person and another
person. Eq. (3) is used when the negative is greater than the positive, but does not exceed the margin
value. Eq. (4) is used when the negative is close to the positive in the hard-negative method.
In this study, the triplet loss function was used in several face recognition tasks,
along with the K-face dataset. Although there is a difference when using the same
background, the same angle, and illuminance for each person, there is no significant
difference between positive and negative; therefore, we used the semi-hard negative
method.
Fig. 1. Structure of triplet loss function.
2.3 Facial detection
In the case of the K-face dataset, several factors of background image can affect
face recognition performance. Therefore, a method of detecting only the face part
was added to address this issue. In this study, we primarily used MTCNN [14] and Dlib [15] to detect the face part and to determine the face detection performance. MTCNN extracts
face part by detecting five landmarks, which are two corners: the left eye, the right
eye, the nose, and the mouth [14]. Dlib detects faces by using the histogram of oriented gradient (HOG) characteristics
and returns 68 feature points using the Kazemi model [15].
Face part detection methods were evaluated by using K-face dataset angle images. The
performance of the Dlib method for facial part detection was lower than that of MTCNN
for data containing images with large angles. Table 2 represents the face detection performances of MTCNN and Dlib for the training image
datasets. For MTCNN, only 139 images were unable to detect faces from a total of 19,200
images, demonstrating a face detection success rate of approximately 99.3%. However,
in the case of Dlib, 1826 images were not detected from the 19,200 images, and the
success rate was low at 90.5%. Face parts was not mainly detected from the left and
right angle images. This study focused on angle-specific face detection; therefore,
MTCNN method was selected to detect face part in images.
Table 2. Face detection performance comparison.
|
Total image
|
Facial detection
|
Success rate
|
MTCNN
|
19200
|
19061
|
99.3%
|
Dlib
|
19200
|
17374
|
90.5%
|
2.4 Proposed Model
The ResNet series model, which typically presents good performance, was recently used
as a deep learning model for face recognition [16]. The ResNet models include Resnet18, Resnet34, and InceptionResNet. Among them, we
used InceptionResNet, which exhibits good performance in the field of face recognition.
InceptionResNet facilitates faster learning by combining passive connections with
an inception model [17]. To increase the performance of the deep learning model, it is essential to make
the model deep and wide. However, with the increase in size, the number of parameters
to be learned also increases, which can lead to overfitting. Additionally, as model
complexity increases, it can have limitations in power consumption and computation.
The inception model is characterized primarily by learning the local region when the
conventional filter size is small. When the conventional filter size is large, the
degree of abstraction increases, and both are used. However, when all of them are
used, the learning parameters increased, and the inception model was reduced by using
a 1 ${\times}$ 1 conv, and then calculated to reduce the number of parameters. Furthermore,
ResNet uses the residual learning method and residual mapping, which is a method of
summing the values that passed the weight layer and those that did not. Thus, the
performance can be improved by solving the problems of overfitting and gradient vanishing.
Therefore, InceptionResnet is constructed by combining the advantages of the inception
model and ResNet.
Consequently, we constructed a deep-learning model based on the InceptionResnet model
in this study. Table 3 depicts the structures of the InceptionResnet model used in this study. In the proposed
model, a mixed layer is created by combining several branches using an inception model.
The existing filter was divided and used for several branches, which reduced the amount
of computation required when compared to that of the existing filter. The repeat layer
repeats the inception model. There are several convolution layers in the mixed and
repeat layers. A total of 191 layers and approximately 16 M parameters were used.
Table 3. Structure of the proposed model.
Layer
|
Output size
|
Parameters
|
Conv1
|
111 x 111 x 32
|
0.8K
|
Conv2a
|
109 x 109 x 32
|
9K
|
Conv2b
|
109 x 109 x 64
|
18K
|
Pool1
|
54 x 54 x 64
|
0
|
Conv3
|
54 x 54 x 80
|
0.5K
|
Conv4
|
52 x 52 x 192
|
138K
|
Pool2
|
25 x 25 x 192
|
0
|
Mixed5
|
25 x 25 x 320
|
264K
|
Repeat1
|
25 x 25 x 320
|
120K
|
Mixed6
|
12 x 12 x 1088
|
2.6M
|
Repeat2
|
12 x 12 x 1088
|
1.1M
|
Mixed7
|
5 x 5 x 2080
|
3.8M
|
Repeat3
|
5 x 5 x 2080
|
2M
|
Block8
|
5 x 5 x 2080
|
2M
|
Conv8
|
5 x 5 x 1536
|
3.2M
|
Avg_pool
|
1 x 1 x 1536
|
0
|
Linear
|
1 x 1 x 512
|
786K
|
Total
|
X
|
16M
|
2.5 Gray Image
While learning the face recognition model specializing in the angle, the face image
for each angle was significantly affected by illumination [18]. When comparing the front image with the face from the left and right angles in the
face image, if the angle was large, the model learning was affected by the shadow
or excessive light on the face because of illuminance. To solve this problem, we used
a grey image rather than an RGB image to reduce the effect of illuminance. When converting
to a gray image using the grayscale method, the channel was reduced to 1 and the characteristics
of the face were not properly reflected, thereby reducing the performance. To solve
this problem and reduce only the effect of illuminance while maintaining the channel,
the saturation was reduced to 0, the RGB channel was maintained at 3.
2.6 Experiment Environment
For the hardware experimental environments in this study, we used 128 gigabytes of
RAM, a 3960x24 Ryzen CPU, and an Nvidia RTX A6000 GPU.
Pytorch 1.12.0 was used as an experimental environment for model training and testing;
the image size, batch size, embedding dimension, and epoch was 224 ${\times}$ 224,
100, 512, and 100, respectively.
3. Results
Section 3 presents the result of the experiment including the real environment data.
Fig. 2 represents a flowchart of the face recognition process. Firstly, image is converted
into a gray scale, and then face recognition is performed by determining the identity
similarity using the Inceptionresnetv2 model.
Fig. 2. Flowchart of proposed facial recognition model.
3.1 K-face dataset
Training and validation were conducted with the InceptionResnetv2 model using the
K-face dataset described in Section 2.1. Fig. 3 depicts the validation accuracy. The validation accuracy of the trained model was
97.4%, precision and recall was 95.58%, and 97.99%, respectively. The learning of
face recognition was performed effectively.
Fig. 3. Validation accuracy for trained models.
Subsequently, the angular image robustness of the trained model was examined according
to the combination of angular images of the test data.
An accuracy of 98% was obtained when testing only the front image, 97.68% when testing
the front, left and right images, and 96.9% when testing the front, left, right, and
upper images, as shown in Table 4. Consequently, the accuracy was reduced by approximately 1.1% when compared to that
in the case of testing with only the front image. These results showed that the proposed
model was robust to angular images.
Table 4. Performance results for angle-specific face recognition for the K-face dataset.
Test Image(angle)
|
Accuracy
|
Precision
|
Recall
|
Front
|
98%
|
98.11%
|
97.89%
|
Front + Left + Right
|
97.68%
|
97.72%
|
97.49%
|
Front + LR + Up(LR)
|
96.9%
|
95.69%
|
97.11%
|
LR, Left + Right
Subsequently, a test was conducted to evaluate the effect of illuminance on the proposed
model. Test dataset consisting of lux 400, which corresponds to the average indoor
illumination, was compared with a dataset containing lux 1000, 400, 200, and 150 (see
Table 5).
Table 5. Performance results for illumination-specific face recognition for the K-face dataset.
Test Image(lux)
|
Accuracy
|
Precision
|
Recall
|
Lux 400
|
98.67%
|
98.08%
|
98.83%
|
Lux 1000, 400, 200, 150
|
96.9%
|
95.69%
|
97.11%
|
Table 6. Performance results for model-specific face recognition for the K-face dataset.
Model
|
Accuracy
|
Precision
|
Recall
|
Resnet34
|
95.7%
|
93.89%
|
96.30%
|
Proposed model
|
96.9%
|
95.69%
|
97.11%
|
When the test dataset was consist of only Lux 400, the accuracy was 98.67%, which
is approximately 1.77% higher than when the images for all the Lux values were tested.
The performance of all illuminances was inferior to that of the indoor illuminance
only; however, these results demonstrate that the illuminance is flexible.
For verification, a comparative analysis was conducted using the existing Resnet model.
For the existing model, the Resnet34 model was used as a representative Resnet model,
and the comparison was conducted with data at all illuminances and all angles.The
performance of the Resnet model was 95.7, 93.89, and 96.30 % in accuracy, precision,
and recall, respectively, which were lower than the performance of the proposed model.
This exhibits better performance than the existing ResNet model used for face recognition.
3.2 Real Environment Dataset
Lastly, to confirm that the trained model is flexible even in a real environment,
a test was conducted using the dataset presented in Section 2.1. Firstly, transfer
learning was performed to ensure flexibility in various real environments.
When transfer learning was performed based on images taken on the same day, the verification
results were better when learning was conducted for only one or two epochs rather
than five epochs. Based on this, transfer learning was conducted using a model in
only two epoch learning.
The test was conducted using data taken twice at three-day intervals to verify robustness
of proposed model for environmental changes. The real environmental dataset were measured
in a significantly different environment compared to K-face dataset. The performance
was reduced compared to the previous results owing to the influence of the illuminance
and environment of the subject. To solve this problem and reduce the influence of
illuminance, verification was performed by converting RGB image to a grayscale image.
The corresponding result was an evaluation of the performance in a real environment.
When using the gray scale image, the accuracy was approximately 81%, which is an increase
of about 5% compared to the accuracy of 76% when RGB images were used. From this results,
the accuracy of each of the 21 participants was evaluated.
Table 7 lists the accuracy of each subject in the actual environmental data. Each of the
six images were tested consisting of the data on the front, left, right, upper, left
upper and right upper images using gray scales. The results demonstrated that the
accuracy rate of the 21 participants was approximately 81%. In the case of some subjects,
the recognition performance was low, which was confirmed to be due to environmental
factor such as accessories. However, we found that the proposed model is robust for
various angles.
Table 7. Accuracy results for each subject.
Subject
|
True / Total
|
Accuracy
|
S1
|
6 / 6
|
100%
|
S2
|
5 / 6
|
83%
|
S3
|
6 / 6
|
100%
|
S4
|
0 / 6
|
0%
|
S5
|
6 / 6
|
100%
|
S6
|
6 / 6
|
100%
|
S7
|
6 / 6
|
100%
|
S8
|
2 / 6
|
100%
|
S9
|
6 / 6
|
33%
|
S10
|
6 / 6
|
100%
|
S11
|
6 / 6
|
100%
|
S12
|
6 / 6
|
100%
|
S13
|
6 / 6
|
100%
|
S14
|
0 / 6
|
0%
|
S15
|
6 / 6
|
100%
|
S16
|
6 / 6
|
100%
|
S17
|
0 / 6
|
0%
|
S18
|
5 / 6
|
83%
|
S19
|
6 / 6
|
100%
|
S20
|
6 / 6
|
100%
|
S21
|
6 / 6
|
100%
|
Total
|
102 / 126
|
80.95%
|
4. Discussion & Conclusion
In this study, we developed end-to-end face recognition system specializing in angles
by using gray-scale images. For angle-specific face recognition, a dataset was created
using K-face dataset and face recognition performance was achieved with 96.9% accuracy.
A comparative analysis was conducted at different angles and illuminances to confirm
the flexibility of the proposed model. Additionally, the performance improved by approximately
1.2% when compared to that of the conventional ResNet model. Lastly, 21 subjects were
recruited and tested to verify the proposed model performance in a real environment;
approximately 81% of the performance was obtained using transfer learning and gray
images. It was observed that the performance in the real environment, which were measured
in other environments with a large difference between several subjects and a small
number of subjects, was reduced to 81% when compared to the K-face dataset results.
Further research is needed to confirm the flexible performance of the proposed model
in a more diverse environment and its robustness to datasets with different accessories
in real environment.
ACKNOWLEDGMENTS
This research was supported by the MSIT(Ministry of Science and ICT), Korea, under
the ITRC(Information Technology Research Center) support program(IITP-2024-RS-2022-00156225)
supervised by the IITP(Institute for Information & Communications Technology Planning
& Evaluation). In addition, this research was supported by the KIAT(Korea Institute
for Advancement of Technology) grant funded by the Korea Government(MOTIE : Ministry
of Trade Industry and Energy). (P0017124, HRD Program for Industrial Innovation).
This research (paper) used datasets from 'K-face dataset (AI-Hub, S. Korea)'. All
data information can be accessed through 'AI-Hub (www.aihub.or.kr)'.
REFERENCES
J.H Im et al., ``Practical privacy-preserving face authentication for smartphones
secure against malicious clients,'' IEEE Transactions on Information Forensics and
Security, January. 2020.
S. Radzi et al., ``IoT based facial recognition door access control home security
system using raspberry pi,'' International Journal of Power Electronics and Drive
Systems, March. 2020.
Yongping Zhong, Segu Oh, Hee Cheol Moon. ``Service transformation under industry 4.0:
Investigating acceptance of facial recognition payment through an extended technology
acceptance model'' Technology in Society, February. 2021.
M.S Minu et al., ``Face Recognition System Based on Haar Cascade Classifier'' International
Journal of Advanced Science and Technology, 2020.
DK Ds, PV Rao., ``Implementing and analysing FAR and FRR for face and voice recognition
(multimodal) using KNN classifier'' International Journal of Intelligent Unmanned
Systems, October. 2019.
L. Boussaad, A. Boucetta., ``Deep-learning based descriptors in application to aging
problem in face recognition,'' Journal of King Saud University - Computer and Information
Sciences, June. 2022.
M. Mosud et al., ``Deep learning-based intelligent face recognition in IoT-cloud environment,''
Computer Communications, Feburary. 2020.
F. Schroff et al., ``Facenet: A unified embedding for face recognition and clustering,''
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2015.
J. Deng et al., ``Arcface: Additive angular margin loss for deep face recognition,''
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2019.
F. Boutros et al., ``Elasticface: Elastic margin loss for deep face recognition,''
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2022.
S. Moschoglou et al., ``Agedb: the first manually collected, in-the-wild age database,''
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2017.
W. Deng et al., ``Cross-pose lfw: A database for studying cross-pose face recognition
in unconstrained environments,'' Beijing University of Posts and Telecommunications,
2018.
Choi, Yeji, et al., "K-face: A large-scale kist face database in consideration with
unconstrained environments," arXiv preprint arXiv:2103.02211, 2021.
Chunming Wu, Ying Zhang., ``MTCNN and FACENET based access control system for face
detection and recognition,'' Automatic Control and Computer Sciences, 2021.
N. Boyko et al., ``Performance Evaluation and Comparison of Software for Face Recognition,
Based on Dlib and Opencv Library,'' 2018 IEEE Second International Conference on Data
Stream Mining & Processing (DSMP), October. 2018.
S. Peng et al., ``More trainable inception-ResNet for face recognition,'' Neurocomputing,
October. 2020.
W. Moungsouy et al., ``Face recognition under mask-wearing based on residual inception
networks,'' Applied Computing and Informatics, April. 2022.
D. Weitzner et al., ``Face Authentication From Grayscale Coded Light Field,'' 2020
IEEE International Conference on Image Processing (ICIP), October 2020.
Jaewon An received his B.S. degree in the School of Computer and Information Engineering
from Kwangwoon University, Seoul, South Korea. He is currently pursuing an M.S degree
in the Department of Computer Engineering from Kwangwoon University, Seoul, South
Korea. His research interests include biomedical signaling and machine learning algorithms.
Sang Ho Choi received his B.S. degree in biomechatronics and electronic&electrical
engineering from Sungkyunkwan University, Suwon, South Korea, and his Ph.D. degree
in bioengineering from Seoul National University, Seoul, South Korea. He worked as
a senior researcher in the Smart Device Team, Samsung Research, Seoul, South Korea.
He is currently an assistant professor in the School of Computer and Information Engineering,
at Kwangwoon University, Seoul, South Korea. His research interest includes biomedical
signal processing and artificial intelligent algorithm for biomedical applications.