Mobile QR Code QR CODE

  1. (Department of Computer Engineering, Kwangwoon University / Seoul, Korea jaewonan@gmail.com)
  2. (School of Computer and Information Engineering, Kwangwoon University / Seoul, Korea shchoi@kw.ac.kr )



Face recognition, Angle, Gray image, InceptionResnet

1. Introduction

Face-recognition technology is used to improve identity verification and to strengthen security in various fields such as financial services and mobile phones [1-3]. Typically, feature points are extracted from a face image and then mapped to a vector. In the case of an embedded facial image, face-recognition is performed by comparing the images of the two identities to determine similarities. A machine learning technique called the K-NN method was used for the embedding and comparison processes [4,5]. Currently, deep learning techniques are being used to increase and automate the datasets [6-8].

Extensive research has been performed in recent years on various face-recognition systems using deep-learning techniques. Most of the research on face-recognition systems is based on FaceNet developed by Google in 2015 [8]. FaceNet extracts high-quality features from face images, makes them a vector with 128 elements, and conducts face search using MTCNN; it then adjusts Triplet Loss to initiate the learning. FaceNet achieved an accuracy of 98.87% using the LFW Dataset. However, the face recognition in FaceNet has limitation to the angles since it only utilizes two-dimensional distance differences. Thus, it cannot be used for angles such as the up and down angles since it lacks the flexibility of focusing on the front, left, and right images.

Therefore, a study was conducted to calculate a new loss value to solve this problem. ArcFace uses a three-dimensional ArcFace Loss with an added angle [9] and achieved an accuracy of 99.82% by using the same LFW Dataset. ArcFace fixes the margin with an arc value. However, if a problem is found in the dataset, it still changes the angle to ensure flexibility, which reduces the accuracy.

ElasticFace was proposed to solve this problem, which uses the Arc and Cos values based on the situation rather than fixing the margin [10]. Accordingly, an accuracy of 99.80% was obtained in the LFW Dataset, where the angle does not change significantly. However, in the case of the AgeDB [11] and CLFW [12] datasets, the accuracy was 98.35% and 93.28%, which increased by 0.2% and 1.2%, respectively, when compared to ArcFace.

Datasets that have many angle changes are considerably affected by light and the shadow of the face. In an environment with constant illumination, models focused on the angle produce good results for angle changes, but when there is a severe change in the illumination, the performance tends to deteriorate. This study aims to solve these problems by learning about the angle and converting the dataset into a grayscale to reduce the effect of illuminance.

Table 1. The number of image pairs in train and validation dataset.

Train

Validation

Per person

48

48

Match pair

1015

113

Mismatch pair

1050

105

Total Images

722,750

87,200

2. Methodology

2.1 Datasets

To develop a face-recognition model specialized for angles, we used the K-Face dataset which consists of various angle images. K-Face is a Korean facial image dataset [13] that is divided into 20 types of angles and 30 types of illuminances, which corresponds to approximately 30,000 images. To create a model specialized for angles, we used data with ${\pm}$ 45 $^{\circ}$, ${\pm}$ 30 $^{\circ}$, ${\pm}$ 15 $^{\circ}$, 0$^{\circ}$, seven types of front-in angles and ${\pm}$ 45 $^{\circ}$, ${\pm}$ 15 $^{\circ}$, 0$^{\circ}$, four types of upper angles. For the training dataset, we obtained 48 images with 12 angle combinations and 4 illuminance levels of 1000, 400, 200, and 150 Lux per person. A total of 400 people were used for this process, with 350 people in the training and 50 people in the validation datasets. A combination of match pair and mismatch pair per person were made to improve face-recognition performance. To prevent overfitting of the model, the numbers of matched pairs and mismatched pairs were balanced to 1015 and 1050, respectively. A total of 722,750 and 87,200 image pairs were used as the training and validation datasets, respectively.

Additionally, to verify the robustness of the model learned using K-face dataset in a real environment, the data were obtained directly from 21 participants using a webcam. This study was approved by the Institutional Review Board of Kwangwoon University (IRB No. 7001546-202300131-HR(SB)-001-01). Twelve images were taken per day, including six photos of the front, left, right, upper, upper left, and upper right in the Lux 1000 and Lux 400 environments. The experiment was conducted over two different days, and a total of 24 images were obtained per person. Transfer learning was applied to the real environment data; the first day images were used for training, and the second day images were used for testing.

2.2 Loss Function

In the existing deep learning models, the binary cross-entropy or mean square error (MSE) is primarily used as a loss function and these loss functions compare two objects. However, in the case of face recognition, images of the same person and another person are learned together, so a loss function method for three objects is needed. FaceNet proposed a new loss function, called triplet loss to overcome this problem. In the case of triplet loss, the baseline anchor is a loss function that compares the distance between the positive and negative inputs. This loss function was compared using the Euclidean distance function, as shown in (1), and learning was conducted such that it minimized the distance of the positive input and maximized the distance of the negative input. Fig. 1 depicts the structure of the triplet loss function.

(1)
$ \mathrm{\mathcal{L}}\left(A,P,N\right)=\max \left(\left\| f\left(A\right)-f\left(P\right)\right\| ^{2}-\left\| f\left(A\right)-f\left(N\right)\right\| ^{2}+\alpha ,0\right) $

The triplet loss function uses three methods: easy negative, semi-hard negative, and hard negative. The equations for each method are given below:

(2)
$ \left\| f\left(A\right)-f\left(P\right)\right\| ^{2}+~ \alpha <~ \left\| f\left(A\right)-f\left(N\right)\right\| ^{2} \\ $
(3)
$ \left\| f\left(A\right)-f\left(P\right)\right\| ^{2}<\left\| f\left(A\right)-f\left(N\right)\right\| ^{2}<\left\| f\left(A\right)-f\left(P\right)\right\| ^{2}+~ \alpha $
(4)
$ \left\| f\left(A\right)-f\left(N\right)\right\| ^{2}<\left\| f\left(A\right)-f\left(P\right)\right\| ^{2} $

In (2), the value obtained by adding an extra margin ($\alpha )$ to the positive input in the easy negative method is smaller than that in the negative input. This method is used when the dataset is already well divided between the same person and another person. Eq. (3) is used when the negative is greater than the positive, but does not exceed the margin value. Eq. (4) is used when the negative is close to the positive in the hard-negative method.

In this study, the triplet loss function was used in several face recognition tasks, along with the K-face dataset. Although there is a difference when using the same background, the same angle, and illuminance for each person, there is no significant difference between positive and negative; therefore, we used the semi-hard negative method.

Fig. 1. Structure of triplet loss function.
../../Resources/ieie/IEIESPC.2024.13.5.534/fig1.png

2.3 Facial detection

In the case of the K-face dataset, several factors of background image can affect face recognition performance. Therefore, a method of detecting only the face part was added to address this issue. In this study, we primarily used MTCNN [14] and Dlib [15] to detect the face part and to determine the face detection performance. MTCNN extracts face part by detecting five landmarks, which are two corners: the left eye, the right eye, the nose, and the mouth [14]. Dlib detects faces by using the histogram of oriented gradient (HOG) characteristics and returns 68 feature points using the Kazemi model [15].

Face part detection methods were evaluated by using K-face dataset angle images. The performance of the Dlib method for facial part detection was lower than that of MTCNN for data containing images with large angles. Table 2 represents the face detection performances of MTCNN and Dlib for the training image datasets. For MTCNN, only 139 images were unable to detect faces from a total of 19,200 images, demonstrating a face detection success rate of approximately 99.3%. However, in the case of Dlib, 1826 images were not detected from the 19,200 images, and the success rate was low at 90.5%. Face parts was not mainly detected from the left and right angle images. This study focused on angle-specific face detection; therefore, MTCNN method was selected to detect face part in images.

Table 2. Face detection performance comparison.

Total image

Facial detection

Success rate

MTCNN

19200

19061

99.3%

Dlib

19200

17374

90.5%

2.4 Proposed Model

The ResNet series model, which typically presents good performance, was recently used as a deep learning model for face recognition [16]. The ResNet models include Resnet18, Resnet34, and InceptionResNet. Among them, we used InceptionResNet, which exhibits good performance in the field of face recognition. InceptionResNet facilitates faster learning by combining passive connections with an inception model [17]. To increase the performance of the deep learning model, it is essential to make the model deep and wide. However, with the increase in size, the number of parameters to be learned also increases, which can lead to overfitting. Additionally, as model complexity increases, it can have limitations in power consumption and computation. The inception model is characterized primarily by learning the local region when the conventional filter size is small. When the conventional filter size is large, the degree of abstraction increases, and both are used. However, when all of them are used, the learning parameters increased, and the inception model was reduced by using a 1 ${\times}$ 1 conv, and then calculated to reduce the number of parameters. Furthermore, ResNet uses the residual learning method and residual mapping, which is a method of summing the values that passed the weight layer and those that did not. Thus, the performance can be improved by solving the problems of overfitting and gradient vanishing. Therefore, InceptionResnet is constructed by combining the advantages of the inception model and ResNet.

Consequently, we constructed a deep-learning model based on the InceptionResnet model in this study. Table 3 depicts the structures of the InceptionResnet model used in this study. In the proposed model, a mixed layer is created by combining several branches using an inception model. The existing filter was divided and used for several branches, which reduced the amount of computation required when compared to that of the existing filter. The repeat layer repeats the inception model. There are several convolution layers in the mixed and repeat layers. A total of 191 layers and approximately 16 M parameters were used.

Table 3. Structure of the proposed model.

Layer

Output size

Parameters

Conv1

111 x 111 x 32

0.8K

Conv2a

109 x 109 x 32

9K

Conv2b

109 x 109 x 64

18K

Pool1

54 x 54 x 64

0

Conv3

54 x 54 x 80

0.5K

Conv4

52 x 52 x 192

138K

Pool2

25 x 25 x 192

0

Mixed5

25 x 25 x 320

264K

Repeat1

25 x 25 x 320

120K

Mixed6

12 x 12 x 1088

2.6M

Repeat2

12 x 12 x 1088

1.1M

Mixed7

5 x 5 x 2080

3.8M

Repeat3

5 x 5 x 2080

2M

Block8

5 x 5 x 2080

2M

Conv8

5 x 5 x 1536

3.2M

Avg_pool

1 x 1 x 1536

0

Linear

1 x 1 x 512

786K

Total

X

16M

2.5 Gray Image

While learning the face recognition model specializing in the angle, the face image for each angle was significantly affected by illumination [18]. When comparing the front image with the face from the left and right angles in the face image, if the angle was large, the model learning was affected by the shadow or excessive light on the face because of illuminance. To solve this problem, we used a grey image rather than an RGB image to reduce the effect of illuminance. When converting to a gray image using the grayscale method, the channel was reduced to 1 and the characteristics of the face were not properly reflected, thereby reducing the performance. To solve this problem and reduce only the effect of illuminance while maintaining the channel, the saturation was reduced to 0, the RGB channel was maintained at 3.

2.6 Experiment Environment

For the hardware experimental environments in this study, we used 128 gigabytes of RAM, a 3960x24 Ryzen CPU, and an Nvidia RTX A6000 GPU.

Pytorch 1.12.0 was used as an experimental environment for model training and testing; the image size, batch size, embedding dimension, and epoch was 224 ${\times}$ 224, 100, 512, and 100, respectively.

3. Results

Section 3 presents the result of the experiment including the real environment data. Fig. 2 represents a flowchart of the face recognition process. Firstly, image is converted into a gray scale, and then face recognition is performed by determining the identity similarity using the Inceptionresnetv2 model.

Fig. 2. Flowchart of proposed facial recognition model.
../../Resources/ieie/IEIESPC.2024.13.5.534/fig2.png

3.1 K-face dataset

Training and validation were conducted with the InceptionResnetv2 model using the K-face dataset described in Section 2.1. Fig. 3 depicts the validation accuracy. The validation accuracy of the trained model was 97.4%, precision and recall was 95.58%, and 97.99%, respectively. The learning of face recognition was performed effectively.

Fig. 3. Validation accuracy for trained models.
../../Resources/ieie/IEIESPC.2024.13.5.534/fig3.png

Subsequently, the angular image robustness of the trained model was examined according to the combination of angular images of the test data.

An accuracy of 98% was obtained when testing only the front image, 97.68% when testing the front, left and right images, and 96.9% when testing the front, left, right, and upper images, as shown in Table 4. Consequently, the accuracy was reduced by approximately 1.1% when compared to that in the case of testing with only the front image. These results showed that the proposed model was robust to angular images.

Table 4. Performance results for angle-specific face recognition for the K-face dataset.

Test Image(angle)

Accuracy

Precision

Recall

Front

98%

98.11%

97.89%

Front + Left + Right

97.68%

97.72%

97.49%

Front + LR + Up(LR)

96.9%

95.69%

97.11%

LR, Left + Right

Subsequently, a test was conducted to evaluate the effect of illuminance on the proposed model. Test dataset consisting of lux 400, which corresponds to the average indoor illumination, was compared with a dataset containing lux 1000, 400, 200, and 150 (see Table 5).

Table 5. Performance results for illumination-specific face recognition for the K-face dataset.

Test Image(lux)

Accuracy

Precision

Recall

Lux 400

98.67%

98.08%

98.83%

Lux 1000, 400, 200, 150

96.9%

95.69%

97.11%

Table 6. Performance results for model-specific face recognition for the K-face dataset.

Model

Accuracy

Precision

Recall

Resnet34

95.7%

93.89%

96.30%

Proposed model

96.9%

95.69%

97.11%

When the test dataset was consist of only Lux 400, the accuracy was 98.67%, which is approximately 1.77% higher than when the images for all the Lux values were tested. The performance of all illuminances was inferior to that of the indoor illuminance only; however, these results demonstrate that the illuminance is flexible.

For verification, a comparative analysis was conducted using the existing Resnet model. For the existing model, the Resnet34 model was used as a representative Resnet model, and the comparison was conducted with data at all illuminances and all angles.The performance of the Resnet model was 95.7, 93.89, and 96.30 % in accuracy, precision, and recall, respectively, which were lower than the performance of the proposed model. This exhibits better performance than the existing ResNet model used for face recognition.

3.2 Real Environment Dataset

Lastly, to confirm that the trained model is flexible even in a real environment, a test was conducted using the dataset presented in Section 2.1. Firstly, transfer learning was performed to ensure flexibility in various real environments.

When transfer learning was performed based on images taken on the same day, the verification results were better when learning was conducted for only one or two epochs rather than five epochs. Based on this, transfer learning was conducted using a model in only two epoch learning.

The test was conducted using data taken twice at three-day intervals to verify robustness of proposed model for environmental changes. The real environmental dataset were measured in a significantly different environment compared to K-face dataset. The performance was reduced compared to the previous results owing to the influence of the illuminance and environment of the subject. To solve this problem and reduce the influence of illuminance, verification was performed by converting RGB image to a grayscale image. The corresponding result was an evaluation of the performance in a real environment. When using the gray scale image, the accuracy was approximately 81%, which is an increase of about 5% compared to the accuracy of 76% when RGB images were used. From this results, the accuracy of each of the 21 participants was evaluated.

Table 7 lists the accuracy of each subject in the actual environmental data. Each of the six images were tested consisting of the data on the front, left, right, upper, left upper and right upper images using gray scales. The results demonstrated that the accuracy rate of the 21 participants was approximately 81%. In the case of some subjects, the recognition performance was low, which was confirmed to be due to environmental factor such as accessories. However, we found that the proposed model is robust for various angles.

Table 7. Accuracy results for each subject.

Subject

True / Total

Accuracy

S1

6 / 6

100%

S2

5 / 6

83%

S3

6 / 6

100%

S4

0 / 6

0%

S5

6 / 6

100%

S6

6 / 6

100%

S7

6 / 6

100%

S8

2 / 6

100%

S9

6 / 6

33%

S10

6 / 6

100%

S11

6 / 6

100%

S12

6 / 6

100%

S13

6 / 6

100%

S14

0 / 6

0%

S15

6 / 6

100%

S16

6 / 6

100%

S17

0 / 6

0%

S18

5 / 6

83%

S19

6 / 6

100%

S20

6 / 6

100%

S21

6 / 6

100%

Total

102 / 126

80.95%

4. Discussion & Conclusion

In this study, we developed end-to-end face recognition system specializing in angles by using gray-scale images. For angle-specific face recognition, a dataset was created using K-face dataset and face recognition performance was achieved with 96.9% accuracy. A comparative analysis was conducted at different angles and illuminances to confirm the flexibility of the proposed model. Additionally, the performance improved by approximately 1.2% when compared to that of the conventional ResNet model. Lastly, 21 subjects were recruited and tested to verify the proposed model performance in a real environment; approximately 81% of the performance was obtained using transfer learning and gray images. It was observed that the performance in the real environment, which were measured in other environments with a large difference between several subjects and a small number of subjects, was reduced to 81% when compared to the K-face dataset results. Further research is needed to confirm the flexible performance of the proposed model in a more diverse environment and its robustness to datasets with different accessories in real environment.

ACKNOWLEDGMENTS

This research was supported by the MSIT(Ministry of Science and ICT), Korea, under the ITRC(Information Technology Research Center) support program(IITP-2024-RS-2022-00156225) supervised by the IITP(Institute for Information & Communications Technology Planning & Evaluation). In addition, this research was supported by the KIAT(Korea Institute for Advancement of Technology) grant funded by the Korea Government(MOTIE : Ministry of Trade Industry and Energy). (P0017124, HRD Program for Industrial Innovation). This research (paper) used datasets from 'K-face dataset (AI-Hub, S. Korea)'. All data information can be accessed through 'AI-Hub (www.aihub.or.kr)'.

REFERENCES

1 
J.H Im et al., ``Practical privacy-preserving face authentication for smartphones secure against malicious clients,'' IEEE Transactions on Information Forensics and Security, January. 2020.URL
2 
S. Radzi et al., ``IoT based facial recognition door access control home security system using raspberry pi,'' International Journal of Power Electronics and Drive Systems, March. 2020.URL
3 
Yongping Zhong, Segu Oh, Hee Cheol Moon. ``Service transformation under industry 4.0: Investigating acceptance of facial recognition payment through an extended technology acceptance model'' Technology in Society, February. 2021.URL
4 
M.S Minu et al., ``Face Recognition System Based on Haar Cascade Classifier'' International Journal of Advanced Science and Technology, 2020.URL
5 
DK Ds, PV Rao., ``Implementing and analysing FAR and FRR for face and voice recognition (multimodal) using KNN classifier'' International Journal of Intelligent Unmanned Systems, October. 2019.URL
6 
L. Boussaad, A. Boucetta., ``Deep-learning based descriptors in application to aging problem in face recognition,'' Journal of King Saud University - Computer and Information Sciences, June. 2022.URL
7 
M. Mosud et al., ``Deep learning-based intelligent face recognition in IoT-cloud environment,'' Computer Communications, Feburary. 2020.URL
8 
F. Schroff et al., ``Facenet: A unified embedding for face recognition and clustering,'' Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.URL
9 
J. Deng et al., ``Arcface: Additive angular margin loss for deep face recognition,'' Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.URL
10 
F. Boutros et al., ``Elasticface: Elastic margin loss for deep face recognition,'' Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.URL
11 
S. Moschoglou et al., ``Agedb: the first manually collected, in-the-wild age database,'' Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.URL
12 
W. Deng et al., ``Cross-pose lfw: A database for studying cross-pose face recognition in unconstrained environments,'' Beijing University of Posts and Telecommunications, 2018.URL
13 
Choi, Yeji, et al., "K-face: A large-scale kist face database in consideration with unconstrained environments," arXiv preprint arXiv:2103.02211, 2021.URL
14 
Chunming Wu, Ying Zhang., ``MTCNN and FACENET based access control system for face detection and recognition,'' Automatic Control and Computer Sciences, 2021.URL
15 
N. Boyko et al., ``Performance Evaluation and Comparison of Software for Face Recognition, Based on Dlib and Opencv Library,'' 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP), October. 2018.URL
16 
S. Peng et al., ``More trainable inception-ResNet for face recognition,'' Neurocomputing, October. 2020.URL
17 
W. Moungsouy et al., ``Face recognition under mask-wearing based on residual inception networks,'' Applied Computing and Informatics, April. 2022.URL
18 
D. Weitzner et al., ``Face Authentication From Grayscale Coded Light Field,'' 2020 IEEE International Conference on Image Processing (ICIP), October 2020.URL
Jaewon An
../../Resources/ieie/IEIESPC.2024.13.5.534/au1.png

Jaewon An received his B.S. degree in the School of Computer and Information Engineering from Kwangwoon University, Seoul, South Korea. He is currently pursuing an M.S degree in the Department of Computer Engineering from Kwangwoon University, Seoul, South Korea. His research interests include biomedical signaling and machine learning algorithms.

Sang Ho Choi
../../Resources/ieie/IEIESPC.2024.13.5.534/au2.png

Sang Ho Choi received his B.S. degree in biomechatronics and electronic&electrical engineering from Sungkyunkwan University, Suwon, South Korea, and his Ph.D. degree in bioengineering from Seoul National University, Seoul, South Korea. He worked as a senior researcher in the Smart Device Team, Samsung Research, Seoul, South Korea. He is currently an assistant professor in the School of Computer and Information Engineering, at Kwangwoon University, Seoul, South Korea. His research interest includes biomedical signal processing and artificial intelligent algorithm for biomedical applications.