YunNayoung1
LimSangkyu1
HongSeoyoung2
MoonJiwon1
LeeHakjun1
KimSunmok1
LeeHeung-Jae1
LeeKi-Baek1
-
(Department of Electrical Engineering, Kwangwoon University / Seoul 01897, Korea
{nayoung1124, khlim258, mjw426, cpfl410, nadasunmok, hjlee, kblee}@kw.ac.kr
)
-
(Department of Electrical and Computer Engineering, New York University, NY, USA
sh6480@nyu.edu )
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
NLP, Sentence similarity, FAQ, Assist system
1. Introduction
Customer service is one of the most difficult tasks in product marketing because it
is not easy to satisfy customers while limiting costs [1]. Cheong et al. [2] showed that 53% of customers were not satisfied with customer support center service.
The results also revealed that customers wanted their problems solved quickly by customer
service representatives.
As one way to address this problem, a number of companies have constructed live chat
systems that connect customers to representatives in real time through the Internet
[3]. In addition, since costs increase with an increased number of representatives, attempts
to replace people with chatbots have also been initiated in order to keep costs down.
It is questionable, however, whether chatbots are capable of natural conversation
and of understanding exactly what the customer needs [4,5].
Another way—text classification through deep learning—can be used to preliminarily
classify customer questions before passing them to human representatives. However,
the accuracy of such classification systems is not good enough to help representatives,
and it is not easy to enhance such classification systems to obtain the necessary
training data [6-10]. Attempts have been made to transform those classification problems into similarity
evaluation problems based on recently proposed natural language processing (NLP) models,
such as BERT [11], BiMPM [12], and Open AI GPT [13]. Nonetheless, although the NLP models were pre-trained to include extensive domain
information, they are not efficient enough to be used for customer service, and a
lot of additional data are required [14-18].
Consequently, in this paper, a novel representative assistance system is proposed
to overcome the difficulties with the previous approaches and to improve customer
service efficiency. The proposed system includes two main functions: FAQ recommendation
and automatic data acquisition. For FAQ recommendation, the system calculates a similarity
measure between an input question and every question in a well-defined customer service
FAQ list. Then, it recommends the top $\textit{k}$ FAQs to the representative.
In fact, consumers frequently ask questions that have already been answered in the
FAQ list, or that are similar. Thus, the recommended FAQs can help the representative
to answer more quickly and accurately by transforming a subjective problem into an
objective problem. Following this system, the representative chooses one of the recommendations,
and the choice is automatically saved as new data. Consequently, the system is updated
with newly collected data from the specific service domain, and the accuracy of the
system is improved incrementally.
This paper is organized as follows. Section 2 explains the proposed system. In Section
3, the experimental results are evaluated. Finally, Section 4 presents the conclusions.
2. The Proposed System
2.1 Building a Baseline NLP Model
The first main function of the proposed system is to recommend the $\textit{k}$ FAQs
that are more similar to a customer’s query than others from the list. Fig. 1 shows the overall flow of the proposed system. At first, it is necessary to train
a baseline NLP model to recommend the most similar $\textit{k}$ FAQs. Here, for the
baseline NLP model, a Quora Question Pairs (QQP) dataset [19] was used. The dataset contains roughly 400,000 sentence pairs with corresponding
labels. The original dataset is structured as shown in Fig. 2(a) and has been modified for simplicity as shown in Fig. 2(b)
Fig. 1. The overall flow of the proposed system.
Fig. 2. An example of the data format.
2.2 Operating the System with the NLP Model
When a customer asks a question, the NLP model measures the similarity between the
customer’s question and every FAQ in a well-defined FAQ list. Then, the system shows
the representative the closest $\textit{k}$ FAQs. After that, the representative chooses
one from among them that is similar to the customer’s question. New data are constructed
from these choices and are stored in a training dataset. Fig. 3 shows an example of the data construction process with k=3.
Fig. 3. An example of the data-construction process.
2.3 Fine-tuning the Model
After the process described in Subsection 2.2 has been repeated many times, and enough
data have been added to the training dataset, the model can be fine-tuned with the
newly obtained data. The fine-tuning process is as follows. First, as shown in Fig. 1, the weights of the layers are copied to the model’s next version except for the
pooling layer, which is the last layer of the model. Instead, the pooling layer of
the next version is initialized. Then, the new version of the model is trained with
the data in the training dataset. When the training process is finished, the updated
model is applied to the system. The processes in subsections 2.2 and 2.3 are repeated
until no more improvement is achieved.
3. Performance Evaluation
The environmental settings for the experiments are as follows. As the initial FAQ
list for customer service, we chose 40 FAQs from the Facebook website. Customer questions
were collected from the Facebook user community. We then divided the collected questions
into two sets. One set was used for training the model, and the other for testing
the performance of the system in each version. The training dataset was built through
a role-playing simulation by five participants randomly recruited from among a population
of graduate students who did not know the authors personally. The participants used
the proposed system as if they were representatives, choosing responses from the recommended
$\textit{k}$ FAQs when $\textit{k}$=5 for each query. The test dataset was built based
on directly matching participants. Note that the training and test datasets included
questions related to FAQs 1-20 and FAQs 1-40, respectively, which means the system
did not learn information from FAQs 21-40.
In the experiments, BiMPM, OpenAI GPT, and BERT were employed as the NLP models, and
the results were compared. Each model was pre-trained with the QQP dataset and used
as the baseline model. The service’s operation and fine-tuning scenario was set considering
the real-world customer service process illustrated in Fig. 4. The scenario consisted of four steps in operating/fine-tuning the pairs and one
step in testing them, with 5,000 data entries gathered for each operation and the
number of FAQs in the FAQ list increased at the beginning of Step 3. In Step 5 (the
testing step), since versions 1 and 2 were trained with data from FAQs 1-10, they
were tested with the test dataset that included FAQs 1-10 and then retested with the
dataset that had FAQs 21-40. Similarly, versions 3 and 4 were tested with the test
dataset including FAQs 1-20 and then retested with the dataset using FAQs 21-40.
Fig. 4. The test scenario of the experiments.
Fig. 5 shows the test accuracies for each model and version. Top $\textit{k}$ accuracy (the
y-axes) indicates the probability that the best answer exists among the top $\textit{k}$
recommendations in the system. For every NLP model, the accuracy from the proposed
system increased after each step in the scenario. Table 1 shows the test accuracy in detail. The most important thing is that for the BERT
and OpenAI GPT models already pre-trained with relatively heavy data in their initial
states, the test accuracies increased even with the test dataset excluding experienced
information. Moreover, BiMPM showed significant accuracy improvement with the test
dataset including the experienced information, and this is an advantage because additional
data for the changed FAQ list can be readily and automatically accumulated during
services with the proposed system, as shown in the test scenario. OpenAI GPT showed
the best performance, along with the proposed system, under the test configuration.
Fig. 5. The resulting top $\textit{k}$ accuracies for each model.
Table 1. The detailed results of test accuracies for each model.
|
FAQs 1-10
|
FAQs 1-20
|
FAQs 21-40
|
Top 1
|
Top 3
|
Top 5
|
Top 1
|
Top 3
|
Top 5
|
Top 1
|
Top 3
|
Top 5
|
Baseline
(version 0)
|
BiMPM
|
37.90
|
60.00
|
71.60
|
47.77
|
59.96
|
72.71
|
43.16
|
63.31
|
74.24
|
GPT
|
39.06
|
61.58
|
68.56
|
57.99
|
68.35
|
69.86
|
33.60
|
43.82
|
63.67
|
BERT
|
42.52
|
62.73
|
70.58
|
41.08
|
60.79
|
67.63
|
41.51
|
54.68
|
61.58
|
Fine-tuned
(version 1)
|
BiMPM
|
61.51
|
78.05
|
83.59
|
N/A
|
39.06
|
63.67
|
74.89
|
GPT
|
65.28
|
81.51
|
86.88
|
60.79
|
77.91
|
86.19
|
BERT
|
55.04
|
82.37
|
88.78
|
60.50
|
74.17
|
81.80
|
Fine-tuned
(version 2)
|
BiMPM
|
64.60
|
81.00
|
87.55
|
N/A
|
41.65
|
64.96
|
76.97
|
GPT
|
65.61
|
81.94
|
88.25
|
60.43
|
77.05
|
86.83
|
BERT
|
55.04
|
73.24
|
82.73
|
63.45
|
70.36
|
80.12
|
Fine-tuned
(version 3)
|
BiMPM
|
N/A
|
73.89
|
86.83
|
91.22
|
38.34
|
63.74
|
76.12
|
GPT
|
81.87
|
90.94
|
91.87
|
62.59
|
79.42
|
87.41
|
BERT
|
69.07
|
84.43
|
89.18
|
60.94
|
79.14
|
85.68
|
Fine-tuned
(version 4)
|
BiMPM
|
N/A
|
76.44
|
89.82
|
93.34
|
40.36
|
64.75
|
75.47
|
GPT
|
85.11
|
90.58
|
92.09
|
59.93
|
81.44
|
87.55
|
BERT
|
65.29
|
80.98
|
87.84
|
52.59
|
79.42
|
86.76
|
4. Conclusion
In this paper, we proposed a novel system to assist customer service representatives
in answering customer questions. Since the proposed system automatically accumulates
new data during service calls with a representative, it can avoid the data-shortage
problem common in various service fields. In addition, as the experimental results
show, the more data gathered, the greater the accuracy becomes. This means the accuracy
of the proposed system improves from the automatically accumulated data as time goes
by. Above all, the proposed system transforms subjective problems into objective ones
so that representatives can save time in answering, and so customers are more satisfied.
Furthermore, this system can be applied to languages other than English.
ACKNOWLEDGMENTS
This work was supported by the National Research Foundation of Korea (NRF) grant
funded by the Korea government (MSIT) (No. NRF-2019R1F1A1062979) and excellent researcher
support project of Kwangwoon University in 2021.
REFERENCES
Novalia Agung W. A., 2018, The Impact of Interpersonal Communication toward Customer
Satisfaction: The Case of Customer Service of Sari Asih Hospital., MATEC Web of Conferences,150,
05087
Cheong K.J., Kim J.J., So S.H., 2008, A study of strategic call center management:Relationship
between key performance indicators and customer satisfaction., 6, Vol. 2, pp. 268-276
Jane Lockwood. , 2017, An analysis of web-chat in an outsourced customer service account
in the Philippines.
Bhavika R. Ranoliya , Nidhi Raghuwanshi , Sanjay Singh , 2017, Chatbot for university
related FAQs., 2017 International Conference on Advances in Computing, Communications
and Informatics (ICACCI)
Chung M., Ko E., Joung H., Kim S. J., 2018, Chatbot e-service and customer satisfaction
regarding luxury brands., Journal of Business Research
Tetsuji Nakagawa , Kentaro Inui , Sadao Kurohashi , 2010, Dependency Tree-based Sentiment
Classification using CRFs with Hidden Variables., In Proceedings of NIPS 2010
Honglun Zhang , Liqiang Xiao , Yongkun Wang , Yaohui Jin , 2017, A Generalized Recurrent
Neural Architecture for Text Classification with Multi-Task Learning., In Proceedings
of the Twenty-Sixth International Joint Conference on Artificial Intelligence
Baoyu Jing , Chenwei Lu , Deqing Wang , Fuzhen Zhuang , 2018, Cross-Domain Labeled
LDA for Cross-Domain Text Classification., 2018 IEEE International Conference on Data
Mining (ICDM)
Shang Gao , Arvind Ramanathan , Georgia Tourassi , 2018, Hierarchical Convolutional
Attention Networks for Text Classification., In Proceedings of the Third Workshop
on Representation Learning for NLP. Association for Computational Linguistics, pp.
11-23
Jeremy Howard , Sebastian Ruder , 2018, Universal Language Model Fine-tuning for Text
Classification., In Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics,
pp. 328-339
Jacob Devlin , Ming-Wei Chang , Kenton Lee , Kristina Toutanova , 2018, BERT: Pre-training
of Deep Bidirectional Transformers for Language Understanding., arXiv preprint arXiv:
1810.04805
Zhiguo Wang , Wael Hamza , Radu Florian , 2017, Bilateral Multi-Perspective Matching
for Natural Language Sentences., arXiv:1702.03814
Alec Radford , Karthik Narasimhan , Tim Salimans , Ilya Sutskever , 2018, Improving
Language Understanding by Generative Pre-Training.
Alexis Conneau , Douwe Kiela , Holger Schwenk , Lo¨ıc Barrault , Antoine Bordes ,
2017, Supervised Learning of Universal Sentence Representations from Natural Language
Inference Data., In Proceedings of the 2017 Conference on Empirical Methods in Natural
Language Processing, pages 670-680, Copenhagen, Denmark. Association for Computational
Linguistics
Bryan McCann , James Bradbury , Caiming Xiong , Richard Socher , 2017, Learned in
Translation: Contextualized Word Vectors., In NIPS. arXiv: 1708.00107
Antonio Valerio Miceli Barone , Barry Haddow , Ulrich Germann and Rico Sennrich ,
2017, Regularization techniques for fine-tuning in neural machine translation., In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,
pages 1489-1494, Copenhagen, Denmark, September 7-11, 2017. Association for Computational
Linguistics
Kanako Komiya , Hiroyuki Shinnou , 2018, Investigating Effective Parameters for Fine-tuning
of Word Embeddings Using Only a Small Corpus., In Proceedings of the Workshop on Deep
Learning Approaches for Low-Resource NLP. Association for Computational Linguistics,
pp. 60-67
Jinhyuk Lee , Wonjin Yoon , Sungdon Kim , Donghyeon Kim , Sunkyu Kim , Chang Ho So
, Jaewoo Kang , 2019, BioBERT: a pre-trained biomedical language representation model
for biomedical text mining., arXiv:1901.08746
Chen Z., Zhang H., Zhang X., Zhao L., 2018, Quora question pairs.
Author
Nayoung Yun received her BS degree in Electrical Engineering from Kwangwoon University,
Seoul, Korea, in 2021. She has been a MS student of the Department of Electrical Engi-neering,
Kwangwoon University, Seoul, Korea. She is interested in Computer vision and transformer
deep learning models.
Sangkyu Lim Graduated Kwangwoon University, major in Electrical Engi-neering. Interested
in Vision and Multimodal NLP.
Seoyoung Hong received her BS degree in Electrical Engineering from Kwangwoon University,
Seoul, Korea, in 2021. Since 2021, she has been a MS student at the Department of
Electrical and Computer Engineering, New York University, NY, USA. Her research interests
include Signal Processing and Deep Learning.
Jiwon Moon Graduated Kwangwoon University, major in Electrical Engi-neering. Currently
a graduate student at the Nature-Inspired Intelligence Laboratory, Department of Electrical
Engineering, Kwangwoon Graduate School. Interested in Vision and Multimodal NLP.
Hakjun Lee graduated from Kwangwoon University majoring in Electrical Engineering.
Currently a graduate student in the Nature-Inspired Intelligence Laboratory in the
Department of Electrical Engineering of the Kwangwoon Graduate School, research interests
include transformer deep learning models.
Sunmok Kim received his BS degree in electrical engineering from Kwang-woon University,
Seoul, Korea, in 2016. Since 2016, he has been a MS student of the Department of Electrical
Engineering, Kwangwoon University, Seoul, Korea. His research interests include machine
learning.
Heung-Jae Lee received the BS, MS and Ph. D. degrees from Seoul National University,
in 1983, 1986 and 1990 respectively, all in electrical engineering. He was a visiting
professor in the University of Washington from 1995 to 1996. His major research interests
are the expert systems, the neural networks and the fuzzy systems application to power
systems including the computer application. He is a full professor in the Kwangwoon
university.
Ki-Baek Lee received his BS, MS, and PhD degrees in electrical engineering from
the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Rep. of Korea,
in 2005, 2008 and 2014, respectively. Since 2014, he has been an assistant professor
with the Department of Electrical Engineering, College of Electronics and Information
Engineering, Kwangwoon University, Seoul, South Korea. He has researched computational
intelligence and artificial intelligence, particularly in swarm intelligence, multi-objective
evolutionary algorithms, and machine learning. His research interests also include
real‐world applications such as sign‐language recognition, object picking, and customer
service automation.