ChengYing1
-
(School of Teachers College, Xianyang Vocational Technical College, Xianyang, 712000,
China ying_cheng712@163.com)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Neural network, Visual question answering, Preschool language education, Attention mechanism
1. Introduction
Preschool is a critical period for language improvement and language cognition in
children. During this period, systematic and standardized language education plays
a vital role, providing a solid foundation for their future learning and cognition
[1]. Throughout the entire preschool process, language runs through all stages of childhood,
including various fields of physical and mental development. Language activities are
one of the five main areas of teaching planned by early childhood education institutions.
They serve as the cornerstone for the growth of other activities and are present in
other fields, playing a crucial role in their development. Language is a comprehensive
ability, and its development in young children is inseparable from their emotions,
thinking, social participation level, communication skills, knowledge, and experiences.
With the gradual popularization of information technology in education, the application
of a visual question answering (VQA) system to children’s preschool language education
has become a new achievable path [2]. Visual question and answer (Q&A) is emerging research content that combines computer
vision research tasks and natural language processing. Similar to image description,
a VQA system can determine input image information and generate text to describe the
image simply. However, the task of visual Q&A is not just about feature extraction
and understanding images. Its main task is to help children gain a deeper understanding
of the image based on the text question, combining image features and text analysis
to predict the corresponding answer. Therefore, developing a visual question answering
model to create interactive, intelligent tutoring systems for youngsters, and applying
it to textbook visuals for problem solving has extensive application prospects. However,
for the current VQA model, the overall accuracy rate is not high, and it is difficult
for prediction results to meet the requirements.
The use of neural networks for visual question answering has emerged as a key area
of research in this field as a result of the progressive maturation of acquisition
technology and the growing significance of neural network technology in fostering
the development of vision and language [3]. However, the current neural network-based visual question answering model is still
limited by the number of questions, resulting in an overall low accuracy rate. Therefore,
based on a neural network that extracts information features, this research introduces
the Hierarchical Joint Attention (HJA) model to optimize the whole process in order
to raise the precision of VQA systems.
2. Related Work
The success of a visual question answering system as a whole will determine how well
it can be used to teach language to preschoolers. Relevant academics from home and
abroad have relentlessly investigated the use of neural networks in visual question
answering in recent years, and have made some progress. To increase the accuracy of
answers generated by VQA, Yin et al. proposed a memory-enhanced Recurrent Neural Network
(RNN) model. It utilizes a new video coding method, and enhances memory through an
emerging differentiable neural computer. The outcomes showed high precision [4]. Cao and colleagues designed a parse tree-guided ratiocination network for the collaborative
reasoning problem of image understanding in interpretable VQA systems. Deriving image
cues, experiments on relational datasets have shown ways to highlight the interpretability
of inference systems [5]. To reduce low-level semantic loss in top-level visual representations of convolutional
neural networks, Hong et al. developed a hierarchical feature network that derives
the semantics of visual-question answers through intermediate convolutional layers.
In it, focused attention, hierarchical features, and multimodal pooling are combined,
and the results showed it has superior performance [6]. Garg et al. encode visual information from an image into a sequence and utilize
a neural network to appropriately process it in order to comply with the present VQA
system's requirements to grasp visual content and voice information. The outcomes
demonstrated better image quality and processing effectiveness [7]. Yusuf et al. applied graph convolutional networks to different subtasks of visual
question answering datasets with different results, proposed a VQA framework based
on fine-tuned word representations, and evaluated the framework’s performance through
various performance measures. The results showed that it improves vision-to-language
processing efficiency [8]. To encode a batch of input tensor objects for the difficult challenge of modeling
visual question answering, Le et al. created a general-purpose repeatable neuron.
The outcomes demonstrated that the modeling process was simplified [9].
At the same time, for the improvement of visual question answering systems, related
research in the medical field is significant. Because current VQA systems have difficulty
classifying meaningful instance grouping, Chong et al. proposed the use of a graph
Convolutional Neural Network (CNN) for a VQA system, and the results showed that it
obtained 94.3% accuracy [10]. Since VQA systems regularly struggle to interpret medical images, Sharma et al.
built an attention-based multimodal deep learning system that maximizes learning with
the least amount of complexity. The results showed that the attention it garners improved
the accuracy of model predictions when assessed objectively and subjectively [11]. Zhang et al. designed a peer-to-peer project to address a problem whereby medical
information on the Internet is restricted by quality and accessibility. They proposed
a question answering system based on a memory neural network and an attention mechanism
that saves time and cost [12]. To compensate for bridging the lexical gap between questions and answers, Nie et
al. proposed an attention-based deep learning model that adopts a bidirectional long
short-term memory encoder and decoder. They trained the model with benchmark datasets,
then evaluated it and showed that it enhances VQA system tasks [13]. Shi took advantage of deep learning in crawling sentence information, combined a
self-noticing mechanism to obtain semantic vectors of relevant attributes, and inserted
candidate attributes into triples of the same pattern through a parameter sharing
mechanism. The results showed that single-entity query time was less than 3 s, and
the join query was no longer than 5 s, with good horizontal scalability [14].
To sum up, in the optimization of VQA systems, the CNN and the RNN occupy a huge development
space, and have played a promoting role to varying degrees. In the medical field,
memory neural networks and attention mechanisms have been successfully used and have
achieved high accuracy in VQA systems. Therefore, this research takes the neural network
as the starting point, and introduces the attention mechanism to optimize the VQA
system, which can heighten its precision.
3. Application of the Neural Network VQA System in Preschool Language Education
3.1 Data Feature Extraction
The primary goal of VQA is to develop a neural network model through a specific training
process. This model must be able to comprehend natural language questions posed during
human conversations in addition to being able to comprehend images [15]. In the VQA model training process, feature information needs to be extracted from
the data first; then, feature fusion is performed, and finally, the answer is predicted.
The data information mainly comes from a public dataset that contains a large amount
of image content; behind each picture, there are many questions and standard answers.
Feature extraction includes problem feature and picture feature mining. Image feature
extraction is mainly performed by the CNN, and problem feature extraction is usually
performed by the RNN. The Visual Geometry Group (VGG) is part of the Department of
Engineering Science of Oxford University. The VGGNet convolution kernel is smaller,
and the network results are greatly deepened. It can reduce network structure parameters
and achieves more linear transformations, thus enhancing the feature learning ability
[16]. Simultaneously, VGGNet initializes the subsequent complicated model using the shallow
network's weights in order to speed up convergence and avoid pattern fitting, which
boosts prediction accuracy. VGGNet-16, which is most frequently employed for image
processing, has a better network depth and more effective feature extraction capabilities
[17]. Therefore, this research uses VGGNet-16 to extract image features. The VGGNet-16
structure is shown in Fig. 1.
The main feature of VGGNet is a 3 ${\times}$ 3 advantage where the performance of
the model will improve with an increase in the number of network layers, but the number
of parameters to be trained will not increase sharply. This is mainly because of the
unique structure of the VGG network, and the number of parameters is mainly concentrated
in the full connection of the grounding layer. In addition, the advantage of using
a small convolution kernel is that the receptive field of a convolutional kernel of
size 3 is equivalent to a 7${\times}$7 convolution kernel, but the number of parameters
is only 7${\times}$7, and the linear operation also increases, which gives VGG a stronger
feature learning ability. Then the selection of the text feature extraction method
is carried out. Unlike image features, text is usually continuous and has certain
arrangement rules. When the RNN determines the value of the hidden layer neural unit,
it needs both the current input and the value of the previous hidden layer to make
a decision [18]. Therefore, Recurrent Neural Networks can be used for question understanding in visual
question answering systems. The construction of the RNN is shown in Fig. 2.
The output of the RNN is related to the input of the current state and is also determined
by the input calculated in each previous step, which is the memory ability of the
RNN. The memory ability of an RNN comes from the hidden state, which is updated through
the input at each moment, and the output at each moment only depends on the current
hidden state. The calculation is formula (1).
In formula (1), $U$, $Y$, and $W$are the weights of linear transformation; $X_{t}$, $O_{t}$, and
$S_{t}$ represent the input, output, and hidden state, respectively, of the first
moment, $b$; $t$ represents the bias term; and $\sigma $ is a nonlinear transformation
function. Inside the RNN, weights $U$, $Y$, and $W$are shared and are related to the
time series, so the parameters cannot be updated through the backpropagation algorithm
[19]. Therefore, adding a long short-term memory (LSTM) network offers better performance
in the training of long sequence data, and its structure is shown in Fig. 3.
Three sets of controllers are introduced when the RNN is included in LSTM: the forgetting
door, the export door, and the import door. The calculation by the LSTM is demonstrated
in formula (2).
In formula (2), $h_{t-1}$ represents the output at the previous moment and $x_{t}$ is the current
input. Through the output of the last moment and the current input, some data can
be selectively removed to achieve forgetting. After the forget gate, the method of
updating the current state must be selected, as shown in formula (3).
In formula (3), $\overline{C_{t}}$ represents the candidate value vector. The final output is shown
in formula (4).
In formula (4), $C_{t}$ represents the current state, $O_{t}$ is the amount of information in the
current output, and $h_{t}$ is the final output obtained. There are two gates inside
LSTM, including the update gate and the reset gate, namely, a Gated Recurrent Unit
(GRU). The reset door and the update door of the GRU work together to update the time,
and the state update formula is shown in formula (5).
In formula (5), $h_{t}$ represents the current new state, $h_{t-1}$ is the state before calculation
of the new sequence information, and $\overline{h_{t}}$ is the current state. The
expression of the gating function is shown in formula (6).
In formula (6), $x_{t}$ is the sequence vector of moment $t$, and the calculation of the candidate
state is shown in formula (7).
In formula (7), $r_{t}$ represents the reset door and $\overline{h}$ is candidacy.
Fig. 1. The network structure of VGGNet-16.
Fig. 2. The structure of the RNN.
Fig. 3. The long short-term memory network structure.
3.2 Visual Question Answering from HJA
VQA involves information processing in both image and text modalities. The input question
must be encoded and extracted during question feature extraction, and it must be performed
word by word [20]. A self-attention mechanism is included to improve the accuracy of the final answer
prediction since, unlike machine translation, visual question answering only needs
to catch a small portion of the input question in order to provide the desired result.
This research proposes a Hierarchical Joint Attention model that first obtains the
required image feature vector through a CNN, and then utilizes the hierarchical multi-attention
network to get the problem features. The mechanism is composed of a bidirectional
GRU, which can realize multiple encoding of the question and can thus add attention
weights [21]. Following acquisition of the feature vector, it is passed to the joint attention
layer where the attention weights of the image and the question are updated. The two
features are linked before being sent to the response prediction layer to complete
the prediction and output. The resulting HJA model is shown in Fig. 4.
To acquire the question vector in visual question answering, an attention method is
proposed that extracts the keywords that best capture the entire question, and then
aggregates this portion of the informative word representation. The multi-layer perceptron
used by the attention mechanism to obtain the representation of the word annotation
in the hidden layer is used to calculate the influence weight of each word in the
question. It then compares the correlation between the influence weight and the word-level
context vector, and uses the Softmax function to find the weights of importance [22]. The sentence vector is then computed as the weighting totality of the word annotations.
During the training process, the context vector of the word is initialized; it will
continue to learn at the same time, and the word-level attention representation is
shown in formula (8).
In formula (8), $W_{w}$ and $b_{w}$ are the parameters of the fully connected layer, $u_{it}$ and
$h_{it}$ are the state representation in the underground layer, $a_{it}$ is the normalized
weight of the Softmax function, $q_{N}^{w}$ is the question volume, and $q_{N}^{p}$
represents the phrase vector. After obtaining a series of phrase vectors, the phrases
are encoded by a bidirectional GRU to obtain feature vectors [23]. The forward and backward mapped vectors are then correlated, thereby obtaining a
context-dependent vector. After correlating the forward and reverse states of the
hidden layer, sentence annotations are obtained, as shown in formula (9).
In formula (9), $\overset{\rightarrow }{h_{i}}$ represents the forward state, $\overset{\leftarrow
}{h_{i}}$ is the reverse state, while $h_{i}$ is the sentence annotation. Then, the
attention mechanism is also utilized at the phrase layer, and the importance of the
sentence is measured by its proportion, as shown in formula (10).
In formula (10), $a_{i}$ represents the proportion of the phrase in the question. Finally, the Softmax
layer is used to classify the problem features, and the obtained probability value
is the final feature, as shown in formula (11).
In formula (11), $p$ represents the normalized output. In the question-level mechanism, to ensure
mapping between pictures and questions, a joint attention mechanism is introduced.
The joint attention mechanism is able to simultaneously generate attention for questions
and images, and is more capable of processing direct correlations in multimodal information
than single attention [24]. Based on the original picture characteristics and question features, the joint attention
mechanism constructs an affinity matrix that affects the original features, maximizes
the feature weight, and then has an effect on the question and image features at all
positions. The joint attention model is shown in Fig. 5.
When associating a problem with an image, one task is to calculate the site of the
graphic feature in the image space, and another task is to obtain the similarity of
the problem feature to the corresponding problem vector. The affinity matrix is obtained
from the representation of the problem and the image feature map, as shown in formula
(12).
In formula (12), $W_{b}$ is the weight parameter, $Q$ is the problem vector, $V$ represents the feature
map of the image, and $T$ is parameter matrix. Then, the affinity matrix at the position
is maximized, as shown in formula (13).
In formula (13), $a^{v}[n]$ is the maximum affinity matrix representing the image, and $a^{q}[t]$
is the maximum affinity matrix of the problem. The affinity matrix is then used as
a feature parameter, and the attention maps of the question and image are predicted
to increase the precision of the model. Then, weighting totality for the attention
weights of the question and the image is applied to obtain the final feature, as shown
in formula (14).
In formula (14), $\overset{\wedge }{v}$ represents the image feature, and $\overset{\wedge }{q}$
represents the problem feature. After the features are processed with joint attention,
the accuracy of answer prediction is improved because it can better fuse the two feature
vectors [25]. The prediction of the answer is a multi-classification task; that is, a series of
answers will be obtained, classified, and output through the multi-layer perceptron.
The candidate answers are the top five with the highest probability in the output,
and among them, the predicted correct answer is the probability. The highest candidate
answer is shown in formula (15).
In formula (15), $W_{w}$ and $W$are weight parameters, both located in the fully connected layer,
and $p$ represents the probability distribution of the answer. Therefore, the final
establishment of the VQA system is applied to children’s preschool language education.
A VQA-model-based interactive intelligent tutoring system is then offered to the children
after the language education questions are first retrieved from textbook images. The
visual question answering system will first analyze an image, then deduce the answer
to the language question, and finally, automatically respond to a preschooler’s language
inquiry.
Fig. 5. The joint attention model.
4. Application Effect Analysis
First, the effect of the HJA in the study was tested and compared with three models:
the Deep and Cross Network (DCN), Bottom-Up and Top-Down Attention (BUTD), and MuRel.
Using the same dataset, the training results of the four models with an expanded number
of iterations are displayed in Fig. 6.
In Fig. 6, among the four models, the proposed HJA model reached the target accuracy first
after 125 iterations, whereas BUTD approached the target accuracy after nearly 300
iterations. The number of iterations needed for the DCN and MuRel models to approach
the target accuracy was similar: 220 and 240, respectively. This shows that the HJA
achieves the best convergence performance. Then, the four models were tested for accuracy
in answering visual questions. Before the experiment started, data preprocessing was
required to obtain problem annotations of the verification data and test data, and
the JSON format file was downloaded through Python, thereby converting the images
and questions into a new file format. The experiments were carried out on a Linux
system and using the Python language. The machine learning framework was TensorFlow,
and image preprocessing was performed through OpenCV. The specific parameters of the
experiment are shown in Table 1.
The four models used the VQAv1 dataset in the experiments. Each picture in the VQA
dataset corresponds to more than three questions, and each question corresponds to
10 correct answers and three possibly correct answers. The number of images in this
dataset is 204,721, and the number of questions is 614,163. There are various types
of questions: yes/no, open-ended questions answered with only one word or phrase,
and multiple-choice questions with 18 alternative answers. The official provides the
response, and if more than three people have the same response, the forecast result
is used to determine whether the response is correct. If so, it is unquestionably
accurate. The results obtained are shown in Fig. 7.
Fig. 7 shows the VQA accuracies from the four models when using the VQAv1 dataset. In Fig. 7, with additional iterations, the accuracy rates of the four models stayed at a certain
level after the increase. Among them, the accuracy of BUTD was stable at 78%, and
when DCN reached 30 iterations, the precision began to stabilize and fluctuated around
82%. After the MuRel model reached 20 iterations, the accuracy rate stayed at about
83%, whereas the precision rate increased rapidly with HJA until it reached 18 iterations,
after which it began to stabilize at 88%, which is better than the other three models.
The VQAv2 dataset was then utilized to compare the results of the four models. The
kinds of questions in this dataset are numerical, yes/no questions, and other questions,
accordingly denoted Num, Y/N, and Other. Overall performance is denoted All. The results
obtained are shown in Fig. 8.
Figs. 8(a) and (d) are the accuracy results for Num and All question types, respectively, and
Figs. 8(b) and (c) are the accuracy rates of the Y/N and Other question types, respectively.
In Fig. 8(a), the accuracy of the four models for Num questions gradually decreased with additional
questions. Among the four models, the lowest accuracy rates for HJA, DCN, BUTD, and
MuRel were 52%, 50%, 48%, and 49%, respectively, and the highest rates were 58%, 57%,
56%, and 56.5% respectively. The HJA accuracy rates were higher than the other three
models. In Fig. 8(b), the accuracy rates of HJA, BUTD, DCN, and MuRel for the Y/N problem type were the
lowest at 88%, 82%, 84%, and 85%, respectively, and the highest accuracy rate for
HJA was 92%. In Fig. 8(c), the lowest accuracy from HJA, BUTD, MuRel, and DCN models for the other problem
type were 56%, 52%, 53%, and 55%, respectively. Fig. 8(d) shows that BUTD, DCN, and MuRel had lowest overall accuracies of 63%, 65%, and 67%,
respectively (highest at 68%, 69%, and 71%, respectively), whereas HJA had an overall
accuracy of 77% (lowest at 71%) but outperformed the other three models. Therefore,
the VQA accuracy of the HJA model showed high accuracy with all four question types
and superior performance. Finally, the visual question answering system established
by the four models was applied to the language education of preschool children. This
research selected 10 well-known kindergartens in Shanghai, and integrated the language
education data of these 10 kindergartens into a new preschool language database. There
were 128,654 questions in the database, separated into three question types: numerical,
yes/no, and Other. All is again used to denote the overall accuracy rate. Results
from the four models are shown in Fig. 9.
Fig. 9 shows the accuracy outcomes of the VQA system established by the four models in the
selected preschool language learning database. Among them, Fig. 9(a) shows the accuracy with the different problem types. We can see that the accuracy
of DCN and BUTD was below 85%, and the accuracy of MuRel was above 85%. The highest
was 87%, while HJA’s accuracy rates were all above 90%, and the Y/N question type
had the highest accuracy rate at 94%. Fig. 9(b) shows the accuracy results of the four models when the number of questions gradually
increased. It is clear that increasing the number of questions reduced precision from
the four models. The MuRel, BUTD, and DCN accuracy rates decreased to around 85%,
although HJA remained at or above 90%.
Table 1. Specific parameters of the experiment.
Name
|
Numerical value
|
Epochs
|
105
|
Run size
|
512
|
Number of GRU layers
|
2
|
Word embedding size
|
512
|
Learning rate
|
0.001
|
Number of samples selected each time
|
200
|
Discard after embedding
|
0.5
|
Discard word after embedding
|
0.5
|
Fc7 image feature dimension
|
4096
|
Dataset version
|
1
|
Fig. 6. Training results of the four models on the same dataset.
Fig. 7. Accuracies with the VQAv1 dataset.
Fig. 8. VQA accuracy results of the four models on the VQAv2 dataset.
Fig. 9. VQA results of the four models when applied to the actual Preschool Language Education Database.
5. Conclusion
The primary subject in contemporary education is language instruction for young children,
and it is crucial for their overall development. The effectiveness of language learning
for preschoolers is impacted by low accuracy of VQA patterns in the existing interactive
intelligent assistance system. This research used a neural network to extract information,
proposed the Hierarchical Joint Attention model, and applied it to a VQA system. The
outcomes indicate that the HJA pattern approached the target accuracy at 125 iterations,
and had better convergence performance. With the VQAv1 dataset, the accuracy of the
MuRel model remained at around 83% after 20 iterations. After stabilizing, BUTD's
accuracy stayed at 78%, DCN's accuracy stayed at 82%, and the HJA model's accuracy
stabilized at 88% after 18 iterations. With the VQAv2 dataset, HJA accuracy rates
were the lowest when dealing with Num, Y/N, and other problems at 52%, 88%, and 56%,
respectively, but were all greater than the accuracy rates for the other three models.
The accuracy rate of the HJA model's VQA system in the example verification was above
90%, with the highest at 94%. This high accuracy rate shows that the HJA VQA system
performs well in preschool language teaching and can offer useful support for that
education. However, in the process of instance validation, the selected data volume
was still small, and there was a lack of data screening, which may lead to low accuracy
and credibility in the obtained results. Therefore, it is essential to continue expanding
high-quality and larger datasets to support the model's ability to recognize new patterns
and to raise prediction accuracy and generalizability.
6. Fundings
The research was supported by: A special issure of the teaching reform and development
of basic education in Shaanxi Province - The Research on Curriculum Construction and
Teaching Practice of Infant Life Education Based on Picture Book Reading under the
New Outline (No. JYTYB2022-08).
REFERENCES
A. Partika, A. D. Johnson, D. A. Phillips, et al. “Dual language supports for dual
language learners? Exploring preschool classroom instructional supports for DLLs’
early learning outcomes”. Early Childhood Research Quarterly, vol. 56, pp. 124-138,
2021.
S. M. Satagalieva. “The trends for modern libraries and building the strategy of library
and information education in the Republic of Kazakhstan”. Scientific and Technical
Libraries, 2021vol. 3, pp. 58-70, 2021.
M. Zhang, M. Zhang, G. Tian, et al. “A Home Service-Oriented Question Answering System
with High Accuracy and Stability”. IEEE Access, 2019, pp. 1-3, 2019.
C. Yin, J. Tang, Z. Xu, et al. “Memory Augmented Deep Recurrent Neural Network for
Video Question Answering.” IEEE Transactions on Neural Networks and Learning Systems,
vol. 99, pp. 1-9, 2019.
Q. Cao, X. Liang, B. Li, et al. “Interpretable Visual Question Answering by Reasoning
on Dependency Trees”. IEEE transactions on pattern analysis and machine intelligence,
vol. 43(3), pp. 887-901, 2021.
J. Hong, J. Fu, Y. Uh, et al. “Exploiting hierarchical visual features for visual
question answering”. Neurocomputing, vol. 351, pp. 187-195, 2019.
S. Garg, R. Srivastava. “Object sequences: encoding categorical and spatial information
for a yes/no visual question answering task”. Computer Vision, IET, vol. 12(8), pp.
1141-1150, 2018.
A. A. Yusuf, F. Chong, M. Xianling, “Evaluation of graph convolutional networks performance
for visual question answering on reasoning datasets”. Multimedia Tools and Applications,
pp. 1-10, 2022.
T. M. Le, V. Le, S. Venkatesh, et al. “Hierarchical Conditional Relation Networks
for Multimodal Video Question Answering”. International Journal of Computer Vision,
vol. 8, pp. 1-24, 2021.
F. Chong, A. A. Yusuf, M. Xianling. “An analysis of graph convolutional networks and
recent datasets for visual question answering”. Artificial Intelligence Review, pp.
1-24, 2022.
D. Sharma, S. Purushotham, C. K. Reddy. “MedFuseNet: Anattention-based multimodal
deep learning model for visual question answering in the medical domain”. Scientific
Reports, vol. 11(1), pp. 1-18, 2021.
L. Zhang, X. Yang, S. Li, et al. “Answering medical questions in Chinese using automatically
mined knowledge and deep neural networks: an end-to-end solution”. BMC Bioinformatics,
vol. 23(1), pp. 1-32, 2022.
Y. P. Nie, Y. Han, J. M. Huang, et al. “Attention-based encoder-decoder model for
answer selection in question answering”. Frontiers of Information Technology & Electronic
Engineering, vol. 18(4), pp. 535-544, 2019.
M. Shi, “Knowledge Graph Question and Answer System for Mechanical Intelligent Manufacturing
Based on Deep Learning”. Mathematical Problems in Engineering, vol. 2, pp. 1-8, 2021.
A. Al-Sadi, M. Al-Ayyoub, Y. Jararweh, et al. “Visual Question Answering in the Medical
Domain Based on Deep Learning Approaches: A Comprehensive Study”. Pattern Recognition
Letters, vol. 150(2), pp. 1-4, 2021.
M. Jangra, S. K. Dhull, K. K. Singh “ECG arrhythmia classification using modified
visual geometry group network (mVGGNet)”. Journal of Intelligent and Fuzzy Systems,
vol. 38(5), pp. 1-15, 2020.
Y. Chen, Y. Mai, J. Xiao, et al. “Improving the Antinoise Ability of DNNs via a Bio-Inspired
Noise Adaptive Activation Function Rand Softplus”. Neural Computation, vol. 31(6),
pp. 1215-1233, 2019.
Y. Zhang, B. Mu, H. Zheng. “Link Between and Comparison and Combination of Zhang Neural
Network and Quasi-Newton BFGS Method for Time-Varying Quadratic Minimization”. IEEE
Transactions on Cybernetics, vol. 43(2), pp. 490-503, 2018.
Y. Fu, Z. Liang, S. You. “Bidirectional 3D Quasi-Recurrent Neural Network for Hyperspectral
Image Super-Resolution”. IEEE Journal of Selected Topics in Applied Earth Observations
and Remote Sensing, 2021, vol. 99, pp. 1-7, 2021.
S. Park, J. Jang, S. Kim, et al. “Memory-Augmented Neural Networks on FPGA for Real-Time
and Energy-Efficient Question Answering”. IEEE Transactions on Very Large-Scale Integration
(VLSI) Systems, vol. 99, pp. 1-14, 2020.
P. Y. Wang, C. T. Chen, J. W. Su, et al. “Deep Learning Model for House Price Prediction
Using Heterogeneous Data Analysis Along with Joint Self-Attention Mechanism”. IEEE
Access, vol. 99, pp. 1-9, 2021.
X. Li, J. Song, L. Gao, et al. “Beyond RNNs: Positional Self-Attention with Co-Attention
for Video Question Answering”. Proceedings of the AAAI Conference on Artificial Intelligence,
vol. 33, pp. 8658-8665, 2019.
Y. Cheng, L. Yao, G. Xiang, et al. “Text Sentiment Orientation Analysis Based on Multi-Channel
CNN and Bidirectional GRU With Attention Mechanism”. IEEE Access, vol. 8, pp. 134964-134975,
2020.
S. Pendurkar, S. Kolpekwar, S. Dhoot, et al. “Attention Based Multi-Modal Fusion Architecture
for Open-Ended Video Question Answering Systems”. Procedia Computer Science, vol.
171, pp. 446-455, 2020.
C. Zhao, S. Wang, D. Li, et al. “Cross-domain sentiment classification via parameter
transferring and attention sharing mechanism”. Information Sciences, vol. 578, pp.
281-296, 2021.
Author
Ying Cheng obtained her master’s degree in the major of Education (2008) from Shaanxi
Normal University. Presently, she is working as an associate Professor in the Teachers’
College of Xianyang Vocational Technical College. She has published articles in more
than10 national reputed peer reviewed journals. Her areas of interest include Language
education for preschool children, Children’s Literature and early language education
of young children.