Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 12, No. 05, p.419-427

ISSN (online) :

2287-5255

Received : 13 February 2023Revised : 26 April 2023Accepted : 28 April 202330 October 2023

DOI :

https://doi.org/10.5573/IEIESPC.2023.12.5.419

Regular Paper

Application of a Neural Network-based Visual Question Answering System in Preschool Language Education

ChengYing¹

(School of Teachers College, Xianyang Vocational Technical College, Xianyang, 712000, China ying_cheng712@163.com)

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

The continuous progress of modern science and technology has led to comprehensive innovations in education, and the use of information technology for teaching has become the mainstream in the current education field. For children’s preschool language education, the application of a visual question answering (VQA) system has gradually become a new development power. This research uses a Recurrent Neural Network and a VGGNet-16 network to extract features from text and images, respectively, and applies a Hierarchical Joint Attention (HJA) model to the whole VQA system. Experiment results demonstrate that the HJA model reaches the target accuracy after 125 iterations, and convergence performance is good. When using the VQAv1 dataset, accuracy can stabilize at 88% after 18 iterations, and when using the VQAv2 dataset, the highest and lowest overall accuracy rates are 77% and 72%, respectively. The three question types (Num, Y/N, and Other) are answered with high accuracy when using the chosen preschool language education database for children, providing accuracy rates of 90%, 94%, and 91%, respectively. This new reference technique offers a new method for maximization of a VQA system, and significantly raises the preschool language education level of the children.

Keywords

Neural network, Visual question answering, Preschool language education, Attention mechanism

1. Introduction

Preschool is a critical period for language improvement and language cognition in children. During this period, systematic and standardized language education plays a vital role, providing a solid foundation for their future learning and cognition ^[1]. Throughout the entire preschool process, language runs through all stages of childhood, including various fields of physical and mental development. Language activities are one of the five main areas of teaching planned by early childhood education institutions. They serve as the cornerstone for the growth of other activities and are present in other fields, playing a crucial role in their development. Language is a comprehensive ability, and its development in young children is inseparable from their emotions, thinking, social participation level, communication skills, knowledge, and experiences. With the gradual popularization of information technology in education, the application of a visual question answering (VQA) system to children’s preschool language education has become a new achievable path ^[2]. Visual question and answer (Q&A) is emerging research content that combines computer vision research tasks and natural language processing. Similar to image description, a VQA system can determine input image information and generate text to describe the image simply. However, the task of visual Q&A is not just about feature extraction and understanding images. Its main task is to help children gain a deeper understanding of the image based on the text question, combining image features and text analysis to predict the corresponding answer. Therefore, developing a visual question answering model to create interactive, intelligent tutoring systems for youngsters, and applying it to textbook visuals for problem solving has extensive application prospects. However, for the current VQA model, the overall accuracy rate is not high, and it is difficult for prediction results to meet the requirements.

The use of neural networks for visual question answering has emerged as a key area of research in this field as a result of the progressive maturation of acquisition technology and the growing significance of neural network technology in fostering the development of vision and language ^[3]. However, the current neural network-based visual question answering model is still limited by the number of questions, resulting in an overall low accuracy rate. Therefore, based on a neural network that extracts information features, this research introduces the Hierarchical Joint Attention (HJA) model to optimize the whole process in order to raise the precision of VQA systems.

2. Related Work

The success of a visual question answering system as a whole will determine how well it can be used to teach language to preschoolers. Relevant academics from home and abroad have relentlessly investigated the use of neural networks in visual question answering in recent years, and have made some progress. To increase the accuracy of answers generated by VQA, Yin et al. proposed a memory-enhanced Recurrent Neural Network (RNN) model. It utilizes a new video coding method, and enhances memory through an emerging differentiable neural computer. The outcomes showed high precision ^[4]. Cao and colleagues designed a parse tree-guided ratiocination network for the collaborative reasoning problem of image understanding in interpretable VQA systems. Deriving image cues, experiments on relational datasets have shown ways to highlight the interpretability of inference systems ^[5]. To reduce low-level semantic loss in top-level visual representations of convolutional neural networks, Hong et al. developed a hierarchical feature network that derives the semantics of visual-question answers through intermediate convolutional layers. In it, focused attention, hierarchical features, and multimodal pooling are combined, and the results showed it has superior performance ^[6]. Garg et al. encode visual information from an image into a sequence and utilize a neural network to appropriately process it in order to comply with the present VQA system's requirements to grasp visual content and voice information. The outcomes demonstrated better image quality and processing effectiveness ^[7]. Yusuf et al. applied graph convolutional networks to different subtasks of visual question answering datasets with different results, proposed a VQA framework based on fine-tuned word representations, and evaluated the framework’s performance through various performance measures. The results showed that it improves vision-to-language processing efficiency ^[8]. To encode a batch of input tensor objects for the difficult challenge of modeling visual question answering, Le et al. created a general-purpose repeatable neuron. The outcomes demonstrated that the modeling process was simplified ^[9].

At the same time, for the improvement of visual question answering systems, related research in the medical field is significant. Because current VQA systems have difficulty classifying meaningful instance grouping, Chong et al. proposed the use of a graph Convolutional Neural Network (CNN) for a VQA system, and the results showed that it obtained 94.3% accuracy ^[10]. Since VQA systems regularly struggle to interpret medical images, Sharma et al. built an attention-based multimodal deep learning system that maximizes learning with the least amount of complexity. The results showed that the attention it garners improved the accuracy of model predictions when assessed objectively and subjectively ^[11]. Zhang et al. designed a peer-to-peer project to address a problem whereby medical information on the Internet is restricted by quality and accessibility. They proposed a question answering system based on a memory neural network and an attention mechanism that saves time and cost ^[12]. To compensate for bridging the lexical gap between questions and answers, Nie et al. proposed an attention-based deep learning model that adopts a bidirectional long short-term memory encoder and decoder. They trained the model with benchmark datasets, then evaluated it and showed that it enhances VQA system tasks ^[13]. Shi took advantage of deep learning in crawling sentence information, combined a self-noticing mechanism to obtain semantic vectors of relevant attributes, and inserted candidate attributes into triples of the same pattern through a parameter sharing mechanism. The results showed that single-entity query time was less than 3 s, and the join query was no longer than 5 s, with good horizontal scalability ^[14].

To sum up, in the optimization of VQA systems, the CNN and the RNN occupy a huge development space, and have played a promoting role to varying degrees. In the medical field, memory neural networks and attention mechanisms have been successfully used and have achieved high accuracy in VQA systems. Therefore, this research takes the neural network as the starting point, and introduces the attention mechanism to optimize the VQA system, which can heighten its precision.

3. Application of the Neural Network VQA System in Preschool Language Education

3.1 Data Feature Extraction

The primary goal of VQA is to develop a neural network model through a specific training process. This model must be able to comprehend natural language questions posed during human conversations in addition to being able to comprehend images ^[15]. In the VQA model training process, feature information needs to be extracted from the data first; then, feature fusion is performed, and finally, the answer is predicted. The data information mainly comes from a public dataset that contains a large amount of image content; behind each picture, there are many questions and standard answers. Feature extraction includes problem feature and picture feature mining. Image feature extraction is mainly performed by the CNN, and problem feature extraction is usually performed by the RNN. The Visual Geometry Group (VGG) is part of the Department of Engineering Science of Oxford University. The VGGNet convolution kernel is smaller, and the network results are greatly deepened. It can reduce network structure parameters and achieves more linear transformations, thus enhancing the feature learning ability ^[16]. Simultaneously, VGGNet initializes the subsequent complicated model using the shallow network's weights in order to speed up convergence and avoid pattern fitting, which boosts prediction accuracy. VGGNet-16, which is most frequently employed for image processing, has a better network depth and more effective feature extraction capabilities ^[17]. Therefore, this research uses VGGNet-16 to extract image features. The VGGNet-16 structure is shown in Fig. 1.

The main feature of VGGNet is a 3 ${\times}$ 3 advantage where the performance of the model will improve with an increase in the number of network layers, but the number of parameters to be trained will not increase sharply. This is mainly because of the unique structure of the VGG network, and the number of parameters is mainly concentrated in the full connection of the grounding layer. In addition, the advantage of using a small convolution kernel is that the receptive field of a convolutional kernel of size 3 is equivalent to a 7${\times}$7 convolution kernel, but the number of parameters is only 7${\times}$7, and the linear operation also increases, which gives VGG a stronger feature learning ability. Then the selection of the text feature extraction method is carried out. Unlike image features, text is usually continuous and has certain arrangement rules. When the RNN determines the value of the hidden layer neural unit, it needs both the current input and the value of the previous hidden layer to make a decision ^[18]. Therefore, Recurrent Neural Networks can be used for question understanding in visual question answering systems. The construction of the RNN is shown in Fig. 2.

The output of the RNN is related to the input of the current state and is also determined by the input calculated in each previous step, which is the memory ability of the RNN. The memory ability of an RNN comes from the hidden state, which is updated through the input at each moment, and the output at each moment only depends on the current hidden state. The calculation is formula (1).

(1)

$ \left\{\begin{array}{l} S_{t}=\left(UX_{t}+b_{1}+WH_{t-1}\right)\sigma _{1}\\ O_{t}=\left(YH_{t}+b_{2}\right)\sigma _{2} \end{array}\right. $

In formula (1), $U$, $Y$, and $W$are the weights of linear transformation; $X_{t}$, $O_{t}$, and $S_{t}$ represent the input, output, and hidden state, respectively, of the first moment, $b$; $t$ represents the bias term; and $\sigma $ is a nonlinear transformation function. Inside the RNN, weights $U$, $Y$, and $W$are shared and are related to the time series, so the parameters cannot be updated through the backpropagation algorithm ^[19]. Therefore, adding a long short-term memory (LSTM) network offers better performance in the training of long sequence data, and its structure is shown in Fig. 3.

Three sets of controllers are introduced when the RNN is included in LSTM: the forgetting door, the export door, and the import door. The calculation by the LSTM is demonstrated in formula (2).

(2)

$ f_{t}=\left(W_{f}\left[h_{t-1},x_{t}\right]+b_{f}\right)\cdot \sigma $

In formula (2), $h_{t-1}$ represents the output at the previous moment and $x_{t}$ is the current input. Through the output of the last moment and the current input, some data can be selectively removed to achieve forgetting. After the forget gate, the method of updating the current state must be selected, as shown in formula (3).

(3)

$ \left\{\begin{array}{l} i_{t}=\left(W_{i}\cdot \left(h_{t-1},x_{t}\right)+b_{i}\right)\cdot \sigma \\ \overline{C_{t}}=\tanh \left(W_{c}\left[h_{t-1},x_{t}\right]+b_{C}\right) \end{array}\right. $

In formula (3), $\overline{C_{t}}$ represents the candidate value vector. The final output is shown in formula (4).

(4)

$\left\{\begin{array}{l} C_{t}=f\cdot C_{t-1}+i_{t}\cdot \\ O_{t}=(W_{o}\cdot (h_{t-1},x_{t})+b_{o})\cdot \sigma \\ h_{t}=O_{t}\cdot \tanh (C_{t}) \end{array}\right.$

In formula (4), $C_{t}$ represents the current state, $O_{t}$ is the amount of information in the current output, and $h_{t}$ is the final output obtained. There are two gates inside LSTM, including the update gate and the reset gate, namely, a Gated Recurrent Unit (GRU). The reset door and the update door of the GRU work together to update the time, and the state update formula is shown in formula (5).

(5)

$ h_{t}=h_{t-1}\odot \left(1-z_{t}\right)+z_{t}\odot \overline{h_{t}} $

In formula (5), $h_{t}$ represents the current new state, $h_{t-1}$ is the state before calculation of the new sequence information, and $\overline{h_{t}}$ is the current state. The expression of the gating function is shown in formula (6).

(6)

$ z_{t}=\left(W_{z}x_{t}+U_{z}h_{t-1}+b_{z}\right)\cdot \sigma $

In formula (6), $x_{t}$ is the sequence vector of moment $t$, and the calculation of the candidate state is shown in formula (7).

(7)

$ \overline{h}=\tanh \left(W_{h}x_{t}+b_{h}+r_{t}\odot \left(U_{h}h_{t-1}\right)\right) $

In formula (7), $r_{t}$ represents the reset door and $\overline{h}$ is candidacy.

Fig. 1. The network structure of VGGNet-16.

Fig. 2. The structure of the RNN.

Fig. 3. The long short-term memory network structure.

3.2 Visual Question Answering from HJA

VQA involves information processing in both image and text modalities. The input question must be encoded and extracted during question feature extraction, and it must be performed word by word ^[20]. A self-attention mechanism is included to improve the accuracy of the final answer prediction since, unlike machine translation, visual question answering only needs to catch a small portion of the input question in order to provide the desired result. This research proposes a Hierarchical Joint Attention model that first obtains the required image feature vector through a CNN, and then utilizes the hierarchical multi-attention network to get the problem features. The mechanism is composed of a bidirectional GRU, which can realize multiple encoding of the question and can thus add attention weights ^[21]. Following acquisition of the feature vector, it is passed to the joint attention layer where the attention weights of the image and the question are updated. The two features are linked before being sent to the response prediction layer to complete the prediction and output. The resulting HJA model is shown in Fig. 4.

To acquire the question vector in visual question answering, an attention method is proposed that extracts the keywords that best capture the entire question, and then aggregates this portion of the informative word representation. The multi-layer perceptron used by the attention mechanism to obtain the representation of the word annotation in the hidden layer is used to calculate the influence weight of each word in the question. It then compares the correlation between the influence weight and the word-level context vector, and uses the Softmax function to find the weights of importance ^[22]. The sentence vector is then computed as the weighting totality of the word annotations. During the training process, the context vector of the word is initialized; it will continue to learn at the same time, and the word-level attention representation is shown in formula (8).

(8)

$ \left\{\begin{array}{l} u_{it}=\tanh \left(W_{w}q_{iN}^{w}+b_{w}\right)\\ a_{it}=soft\max \left(u_{it}^{T}u_{w}\right)\\ q_{N}^{p}=\sum _{t}a_{it}h_{it} \end{array}\right. $

In formula (8), $W_{w}$ and $b_{w}$ are the parameters of the fully connected layer, $u_{it}$ and $h_{it}$ are the state representation in the underground layer, $a_{it}$ is the normalized weight of the Softmax function, $q_{N}^{w}$ is the question volume, and $q_{N}^{p}$ represents the phrase vector. After obtaining a series of phrase vectors, the phrases are encoded by a bidirectional GRU to obtain feature vectors ^[23]. The forward and backward mapped vectors are then correlated, thereby obtaining a context-dependent vector. After correlating the forward and reverse states of the hidden layer, sentence annotations are obtained, as shown in formula (9).

(9)

$ \left\{\begin{array}{l} \overset{\rightarrow }{h_{i}},\overset{\leftarrow }{h_{i}}=\overset{\rightarrow }{GRU}\left(q_{i}^{p}\right),\overset{\leftarrow }{GRU}\left(q_{i}^{p}\right)\\ h_{i}=\left[\overset{\rightarrow }{h_{i}},\overset{\leftarrow }{h_{i}}\right] \end{array}\right. $

In formula (9), $\overset{\rightarrow }{h_{i}}$ represents the forward state, $\overset{\leftarrow }{h_{i}}$ is the reverse state, while $h_{i}$ is the sentence annotation. Then, the attention mechanism is also utilized at the phrase layer, and the importance of the sentence is measured by its proportion, as shown in formula (10).

(10)

$ \left\{\begin{array}{l} u_{i}=\tanh \left(W_{p}h_{i}+b_{p}\right)\\ a_{it}=soft\max \left(u_{it}^{T}u_{p}\right)\\ q^{q}=\sum _{i}a_{i}h_{i} \end{array}\right. $

In formula (10), $a_{i}$ represents the proportion of the phrase in the question. Finally, the Softmax layer is used to classify the problem features, and the obtained probability value is the final feature, as shown in formula (11).

(11)

$ p=soft\max \left(W_{c}q^{q}+b_{c}\right) $

In formula (11), $p$ represents the normalized output. In the question-level mechanism, to ensure mapping between pictures and questions, a joint attention mechanism is introduced. The joint attention mechanism is able to simultaneously generate attention for questions and images, and is more capable of processing direct correlations in multimodal information than single attention ^[24]. Based on the original picture characteristics and question features, the joint attention mechanism constructs an affinity matrix that affects the original features, maximizes the feature weight, and then has an effect on the question and image features at all positions. The joint attention model is shown in Fig. 5.

When associating a problem with an image, one task is to calculate the site of the graphic feature in the image space, and another task is to obtain the similarity of the problem feature to the corresponding problem vector. The affinity matrix is obtained from the representation of the problem and the image feature map, as shown in formula (12).

(12)

$ C=\tanh \left(Q^{T}W_{b}V\right) $

In formula (12), $W_{b}$ is the weight parameter, $Q$ is the problem vector, $V$ represents the feature map of the image, and $T$ is parameter matrix. Then, the affinity matrix at the position is maximized, as shown in formula (13).

(13)

$ \left\{\begin{array}{l} a^{v}\left[n\right]=\max _{i}\left(C_{i},n\right)\\ a^{q}\left[t\right]=\max _{j}\left(C_{i},j\right) \end{array}\right. $

In formula (13), $a^{v}[n]$ is the maximum affinity matrix representing the image, and $a^{q}[t]$ is the maximum affinity matrix of the problem. The affinity matrix is then used as a feature parameter, and the attention maps of the question and image are predicted to increase the precision of the model. Then, weighting totality for the attention weights of the question and the image is applied to obtain the final feature, as shown in formula (14).

(14)

$ \left\{\begin{array}{l} \overset{\wedge }{v}=\sum _{n=1}^{N}a_{n}^{v}v_{n}\\ \overset{\wedge }{q}=\sum _{t=1}^{T}a_{t}^{q}q_{t} \end{array}\right. $

In formula (14), $\overset{\wedge }{v}$ represents the image feature, and $\overset{\wedge }{q}$ represents the problem feature. After the features are processed with joint attention, the accuracy of answer prediction is improved because it can better fuse the two feature vectors ^[25]. The prediction of the answer is a multi-classification task; that is, a series of answers will be obtained, classified, and output through the multi-layer perceptron. The candidate answers are the top five with the highest probability in the output, and among them, the predicted correct answer is the probability. The highest candidate answer is shown in formula (15).

(15)

$ \left\{\begin{array}{l} h=\tanh \left(W_{w}\left[\overset{\wedge }{q},\overset{\wedge }{v}\right]\right)\\ p=soft\max \left(Wh\right) \end{array}\right. $

In formula (15), $W_{w}$ and $W$are weight parameters, both located in the fully connected layer, and $p$ represents the probability distribution of the answer. Therefore, the final establishment of the VQA system is applied to children’s preschool language education. A VQA-model-based interactive intelligent tutoring system is then offered to the children after the language education questions are first retrieved from textbook images. The visual question answering system will first analyze an image, then deduce the answer to the language question, and finally, automatically respond to a preschooler’s language inquiry.

Fig. 4. The HJA model.

Fig. 5. The joint attention model.

4. Application Effect Analysis

First, the effect of the HJA in the study was tested and compared with three models: the Deep and Cross Network (DCN), Bottom-Up and Top-Down Attention (BUTD), and MuRel. Using the same dataset, the training results of the four models with an expanded number of iterations are displayed in Fig. 6.

In Fig. 6, among the four models, the proposed HJA model reached the target accuracy first after 125 iterations, whereas BUTD approached the target accuracy after nearly 300 iterations. The number of iterations needed for the DCN and MuRel models to approach the target accuracy was similar: 220 and 240, respectively. This shows that the HJA achieves the best convergence performance. Then, the four models were tested for accuracy in answering visual questions. Before the experiment started, data preprocessing was required to obtain problem annotations of the verification data and test data, and the JSON format file was downloaded through Python, thereby converting the images and questions into a new file format. The experiments were carried out on a Linux system and using the Python language. The machine learning framework was TensorFlow, and image preprocessing was performed through OpenCV. The specific parameters of the experiment are shown in Table 1.

The four models used the VQAv1 dataset in the experiments. Each picture in the VQA dataset corresponds to more than three questions, and each question corresponds to 10 correct answers and three possibly correct answers. The number of images in this dataset is 204,721, and the number of questions is 614,163. There are various types of questions: yes/no, open-ended questions answered with only one word or phrase, and multiple-choice questions with 18 alternative answers. The official provides the response, and if more than three people have the same response, the forecast result is used to determine whether the response is correct. If so, it is unquestionably accurate. The results obtained are shown in Fig. 7.

Fig. 7 shows the VQA accuracies from the four models when using the VQAv1 dataset. In Fig. 7, with additional iterations, the accuracy rates of the four models stayed at a certain level after the increase. Among them, the accuracy of BUTD was stable at 78%, and when DCN reached 30 iterations, the precision began to stabilize and fluctuated around 82%. After the MuRel model reached 20 iterations, the accuracy rate stayed at about 83%, whereas the precision rate increased rapidly with HJA until it reached 18 iterations, after which it began to stabilize at 88%, which is better than the other three models. The VQAv2 dataset was then utilized to compare the results of the four models. The kinds of questions in this dataset are numerical, yes/no questions, and other questions, accordingly denoted Num, Y/N, and Other. Overall performance is denoted All. The results obtained are shown in Fig. 8.

Figs. 8(a) and (d) are the accuracy results for Num and All question types, respectively, and Figs. 8(b) and (c) are the accuracy rates of the Y/N and Other question types, respectively. In Fig. 8(a), the accuracy of the four models for Num questions gradually decreased with additional questions. Among the four models, the lowest accuracy rates for HJA, DCN, BUTD, and MuRel were 52%, 50%, 48%, and 49%, respectively, and the highest rates were 58%, 57%, 56%, and 56.5% respectively. The HJA accuracy rates were higher than the other three models. In Fig. 8(b), the accuracy rates of HJA, BUTD, DCN, and MuRel for the Y/N problem type were the lowest at 88%, 82%, 84%, and 85%, respectively, and the highest accuracy rate for HJA was 92%. In Fig. 8(c), the lowest accuracy from HJA, BUTD, MuRel, and DCN models for the other problem type were 56%, 52%, 53%, and 55%, respectively. Fig. 8(d) shows that BUTD, DCN, and MuRel had lowest overall accuracies of 63%, 65%, and 67%, respectively (highest at 68%, 69%, and 71%, respectively), whereas HJA had an overall accuracy of 77% (lowest at 71%) but outperformed the other three models. Therefore, the VQA accuracy of the HJA model showed high accuracy with all four question types and superior performance. Finally, the visual question answering system established by the four models was applied to the language education of preschool children. This research selected 10 well-known kindergartens in Shanghai, and integrated the language education data of these 10 kindergartens into a new preschool language database. There were 128,654 questions in the database, separated into three question types: numerical, yes/no, and Other. All is again used to denote the overall accuracy rate. Results from the four models are shown in Fig. 9.

Fig. 9 shows the accuracy outcomes of the VQA system established by the four models in the selected preschool language learning database. Among them, Fig. 9(a) shows the accuracy with the different problem types. We can see that the accuracy of DCN and BUTD was below 85%, and the accuracy of MuRel was above 85%. The highest was 87%, while HJA’s accuracy rates were all above 90%, and the Y/N question type had the highest accuracy rate at 94%. Fig. 9(b) shows the accuracy results of the four models when the number of questions gradually increased. It is clear that increasing the number of questions reduced precision from the four models. The MuRel, BUTD, and DCN accuracy rates decreased to around 85%, although HJA remained at or above 90%.

Table 1. Specific parameters of the experiment.

Name	Numerical value
Epochs	105
Run size	512
Number of GRU layers	2
Word embedding size	512
Learning rate	0.001
Number of samples selected each time	200
Discard after embedding	0.5
Discard word after embedding	0.5
Fc7 image feature dimension	4096
Dataset version	1

Fig. 6. Training results of the four models on the same dataset.

Fig. 7. Accuracies with the VQAv1 dataset.

Fig. 8. VQA accuracy results of the four models on the VQAv2 dataset.

Fig. 9. VQA results of the four models when applied to the actual Preschool Language Education Database.

5. Conclusion

The primary subject in contemporary education is language instruction for young children, and it is crucial for their overall development. The effectiveness of language learning for preschoolers is impacted by low accuracy of VQA patterns in the existing interactive intelligent assistance system. This research used a neural network to extract information, proposed the Hierarchical Joint Attention model, and applied it to a VQA system. The outcomes indicate that the HJA pattern approached the target accuracy at 125 iterations, and had better convergence performance. With the VQAv1 dataset, the accuracy of the MuRel model remained at around 83% after 20 iterations. After stabilizing, BUTD's accuracy stayed at 78%, DCN's accuracy stayed at 82%, and the HJA model's accuracy stabilized at 88% after 18 iterations. With the VQAv2 dataset, HJA accuracy rates were the lowest when dealing with Num, Y/N, and other problems at 52%, 88%, and 56%, respectively, but were all greater than the accuracy rates for the other three models. The accuracy rate of the HJA model's VQA system in the example verification was above 90%, with the highest at 94%. This high accuracy rate shows that the HJA VQA system performs well in preschool language teaching and can offer useful support for that education. However, in the process of instance validation, the selected data volume was still small, and there was a lack of data screening, which may lead to low accuracy and credibility in the obtained results. Therefore, it is essential to continue expanding high-quality and larger datasets to support the model's ability to recognize new patterns and to raise prediction accuracy and generalizability.

6. Fundings

The research was supported by: A special issure of the teaching reform and development of basic education in Shaanxi Province - The Research on Curriculum Construction and Teaching Practice of Infant Life Education Based on Picture Book Reading under the New Outline (No. JYTYB2022-08).

REFERENCES

A. Partika, A. D. Johnson, D. A. Phillips, et al. “Dual language supports for dual language learners? Exploring preschool classroom instructional supports for DLLs’ early learning outcomes”. Early Childhood Research Quarterly, vol. 56, pp. 124-138, 2021.

S. M. Satagalieva. “The trends for modern libraries and building the strategy of library and information education in the Republic of Kazakhstan”. Scientific and Technical Libraries, 2021vol. 3, pp. 58-70, 2021.

M. Zhang, M. Zhang, G. Tian, et al. “A Home Service-Oriented Question Answering System with High Accuracy and Stability”. IEEE Access, 2019, pp. 1-3, 2019.

C. Yin, J. Tang, Z. Xu, et al. “Memory Augmented Deep Recurrent Neural Network for Video Question Answering.” IEEE Transactions on Neural Networks and Learning Systems, vol. 99, pp. 1-9, 2019.

Q. Cao, X. Liang, B. Li, et al. “Interpretable Visual Question Answering by Reasoning on Dependency Trees”. IEEE transactions on pattern analysis and machine intelligence, vol. 43(3), pp. 887-901, 2021.

J. Hong, J. Fu, Y. Uh, et al. “Exploiting hierarchical visual features for visual question answering”. Neurocomputing, vol. 351, pp. 187-195, 2019.

S. Garg, R. Srivastava. “Object sequences: encoding categorical and spatial information for a yes/no visual question answering task”. Computer Vision, IET, vol. 12(8), pp. 1141-1150, 2018.

A. A. Yusuf, F. Chong, M. Xianling, “Evaluation of graph convolutional networks performance for visual question answering on reasoning datasets”. Multimedia Tools and Applications, pp. 1-10, 2022.

T. M. Le, V. Le, S. Venkatesh, et al. “Hierarchical Conditional Relation Networks for Multimodal Video Question Answering”. International Journal of Computer Vision, vol. 8, pp. 1-24, 2021.

F. Chong, A. A. Yusuf, M. Xianling. “An analysis of graph convolutional networks and recent datasets for visual question answering”. Artificial Intelligence Review, pp. 1-24, 2022.

D. Sharma, S. Purushotham, C. K. Reddy. “MedFuseNet: Anattention-based multimodal deep learning model for visual question answering in the medical domain”. Scientific Reports, vol. 11(1), pp. 1-18, 2021.

L. Zhang, X. Yang, S. Li, et al. “Answering medical questions in Chinese using automatically mined knowledge and deep neural networks: an end-to-end solution”. BMC Bioinformatics, vol. 23(1), pp. 1-32, 2022.

Y. P. Nie, Y. Han, J. M. Huang, et al. “Attention-based encoder-decoder model for answer selection in question answering”. Frontiers of Information Technology & Electronic Engineering, vol. 18(4), pp. 535-544, 2019.

M. Shi, “Knowledge Graph Question and Answer System for Mechanical Intelligent Manufacturing Based on Deep Learning”. Mathematical Problems in Engineering, vol. 2, pp. 1-8, 2021.

A. Al-Sadi, M. Al-Ayyoub, Y. Jararweh, et al. “Visual Question Answering in the Medical Domain Based on Deep Learning Approaches: A Comprehensive Study”. Pattern Recognition Letters, vol. 150(2), pp. 1-4, 2021.

M. Jangra, S. K. Dhull, K. K. Singh “ECG arrhythmia classification using modified visual geometry group network (mVGGNet)”. Journal of Intelligent and Fuzzy Systems, vol. 38(5), pp. 1-15, 2020.

Y. Chen, Y. Mai, J. Xiao, et al. “Improving the Antinoise Ability of DNNs via a Bio-Inspired Noise Adaptive Activation Function Rand Softplus”. Neural Computation, vol. 31(6), pp. 1215-1233, 2019.

Y. Zhang, B. Mu, H. Zheng. “Link Between and Comparison and Combination of Zhang Neural Network and Quasi-Newton BFGS Method for Time-Varying Quadratic Minimization”. IEEE Transactions on Cybernetics, vol. 43(2), pp. 490-503, 2018.

Y. Fu, Z. Liang, S. You. “Bidirectional 3D Quasi-Recurrent Neural Network for Hyperspectral Image Super-Resolution”. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2021, vol. 99, pp. 1-7, 2021.

S. Park, J. Jang, S. Kim, et al. “Memory-Augmented Neural Networks on FPGA for Real-Time and Energy-Efficient Question Answering”. IEEE Transactions on Very Large-Scale Integration (VLSI) Systems, vol. 99, pp. 1-14, 2020.

P. Y. Wang, C. T. Chen, J. W. Su, et al. “Deep Learning Model for House Price Prediction Using Heterogeneous Data Analysis Along with Joint Self-Attention Mechanism”. IEEE Access, vol. 99, pp. 1-9, 2021.

X. Li, J. Song, L. Gao, et al. “Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering”. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8658-8665, 2019.

Y. Cheng, L. Yao, G. Xiang, et al. “Text Sentiment Orientation Analysis Based on Multi-Channel CNN and Bidirectional GRU With Attention Mechanism”. IEEE Access, vol. 8, pp. 134964-134975, 2020.

S. Pendurkar, S. Kolpekwar, S. Dhoot, et al. “Attention Based Multi-Modal Fusion Architecture for Open-Ended Video Question Answering Systems”. Procedia Computer Science, vol. 171, pp. 446-455, 2020.

C. Zhao, S. Wang, D. Li, et al. “Cross-domain sentiment classification via parameter transferring and attention sharing mechanism”. Information Sciences, vol. 578, pp. 281-296, 2021.

Author

Ying Cheng

Ying Cheng obtained her master’s degree in the major of Education (2008) from Shaanxi Normal University. Presently, she is working as an associate Professor in the Teachers’ College of Xianyang Vocational Technical College. She has published articles in more than10 national reputed peer reviewed journals. Her areas of interest include Language education for preschool children, Children’s Literature and early language education of young children.

IEIE SPC IEIE Transactions on Smart Processing & Computing

Journal Search

Journal XML

Journal Information

Application of a Neural Network-based Visual Question Answering System in Preschool Language Education

Abstract

Keywords

1. Introduction

2. Related Work

3. Application of the Neural Network VQA System in Preschool Language Education

3.1 Data Feature Extraction

(1)

(2)

(3)

(4)

(5)

(6)

(7)

Fig. 1. The network structure of VGGNet-16.

Fig. 2. The structure of the RNN.

Fig. 3. The long short-term memory network structure.

3.2 Visual Question Answering from HJA

(8)

(9)

(10)

(11)

(12)

(13)

(14)

(15)

Fig. 4. The HJA model.

Fig. 5. The joint attention model.

4. Application Effect Analysis

Table 1. Specific parameters of the experiment.

Fig. 6. Training results of the four models on the same dataset.

Fig. 7. Accuracies with the VQAv1 dataset.

Fig. 8. VQA accuracy results of the four models on the VQAv2 dataset.

Fig. 9. VQA results of the four models when applied to the actual Preschool Language Education Database.

5. Conclusion

6. Fundings

REFERENCES

Author

Ying Cheng

Article Information (continued)

Keywords

IEIE SPC

IEIE Transactions on Smart Processing & Computing