Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 14, No. 05, p.657-667

ISSN (online) :

2287-5255

Received : 22 November 2023Revised : 2 April 2024Accepted : 21 April 2024

DOI :

https://doi.org/10.5573/IEIESPC.2025.14.5.657

Regular Paper

Design and Application of Live Strip Merchandise Recognition System Based on Multimodal Learning

GongXuyun¹

(Economic management college, YiWu Industrial & Commercial College, YiWu 322000, China)

^* Corresponding Author:Xuyun Gong, Xuyun_Gong23@outlook.com

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

Given the existing problems such as weak interaction ability, low information recognition rate, and poor accuracy of goods identification, a new system of goods identification based on a multi-modal learning method was designed. By integrating commodity visual, voice, text, and other multiple information, a multi-modal information recognition model is built, commodity data is mapped to the model analysis module using modal retrieval, ITC contrast training, ITM interactive training, and ITG weighted training is carried out, and multi-modal commodity information is initially integrated. SDI+HDMI dual-interface encoder was used to encode the commodity graphic frequency information, and TRIOPC-MCAT-2 controller and IC identifier were selected to optimize the hardware equipment, effectively improve the commodity identification computing power and information response rate, and enhance the stability of the system. The multimodal learning model is used to transform the commodity information into 10-dimensional mapping vectors, multimodal coding is carried out according to the model input hierarchy, key features are extracted using SVM classifier, and multimodal feature information fusion vectors of commodities are obtained through interactive guided weighting operation, and commodity recognition is carried out according to the training results. According to the experimental results, given the massive commodity information of multiple categories, the success rate of the multi-modal learning-based live broadcast cargoes identification system designed in this paper for the commodity multi-modal information fusion recognition has reached more than 88%, the recognition time is less than 90ms, and the recognition accuracy rate is higher than 95%, indicating that the system studied in this paper has good recognition performance. The practical application effect is better than the traditional method, which can meet the current needs of live broadcast goods identification.

Keywords

Multimodal learning, Live streaming with goods, Goods with goods, Image recognition, , Recognition system

1. Introduction

Multimodal learning is a fusion recognition technology that improves the target recognition effect by utilizing the visual information, auditory information, and other multimodal elements in the target scene for deep training and learning. In the image recognition system of live band goods, using multimodal retrieval scheme, we can carry out comprehensive training on goods images, text information, voice speech, etc., and at the same time, enhance the recognition efficiency of multiple visual elements, such as image color, contour, etc., to improve the ability of goods information recognition. Jiang Daying ^[1] found that improving the system architecture and model of convolutional neural network, especially optimizing the design of the time-consuming training and learning model, can effectively improve the speed of calculation and recognition accuracy of the commodity image recognition system, and satisfy the business needs of the society for efficient settlement in intelligent business operations. Jiao Libin ^[2] applied multimodal deep learning in traffic classification, utilizing the complementarity between multimodal modalities and eliminating inter-modal redundancy so as to learn a better representation of traffic data features. Different modal inputs of the same traffic unit are trained using convolutional neural networks and long and short-term memory networks respectively, in order to adequately learn the interdependence of inter- and intramodal information of the traffic data, and to overcome the existing unimodal The classifier can overcome the limitations of existing unimodal classifiers, and can support more complex modern network application scenarios. Chao Zhang ^[3] proposed a multimodal multi-label sentiment recognition algorithm based on label embedding, capturing inter-label dependencies through the trained label embedding vectors and adding constraints to the modal features to reduce the semantic gap between modalities, and the experimental results show that the algorithm has a significant improvement in the accuracy and Hamming loss metrics in the task of multimodal multi-label sentiment recognition as compared to the existing methods.

There are still some shortcomings in the current research: (1) weak interaction ability of commodity recognition information; (2) low recognition rate of image segmentation information; (3) low accuracy of commodity feature classification recognition. In this regard, this paper tries to design a multimodal deep learning live streaming with goods commodity recognition system, combining SVM classifier for innovative optimization, making full use of the complementarity between different modal data, and broadening the perceptual recognition domain ^[4]. Through deep learning and computer vision technology, the system can realize accurate parsing of video frames and accurate detection of commodity targets. Secondly, combined with speech recognition technology, the system is able to extract useful speech information from the live video, further enriching the data source for merchandise recognition. Finally, using the multimodal fusion algorithm and SVM classifier, the system is able to process a variety of information such as visual, auditory and text for comprehensive recognition to improve the accuracy and reliability of merchandise recognition.

2. Multimodal Learning Based Merchandise Recognition Model Architecture

Based on multi-modal learning, this paper constructs the identification model of live broadcast goods with freight, takes the multi-modal map of goods as the core, and mines the correlation between text information, voice information and visual information. Compared with other image recognition methods, multimodal learning has the advantages of real-time and scalability, and multiple sensors can be used for perception at the same time, with higher recognition accuracy; through computer vision technology, the automated processing and analysis of a large number of commodity images is realized, and the recognition efficiency is also significantly improved.

Visual information. The product image is divided into multiple layers for feature extraction, and the feature parameters are extracted for linear mapping.

Text information. Identify the text information directly, extract the keywords and other feature elements in the text, and carry out linear mapping.

Voice information. Identify the voice information of anchors in the broadcast room, convert it into a text file, and carry out sharing parameter mapping according to the sequence of feature elements.

The grayscale model is selected to identify the visual, text, and voice information identification, and the location of the identification is identified by the Token position embedding, and the corresponding modes are added as model parameters input to the modeling program. The multi-modal deep learning method is used to connect the multi-modal features of live cargoes and generate the multi-modal mapping relationship image as follows:

Fig. 1. Multimodal mapping relationship image for commodity recognition.

The product multimodal data is embedded into the spatial model position coding, and the correlative modal relationship analysis results can be generated by the modal information retrieval. The mapping layer of the three modules is fused, and the modal parameter samples are extracted for commodity feature recognition training. In order to train the model with a higher degree of matching of the sample information.

ITC comparative training. Image, voice, and text information data were imported into the corresponding mapping layer, the current study compared with the original data, using image and text contrastive learning maintenance history data samples, and random prediction data online comparison, to get the training samples ^[8].

ITM interactive training. The image information and text information parameters are output to the training cross-layer, the text information is input from the bottom, and the image information is input to each layer at the same time. In this case, there is an interactive relationship between multiple modal information. The binary linear classifier determines whether the image text information matches, and finally the training mean is taken as the key element of the linear prediction and matching of modal features.

ITG weighted training. The positive sample of multi-modal comparison information is input into the commodity identification layer, the negative sample of multi-modal noise is extracted for cross-training, and the bilateral similarity is calculated using the KL divergence ^[9].

(1)

$ {\mathcal L}_{ITC} =\frac{a}{2} \big[{\rm KL}\left(q^{{\rm itc}} (I)\| p^{{\rm itc}} (I)\right)\nonumber\\ \quad\quad + {\rm KL}\left(q^{itc} (T)\| p^{itc} (T)\right)\big]. $

In the formula, $a$ is the ITC-weighted training weight, $q^{{\rm itc}} (I)$, $p^{{\rm itc}} (I)$ denotes the positive and negative samples of the visual information samples, $q^{itc} (T)$, $p^{itc} (T)$ denotes the positive and negative samples of the textual information, and ${\rm KL}$ is the ${\rm KL}$ dispersion of the two-label cross-training.

The multi-modal sample information is trained and encoded, and based on the initial characteristics of independent modes, the identification optimization model representing the attribute characteristics of live-streamed commodities is constructed. The model structure is as Fig. 2.

Fig. 2. Multimodal learning-based merchandise recognition model architecture.

The recognition model based on multi-modal learning has clear structure and a clear hierarchy, which simplifies the product identification process and reduces the amount of information calculation.

3. Hardware Design of Live Strip Merchandise Recognition System Based on Multimodal Learning

3.1. SDI+HDMI Dual Interface Encoder

Aiming at the characteristics of multimodal commodity identification processing, SDI+HDMI dual interface encoder is selected for anti-interference signal acquisition and encoding. Because compared with the single interface encoder, dual interface encoder is more flexible, can meet the HDMI and VGA and other different types of equipment connection requirements. The main components of the encoder hardware equipment include a 12V/1A input power supply, timer, counter, LED indicator, temperature monitor, voice intercom, expansion hard disk, etc. The communication transmission carrier is a gigabit wired network, which can meet the real-time storage of the NAS network ^[5]. The input information is encoded by the encoder graph frequency, which can realize image superposition, speech recognition, custom text, and other functions, and support remote operation control. The configuration structure of the SDI+HDMI dual interface encoder is shown in the Fig.~3.

Fig. 3. Configuration structure of SDI+HDMI dual interface encoder.

The SDI+HDMI encoder has a resolution of up to 4KP30 and has multi-channel video and audio collaborative encoding. It can support synchronous transmission of multi-protocol information. The video input interface and ring out interface models are 1 * 3G-SDI and 1 * HDMI 1.4a, respectively. The network interface can be suitable for 1 * 10/100M/1000M RJ45 network port, supports PoE (802.3af), and can be configured with 2 * USB 2.0 Type-C and 1 * USB 2.0 Type-A mobile USB interfaces. Media communication can support the SIP/GB-T28181 protocol, the information encoding format meets the H.265/HEVC/H.264/AVC standards, and can simultaneously support voice intercom, Tally indication, and POE access. By capturing the pulse signal through a counter, the phase period of the target recognition code is obtained, and the encoder operating phase is read to determine the clock signal of the image frequency signal ^[6].

3.2. TRIO PC-MCAT-2 Controller

The TRIO PC-MCAT-2 controller integrates visual, audio, and textual information control to achieve multimodal information integration control. Support communication protocols such as Telnet, Modbus TCP, Ethernet IP, etc; Applying EtherCAT to expand high-speed interfaces; Capable of supporting 128 axis servo control, 22 motion control process program runs, and a minimum servo cycle of 125 $\mu$S ^[20]. The controller adopts a dual system physical isolation and split core operation of Windows system and RTX64 real-time system; It can solve the problems of traditional PC+PLC/motion control stuttering, abnormal blue screen, etc., ensuring that the motion control program can continue to run in the event of a Windows system outage, with strong stability; At the same time, it has strong computing power, extremely fast response speed, and achieves microsecond level memory data exchange, which increases the execution time by 2000 to 3000 compared to traditional single controllers $\mu$s. It can effectively solve problems such as delay and low efficiency in multimodal information communication ^[7].

3.3. IC Recognizer

The IC recognizer mainly consists of the control core card reader CPU P89LPC932, non-contact communication card reader IC, real-time clock chip PCF8563, RS232 for PC communication, and memory AT45DB021 ^[19]. By utilizing RF chips to expand the control circuit range, the power supply voltage is improved to 3.3 V, reducing the power consumption of the recognizer, effectively enhancing the system's recognition control range, and enabling reset requests for multiple instructions. The core card reader is a highly integrated microcontroller, with a command execution speed six times that of the standard 80C51 card reader. It can perform anti-collision operations within the communication range, accurately identify the type of card information read and written, and perform function decoding and modification on the multimodal information encoded by the encoder. It has high compatibility and operational stability ^[6].

Fig. 4. Functional block diagram of IC recognizer configuration.

4. Software Design of Live Streaming Product Identification System Based on Multimodal Learning

4.1. Encoding of Live Streaming Sales Information Based on Multimodal Learning

Using SDI+HDMI dual interface encoder to encode live streaming product information, input product image information into a multimodal learning recognition model, cut the image into three channel feature vectors, assume each image feature vector is $x_{i} $, and obtain a 10-dimensional projection direction $Y_{i} $ through linear transformation:

(2)

$ Y_{i} =Z_{q} x_{i} +Z_{k} r_{i} +Z_{r} k_{i}. $

In the formula, $x_{i} $, $r_{i} $, $k_{i} $represents visual, speech, and text information encoding for multimodal training, respectively; $Z_{q} $, $Z_{k} $, $Z_{r} $ are the weight matrix of three modal transformations encoded by the model ^[10]. Measure encoding similarity through dot product to obtain SoftMax function dot product value

(3)

$ d_{i} =r_{i} \cdot k_{i}. $

In the formula, $d_{i} $ is the encoded dot product value. Calculate the fusion vector based on this and introduce a scaling factor to reduce the instability of encoding vector calculation;

(4)

$ v_{i} =softmax\left(\sum\limits_{i} \frac{d_{i}}{\sqrt{10}} \right)=\frac{\frac{d_{i} }{\sqrt{10} } }{\sum\limits_{i} \frac{d_{i} }{\sqrt{10} } }. $

In the formula, $v_{i} $ is the multimodal parameter fusion vector. Introducing fully connected weights and residual weight matrix $Z_{1} $, $Z_{2} $, The corresponding interlayer bias is $b_{1} $, $b_{2} $. Nonlinear transformation of multimodal commodity information vectors into commodity information activation functions

(5)

$ F(x)=\max \left(0,xZ_{1} +b_{1} \right)Z_{2} +b_{2} +x . $

By using the dual interface encoder interface function, the product information is converted into multimodal encoding, and the calculation formula is

(6)

$ SH(x)=LayerNorm (F(x)+Sublayer(x)) . $

In the formula, $SH(x)$ is a multimodal information encoding for live streaming products with goods, $Sublayer(x)$represents the output of sub-module information in the multimodal recognition model, $LayerNorm$is the normalization calculation process.

After the above calculation, the multimodal information encoding data of live streaming goods are obtained ^[11-^14].

4.2. Feature Extraction

The key features such as commodity images, characters, speech, etc. are extracted using the recognizer, and the information such as text and images are represented as initial vectors with different attributes, and the SVM classifier is used for image segmentation feature extraction, which is able to classify the kernel function for the local features:

(7)

$ S=\left(x_{i} ,y_{i} \right),~i=1,~2,~3,~\cdots,~n,~x_{i} \in N,~y_{i} \in \{ +1,~-1\}, $

where $S$ is the original training set of support vector machine and $N$ is the total number of sample data. The core solver coefficient $\alpha _{i} $ is introduced for pairwise training, and the image feature classification formula is obtained as

(8)

$ f_{s} (x)=sign\left\{\sum\limits_{n}^{i=1} \alpha _{i} -\frac{1}{2} \sum\limits_{n}^{i=1} \sum\limits_{n}^{j=1} \alpha _{i} y_{i} \left(x_{i} ,x_{j} \right)\right\} . $

The key classification features of commodity images are extracted and encoded, divided into $224 \times 224$ image domains of uniform size, and attention weights are introduced $l_{i} $, and feature guidance vectors are computed for each image domain

(9)

$ w_{region} =\sum_{i=1}^{n} g\left(attr_{i} \right) . $

In the formula, $g\left(attr_{i} \right)$ represents the original encoding vector in the product image domain. The attention weight meets the following conditions:

(10)

$ l=soft\max \left(l\right), $

(11)

$ l_{i} =Z_{2} \cdot \tan \left(Z_{1} \cdot g\left(attr_{i} \right)+b_{1} \right)+b_{2}. $

Multimodal feature coded images are extracted for multi-attribute feature coded matching, and the image feature vectors in the model are knowledge fused with the text features to obtain the multimodal feature knowledge information of the commodity ^[15-^17].

4.3. Product Identification Based on Multimodal Interaction

The above multimodal eigenvectors are deeply instructed to obtain multimodal reconstruction vectors using the linear relationship between the multimodal sample data

(12)

$ G_{m} =\sum\limits_{i=1}^{L} a_{m} X_{m}^{i} . $

In the formula, $G_{m} $represents the reconstructed vector after multimodal data interaction, $a_{m} $ represents the guidance weight for multimodal interaction, $X_{m}^{i} $ represents the random vector of modality $m$, and $L$ represents the vector length. Due to the different weights of different modal vectors, the interactive guidance weights should meet the following requirements:

(13)

$ a_{m} =soft\max \left(a_{m} \right), $

(14)

$ a_{m}^{i} =Z_{m2} \cdot \tan \left(Z_{m1} \cdot X_{m}^{i} +b_{m1} \right)+b_{m2} . $

Input the multimodal feature interaction guidance vector into the model fusion layer and substitute it into the fixed length form of the modal feature vector mapping $l'_{m} $

(15)

$ l'_{m} =\tan \left(Z_{mi} \cdot X_{m}^{i} +b_{mi} \right) . $

Calculate the guidance weight formula for modal fusion features

(16)

$ \bar{a}_{m} =soft\max \left(\bar{a}_{m} \right) , $

(17)

$ \bar{a}_{m}^{i} =Z_{m2} \cdot \tan \left(Z_{m1} \cdot X_{m}^{i} +b_{m1} \right)+b_{m2} . $

Weighted average calculation to obtain the final product feature fusion vector $f_{m} $:

(18)

$ f_{m} ={\sum\limits_{m\in \{ token,region,attr\} }} \bar{a}_{m} l'_{m} . $

In the formula, $m$ represents any mode of live-streaming merchandise, including text mode $token$and image mode $region$, and ultimately calculates the target product feature interaction fusion vector. After the above steps to live commodity image information for residual recognition, multimodal interaction training to get the commodity feature set, while the auxiliary monitoring module of the text information for interaction training, the transfer matrix for feature matching, so as to achieve a more accurate commodity image and text and other information, to improve the performance of the live band goods recognition and response ^[18]. The total flow of commodity interaction recognition is shown in Fig. 5:

Fig. 5. Product multimodal interaction identification process.

5. Experimental Research

To verify the practical application effect of the design in this article, a comparative experiment was designed using the Windows 10 operating system, and the training operating platform was Hadoop0.20.2. A virtualization program was used to simulate multiple hosts running nodes in the server, capture the live bandwagon merchandise information dataset and utilize Python with high performance for custom crawler crawling. Integrating multimodal product categories, a total of 11400 products covering 7 major categories were obtained. Extract feature information from all product samples for partitioning, and conduct multimodal learning product recognition experiments. The training set consists of 6523 samples, the validation set consists of 2438 samples, and the testing set consists of 2439 samples. The specific number of positive and negative samples is shown in the Table 1.

Table 1. Experimental categories of product identification for live streaming products.

Data set	Training set	Test set	Validation set
Clothing	1174	428	424
Footwear luggage	558	228	230
Cosmetics	1056	417	426
Delicacy	1230	461	463
Daily necessities	1085	368	364
Jewelry	530	209	198
Other products	890	328	333
Positive sample	4255	2117	2132
Negative sample	2636	322	895
Total	6523	2439	2438

According to the above Table 1, when conducting multimodal learning on different categories of goods, the live streaming product recognition system based on multimodal learning studied in this article has a recognition rate of 95.81% for single modal information in images, 90.29% for bimodal information in images and texts, and 88.10% for multimodal recognition by fusion of image text and phonetic information, which is at least 6.5% higher than the success rate of traditional product information recognition methods, The success rate of the research method in this article for identifying multimodal goods is 8.4% higher than that of the convolutional neural network recognition method. At the same time, the time required for multimodal information fusion and product recognition of the research method in this article is 86ms, the embedded label recognition method requires 98ms, and the convolutional neural network recognition method requires 152ms. From this, it can be seen that multimodal learning recognition methods can perform complex multimodal fusion recognition in a shorter time, and the recognition success rate is higher.

Train the above samples and obtain the product recognition experimental results as Figs. 6 and 7.

Fig. 6. Success rate of product multimodal information fusion recognition.

Fig. 7. Multimodal product information exchange rate.

As shown in Fig. 6, in the 10000 items recognition experiment, when the modal missing degree of product modal information is 20%, the information exchange rate of multimodal learning is 78%, and the information exchange rate of embedded label recognition method is 59%; The missing degree is 40%, and the information exchange rate for multimodal recognition is 71%. When the missing degree is 80%, the information exchange rate for multimodal learning is 56%, while the information exchange rate for convolutional neural networks is only 18%. It can be seen that the recognition method of text research has a significantly higher interaction rate with multimodal information than traditional methods, and has stronger information interaction capabilities. The degree of modal interaction will affect the robustness of model recognition. The lower the modal information interaction rate, the lower the accuracy of product identification information, and the worse the effectiveness of multimodal fusion recognition.

The statistical results of the accuracy rate of live-streaming product identification are as Table 2.

Table 2. Accuracy rate of live streaming product identification.

	Sample size of goods/piece	Training set recognition accuracy/%	Test set recognition accuracy/%	Verification set recognition accuracy/%
Multimodal system	0-2000	98.21	97.43	97.36
	2000-5000	96.44	96.81	96.67
	5000-10000	95.93	95.32	95.24
Label recognition system	0-2000	93.66	92.18	90.98
	2000-5000	91.09	85.62	88.76
	5000-10000	88.34	81.93	85.32
Convolutional Neural Recognition System	0-2000	93.67	89.04	81.74
	2000-5000	90.25	83.29	79.64
	5000-10000	86.44	81.45	76.98

According to the above table, the recognition accuracy of the multimodal learning product recognition method proposed in this article has always been above 95%. Through multimodal learning, visual, text, and speech information can be integrated for more comprehensive multimodal information fusion recognition, with stronger information interaction ability and higher recognition accuracy.

6. Conclusion

A multimodal learning-based live-streaming product identification system is proposed to address the efficiency issue of live-streaming product identification. The system is optimized from both hardware structure and software design, and the following research conclusions are drawn:

Extract product visual, text, and speech information, mine the correlation between multimodal information features, and use multimodal deep learning to construct a product recognition model for sample training. Optimize the hardware design of the live streaming product identification system using SDI+HDMI dual interface encoder, TRIO PC-MCAT-2 controller, and IC recognizer to enhance system operational stability and compatibility. The multimodal deep learning algorithm is used to encode the commodity information, the SVM classifier extracts the key features, and the multimodal information fusion interaction recognition of commodities is realized by model-guided interaction operation. Through experiments, it can be seen that the product recognition system designed in this article can complete multimodal information interaction fusion recognition in a shorter time, and the recognition rate is significantly higher than traditional recognition systems. However, there are still some shortcomings in the system, mainly manifested in the following aspects.

In the face of massive live streaming product information, there may be risk issues such as information omission and virus invasion during the product information collection process, which affects the security of product information. The speech recognition ability is limited, and when there are situations such as unclear speech recognition, it is difficult to accurately recognize keywords and sentences expressed in speech, which affects the efficiency of product recognition. Subsequent research should further improve speech recognition and feature extraction functions to enhance the security of product information for live-streaming sales.

REFERENCES

D. Jiang, ``Systems design of commodity image recognition based on convolution neural network,'' Journal of Beijing Polytechnic College, vol. 20, no. 3, pp. 4-6, 2021.

L. Jiao, M. Wang, and Y. Huo, ``A traffic classification and recognition method based on multimodal deep learning,'' Radio Communications Technology, vol. 2, no. 2, pp. 13-17, 2021.

C. Zhang and X. Zhang, ``Label embedding based multimodal multi-label emotion recognition,'' Cyber Security and Data Governance, vol. 7, pp. 41-44, 2022.

P. Li, X. Wan, and S. Li, ``Image caption of space science experiment based on multi-modal learning,'' Optics and Precision Engineering, vol. 29, no. 12, pp. 12-16, 2021.

S. Sun, B. Guo, and X. Yang, ``Embedding consensus autoencoder for cross-modal semantic analysis,'' Computer Science, vol. 48, no. 7, pp. 93-98, 2021.

W. Liu and S. Jiang, ``Product identification method based on unlabeled semi-supervised learning,'' Computer Applications and Software, vol. 2022, no. 7, pp. 39-44, 2022.

Y. Wang, ``E-commerce commodity entity recognition algorithm based on big data,'' Microcomputer Applications, vol. 37, no. 6, pp. 80-83, 2021.

M. J. A. Patwary, W. Cao, Z.-Z. Wang, and M. A. Haque, ``Fuzziness based semi-supervised multimodal learning for patient’s activity recognition using RGBDT videos,'' Applied Soft Computing, vol. 120, pp. 120-129, 2022.

S. Praharaj, M. Scheffel, H. Drachsler, and M. Specht, ``Literature review on co-located collaboration modeling using multimodal learning analytics—Can we go the whole nine yards?'' IEEE Transactions on Learning Technologies, vol. 14, no. 3, pp. 367-385, 2021.

L. Yu, C. Liu, J. Y. H. Yang, and P. Yang, ``Ensemble deep learning of embeddings for clustering multimodal single-cell omics data,'' Bioinformatics, vol. 39, no. 6, pp. 10-18, 2023.

C. Mi, T. Wang, and X. Yang, ``An efficient hybrid reliability analysis method based on active learning Kriging model and multimodal-optimization-based importance sampling,'' International Journal for Numerical Methods in Engineering, vol. 122, no. 24, pp. 7664-7682, 2021.

A. Rahate, R. Walambe, S. Ramanna, and K. Kotecha, ``Multimodal co-learning: Challenges, applications with datasets, recent advances and future directions,'' Information Fusion, vol. 81, pp. 203-239, 2022.

B. Bardak and M. Tan, ``Improving clinical outcome predictions using convolution over medical entities with multimodal learning,'' Artificial Intelligence in Medicine, vol. 117, pp. 102-112, 2021.

J. Xiong, F. Li, and X. Zhang, ``Re: Xiong et al.: Multimodal machine learning using visual fields and peripapillary circular OCT scans in detection of glaucomatous optic neuropathy (Ophthalmology. 2022; 129:171-180) reply,'' Ophthalmology, vol. 129, no. 4, pp. 129-139, 2022.

E. A. Smith, N. T. Hill, T. Gelb, et al., ``Identification of natural product modulators of Merkel cell carcinoma cell growth and survival'' Scientific Reports, vol. 11, no. 1, 13597, 2021.

Y. Pan, A. Braun, and I. Brilakis, ``Enriching geometric digital twins of buildings with small objects by fusing laser scanning and AI-based image recognition,'' Automation in Construction, vol. 140, 106633, 2022.

J. Qin, C. Wang, X. Ran, S. Yang, and B. Chen, ``A robust framework combined saliency detection and image recognition for garbage classification,'' Waste Management, vol. 140, pp. 193-203, 2022.

F. Long, ``Simulation of English text recognition model based on ant colony algorithm and genetic algorithm,'' Journal of Intelligent and Fuzzy Systems, vol. 40, no. 4, pp. 1-12, 2021.

B. Lu and Z. Chen, ``Live streaming commerce and consumers’ purchase intention: An uncertainty reduction perspective,'' Information & Management, vol. 58, 103509, 2021.

C.-D. Chen, Q. Zhao, and J.-L. Wang, ``How livestreaming increases product sales: Role of trust transfer and elaboration likelihood model,'' Behaviour & Information Technology, vol. 41, no. 3, pp. 558–573, 2022/

Author

Xuyun Gong

Xuyun Gong was Born in 1982, she obtained a bachelor's degree in 2004. From 2004 to 2006, she studied at Northeast Normal University and obtained a master's degree in economics in July 2006. Published 4 domestic core papers, including 1 CSSCI paper. She main research interests are digital finance, small and micro enterprise financing and money market.

IEIE SPC IEIE Transactions on Smart Processing & Computing

Journal Search

Journal XML

Journal Information

Design and Application of Live Strip Merchandise Recognition System Based on Multimodal Learning

Abstract

Keywords

1. Introduction

2. Multimodal Learning Based Merchandise Recognition Model Architecture

(1)

3. Hardware Design of Live Strip Merchandise Recognition System Based on Multimodal Learning

3.1. SDI+HDMI Dual Interface Encoder

3.2. TRIO PC-MCAT-2 Controller

3.3. IC Recognizer

4. Software Design of Live Streaming Product Identification System Based on Multimodal Learning

4.1. Encoding of Live Streaming Sales Information Based on Multimodal Learning

(2)

(3)

(4)

(5)

(6)

4.2. Feature Extraction

(7)

(8)

(9)

(10)

(11)

4.3. Product Identification Based on Multimodal Interaction

(12)

(13)

(14)

(15)

(16)

(17)

(18)

5. Experimental Research

6. Conclusion

REFERENCES

Author

Xuyun Gong

Article Information (continued)

Keywords

IEIE SPC

IEIE Transactions on Smart Processing & Computing