GongXuyun1
-
(Economic management college, YiWu Industrial & Commercial College, YiWu 322000, China)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Multimodal learning, Live streaming with goods, Goods with goods, Image recognition, , Recognition system
1. Introduction
Multimodal learning is a fusion recognition technology that improves the target recognition
effect by utilizing the visual information, auditory information, and other multimodal
elements in the target scene for deep training and learning. In the image recognition
system of live band goods, using multimodal retrieval scheme, we can carry out comprehensive
training on goods images, text information, voice speech, etc., and at the same time,
enhance the recognition efficiency of multiple visual elements, such as image color,
contour, etc., to improve the ability of goods information recognition. Jiang Daying
[1] found that improving the system architecture and model of convolutional neural network,
especially optimizing the design of the time-consuming training and learning model,
can effectively improve the speed of calculation and recognition accuracy of the commodity
image recognition system, and satisfy the business needs of the society for efficient
settlement in intelligent business operations. Jiao Libin [2] applied multimodal deep learning in traffic classification, utilizing the complementarity
between multimodal modalities and eliminating inter-modal redundancy so as to learn
a better representation of traffic data features. Different modal inputs of the same
traffic unit are trained using convolutional neural networks and long and short-term
memory networks respectively, in order to adequately learn the interdependence of
inter- and intramodal information of the traffic data, and to overcome the existing
unimodal The classifier can overcome the limitations of existing unimodal classifiers,
and can support more complex modern network application scenarios. Chao Zhang [3] proposed a multimodal multi-label sentiment recognition algorithm based on label
embedding, capturing inter-label dependencies through the trained label embedding
vectors and adding constraints to the modal features to reduce the semantic gap between
modalities, and the experimental results show that the algorithm has a significant
improvement in the accuracy and Hamming loss metrics in the task of multimodal multi-label
sentiment recognition as compared to the existing methods.
There are still some shortcomings in the current research: (1) weak interaction ability
of commodity recognition information; (2) low recognition rate of image segmentation
information; (3) low accuracy of commodity feature classification recognition. In
this regard, this paper tries to design a multimodal deep learning live streaming
with goods commodity recognition system, combining SVM classifier for innovative optimization,
making full use of the complementarity between different modal data, and broadening
the perceptual recognition domain [4]. Through deep learning and computer vision technology, the system can realize accurate
parsing of video frames and accurate detection of commodity targets. Secondly, combined
with speech recognition technology, the system is able to extract useful speech information
from the live video, further enriching the data source for merchandise recognition.
Finally, using the multimodal fusion algorithm and SVM classifier, the system is able
to process a variety of information such as visual, auditory and text for comprehensive
recognition to improve the accuracy and reliability of merchandise recognition.
2. Multimodal Learning Based Merchandise Recognition Model Architecture
Based on multi-modal learning, this paper constructs the identification model of live
broadcast goods with freight, takes the multi-modal map of goods as the core, and
mines the correlation between text information, voice information and visual information.
Compared with other image recognition methods, multimodal learning has the advantages
of real-time and scalability, and multiple sensors can be used for perception at the
same time, with higher recognition accuracy; through computer vision technology, the
automated processing and analysis of a large number of commodity images is realized,
and the recognition efficiency is also significantly improved.
Visual information. The product image is divided into multiple layers for feature
extraction, and the feature parameters are extracted for linear mapping.
Text information. Identify the text information directly, extract the keywords and
other feature elements in the text, and carry out linear mapping.
Voice information. Identify the voice information of anchors in the broadcast room,
convert it into a text file, and carry out sharing parameter mapping according to
the sequence of feature elements.
The grayscale model is selected to identify the visual, text, and voice information
identification, and the location of the identification is identified by the Token
position embedding, and the corresponding modes are added as model parameters input
to the modeling program. The multi-modal deep learning method is used to connect the
multi-modal features of live cargoes and generate the multi-modal mapping relationship
image as follows:
Fig. 1. Multimodal mapping relationship image for commodity recognition.
The product multimodal data is embedded into the spatial model position coding, and
the correlative modal relationship analysis results can be generated by the modal
information retrieval. The mapping layer of the three modules is fused, and the modal
parameter samples are extracted for commodity feature recognition training. In order
to train the model with a higher degree of matching of the sample information.
ITC comparative training. Image, voice, and text information data were imported into
the corresponding mapping layer, the current study compared with the original data,
using image and text contrastive learning maintenance history data samples, and random
prediction data online comparison, to get the training samples [8].
ITM interactive training. The image information and text information parameters are
output to the training cross-layer, the text information is input from the bottom,
and the image information is input to each layer at the same time. In this case, there
is an interactive relationship between multiple modal information. The binary linear
classifier determines whether the image text information matches, and finally the
training mean is taken as the key element of the linear prediction and matching of
modal features.
ITG weighted training. The positive sample of multi-modal comparison information is
input into the commodity identification layer, the negative sample of multi-modal
noise is extracted for cross-training, and the bilateral similarity is calculated
using the KL divergence [9].
In the formula, $a$ is the ITC-weighted training weight, $q^{{\rm itc}} (I)$, $p^{{\rm
itc}} (I)$ denotes the positive and negative samples of the visual information samples,
$q^{itc} (T)$, $p^{itc} (T)$ denotes the positive and negative samples of the textual
information, and ${\rm KL}$ is the ${\rm KL}$ dispersion of the two-label cross-training.
The multi-modal sample information is trained and encoded, and based on the initial
characteristics of independent modes, the identification optimization model representing
the attribute characteristics of live-streamed commodities is constructed. The model
structure is as Fig. 2.
Fig. 2. Multimodal learning-based merchandise recognition model architecture.
The recognition model based on multi-modal learning has clear structure and a clear
hierarchy, which simplifies the product identification process and reduces the amount
of information calculation.
3. Hardware Design of Live Strip Merchandise Recognition System Based on Multimodal
Learning
3.1. SDI+HDMI Dual Interface Encoder
Aiming at the characteristics of multimodal commodity identification processing, SDI+HDMI
dual interface encoder is selected for anti-interference signal acquisition and encoding.
Because compared with the single interface encoder, dual interface encoder is more
flexible, can meet the HDMI and VGA and other different types of equipment connection
requirements. The main components of the encoder hardware equipment include a 12V/1A
input power supply, timer, counter, LED indicator, temperature monitor, voice intercom,
expansion hard disk, etc. The communication transmission carrier is a gigabit wired
network, which can meet the real-time storage of the NAS network [5]. The input information is encoded by the encoder graph frequency, which can realize
image superposition, speech recognition, custom text, and other functions, and support
remote operation control. The configuration structure of the SDI+HDMI dual interface
encoder is shown in the Fig.~3.
Fig. 3. Configuration structure of SDI+HDMI dual interface encoder.
The SDI+HDMI encoder has a resolution of up to 4KP30 and has multi-channel video and
audio collaborative encoding. It can support synchronous transmission of multi-protocol
information. The video input interface and ring out interface models are 1 * 3G-SDI
and 1 * HDMI 1.4a, respectively. The network interface can be suitable for 1 * 10/100M/1000M
RJ45 network port, supports PoE (802.3af), and can be configured with 2 * USB 2.0
Type-C and 1 * USB 2.0 Type-A mobile USB interfaces. Media communication can support
the SIP/GB-T28181 protocol, the information encoding format meets the H.265/HEVC/H.264/AVC
standards, and can simultaneously support voice intercom, Tally indication, and POE
access. By capturing the pulse signal through a counter, the phase period of the target
recognition code is obtained, and the encoder operating phase is read to determine
the clock signal of the image frequency signal [6].
3.2. TRIO PC-MCAT-2 Controller
The TRIO PC-MCAT-2 controller integrates visual, audio, and textual information control
to achieve multimodal information integration control. Support communication protocols
such as Telnet, Modbus TCP, Ethernet IP, etc; Applying EtherCAT to expand high-speed
interfaces; Capable of supporting 128 axis servo control, 22 motion control process
program runs, and a minimum servo cycle of 125 $\mu$S [20]. The controller adopts a dual system physical isolation and split core operation
of Windows system and RTX64 real-time system; It can solve the problems of traditional
PC+PLC/motion control stuttering, abnormal blue screen, etc., ensuring that the motion
control program can continue to run in the event of a Windows system outage, with
strong stability; At the same time, it has strong computing power, extremely fast
response speed, and achieves microsecond level memory data exchange, which increases
the execution time by 2000 to 3000 compared to traditional single controllers $\mu$s.
It can effectively solve problems such as delay and low efficiency in multimodal information
communication [7].
3.3. IC Recognizer
The IC recognizer mainly consists of the control core card reader CPU P89LPC932, non-contact
communication card reader IC, real-time clock chip PCF8563, RS232 for PC communication,
and memory AT45DB021 [19]. By utilizing RF chips to expand the control circuit range, the power supply voltage
is improved to 3.3 V, reducing the power consumption of the recognizer, effectively
enhancing the system's recognition control range, and enabling reset requests for
multiple instructions. The core card reader is a highly integrated microcontroller,
with a command execution speed six times that of the standard 80C51 card reader. It
can perform anti-collision operations within the communication range, accurately identify
the type of card information read and written, and perform function decoding and modification
on the multimodal information encoded by the encoder. It has high compatibility and
operational stability [6].
Fig. 4. Functional block diagram of IC recognizer configuration.
4. Software Design of Live Streaming Product Identification System Based on Multimodal
Learning
4.1. Encoding of Live Streaming Sales Information Based on Multimodal Learning
Using SDI+HDMI dual interface encoder to encode live streaming product information,
input product image information into a multimodal learning recognition model, cut
the image into three channel feature vectors, assume each image feature vector is
$x_{i} $, and obtain a 10-dimensional projection direction $Y_{i} $ through linear
transformation:
In the formula, $x_{i} $, $r_{i} $, $k_{i} $represents visual, speech, and text information
encoding for multimodal training, respectively; $Z_{q} $, $Z_{k} $, $Z_{r} $ are the
weight matrix of three modal transformations encoded by the model [10]. Measure encoding similarity through dot product to obtain SoftMax function dot product
value
In the formula, $d_{i} $ is the encoded dot product value. Calculate the fusion vector
based on this and introduce a scaling factor to reduce the instability of encoding
vector calculation;
In the formula, $v_{i} $ is the multimodal parameter fusion vector. Introducing fully
connected weights and residual weight matrix $Z_{1} $, $Z_{2} $, The corresponding
interlayer bias is $b_{1} $, $b_{2} $. Nonlinear transformation of multimodal commodity
information vectors into commodity information activation functions
By using the dual interface encoder interface function, the product information is
converted into multimodal encoding, and the calculation formula is
In the formula, $SH(x)$ is a multimodal information encoding for live streaming products
with goods, $Sublayer(x)$represents the output of sub-module information in the multimodal
recognition model, $LayerNorm$is the normalization calculation process.
After the above calculation, the multimodal information encoding data of live streaming
goods are obtained [11-14].
4.2. Feature Extraction
The key features such as commodity images, characters, speech, etc. are extracted
using the recognizer, and the information such as text and images are represented
as initial vectors with different attributes, and the SVM classifier is used for image
segmentation feature extraction, which is able to classify the kernel function for
the local features:
where $S$ is the original training set of support vector machine and $N$ is the total
number of sample data. The core solver coefficient $\alpha _{i} $ is introduced for
pairwise training, and the image feature classification formula is obtained as
The key classification features of commodity images are extracted and encoded, divided
into $224 \times 224$ image domains of uniform size, and attention weights are introduced
$l_{i} $, and feature guidance vectors are computed for each image domain
In the formula, $g\left(attr_{i} \right)$ represents the original encoding vector
in the product image domain. The attention weight meets the following conditions:
Multimodal feature coded images are extracted for multi-attribute feature coded matching,
and the image feature vectors in the model are knowledge fused with the text features
to obtain the multimodal feature knowledge information of the commodity [15-17].
4.3. Product Identification Based on Multimodal Interaction
The above multimodal eigenvectors are deeply instructed to obtain multimodal reconstruction
vectors using the linear relationship between the multimodal sample data
In the formula, $G_{m} $represents the reconstructed vector after multimodal data
interaction, $a_{m} $ represents the guidance weight for multimodal interaction, $X_{m}^{i}
$ represents the random vector of modality $m$, and $L$ represents the vector length.
Due to the different weights of different modal vectors, the interactive guidance
weights should meet the following requirements:
Input the multimodal feature interaction guidance vector into the model fusion layer
and substitute it into the fixed length form of the modal feature vector mapping $l'_{m}
$
Calculate the guidance weight formula for modal fusion features
Weighted average calculation to obtain the final product feature fusion vector $f_{m}
$:
In the formula, $m$ represents any mode of live-streaming merchandise, including text
mode $token$and image mode $region$, and ultimately calculates the target product
feature interaction fusion vector. After the above steps to live commodity image information
for residual recognition, multimodal interaction training to get the commodity feature
set, while the auxiliary monitoring module of the text information for interaction
training, the transfer matrix for feature matching, so as to achieve a more accurate
commodity image and text and other information, to improve the performance of the
live band goods recognition and response [18]. The total flow of commodity interaction recognition is shown in Fig. 5:
Fig. 5. Product multimodal interaction identification process.
5. Experimental Research
To verify the practical application effect of the design in this article, a comparative
experiment was designed using the Windows 10 operating system, and the training operating
platform was Hadoop0.20.2. A virtualization program was used to simulate multiple
hosts running nodes in the server, capture the live bandwagon merchandise information
dataset and utilize Python with high performance for custom crawler crawling. Integrating
multimodal product categories, a total of 11400 products covering 7 major categories
were obtained. Extract feature information from all product samples for partitioning,
and conduct multimodal learning product recognition experiments. The training set
consists of 6523 samples, the validation set consists of 2438 samples, and the testing
set consists of 2439 samples. The specific number of positive and negative samples
is shown in the Table 1.
Table 1. Experimental categories of product identification for live streaming products.
|
Data set
|
Training set
|
Test set
|
Validation set
|
|
Clothing
|
1174
|
428
|
424
|
|
Footwear luggage
|
558
|
228
|
230
|
|
Cosmetics
|
1056
|
417
|
426
|
|
Delicacy
|
1230
|
461
|
463
|
|
Daily necessities
|
1085
|
368
|
364
|
|
Jewelry
|
530
|
209
|
198
|
|
Other products
|
890
|
328
|
333
|
|
Positive sample
|
4255
|
2117
|
2132
|
|
Negative sample
|
2636
|
322
|
895
|
|
Total
|
6523
|
2439
|
2438
|
According to the above Table 1, when conducting multimodal learning on different categories of goods, the live streaming
product recognition system based on multimodal learning studied in this article has
a recognition rate of 95.81% for single modal information in images, 90.29% for bimodal
information in images and texts, and 88.10% for multimodal recognition by fusion of
image text and phonetic information, which is at least 6.5% higher than the success
rate of traditional product information recognition methods, The success rate of the
research method in this article for identifying multimodal goods is 8.4% higher than
that of the convolutional neural network recognition method. At the same time, the
time required for multimodal information fusion and product recognition of the research
method in this article is 86ms, the embedded label recognition method requires 98ms,
and the convolutional neural network recognition method requires 152ms. From this,
it can be seen that multimodal learning recognition methods can perform complex multimodal
fusion recognition in a shorter time, and the recognition success rate is higher.
Train the above samples and obtain the product recognition experimental results as
Figs. 6 and 7.
Fig. 6. Success rate of product multimodal information fusion recognition.
Fig. 7. Multimodal product information exchange rate.
As shown in Fig. 6, in the 10000 items recognition experiment, when the modal missing degree of product
modal information is 20%, the information exchange rate of multimodal learning is
78%, and the information exchange rate of embedded label recognition method is 59%;
The missing degree is 40%, and the information exchange rate for multimodal recognition
is 71%. When the missing degree is 80%, the information exchange rate for multimodal
learning is 56%, while the information exchange rate for convolutional neural networks
is only 18%. It can be seen that the recognition method of text research has a significantly
higher interaction rate with multimodal information than traditional methods, and
has stronger information interaction capabilities. The degree of modal interaction
will affect the robustness of model recognition. The lower the modal information interaction
rate, the lower the accuracy of product identification information, and the worse
the effectiveness of multimodal fusion recognition.
The statistical results of the accuracy rate of live-streaming product identification
are as Table 2.
Table 2. Accuracy rate of live streaming product identification.
|
|
Sample size of goods/piece
|
Training set recognition accuracy/%
|
Test set recognition accuracy/%
|
Verification set recognition accuracy/%
|
|
Multimodal system
|
0-2000
|
98.21
|
97.43
|
97.36
|
|
2000-5000
|
96.44
|
96.81
|
96.67
|
|
5000-10000
|
95.93
|
95.32
|
95.24
|
|
Label recognition system
|
0-2000
|
93.66
|
92.18
|
90.98
|
|
2000-5000
|
91.09
|
85.62
|
88.76
|
|
5000-10000
|
88.34
|
81.93
|
85.32
|
|
Convolutional Neural Recognition System
|
0-2000
|
93.67
|
89.04
|
81.74
|
|
2000-5000
|
90.25
|
83.29
|
79.64
|
|
5000-10000
|
86.44
|
81.45
|
76.98
|
According to the above table, the recognition accuracy of the multimodal learning
product recognition method proposed in this article has always been above 95%. Through
multimodal learning, visual, text, and speech information can be integrated for more
comprehensive multimodal information fusion recognition, with stronger information
interaction ability and higher recognition accuracy.
6. Conclusion
A multimodal learning-based live-streaming product identification system is proposed
to address the efficiency issue of live-streaming product identification. The system
is optimized from both hardware structure and software design, and the following research
conclusions are drawn:
Extract product visual, text, and speech information, mine the correlation between
multimodal information features, and use multimodal deep learning to construct a product
recognition model for sample training. Optimize the hardware design of the live streaming
product identification system using SDI+HDMI dual interface encoder, TRIO PC-MCAT-2
controller, and IC recognizer to enhance system operational stability and compatibility.
The multimodal deep learning algorithm is used to encode the commodity information,
the SVM classifier extracts the key features, and the multimodal information fusion
interaction recognition of commodities is realized by model-guided interaction operation.
Through experiments, it can be seen that the product recognition system designed in
this article can complete multimodal information interaction fusion recognition in
a shorter time, and the recognition rate is significantly higher than traditional
recognition systems. However, there are still some shortcomings in the system, mainly
manifested in the following aspects.
In the face of massive live streaming product information, there may be risk issues
such as information omission and virus invasion during the product information collection
process, which affects the security of product information. The speech recognition
ability is limited, and when there are situations such as unclear speech recognition,
it is difficult to accurately recognize keywords and sentences expressed in speech,
which affects the efficiency of product recognition. Subsequent research should further
improve speech recognition and feature extraction functions to enhance the security
of product information for live-streaming sales.
REFERENCES
D. Jiang, ``Systems design of commodity image recognition based on convolution neural
network,'' Journal of Beijing Polytechnic College, vol. 20, no. 3, pp. 4-6, 2021.

L. Jiao, M. Wang, and Y. Huo, ``A traffic classification and recognition method based
on multimodal deep learning,'' Radio Communications Technology, vol. 2, no. 2, pp.
13-17, 2021.

C. Zhang and X. Zhang, ``Label embedding based multimodal multi-label emotion recognition,''
Cyber Security and Data Governance, vol. 7, pp. 41-44, 2022.

P. Li, X. Wan, and S. Li, ``Image caption of space science experiment based on multi-modal
learning,'' Optics and Precision Engineering, vol. 29, no. 12, pp. 12-16, 2021.

S. Sun, B. Guo, and X. Yang, ``Embedding consensus autoencoder for cross-modal semantic
analysis,'' Computer Science, vol. 48, no. 7, pp. 93-98, 2021.

W. Liu and S. Jiang, ``Product identification method based on unlabeled semi-supervised
learning,'' Computer Applications and Software, vol. 2022, no. 7, pp. 39-44, 2022.

Y. Wang, ``E-commerce commodity entity recognition algorithm based on big data,''
Microcomputer Applications, vol. 37, no. 6, pp. 80-83, 2021.

M. J. A. Patwary, W. Cao, Z.-Z. Wang, and M. A. Haque, ``Fuzziness based semi-supervised
multimodal learning for patient’s activity recognition using RGBDT videos,'' Applied
Soft Computing, vol. 120, pp. 120-129, 2022.

S. Praharaj, M. Scheffel, H. Drachsler, and M. Specht, ``Literature review on co-located
collaboration modeling using multimodal learning analytics—Can we go the whole nine
yards?'' IEEE Transactions on Learning Technologies, vol. 14, no. 3, pp. 367-385,
2021.

L. Yu, C. Liu, J. Y. H. Yang, and P. Yang, ``Ensemble deep learning of embeddings
for clustering multimodal single-cell omics data,'' Bioinformatics, vol. 39, no. 6,
pp. 10-18, 2023.

C. Mi, T. Wang, and X. Yang, ``An efficient hybrid reliability analysis method based
on active learning Kriging model and multimodal-optimization-based importance sampling,''
International Journal for Numerical Methods in Engineering, vol. 122, no. 24, pp.
7664-7682, 2021.

A. Rahate, R. Walambe, S. Ramanna, and K. Kotecha, ``Multimodal co-learning: Challenges,
applications with datasets, recent advances and future directions,'' Information Fusion,
vol. 81, pp. 203-239, 2022.

B. Bardak and M. Tan, ``Improving clinical outcome predictions using convolution over
medical entities with multimodal learning,'' Artificial Intelligence in Medicine,
vol. 117, pp. 102-112, 2021.

J. Xiong, F. Li, and X. Zhang, ``Re: Xiong et al.: Multimodal machine learning using
visual fields and peripapillary circular OCT scans in detection of glaucomatous optic
neuropathy (Ophthalmology. 2022; 129:171-180) reply,'' Ophthalmology, vol. 129, no.
4, pp. 129-139, 2022.

E. A. Smith, N. T. Hill, T. Gelb, et al., ``Identification of natural product modulators
of Merkel cell carcinoma cell growth and survival'' Scientific Reports, vol. 11, no.
1, 13597, 2021.

Y. Pan, A. Braun, and I. Brilakis, ``Enriching geometric digital twins of buildings
with small objects by fusing laser scanning and AI-based image recognition,'' Automation
in Construction, vol. 140, 106633, 2022.

J. Qin, C. Wang, X. Ran, S. Yang, and B. Chen, ``A robust framework combined saliency
detection and image recognition for garbage classification,'' Waste Management, vol.
140, pp. 193-203, 2022.

F. Long, ``Simulation of English text recognition model based on ant colony algorithm
and genetic algorithm,'' Journal of Intelligent and Fuzzy Systems, vol. 40, no. 4,
pp. 1-12, 2021.

B. Lu and Z. Chen, ``Live streaming commerce and consumers’ purchase intention: An
uncertainty reduction perspective,'' Information & Management, vol. 58, 103509, 2021.

C.-D. Chen, Q. Zhao, and J.-L. Wang, ``How livestreaming increases product sales:
Role of trust transfer and elaboration likelihood model,'' Behaviour & Information
Technology, vol. 41, no. 3, pp. 558–573, 2022/

Author
Xuyun Gong was Born in 1982, she obtained a bachelor's degree in 2004. From 2004
to 2006, she studied at Northeast Normal University and obtained a master's degree
in economics in July 2006. Published 4 domestic core papers, including 1 CSSCI paper.
She main research interests are digital finance, small and micro enterprise financing
and money market.