BakHuiyong1
LeeSangmin1
-
(Department of Electrical and Computer Engineering, Inha University, Incheon 22212,
Korea 22211253inha.edu@inha.edu, sanglee@inha.ac.kr
)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Violent scene discrimination, Wav2vec 2.0, Audio signal processing
1. Introduction
The movie industry produces thousands of movies every year; however, movies with violent
content are not suitable for children. Watching violent scenes in movies tends to
make children more aggressive and leads to unhealthy attitudes. Thus, it is imperative
to have a violent scene discrimination system (VSDS) to protect children from viewing
violence in movies. Moreover, these systems can be useful for child-suitability ratings
for movies [1,2].
Since most violent scenes are related to the behavior of objects, visual information
is utilized to discriminate violent scenes. However, visual information of violent
scenes does not include audio information, such as screams and offensive language.
The audio information can include information such as screams and profanities that
are not included in visual information. It can also include information about violent
scenes that do not last longer than a second, such as gunshots. Thus, it is beneficial
to utilize audio information in violent scene discrimination.
Previous studies that have implemented audio-based violent scene discrimination are
as follows. Mu et al. built a VSDS using 2D convolutional neural networks (CNNs) [3]. Sarman and Sert built a VSDS using the support vector machine (SVM), random forest,
and bagging [4]. Potharaju et al. also built a VSDS using an SVM [5]. Gu et al. proposed a violent scene detection system using a mel spectrogram and
the CNN-based VGGNet [6]. Among the previous VSDS studies, a study using the mel spectrogram and the CNN-based
VGGish showed good performance. However, the study on violent scene discrimination
using the mel spectrogram and VGGish had two limitations. First, the mel spectrogram
can extract unique features of audio signals, but it cannot extract mutual information
that audio data have in common. Secondly, VGGish was pre-trained using audio that
is not related to violent scenes, such as sports and games.
To improve on the limitations in previous studies, a new system is proposed that discriminates
violent scenes in movies by using audio signals. The proposed system extracts audio
features with Wav2vec 2.0, which can extract mutual information in audio data. Audio
features are then used as the input for a 1D CNN and long short-term memory (LSTM),
which can effectively discriminate audio data, and violent scenes are discriminated
through fully connected and softmax layers.
Section 2 describes the techniques in the proposed system, which is presented in Section
3. Section 4 describes the experiment conducted, how the proposed system was used
in it, and the performance evaluation and results. Section 5 concludes the paper.
2. Technologies of the Proposed System
2.1 Wav2vec 2.0
As shown in Fig. 1, speech input for Wav2vec2.0 is converted into vectors of specific lengths through
the 1D CNN. The transformed vectors, called latent speech representations, are the
input for the transformer encoder, which creates contextualized representations that
restore the masked parts using the surrounding information. Wav2vec2.0 performs training
in such a way that contextualized representations and quantized represen-tations are
similar [7]. Using Wav2vec trained in this way has the advantage of extracting mutual information
common to audio data. Therefore, in the proposed system, audio features are extracted
using a pre-trained Wav2vec2.0, which is a model that obtains its characteristics
from self-supervised learning of human speech without a label.
Fig. 1. The structure of Wav2vec 2.0.
2.2 The CNN and LSTM
The CNN creates a feature map with the spatial characteristics of the data through
the convolution layer. The feature map is reduced in size through pooling, and the
features are compressed. After repeating this process, the data are classified using
fully connected and softmax layers [8]. The CNN can extract spatial features of input data through a convolution layer.
Therefore, the proposed system extracts spatial features from contextualized representations
of Wav2vec2.0 with the CNN.
A recurrent neural network (RNN) structure was used to handle time series data, such
as audio signals. The RNN trains time series data by inputting the previous hidden
state into the next neural network. The RNN is limited in that the gradient required
for backpropagation decreases or increases exponentially, depending on the length
of the time series data. To overcome the limitations of the RNN, an LSTM architecture
adds the cell state to the RNN hidden state. When backpropagating to the cell state,
it does not pass through nonlinear functions, such as tanh, so it can prevent gradient
vanishing and exploding in the RNN [9]. Therefore, the proposed system extracts temporal characteristics from contextualized
represen-tations of Wav2vec2.0 with the LSTM.
3. Violent Scene Discrimination
3.1 Proposed System Overview
As shown in Fig. 2, the proposed system inputs the audio signal into the backbone network, which uses
the pre-trained Wav2vec2.0 to extract features with mutual information. The transfer
network is trained using the extracted features. The proposed system discriminates
violent scenes using a trained transfer network and backbone network.
Fig. 2. The proposed system.
3.2 Backbone of the Proposed System
The backbone network converts the input audio signal into audio features using Wav2vec2.0.
Because the Wav2vec2.0 model is trained with unlabeled audio through self-supervised
learning, it can extract the mutual information from audio signals.
3.3 Transfer Network in the System
The transfer network utilizes the CNN and LSTM. The CNN can consider spatial features
using a convolutional layer. Because LSTM receives the previous hidden state as input,
temporal characteristics can be considered. Because the transfer network uses both
CNN and LSTM models, it has the advantage of simultaneously considering spatial and
temporal characteristics.
Because the backbone network uses a 1D CNN, the 1D CNN is also used in the transfer
network to preserve nonlinear information in the backbone network. A 1D CNN is suitable
for audio because it can convolve 1D data [10].
LSTM exhibits good performance for time series data-prediction tasks. An LSTM increases
the prediction accuracy of time series data by reducing the importance of the data
at a point far from the prediction point, and increasing the importance of the data
at points near the prediction point. Therefore, an LSTM with high prediction accuracy
for time series data is used for the transfer network.
4. Experiment
4.1 Dataset used in the Experiment
The dataset used in this paper, called the Violent Movie Scenes Dataset (VMD) was
generated to discriminate violent scenes. Because the concept violent scene is subjective
and difficult to characterize, each audio dataset was manually labeled in the movie
by using violent scene criteria from a previous study, as shown in Table 1 [5].
The details of the dataset used in this study are presented in Table 2. Violent and non-violent scenes were extracted from 69 movies. Of those movies, scenes
from 34 were used for training, scenes from 15 movies were used for validation, and
scenes from 20 movies were used for testing. In total, 2400 scenes were extracted
from the 69 movies selected. Training and validation sets were used for training,
and the testing set was used for evaluation.
Table 1. Criteria for classification of violent scenes[5].
Violent scenes
|
Categories
|
Detail
|
Person-related sound
|
Angry voice, Scream
|
Weapon-related sound
|
Gunshot, Bomb
|
Vehicle-related sound
|
Accident
|
Fight-related sound
|
Fight
|
Environment sound
|
Sharp
|
Table 2. Dataset used in the study.
Scene type
|
Training
(34 Movies)
|
Validation
(15 Movies)
|
Testing
(20 Movies)
|
Violence
|
800
|
200
|
200
|
Non-violent
|
800
|
200
|
200
|
Total
|
1,600
|
400
|
400
|
4.2 Implementation Details
4.2.1 Backbone Network in the System
The backbone network used the Wav2vec 2.0 base model without fine-tuning [7]. To reduce computations, the backbone network adopted a base model trained with 960
h of speech. As shown in Fig. 3, when audio is input to the backbone network, the network generates audio features
sized. The total number of parameters in the backbone network is 95 M.
Fig. 3. Processing the backbone network.
4.2.2 Transfer Network used in the Proposed System
As shown in Fig. 4, the transfer network transforms audio features at 100${\times}$768${\times}$49 into
a feature map sized 16${\times}$112${\times}$720 with spatial features through the
1D CNN.
The kernel size and output channels of the 1D CNN were set to 25. The transformed
feature map is the input for the LSTM and is converted into an LSTM feature sized
16${\times}$112${\times}$48 with the characteristics of the data that change over
time. In the LSTM, hidden dim was 48 and num layers was 2. Subsequently, the LSTM
features were classified as violent or non-violent by fully connected and softmax
layers.
The total number of parameters in the transfer network was 0.3 M.
Fig. 4. Processing of the transfer network.
4.3 Methods for Performance Evaluation
Eqs. (1) and (2) were used to evaluate the performance of the proposed system. The basis for these
metric evaluations is the confusion matrix, which is presented in Table 3 [11]. In Eq. (2), P is the number of ground truth violent scenes, N is the total amount of data, Li=1
when the i-th data is violent; otherwise, Li=0.
Table 3. Confusion matrix of the experiment results.
|
|
True Class
|
|
|
Violent
|
Non-violent
|
Predicted class
|
Violent
|
199
(TP)
|
14
(FP)
|
Non-violent
|
1
(FN)
|
186
(TN)
|
4.4 Results
The proposed system was trained with training and validation data. The result of evaluating
the performance of the trained model with the testing data is the confusion matrix
in Table 3. Table 4 displays the results from comparing the performance obtained using Eqs. (1) and (2) with those of previous studies.
In order to evaluate the performance of the algorithm proposed in this paper, it was
compared with Gu et al. [6]. Among the previous studies, that of Gu et al. was the latest and had high performance;
thus, the performance of Gu et al. was compared with the proposed algorithm. The VCD
dataset used by Gu et al. was not disclosed, but Medieval 2015 was disclosed. Therefore,
the Medieval 2015 was applied to the proposed system. As a result, it was confirmed
that the performance of the proposed system was 4.5% higher. Additionally, the algorithm
proposed by Gu et al. was applied to the VMD dataset used in this paper to compare
performance. As a result, it was confirmed that the performance of the algorithm proposed
in this paper was higher. The reason for the higher accuracy is that it extracted
mutual information from audio using Wave2vec, and utilized a 1D CNN and LSTM, which
can effectively discriminate audio data.
In Table 4, the datasets used in previous studies are Medieval 2014, Medieval 2015, Violent
video dataset (VSD), and Violent scenes dataset (VCD). Among them, VSD and VCD are
datasets in which author of papers directly extracted violent scenes from movies and
YouTube. On the other hand, Medieval 2014 and Medieval 2015 are the most widely used
public datasets for discriminating violent scenes, and were extracted from hundreds
of movies [5, 6, 13, 14]
Table 4. Experimental results and comparison.
Researcher
|
Sarman and Sert [4]
|
Potharaju
et al. [5]
|
Gu et al. [6]
|
Our group
|
Algorithm
|
Random
Forest
|
SVM
|
Mel Spectrogram
VGGish
|
Proposed
System
|
Mel Spectrogram
VGGish
|
Proposed System
|
Dataset
|
Medieval
2014
|
VSD
|
VCD
|
Medieval
2015*
|
Medieval
2015*
|
VMD
|
VMD
|
Accuracy
|
-
|
78.22%
|
80.55%
|
-
|
-
|
89.75%
|
96.25%
|
Average Precision
|
68.80%
|
-
|
-
|
14.16%
|
18.69%
|
79.5%
|
99.50%
|
* Since the data in Medieval is unbalanced, the evaluation metric uses average precision.
5. Conclusion
Automatic identification of violent scenes is required to protect users from unwanted
and violent media. In this study, a system was proposed to discriminate violent movie
scenes based on audio signals. The proposed system uses Wav2vec 2.0 for audio feature
extraction, and a 1D CNN-LSTM combination to discriminate extracted audio features
into violent and non-violent scenes. The proposed system discriminated violent scenes
with an accuracy of 96.25% when using VMD, which is superior to results in previous
studies. This study considered only audio features to discriminate movie scenes as
violent or non-violent. Although it is generally more effective to discriminate violent
scenes using visual information along with audio signals, the results of this study
are expected to show more effective results in discriminating media with limited visual
information, such as radio.
ACKNOWLEDGMENTS
This research was supported by the Basic Science Research Program through the
National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2020R1A2C2004624
and NRF-2018R1A6A1A03025523).
REFERENCES
Gentile Douglas A., 2004, Media Violence as a Risk Factor for Children: A Longitudinal
Study., American Psychological Society 16th Annual Convention, Chicago, IL.
Shafaei M., Samghabadi N.S., Kar S., Solorio T., Rating for Parents: Predicting Children
Suitability Rating for Movies Based on Language of the Movies., arXiv 2019, arXiv:1908.07819.
Mu Guankun, Haibing Cao, Qin Jin , 2016, Violent Scene Detection Using Convolutional
Neural Networks and Deep Audio Features., Chinese Conference on Pattern Recognition.
Springer, Singapore
Sarman Sercan., Mustafa Sert., 2018, Audio Based Violent Scene Classification Using
Ensemble Learning., 2018 6th International Symposium on Digital Forensic and Security
(ISDFS). IEEE
Potharaju Y., Kamsali M., Kesavari C. R., 2019, Classification of Ontological Violence
Content Detection through Audio Features and Supervised Learning, International Journal
of Intelligent Engineering and Systems, Vol. 12, No. 3, pp. 20-230
Gu C., Wu X., Wang S., 2020, Violent Video Detection Based on Semantic Correspondence,
IEEE Access, Vol. 8, pp. 85 958-85 967
Baevski A., Zhou H., Mohamed. A., Auli M., Jun. 2020, Wav2vec 2.0: A Framework for
Self-Supervised Learning of Speech Representations
Krizhevsky Alex, Ilya Sutskever, Geoffrey E. Hinton, ImageNet Classification with
Deep Convolutional Neural Networks., Advances in neural information processing systems
25 (2012): 1097-1105.
Hochreiter Sepp, Schmidhuber Jürgen, 1997, Long Short-Term Memory., Neural Computation,
Vol. 9, No. 8, pp. 1735-1780
Kiranyaz S., Avci O., Abdeljaber O., Ince T., Gabbouj. M., Inman D. J., 2019, 1D Convolutional
Neural Networks and Applications: A Survey, arXiv preprint arXiv:1905.03554
Olson D. L., Delen D., 2008, Advanced Data Mining Techniques., Springer Science &
Business Media.
Schedi. M., et al. , 2015, VSD2014: A dataset for violent scenes detection in hollywood
movies and web videos, 2015 13th International Workshop on Content-Based Multimedia
Indexing (CBMI), pp. 1-6
Sjberg M., Baveye Y., Wang H., Quang V.L., Ionescu B., Dellandra E., Chen L., The
mediaeval 2015 affective impact of movies task., In: MediaEval 2015 Workshop
Author
Huiyong Bak received his B.S. from the Department of Mechatronics Engineering,
Inha University, Incheon, Republic of Korea in 2021. He is currently pursuing an M.S.
in the Department of Electrical and Computer Engineering, Inha University. His research
interests include deep learning using audio signals.
Sangmin Lee received a B.S., an M.S., and a Ph.D. from Inha University, all in
electronic engineering, in 1987, 1989, and 2000, respectively. He is currently a Professor
with the School of Electronic Engineering, Inha University, Korea. His research interests
include bio-signal processing and psycho-acoustics.