Mobile QR Code QR CODE

  1. (Department of Electrical and Computer Engineering, Inha University, Incheon 22212, Korea, )

Violent scene discrimination, Wav2vec 2.0, Audio signal processing

1. Introduction

The movie industry produces thousands of movies every year; however, movies with violent content are not suitable for children. Watching violent scenes in movies tends to make children more aggressive and leads to unhealthy attitudes. Thus, it is imperative to have a violent scene discrimination system (VSDS) to protect children from viewing violence in movies. Moreover, these systems can be useful for child-suitability ratings for movies [1,2].

Since most violent scenes are related to the behavior of objects, visual information is utilized to discriminate violent scenes. However, visual information of violent scenes does not include audio information, such as screams and offensive language. The audio information can include information such as screams and profanities that are not included in visual information. It can also include information about violent scenes that do not last longer than a second, such as gunshots. Thus, it is beneficial to utilize audio information in violent scene discrimination.

Previous studies that have implemented audio-based violent scene discrimination are as follows. Mu et al. built a VSDS using 2D convolutional neural networks (CNNs) [3]. Sarman and Sert built a VSDS using the support vector machine (SVM), random forest, and bagging [4]. Potharaju et al. also built a VSDS using an SVM [5]. Gu et al. proposed a violent scene detection system using a mel spectrogram and the CNN-based VGGNet [6]. Among the previous VSDS studies, a study using the mel spectrogram and the CNN-based VGGish showed good performance. However, the study on violent scene discrimination using the mel spectrogram and VGGish had two limitations. First, the mel spectrogram can extract unique features of audio signals, but it cannot extract mutual information that audio data have in common. Secondly, VGGish was pre-trained using audio that is not related to violent scenes, such as sports and games.

To improve on the limitations in previous studies, a new system is proposed that discriminates violent scenes in movies by using audio signals. The proposed system extracts audio features with Wav2vec 2.0, which can extract mutual information in audio data. Audio features are then used as the input for a 1D CNN and long short-term memory (LSTM), which can effectively discriminate audio data, and violent scenes are discriminated through fully connected and softmax layers.

Section 2 describes the techniques in the proposed system, which is presented in Section 3. Section 4 describes the experiment conducted, how the proposed system was used in it, and the performance evaluation and results. Section 5 concludes the paper.

2. Technologies of the Proposed System

2.1 Wav2vec 2.0

As shown in Fig. 1, speech input for Wav2vec2.0 is converted into vectors of specific lengths through the 1D CNN. The transformed vectors, called latent speech representations, are the input for the transformer encoder, which creates contextualized representations that restore the masked parts using the surrounding information. Wav2vec2.0 performs training in such a way that contextualized representations and quantized represen-tations are similar [7]. Using Wav2vec trained in this way has the advantage of extracting mutual information common to audio data. Therefore, in the proposed system, audio features are extracted using a pre-trained Wav2vec2.0, which is a model that obtains its characteristics from self-supervised learning of human speech without a label.

Fig. 1. The structure of Wav2vec 2.0.

2.2 The CNN and LSTM

The CNN creates a feature map with the spatial characteristics of the data through the convolution layer. The feature map is reduced in size through pooling, and the features are compressed. After repeating this process, the data are classified using fully connected and softmax layers [8]. The CNN can extract spatial features of input data through a convolution layer. Therefore, the proposed system extracts spatial features from contextualized representations of Wav2vec2.0 with the CNN.

A recurrent neural network (RNN) structure was used to handle time series data, such as audio signals. The RNN trains time series data by inputting the previous hidden state into the next neural network. The RNN is limited in that the gradient required for backpropagation decreases or increases exponentially, depending on the length of the time series data. To overcome the limitations of the RNN, an LSTM architecture adds the cell state to the RNN hidden state. When backpropagating to the cell state, it does not pass through nonlinear functions, such as tanh, so it can prevent gradient vanishing and exploding in the RNN [9]. Therefore, the proposed system extracts temporal characteristics from contextualized represen-tations of Wav2vec2.0 with the LSTM.

3. Violent Scene Discrimination

3.1 Proposed System Overview

As shown in Fig. 2, the proposed system inputs the audio signal into the backbone network, which uses the pre-trained Wav2vec2.0 to extract features with mutual information. The transfer network is trained using the extracted features. The proposed system discriminates violent scenes using a trained transfer network and backbone network.

Fig. 2. The proposed system.

3.2 Backbone of the Proposed System

The backbone network converts the input audio signal into audio features using Wav2vec2.0. Because the Wav2vec2.0 model is trained with unlabeled audio through self-supervised learning, it can extract the mutual information from audio signals.

3.3 Transfer Network in the System

The transfer network utilizes the CNN and LSTM. The CNN can consider spatial features using a convolutional layer. Because LSTM receives the previous hidden state as input, temporal characteristics can be considered. Because the transfer network uses both CNN and LSTM models, it has the advantage of simultaneously considering spatial and temporal characteristics.

Because the backbone network uses a 1D CNN, the 1D CNN is also used in the transfer network to preserve nonlinear information in the backbone network. A 1D CNN is suitable for audio because it can convolve 1D data [10].

LSTM exhibits good performance for time series data-prediction tasks. An LSTM increases the prediction accuracy of time series data by reducing the importance of the data at a point far from the prediction point, and increasing the importance of the data at points near the prediction point. Therefore, an LSTM with high prediction accuracy for time series data is used for the transfer network.

4. Experiment

4.1 Dataset used in the Experiment

The dataset used in this paper, called the Violent Movie Scenes Dataset (VMD) was generated to discriminate violent scenes. Because the concept violent scene is subjective and difficult to characterize, each audio dataset was manually labeled in the movie by using violent scene criteria from a previous study, as shown in Table 1 [5].

The details of the dataset used in this study are presented in Table 2. Violent and non-violent scenes were extracted from 69 movies. Of those movies, scenes from 34 were used for training, scenes from 15 movies were used for validation, and scenes from 20 movies were used for testing. In total, 2400 scenes were extracted from the 69 movies selected. Training and validation sets were used for training, and the testing set was used for evaluation.

Table 1. Criteria for classification of violent scenes[5].

Violent scenes



Person-related sound

Angry voice, Scream

Weapon-related sound

Gunshot, Bomb

Vehicle-related sound


Fight-related sound


Environment sound


Table 2. Dataset used in the study.

Scene type


(34 Movies)


(15 Movies)


(20 Movies)













4.2 Implementation Details

4.2.1 Backbone Network in the System

The backbone network used the Wav2vec 2.0 base model without fine-tuning [7]. To reduce computations, the backbone network adopted a base model trained with 960 h of speech. As shown in Fig. 3, when audio is input to the backbone network, the network generates audio features sized. The total number of parameters in the backbone network is 95 M.

Fig. 3. Processing the backbone network.

4.2.2 Transfer Network used in the Proposed System

As shown in Fig. 4, the transfer network transforms audio features at 100${\times}$768${\times}$49 into a feature map sized 16${\times}$112${\times}$720 with spatial features through the 1D CNN.

The kernel size and output channels of the 1D CNN were set to 25. The transformed feature map is the input for the LSTM and is converted into an LSTM feature sized 16${\times}$112${\times}$48 with the characteristics of the data that change over time. In the LSTM, hidden dim was 48 and num layers was 2. Subsequently, the LSTM features were classified as violent or non-violent by fully connected and softmax layers.

The total number of parameters in the transfer network was 0.3 M.

Fig. 4. Processing of the transfer network.

4.3 Methods for Performance Evaluation

Eqs. (1) and (2) were used to evaluate the performance of the proposed system. The basis for these metric evaluations is the confusion matrix, which is presented in Table 3 [11]. In Eq. (2), P is the number of ground truth violent scenes, N is the total amount of data, Li=1 when the i-th data is violent; otherwise, Li=0.

$ \mathrm{Accuracy}=\frac{TP+TN}{TP+TN+FP+FN} \\ $
$ \mathrm{Average}\,\,Precision=\frac{1}{p}\sum _{i=1}^{N}L_{i}\frac{P_{i}}{i} $
Table 3. Confusion matrix of the experiment results.

True Class



Predicted class











4.4 Results

The proposed system was trained with training and validation data. The result of evaluating the performance of the trained model with the testing data is the confusion matrix in Table 3. Table 4 displays the results from comparing the performance obtained using Eqs. (1) and (2) with those of previous studies.

In order to evaluate the performance of the algorithm proposed in this paper, it was compared with Gu et al. [6]. Among the previous studies, that of Gu et al. was the latest and had high performance; thus, the performance of Gu et al. was compared with the proposed algorithm. The VCD dataset used by Gu et al. was not disclosed, but Medieval 2015 was disclosed. Therefore, the Medieval 2015 was applied to the proposed system. As a result, it was confirmed that the performance of the proposed system was 4.5% higher. Additionally, the algorithm proposed by Gu et al. was applied to the VMD dataset used in this paper to compare performance. As a result, it was confirmed that the performance of the algorithm proposed in this paper was higher. The reason for the higher accuracy is that it extracted mutual information from audio using Wave2vec, and utilized a 1D CNN and LSTM, which can effectively discriminate audio data.

In Table 4, the datasets used in previous studies are Medieval 2014, Medieval 2015, Violent video dataset (VSD), and Violent scenes dataset (VCD). Among them, VSD and VCD are datasets in which author of papers directly extracted violent scenes from movies and YouTube. On the other hand, Medieval 2014 and Medieval 2015 are the most widely used public datasets for discriminating violent scenes, and were extracted from hundreds of movies [5, 6, 13, 14]

Table 4. Experimental results and comparison.


Sarman and Sert [4]


et al. [5]

Gu et al. [6]

Our group





Mel Spectrogram




Mel Spectrogram


Proposed System




















Average Precision








* Since the data in Medieval is unbalanced, the evaluation metric uses average precision.

5. Conclusion

Automatic identification of violent scenes is required to protect users from unwanted and violent media. In this study, a system was proposed to discriminate violent movie scenes based on audio signals. The proposed system uses Wav2vec 2.0 for audio feature extraction, and a 1D CNN-LSTM combination to discriminate extracted audio features into violent and non-violent scenes. The proposed system discriminated violent scenes with an accuracy of 96.25% when using VMD, which is superior to results in previous studies. This study considered only audio features to discriminate movie scenes as violent or non-violent. Although it is generally more effective to discriminate violent scenes using visual information along with audio signals, the results of this study are expected to show more effective results in discriminating media with limited visual information, such as radio.


This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2020R1A2C2004624 and NRF-2018R1A6A1A03025523).


Gentile Douglas A., 2004, Media Violence as a Risk Factor for Children: A Longitudinal Study., American Psychological Society 16th Annual Convention, Chicago, IL.URL
Shafaei M., Samghabadi N.S., Kar S., Solorio T., Rating for Parents: Predicting Children Suitability Rating for Movies Based on Language of the Movies., arXiv 2019, arXiv:1908.07819.URL
Mu Guankun, Haibing Cao, Qin Jin , 2016, Violent Scene Detection Using Convolutional Neural Networks and Deep Audio Features., Chinese Conference on Pattern Recognition. Springer, SingaporeDOI
Sarman Sercan., Mustafa Sert., 2018, Audio Based Violent Scene Classification Using Ensemble Learning., 2018 6th International Symposium on Digital Forensic and Security (ISDFS). IEEEDOI
Potharaju Y., Kamsali M., Kesavari C. R., 2019, Classification of Ontological Violence Content Detection through Audio Features and Supervised Learning, International Journal of Intelligent Engineering and Systems, Vol. 12, No. 3, pp. 20-230URL
Gu C., Wu X., Wang S., 2020, Violent Video Detection Based on Semantic Correspondence, IEEE Access, Vol. 8, pp. 85 958-85 967DOI
Baevski A., Zhou H., Mohamed. A., Auli M., Jun. 2020, Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech RepresentationsDOI
Krizhevsky Alex, Ilya Sutskever, Geoffrey E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks., Advances in neural information processing systems 25 (2012): 1097-1105.URL
Hochreiter Sepp, Schmidhuber Jürgen, 1997, Long Short-Term Memory., Neural Computation, Vol. 9, No. 8, pp. 1735-1780DOI
Kiranyaz S., Avci O., Abdeljaber O., Ince T., Gabbouj. M., Inman D. J., 2019, 1D Convolutional Neural Networks and Applications: A Survey, arXiv preprint arXiv:1905.03554DOI
Olson D. L., Delen D., 2008, Advanced Data Mining Techniques., Springer Science & Business Media.URL
Schedi. M., et al. , 2015, VSD2014: A dataset for violent scenes detection in hollywood movies and web videos, 2015 13th International Workshop on Content-Based Multimedia Indexing (CBMI), pp. 1-6DOI
Sjberg M., Baveye Y., Wang H., Quang V.L., Ionescu B., Dellandra E., Chen L., The mediaeval 2015 affective impact of movies task., In: MediaEval 2015 WorkshopURL


Huiyong Bak

Huiyong Bak received his B.S. from the Department of Mechatronics Engineering, Inha University, Incheon, Republic of Korea in 2021. He is currently pursuing an M.S. in the Department of Electrical and Computer Engineering, Inha University. His research interests include deep learning using audio signals.

Sangmin Lee

Sangmin Lee received a B.S., an M.S., and a Ph.D. from Inha University, all in electronic engineering, in 1987, 1989, and 2000, respectively. He is currently a Professor with the School of Electronic Engineering, Inha University, Korea. His research interests include bio-signal processing and psycho-acoustics.