Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 13, No. 05, p.480-489

ISSN (online) :

2287-5255

Received : 22 September 2023Accepted : 21 November 202330 October 2024

DOI :

https://doi.org/10.5573/IEIESPC.2024.13.5.480

Regular Paper

Visual Design of Emotional Expressions of Music Art on Mobile Devices

HouYihao^1,^* LinZongzhe²

(College of Music, Guangxi Arts University, Nanning, 530022, China)
(Fielding School of Public Health, UCLA, California, 90024, US )

^*Corresponding Author: Yihao Hou, Yihao_hou@gmx.com

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

Music is a powerful way to express emotions, and as information visualization develops, visualizing emotions in music has become a popular topic. This study proposes a strategy for visualizing emotions in music on mobile devices. It uses the activation-degree-effectiveness emotion model and combines the residual phase with mel-frequency cepstral coefficient weighting to extract emotion features. Convolutional and recurrent neural networks were optimized and used together to recognize musical emotions. Experimental results show that the proposed method achieves the highest recognition accuracy of 90% and 92% in the Sound-track dataset and Song’s dataset, respectively, and an error rate of 10% in the AMG1608 dataset. The accuracy for recognizing happiness, sadness, relaxation, and anger is above 88%. This study provides a feasible direction for optimizing the visual design of expression of emotion in music art and recognition of emotion in music.

Keywords

Mobile terminal, Music visualization, Emotion recognition, Convolutional neural network

1. Introduction

Music is an indispensable part of human activities that can link rhythm and expression content and generate emotions ^[1]. An important branch of information visualization is emotional expression, which is one of music’s creative purposes. Visual design of emotional expression of music art has become a new research hotspot ^[2]. At the same time, with the rapid development of electronic technology and information technology, mobile terminal devices are used more and more widely. These terminals not only can carry different mobile applications but also have more intelligent and diversified functions with the development of communication technology and mobile Internet. This makes the visual design of music emotion and its combination a potential development trend.

The recognition of music’s emotion is the only way to realize the visual design of music emotional expression. However, current researchers often use typical machine learning methods such as support vector machines to classify music emotions. There is a lack of unified standards for emotion recognition systems and difficulty of music feature analysis, which hinder the further development of music emotion recognition technology and visual design ^[3]. Therefore, this research presents a method of music visual design on mobile terminals and innovatively applies a deep neural network to the recognition of music emotion. The goal was to improve the accuracy of music emotion recognition and provide further technical support for the realization of music visual design.

2. Related Works

Music emotion recognition technology is a key element in the design of visualization of emotion expressions in music art, and by studying music emotion recognition models, we can lay the necessary technical foundation for visualization design ^[4]. In current music emotion recognition, better extraction of emotion features in music and improving the performance of emotion recognition classifiers are the main research aspects. Deep learning has good feature extraction and recognition capabilities, and research on neural networks provides a strong reference for the improvement of music emotion recognition technology.

Liu et al. designed an interference recognition framework consisting of convolutional neural networks and feature fusion. They obtained recognition input through preprocessing, introduced a residual neural network to extract deep features, and used a fully connected layer to output the recognition content of interference signals. The results showed that it effectively reduced the loss of potential features and improved the generalization ability to deal with uncertainty ^[5]. Xing’s team proposed a convolutional neural network-based interference recognition framework in the field of fraudulent phone call recognition. A deep learning method based on convolutional neural networks that learns call behavior and phone number features was shown to improve classification accuracy ^[6].

Luo and other professionals combined an extreme learning machine with a deep convolutional neural network when dealing with poor generalization performance and low accuracy in finger vein recognition. They removed the fully connected layer from the deep convolutional network and added an extreme learning layer to recognize the extracted feature vectors. The results show that it can automatically extract finger vein features and reduce the loss of valid information with high accuracy and generalization ability ^[7].

Pushokhina et al. proposed a recognition method based on optimal K-means and convolutional neural networks for intelligent license plate recognition in traffic processes. This method divides the license plate recognition process into three stages, namely license plate detection, license plate image segmentation, and license plate number recognition. The simulation results show that this method has high operational efficiency ^[8]. Nandankar and other scholars proposed a prediction model based on long short-term memory networks for related pneumonia diseases, in order to achieve statistical analysis of disease data. Use different hidden layers to process the data during the process. The results show that the finely tuned LSTM model can accurately predict the relevant results of pneumonia data ^[9].

Recurrent neural networks in various fields of recognition provide reference ideas for musical emotion recognition. Taeseung’s team developed a skeleton-less gesture signal detection algorithm for traffic control gesture recognition. It used recurrent neural networks to process gesture time length variations mixed with noise and random pauses as a way to recognize six gesture signals. The results showed that its accuracy was as high as 91% ^[10].

Wu et al. combined a progressive scale expansion network and a convolutional floor cabinet neural network to detect and identify video image content and obtained the image serial number. The results showed that the recognition accuracy was 96% ^[11]. Bah and other researchers designed an end-to-end recognition system that used deep residual networks for emotion and face expression. The test results on the FERGIT dataset showed accuracy of 75% and 97% in classifying facial emotions ^[12]. Yang et al. developed a weighted hybrid deep neural network for automatic extraction of facial expression recognition, which used the outputs of two channels to be fused in a weighted manner. The final recognition results were calculated using the SoftMax function. The test results showed that it was able to recognize six basic facial expressions with high average accuracy of 92.3% ^[13].

Wang’s research team improved an artificial neural network with three layers. They used solar radiation and temperature as inputs and five physical parameters of a single diode model as outputs. The results showed that it was able to accurately recognize current changes ^[14]. Yu et al. proposed a student emotion classification model based on a neural network model to reduce the difficulty of understanding sentence text expression. The regularization method is introduced into LSTM in the process, so that the output at any time has a different correlation with the output from the previous time. The results indicate that this model has better comprehensive performance than traditional models and can correctly classify student emotions ^[15].

In summary, in practical tests applying convolutional neural networks and recurrent neural networks to various fields, such as image feature extraction, face recognition, and signal recognition, most researchers have improved them accordingly and achieved high accuracy. But there is less research on the extraction and recognition of musical emotion features. Therefore, this research is based on improving deep neural networks for the recognition of musical emotion expressions, establishing a better design model for the visualization of musical emotion expressions, and further developing the design of music visualization on mobile terminals.

3. Visual Design of Emotional Expressions of Music Art on Mobile Terminals

3.1 Music Visualization Design and Emotion Modeling on Mobile Terminals

To realize music visualization design on mobile terminals, we first analyzed the characteristics of music visualization design on mobile terminals. Music visualization design is presented in visual form on the mobile terminal, which can realize a transformation from auditory information to visual information and make the emotional and physical properties of music (the two main aspects of music) convey corresponding music information. In practical applications, the visual design of music is expressed in various ways. Graphics, text, pictures, and videos are classified according to the types of visual elements ^[16]. To achieve better design results, we took functional objectives and design objectives as the main service objects of visual design and took harmony, accuracy, readability, and beauty as the macro design objectives.

Music visualization design on mobile terminals has similarities with conventional data visualization, but music information has certain particularity, so it is more special in terms of methods and processes ^[17]. Music visualization design should build conversion rules according to functional objectives and specify the types of data to be converted. Music physical data types generally include timbre, pitch, and intensity, while music emotional data sets a fixed matching mode according to music content.

Based on the selected music data type, we match it with the appropriate visual representation and formulate the final visual conversion rules. Then, we need to cooperate with the interface and interaction design to embed the visualization into the whole application and achieve a high degree of cooperation with the product in the two aspects of interaction function and visual interface. Finally, it is necessary to make relevant instructions for visual design to reduce the learning cost of users ^[18]. The proposed music visualization design strategy for mobile terminals is shown in Fig. 1.

Fig. 1. Music visualization design strategy on mobile terminals.

Fig. 2. V-A emotion model diagram.

The visual design of music on mobile terminals poses significant challenges in practical operation. Many musical visualizations are seen as mere works of art lacking more systematic and scientific research. In the visualization of musical information, the first task is to convert the information into a visual form that facilitates organization and presentation. Dealing with information that lacks visual images requires making an artificial visual image to establish a link between the visual image and the musical message.

The visualization of emotional expressions in music requires an understanding of the definition of and access to emotion in music ^[19]. The first step is to establish a common, systematic model of emotion in music by defining and consistently quantifying the different emotional factors. The second step is to collect relevant music data based on the established emotion model, analyze the emotion characteristics, and finally output the emotion information results (i.e., music emotion recognition) ^[20].

The valence-arousal music emotion model proposed by Russeell was used, which represents emotion states as points in a two-dimensional space containing activation (arousal) and valence. The horizontal axis represents valence, and the vertical axis represents activation. Valence reflects the degree of negative and positive emotion, with smaller values indicating a higher degree of negative musical emotion, and vice versa. The different discrete points in the V-A emotion model are obtained according to the relationship between specific emotion and validity (horizontal axis) and activation (vertical axis). For example, in the activation dimension, calm to energetic is used to indicate the degree of excitement of individual emotions (that is, the intensity of emotions). Positive emotions are excited emotions, negative emotions are calm emotions, and the horizontal distance from the emotion to the origin represents the degree of calm or excitement. The V-A emotional model is shown in Fig. 2.

The V-A two-dimensional space is mapped into four discrete categories: (-V+A), (+V+A), (+V-A), and (-V-A). The four discrete categories correspond to the four typical emotions contained in the emotion model to obtain the musical emotion categories. The relationships corresponding to the four musical emotions are shown in Table 1.

Table 1. Correspondence of four categories of music emotion.

Category	Emotion	V-A value
The first kind of emotion	Happy	+V+A
The second kind of emotion	Anxious	-V+A
The third kind of emotion	Sentimental	-V-A
The fourth kind of emotion	Relaxed	+V-A

After confirming the music emotion model, the mel sound spectrum was chosen as the sound spectrum feature. The feature extraction first requires a short-time Fourier transform of the sound signal of the music ^[21]. Then, the frequencies on the amplitude spectrum are transformed by the Meier scale, and the amplitude is transformed by the Meier filter to obtain a representation of the Meier sound spectrum for all frames. Finally, the sound spectrum within the analysis window length is spliced to obtain the corresponding mel sound spectrum.

To improve the efficiency of mining the emotional features of the music, a weighted combination of the residual phase (RP) and mel frequency cepstral coefficient (MFCC) is introduced. RP is the cosine of the phase function of the resolved signal derived from the linearly predicted residuals in the music signal at the moment$t$ of the music sample. It can be estimated as a linear combination of multiple samples from the past ^[22]. The predicted music samples are shown in Eq. (1).

(1)

$ \overset{\cdot }{s}\left(t\right)=\sum _{k=1}^{p}a_{k}s\left(t-k\right) $

In Eq. (1),$p$ represents the order of the predicted moments, $\overset{\cdot }{s}(t)$ is the predicted music sample,$s(t)$ represents the actual values, and$a_{k}$ represents the set of linear prediction coefficients. The prediction error formula is shown in Eq. (2).

(2)

$ e\left(t\right)=s\left(t\right)-\overset{\cdot }{s}\left(t\right) $

In Eq. (2), $e(t)$ represents the prediction error. The prediction error is minimized to obtain the linear prediction coefficient, which is the linear prediction residual of the music signal. From this, the resolved signal is calculated as shown in Eq. (3).

(3)

$ \left\{\begin{array}{l} r_{a}\left(t\right)=r\left(t\right)+jr_{h}\left(t\right)\\ r_{h}\left(t\right)=IFT\left[R_{h}\left(w\right)\right] \end{array}\right. $

In Eq. (3),$r(t)$ represents the linear prediction residual of the music signal, $r_{a}(t)$ is the resolved signal, $R_{h}(w)$ represents the Fourier transform of$r(t)$, $r_{h}(t)$ represents the Hilbert transform of $r(t)$, and$IFT$ is the inverse Fourier transform. The resolved signal can be expressed as shown in Eq. (4).

(4)

$ h_{e}\left(t\right)=\left| r_{a}\left(t\right)\right| =\sqrt{r_{h}^{2}\left(t\right)+r^{2}\left(t\right)} $

There is much information related to musical emotion in the linear prediction residuals, and it is beneficial to extract emotion-specific information from the musical signal by calculating the residual phase, which is also known as the cosine of the phase of the resolved signal, as shown in Eq. (5).

(5)

$ \cos \left(\theta \left(t\right)\right)=\frac{R_{e}\left[r_{a}\left(t\right)\right]}{\left| r_{a}\left(t\right)\right| }=\frac{r\left(t\right)}{h_{e}\left(t\right)} $

In Eq. (5), $\cos (\theta (t))$ represents the cosine of the phase of the resolved signal. The RP features and MFCC features are weighted together to find the final output, thus improving the model’s ability to extract sentiment features.

3.2 Deep Neural Network-based Music Emotion Recognition

After using the speech spectrogram as feature input, deep learning is used for music emotion recognition. Deep learning can learn the relationship between high-level concepts and underlying features from audio data and thus the difference between the emotional semantics of the music and the features of the audio signal for the purpose of emotion recognition ^[23]. Among the deep learning neural networks, recurrent neural networks (RNNs) and convolutional neural networks (CNNs) have shown better ability in image synthesis features and time series extraction ^[24]. A CNN can effectively identify the underlying patterns in the data and obtain more abstract features by superimposing convolution kernels, while one-dimensional convolution is often used for the analysis of sensor data or time series and is suitable for the analysis of audio signal data ^[25]. The original audio signal is converted to an acoustic spectrum, which is represented as a grey-scale picture, whose convolution is calculated as shown in Eq. (6).

(6)

$ a_{i,j}=h\left(\sum _{m=0}^{f_{w}-1}\sum _{n=0}^{f_{h}-1}w_{m,n}x_{i+m,j+n}+b\right) $

In Eq. (6), $a_{i,j}$ represents the height and width of the feature map, $a_{i,j}$ is the activation function of the convolution layer, $f_{h}$ is the height of the convolution kernel, $b$ is the bias of the convolution,$f_{w}$ is the width of the convolution kernel, $w$ is the weight matrix of the convolution kernel, and $x$ is the data input of the convolution kernel. The frequency range in the sound spectrum is equal to the height of the convolution kernel, and the convolution operation is shown in Eq. (7).

(7)

$ conv\left(X,W\right)=\sum _{m=0}^{f_{w}-1}\sum _{n=0}^{L-1}w_{m,n}x_{i+m,j+n} $

The output of the convolution kernel is represented by $R$. The convolution operation is simplified as shown in Eq. (8).

(8)

$ R=conv\left(X,W\right)+B $

In Eq. (8), $B$ represents the bias matrix. The width of the output of the convolution kernel is shown in Eq. (9).

(9)

$ R_{w}=\frac{c-f_{w}+2q}{v}+1 $

In Eq. (9), $t$ represents the width of the sound spectrum, $q$ is the size of the fill, and $R_{w}$ is the width of $R$. Since the one-dimensional convolution only translates in the time dimension of the sound spectrum, the dimension of the sound spectrum becomes 1 after convolution. A gated linear unit is then added, and the expression is shown in Eq. (10).

(10)

$ I=Conv1D_{1}\left(L\right)\otimes \sigma \left(Conv1D_{2}\left(L\right)\right) $

In Eq. (10), $Conv1D_{1}$ and $Conv1D_{2}$ represent the one-dimensional convolution, both of which are identical but do not share weights, $\sigma $ represents the sigmoid activation function, $I$ represents the output, and$L$ represents the sound spectrum sequence to be processed. A gated linear unit is added to form a one-dimensional gated convolution unit, and then a residual structure is introduced to cope with the gradient disappearance problem, as shown in Eq. (11).

(11)

$ x_{l+1}=x_{l}+F\left(x_{l},W_{l}\right) $

In Eq. (11), $x_{l+1}$ represents the output of the$x_{l}$ layer of the network, and $F(x_{l},W_{l})$ is the network mapping (or the convolution operation in the case of convolutional networks). The residual gated convolution unit is shown in Eq. (12).

(12)

$ I=L+Conv1D_{1}\left(L\right)\otimes \sigma \left(Conv1D_{2}\left(L\right)\right) $

The resulting residual gating unit also enables information to be transmitted over multiple channels, as shown in Eq. (13).

(13)

$ \left\{\begin{array}{l} I=L\otimes \left(1-\sigma \right)+Conv1D_{1}\left(L\right)\otimes \sigma \\ \sigma =\sigma \left(Conv1D_{2}\left(L\right)\right) \end{array}\right. $

Fig. 3. Basic network structure diagram of RNN.

The stacking of convolutional layers enables the extraction of more abstract acoustic spectral features, but the music signal is ultimately temporal information and still has a serial nature in the temporal dimension after conversion to a mel sound spectrum. Therefore, it is combined with a recurrent neural network. In a recurrent neural network, the output state of the hidden layer is related to the input at the current moment and the state of the hidden layer at the previous moment with memory-like properties ^[25]. The basic network structure is shown in Fig. 3.

The state of the hidden layer at step $i$ is expressed as $H_{i}$ and is calculated as shown in Eq. (14).

(14)

$ H_{i}=f\left(WX_{i}+VH_{i-1}+b\right) $

In Eq. (14), $H_{i-1}$ represents the state of the hidden layer at the previous moment, $b$ represents the bias term, $f(\cdot )$ represents the non-linear activation function (usually $\tanh $), and $X_{i}$ represents the input of step $i$. The output of the network at step $i$ is represented by $O_{i}$ and is calculated as shown in Eq. (15).

(15)

$ O_{i}=UH_{i}+d $

In Eq. (15), $U$ is the connection matrix, and$d$ is the bias term. When the sequence is too long, it is difficult for the RNN to transfer information from the preceding time step to the following time step, and the RNN will have a problem of gradient disappearance in backpropagation, so a long short term memory (LSTM) network was added. The LSTM structure is shown in Fig. 4.

Fig. 4. LSTM network structure unit diagram.

Fig. 5. Convolutional cyclic neural network music emotion recognition model diagram.

To better extract the multi-directional dependencies in the musical feature sequences and closely match the way the brain perceives musical emotions, a bidirectional recurrent neural network (BRNN) was used for the classification process of temporal features. The BRNN takes into account both preceding and following inputs, and the final output of the network is the sum of the reverse and forward at each step, as shown in Eq. (16).

(16)

$ O_{i}=U\overset{\rightarrow }{H_{i}}+U'\overset{\rightarrow }{H_{i}} $

In Eq. (16), $\overset{\rightarrow }{H_{i}}$ represents the state of the hidden layer in the bidirectional RNN. A convolutional bidirectional recurrent neural network (CBRNN) was formed by combining a convolutional network based on a residual gated convolutional structure with a bidirectional RNN, and a music emotion recognition model was established. The process is that the sound spectrum is first learned by convolutional layers to obtain a feature map containing high-level abstract features, and then the feature map is expanded by time to obtain a convolutional feature sequence, which is then fed into the BRNN to extract time-series features and do the final classification process. The overall structure of the model is shown in Fig. 5.

4. Analysis of the Effectiveness of the Application of the Music Emotion Recognition Model

To validate the application of the proposed CBRNN method, it was compared with four commonly used music emotion recognition models, namely the K-nearest neighbor (KNN), support vector machine (SVM), ensemble learning (EL), and sound acoustic emotion Gaussians (AEG) models. Information on the experimental environment is shown in Table 2. The five methods were first used for experiments in two publicly available datasets: the Sound-track dataset and Song’s dataset. The random identification experiments were repeated 10 times in each dataset, and the results obtained are shown in Fig. 6.

Table 2. Information of experimental environment.

Index	Performance parameter
Operating system version	Android 11
System digits	System bit 64
Internal storage	8.00 GB
Processor	Snapdragon 865 CPU 2.84 GHz
Experimental platform	AI Benchmark v5.0.0
Time efficiency	82%~85%

Fig. 6(a) shows the music sentiment recognition results of SVM, KNN, EL, AEG, and the proposed method in the Sound-track dataset, and Fig. 6(b) shows the sentiment recognition results of the five methods in Song’s dataset. The results are presented in the form of accuracy. From Fig. 6(a), it can be seen that there is some fluctuation in the results of the 10 random tests for all 5 methods in the Sound-track dataset. The SVM algorithm reached its highest accuracy of 79% in the 8th recognition, and the lowest occurred for the first iteration at 74%. The KNN algorithm’s accuracy ranged from 73% to 78%, the EL algorithm’s highest accuracy was 80%, and the AEG algorithm’s accuracy was stable at around 84%. The accuracy of the method proposed in this study remained above 88%, reaching a maximum of 90%.

Fig. 6. Experimental results of five methods in two open datasets.

As can be seen from Fig. 6(b), the accuracies of both SVM and KNN methods were relatively similar, both reaching up to 79%. The EL algorithm and AEG algorithm reached 83% and 87%, respectively, while the proposed CBRNN stabilized above 90% with the highest accuracy and showed the least fluctuation. To fully evaluate the recognition performance of CBRNN, the accuracy, recall, and F1 values were selected to be tested again in the selected Sound-track dataset and Song’s dataset and compared with the other four methods. The results are shown in Fig. 7.

Fig. 7. Comparison results of precision, recall, and f1 values of five methods.

In Fig. 7, for the Sound-track dataset, the performance of the two algorithms KNN and SVM is similar. The three metrics of both EL and AEG are less different, while the accuracy of CBRNN has the highest improvement of 21.31% compared to the other three methods. In Song’s dataset, SVM had the worst performance, while CBRNN still had the highest accuracy of the five methods with better performance. The five methods were then used for experiments on the AMG1608 dataset. It has 1608 music clips and is the largest continuous sentiment-based music database with sentiment labels in the V-A sentiment space with generalized features. The results of the recognition error rate of the five methods on the AMG1608 dataset are shown in Fig. 8.

Fig. 8. Error rate of music emotion recognition using five methods in the AMG1608 dataset.

Fig. 9. Accuracy of five methods for different music emotion recognition.

In Fig. 8, the error rate of the SVM algorithm fluctuates to different degrees as the number of music data samples increases but eventually reaches a relatively stable value. The error rate of the KNN algorithm fluctuates more than that of the SVM with a maximum of 35% for up to 800 data samples and a minimum of 15-20% throughout the experiment. The error rates of the EL and AEG methods are relatively similar and eventually stabilize at around 15%.

The proposed CBRNN showed a more obvious decreasing trend in error rate before the sample size reached 800. The error rate started to stabilize after 800 samples, basically remaining around 10%, which was better than the other four methods. Finally, practical validation was applied by selecting 4000 songs from a well-known music platform in China, where 79% were Chinese songs and 21% were English songs. The song genres were classified as four types: happy, sad, angry, and relaxed, and then the five methods were used to identify their emotions. The results obtained are shown in Fig. 9.

Fig. 9 shows the emotion recognition results obtained by running each of the five methods three times on the selected music dataset. Figs. 9(a)-(d) correspond to the recognition accuracy results for the four emotion types of happy, sad, angry, and relaxed, respectively. From Fig. 9(a), it can be seen that for the recognition of happy emotion, the accuracy of SVM and KNN is below 90%, but both are higher than 84%. AEG and EL are slightly higher than the first two methods, but CBRNN has the highest accuracy of 96%.

For the recognition of sad emotions, SVM showed the lowest value of 62%, while CBRNN still had a high accuracy of up to 90%. In the recognition of anger emotions, the difference in accuracy between the five methods for all three experiments was small, with the proposed CBRNN obtaining the highest accuracy of 88%, a maximum improvement of 13% compared to the other four methods. In the recognition of the relaxation category of emotions, the highest accuracy of 95% was obtained by CBRNN, which is still higher than the other four methods. In summary, the proposed CBRNN method has higher accuracy and better performance in the recognition of the four emotion types and can be better used for the recognition of musical emotions.

5. Conclusions

The development of information visualization has made the design of music visualization on mobile terminals a current research hotspot in this field. To realize the visual design of music emotion, it is crucial to establish a fast and accurate music emotion recognition model. This research was based on the valence-arousal music emotion model with a weighted combination of MFCC and RP for emotion feature extraction, and an optimized convolutional neural network was combined with a recurrent neural network and applied to emotion recognition. The experimental results show that the method achieves an accuracy of up to 92% in 10 random recognitions in the Sound-track dataset and Song’s dataset.

In the Sound-track dataset, the method achieved an accuracy improvement of up to 21.31%. In the AMG1608music dataset, the error rate of the method started to plateau after the sample size increased to 800 and remained around 10%. In the selected dataset consisting of 4000 songs, the method was able to effectively identify the four emotion types of relaxation, sadness, happiness, and anger with an accuracy of up to 96%, providing superior performance. However, the improvement of the convolutional neural network did not incorporate an attention mechanism during the study, which affected the further improvement of the performance, so it needs to be explored further in this area.

REFERENCES

Ma J, Du K, Zheng F, et al. A recognition method for cucumber diseases using leaf symptom images based on deep convolutional neural network. Computers and Electronics in Agriculture, 2018(154): 154-158.

Masayuki, Satoh. Cognitive and emotional processing in the brain of music. Japanese Journal of Neuropsychology, 2018, 34(4): 274-288.

Liu G, Abolhasani M, Hang H. Disentangling effects of subjective and objective characteristics of advertising music. European Journal of Marketing, 2022, 56(4): 1153-1183.

Ma Y. Research on the Arrangement and Visual Design of Aerobics under the New Situation. International Core Journal of Engineering, 2019, 5(9): 170-173.

Liu S, Zhu C. Jamming Recognition Based on Feature Fusion and Convolutional Neural Network. Journal of Beijing Institute of Technology, 2022, 31(2): 169-177.

Xing J,Wang, Shupeng D, Yu. Fraudulent phone call recognition method based on convolutional neural network. High Technology Letters, 2020, v.26(04): 21-25.

Luo R, Zhang K. Research on Finger Vein Recognition Based on Improved Convolutional Neural Network. International Journal of Social Science and Education Research, 2020, 3(4): 107-114.

Pustokhina I V, Pustokhin D A, Rodrigues J, Gupta D, Khanna A & Shankar K. Automatic Vehicle License Plate Recognition Using Optimal K-Means with Convolutional Neural Network for Intelligent Transportation Systems. IEEE Access, 2020, 8(12): 92907-92917.

Nandankar P V, Nalla A R, Gaddam R R, Gampala V, Kathiravan M & Karunakaran S. Early prediction and analysis of corona pandemic outbreak using deep learning technique. World Journal of Engineering, 2022, 19(4): 559-569.

Taeseung B, Yong-Gu L. Traffic control hand signal recognition using convolution and recurrent neural networks. Journal of Computational Design and Engineering, 2022(2): 2-5.

Wu, Xing G, Yuxi Z, Qingfeng C & Liming. Text Recognition of Barcode Images under Harsh Lighting Conditions. Wuhan University Journal of Natural Sciences, 2020, v.25; No.134(06): 60-66.

Bah I, Yu X. Facial expression recognition using adapted residual based deep neural network. Intelligence & Robotics, 2022, 2(1): 72-88.

Yang B, Cao J, Ni R & Zhang Y. Facial Expression Recognition Using Weighted Mixture Deep Neural Network Based on Double-Channel Facial Images. IEEE Access, 2018, 6:4630-4640.

Wang S, Zhang Y, Zhang C & Yang M. Improved artificial neural network method for predicting photovoltaic output performance. Global Energy Interconnection, 2021, 3(6): 553-561.

Yu H, Ji Y, Li Q. Student sentiment classification model based on GRU neural network and TF-IDF algorithm. Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology, 2021(2): 40-45.

Malandrino D, Pirozzi D, Zaccagnino R. Visualization and music harmony: Design, implementation, and evaluation[C]//2018 22nd International Conference Information Visualization (IV). IEEE, 2018: 498-503.

Wu K, Rege M. Hibiki: A Graph Visualization of Asian Music[C]//2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI). IEEE, 2019: 291-294.

Alvarado K G. Accessibility of music festivals: a British perspective. International Journal of Event and Festival Management, 2022, 13(2): 203-218.

Kim H R. Development of the Artwork Using Music Visualization Based on Sentiment Analysis of Lyrics. The Journal of the Korea Contents Association, 2020, 20(10): 89-99.

Hizlisoy S, Yildirim S, Tufekci Z. Music emotion recognition using convolutional long short term memory deep neural networks. Engineering Science and Technology, an International Journal, 2021, 24(3): 760-767.

Mirzazadeh Z S, Hassan J B, Mansoori A. Assignment model with multi-objective linear programming for allocating choice ranking using recurrent neural network. RAIRO - Operations Research, 2021, 55(5): 3107-3119.

Chen T P, Lin C L, Fan K C, Lin W Y & Kao C W. Radar Automatic Target Recognition Based on Real-Life HRRP of Ship Target by Using Convolutional Neural Network. Journal of information science and engineering: JISE, 2021(4): 37-39.

Jindal N, Kaur H. Graphics Forgery Recognition using Deep Convolutional Neural Network in Video for Trustworthiness. International journal of software innovation, 2020(4): 8-11.

Leonan E, Falqueto, José A & S. Oil Rig Recognition Using Convolutional Neural Network on Sentinel-1 SAR Images. Geoscience and Remote Sensing Letters, IEEE, 2019, 16(8): 1329-1333.

Xu J, Lv H, Zhuang Z, Lu Z, Zou D & Qin W. Control Chart Pattern Recognition Method Based on Improved One-dimensional Convolutional Neural Network-ScienceDirect. IFAC-PapersOnLine, 2019, 52(13): 1537-1542.

Yihao Hou

Yihao Hou earned a Bachelor's Degree in Keyboard Performance from Guangxi Arts Institute in 1992. She has worked at the institute since 1994 and is currently an Associate Professor and Head of the Piano Department. She has published an academic monograph, authored a core journal paper, and led five research projects. Her focus is on piano performance and teaching.

Zongzhe Lin

Zongzhe Lin completed a Bachelor's degree at George Mason University in 2020 after studying there from 2015 to 2020. He went on to earn a Master's degree in Computer Science from the same institution in 2022. Currently, he is pursuing studies in Data Science at the University of California, Los Angeles, as of October 2024. Zongzhe has published two academic papers, with his research focusing on advanced predictive analytics in public health.

IEIE SPC IEIE Transactions on Smart Processing & Computing

Journal Search

Journal XML

Journal Information

Visual Design of Emotional Expressions of Music Art on Mobile Devices

Abstract

Keywords

1. Introduction

2. Related Works

3. Visual Design of Emotional Expressions of Music Art on Mobile Terminals

3.1 Music Visualization Design and Emotion Modeling on Mobile Terminals

Fig. 1. Music visualization design strategy on mobile terminals.

Fig. 2. V-A emotion model diagram.

Table 1. Correspondence of four categories of music emotion.

(1)

(2)

(3)

(4)

(5)

3.2 Deep Neural Network-based Music Emotion Recognition

(6)

(7)

(8)

(9)

(10)

(11)

(12)

(13)

Fig. 3. Basic network structure diagram of RNN.

(14)

(15)

Fig. 4. LSTM network structure unit diagram.

Fig. 5. Convolutional cyclic neural network music emotion recognition model diagram.

(16)

4. Analysis of the Effectiveness of the Application of the Music Emotion Recognition Model

Table 2. Information of experimental environment.

Fig. 6. Experimental results of five methods in two open datasets.

Fig. 7. Comparison results of precision, recall, and f1 values of five methods.

Fig. 8. Error rate of music emotion recognition using five methods in the AMG1608 dataset.

Fig. 9. Accuracy of five methods for different music emotion recognition.

5. Conclusions

REFERENCES

Yihao Hou

Zongzhe Lin

Article Information (continued)

Keywords

IEIE SPC

IEIE Transactions on Smart Processing & Computing