3.1 Research on Motion Feature Screening Model based on MDN
Classic motion data includes motion capture data and key frame-based motion data.
Both types of action data are implemented based on a large amount of manual processing,
so they involve high costs and are difficult to operate [17]. More virtual characters are emerging with the development of computer animation
technology, and the demand for real human motion data is also increasing. On the other
hand, relying only on traditional motion capture and manual production cannot meet
current needs. Therefore, deep learning algorithms have also appeared in motion generation.
The deep neural network has a strong learning ability, which can overcome the limitations
of traditional machine learning algorithms in data acquisition to a certain extent.
Therefore, the experimental action generation model selects the sequence generation
model based on depth learning [18]. Deep neural networks can discover the hidden structural features in the data and
change its internal parameters, but there are limitations in processing continuous
sequence samples. On the other hand, dance movement samples are continuous sequences,
so the ideal experimental effect cannot rely only on deep neural networks. LSMN and
MDN can predict and generate action sequences of unknown lengths, so the experiment
combines these two algorithms to process the action data. LSMN is a recurrent neural
network with a chain structure composed of a unit, an input gate, a forgetting gate,
and an output gate. Gate control can solve the general sequence problem to a certain
extent. Fig. 1 presents the basic structure.
Fig. 1. Basic structure of LSTM.
A sigmoid neural network layer and a point multiplication operation constitute the
LSTM gate. The neural network layer outputs a number between 0 and 1 to indicate the
amount of information passing; 0 means no information is retained, and 1 means all
information is passed. The algorithmic forgetting gate can decide to retain or discard
relevant state information, and its function expression is as follows.
Eq. (1) reflects the retention degree of the forgetting gate to the information at the previous
moment, where $W_{f}$ is the weight of the forget gate; $h_{t-i}$is the output at
the previous time;$x_{t}$ is the input value at the current time; $\gamma $ is the
Sigmoid neural network layer. The value obtained is converted to a number between
0 and 1 through the activation function, and the retention degree is determined. This
number is then multiplied by the unit state $C_{t-1}$at the previous time, and the
proportion of information retained at the previous time can be obtained. The input
gate can determine the number of new states calculated using the current input to
be stored in the cell state, and its function can be expressed as follows.
where $i_{t}$ represents the value to be updated; $\widetilde{C_{t}}$ is the new candidate
value vector calculated by the Tanh layer; this vector is added to the cell state.
The weight matrices for the input gate and candidate values are $W_{i}$ and $W_{C}$,
respectively. The input at the current time, the united state, and the output at the
previous time jointly determine the output at the current time. The formula can be
expressed as follows:
where $h_{t}$ represents the output information. The gating mechanism enables LSTM
to learn the dance movement characteristics of the human body and obtain the constraint
relationship between its bones and the transformation rules of various action postures.
LSTM can fully use its advantages to generate dance movements of arbitrary length,
especially when the input and target output data are discrete. On the other hand,
dance data are continuous and non-discrete data, and the output data of LSTM, in this
case, has no controlled probability distribution. Therefore, MDNs are introduced to
refine the generation of dance movements. The MDN mainly parameterizes the distribution
of multiple mixed components using the output of the neural network. The probability
density function of each dimension in the output tensor of the overall network is
not a single position tensor. After MDN is applied to LSTM, the output distribution
is affected and restricted by the current input and previous historical input. After
applying MDN to LSTM, the output distribution is influenced and restricted by the
current input and previous historical input. The linear combination of each mixing
component constitutes the probability density of the target data, and its functional
expression is as follows:
Eq. (4) expresses the probability of the target vector $t_{a}$ when output $x$, where, $m$
is the number of mixing components, and $\alpha _{i}$ is the mixing coefficient of
each mixing component of output $x$.$\varphi _{i}$ is the conditional density of the
$t$$^{\mathrm{th}}$ kernel of the target vector $i$. The Gaussian kernel function
is expressed as
where $c$ is the dimension of model output data, and $\mu _{i}$ and $\sigma _{i}$are
the mean and variance used to parameterize each mixture component, respectively. Thus,
the number of MDN output variables is$m\left(c+2\right)$, and its tensor can be expressed
as
Eq. (6) covers all the parameters required to construct the mixed-density network, where
the number of mixed components $m$ is arbitrary. In the experiment, when dance movement
data is used to train the MDN model, the movements can be represented by the spatial
coordinates of each human skeleton. Hence, the trained MDN model can predict the probability
distribution of each human skeleton position at the next moment and then generate
dance movements.
3.2 Construction of Choreography Model Integrating Hybrid Density Network and Dance
Movement Features
The application of computer-automatic music choreography has been around for some
time, and its main purpose is to minimize the degree of human intervention in the
process of music choreography using computer-intelligent technology [19,20]. There are three main problems when attempting to achieve the ideal experimental
purpose: acquiring real dance movements intelligently and efficiently; music features
and action features with high correlation are selected, and these features can better
express the characteristics of music and action data; establishing the mapping relationship
between music features and action features. For problem 1, the hybrid density network
algorithm described above can be used to obtain and generate dance movements. Fig. 2 shows the structure of the dance movement generation model based on MDN. The MDN
model includes two structures: a neural network and a mixed-density model, and the
neural network can be used to predict dance movements. The back-end MDN uses the output
value as the parameter vector to parameterize the model so that the mean, variance,
and weight of each mixed component can be determined.
Fig. 2. Structure of the MDN-based dance action generation model.
The characteristics of music and the matching of dance movements should be considered
when extracting music features. Different styles of music have different features.
Constant Q Transform (CQT) algorithm is a spectrum analysis algorithm suitable for
music signals, which can be more consistent with music characteristics when processing
music signals, and has a wide range of applications in music signal analysis. The
Q factor in the CQT algorithm is a constant, and its function expression is as follows.
where $f_{k}$stands for the central frequency of the $k^{th}$ semitone after the initial
semitone, and $b$ is the number of semitones divided within an octave, often with
a value of 12 or 24. The frequency amplitude of the $k^{th}$ semitone can be obtained
after the CQT transformation of a music signal of finite length, and its function
expression is
where $x\left(n\right)$ is the music signal; $w_{{N_{k}}}\left(n\right)$ is the window
function of length$N_{k}$. The window size $N_{k}$ can be expressed as
where $f_{s}$ is the sampling frequency of the input audio signal. According to Eq.
(9), the window size has an inverse relationship with the center frequency $f_{k}$. Music
feature extraction aims to match the action characteristics of dance. Realizing intelligent
choreography is to analyze the correlation of music and movement segments accurately
and, after that, to find the most matching motion clips and music characteristics,
different styles of matching corresponding concert dance moves, and different styles
of intelligent choreography. Therefore, the extraction of music features and movement
features has become a critical link, and the feature extraction can be carried out
according to the rhythm features and intensity features. The extraction of action
features includes low-level features, such as changes in motion speed, acceleration,
direction of motion, and action morphology, as well as high-position features, such
as feelings and style. Among the low-level features, the bone velocity features can
be expressed as:
Eq. (10) expresses the average speed of the arm; $L_{Motion}$ is the length of $N_{i}$; $f$is
the serial number of the frame in the action clip $N_{i}$; $p_{f}^{Arm}$is the key
position of the arm in the frame $f$. Eq. (10) can also calculate the average speed of bones of other joints in the human body.
In intelligent choreography, the model can match the action segments with different
speeds by changing the speed and value range of the local bones of the human body
to realize the choreography of dances with different movement characteristics. Dance
movements have spatial characteristics that affect the synthesis effect of dance.
The spatial measurement of the action segment $N_{i}$can be expressed as Eq. (11):
where $f$ represents the sequence number of the frame in the action clip; $L_{Motion}$
is the length of the action clip; $x_{f}^{Root}$ and $y_{f}^{Root}$ are $f$ and $x$
coordinates of the root node of the frame $y$, respectively. In the intelligent choreography
of the model, the spatial characteristics of the movements can be set to match different
spatial action segments to realize the diversification of the choreography. The rhythm-matching
process of music and dance movements can be realized using Eq. (12):
where $L_{music}$ and $L_{motion}$ are the lengths of $M_{i}$ and$N_{i}$, respectively.
$f_{0}$ is the translation volume, and $s$ is the scaling scale coefficient. The maximum
value of Eq. (12) is obtained, and the first frame $s\cdot L_{music}$ of the translated action clip
is the action segment with the highest matching degree. The connectability of the
obtained connected dance movements must be analyzed to maintain the true and natural
feeling of the synthesized dance movements. At the end of the previous action clip
and the start of the next action segment, the window with the length of the frame
is intercepted, and the sum of the distance of $k$ frame pairs in the window is obtained:
where $f_{i}$ is the end of the previous action segment, and $f_{j}$ is the beginning
of the next action segment. The A threshold$\varepsilon $ is set on this basis. $f_{i}$
and $f_{j}$ are similar enough to be connected when Eq. (13) is less than the specified threshold. Intensity matching can be performed between
the target music sequence and each candidate connectable action sequence to obtain
the action sequence with the highest matching degree to the music sequence. The matching
formula can be expressed as
Eq. (14) represents the intensity matching formula of music segment$M_{i}$ and action segment
$N_{i}$, where $L_{music}$ is the length of the music segment and $L_{motion}$ is
the length of the action segment. Thus, all three problems mentioned above can be
solved, and intelligent choreography based on computer algorithms can be realized.
Fig. 3 shows the overall choreography process based on a hybrid density network and dance
movement characteristics.
Fig. 3. Overall flow of music choreography based on MDN.