Enhanced Control of Human Motion Generation using Action-conditioned Transformer VAE
with Low-rank Factorization
KimHyunsung1,†
KongKyeongbo2,†
KimJoseph Kihoon3,†
LeeJames3,†
ChaGeonho4
JangHo-Deok4
WeeDognyoon4
KangSuk-Ju5
-
( LG Electronics, Seoul, Korea hs9767.kim@lge.com)
-
( Department of Electrical & Electronics Engineering, Pusan National University, Busan,
Korea kbkong@pusan.ac.kr)
-
( Samsung Electronics, Gyeonggi-do, Korea {joseph.kim, jims.lee}@samsung.com)
-
( Naver, Gyeonggi-do, Korea {geonho.cha, hodeok.jang, dongyoon.wee}@navercorp.com)
-
( Department of Electronics Engineering, Sogang University, Seoul, Korea sjkang@sogang.ac.kr)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Disentangled control, 3D human mesh generation, Latent space
1. Introduction
Recently, generative models that be used to create human motions have been studied
extensively [1,2]. Similar to other generative models, these synthesize realistic data from learned
distributions. Since the generative model is probabilistic, controlling the generated
human motions is a very challenging task. Some studies have attempted to control motions
based on text [3,4], but it is still difficult to precisely control specific body parts.
This paper presents a novel approach to directly control human motions in the latent
space of a generative model. We employ a transformer-based variational autoencoder
(VAE) [2] to learn the sequence-level latent space of human motions. In the image domain, there
are reported studies on control of the content generated by generative models [5-10]; these use either supervised or unsupervised methods to find directions that change
the content to semantically meaningful directions. Discovering meaningful directions
using supervised methods has advantages since it can drive the output of a generative
model toward the intended direction. However, since human motions have high degrees
of freedom, it is challenging to accurately supervise learning. Therefore, this paper
focuses on an unsupervised method to control human motions.
There are three key challenges in the task of controlling generated human motions
from the baseline model [2]: (1) entangled attribute vectors, (2) posterior collapse, and (3) lack of diversity
of motions in the dataset. The first challenge originates from the use of an unsupervised
method. Since it is not feasible to explicitly define the attribute vectors to be
modified, it is necessary to choose a semantically meaningful direction from the set
of discovered directions. However, due to bias that resides in the dataset, most attribute
vectors are entangled, resulting in changes to unintended as well as intended body
parts. Low-rank decomposition with projection to an orthogonal complement space is
employed to encourage partwise control of human motions, inspired by [10]. By applying this method, it is possible to reflect specific intentions and control
a particular body part independently from the other body parts.
The second problem is derived from the employed generative model (i.e., transformer
VAE) [11]. Despite the good performance of the transformer VAE, posterior collapse wherein
the generative model only exploits a subset of the latent space [12] occurs owing to the transformer’s complex structure. This posterior collapse induces
a critical problem in low-rank factorization that we bypass through simple scheduling
schemes for the KL-term, which is the distance between a prior and its posterior approximate.
Lastly, it is observed that human action datasets, such as UESTC and HumanAct12, comprise
similar motions for each body part, which degrades the controllability of human actions.
To increase the diversity of motions in human action datasets, a novel data augmentation
method that can be applied to human motion data is proposed that utilizes the differences
between motion frames for each joint.
To demonstrate the effectiveness of the proposed solution, extensive experiments are
conducted on the UESTC and HumanAct12 datasets to assess the controllability of human
motions. Additionally, to explore the applicability of our proposed method in different
contexts, we applied it to the most similar existing model [13]. This comparison allows us to evaluate the performance and versatility of our approach
in a broader range of scenarios. For evaluation, metrics from [1] are adopted to measure the naturalness and diversity of the generated actions. Additionally,
a novel metric called the SC score is introduced to effectively measure specific control
of human body parts.
2. Related Work
We briefly review the background literature on skinned multi person linear (SMPL)
model and action-conditioned transformer VAE.
2.1 SMPL
The SMPL model [14] is a skinned vertex-based model that describes various human body shapes and poses.
This model is simple in the sense that it is computationally efficient while being
visually realistic, which means that it is possesses fewer artifacts and expresses
plausible poses than other models. It is also action-oriented because it is designed
for compatibility with existing 3D animation software. Simple Specific task related
to the SMPL model is human pose and shape estimation task [15-17], which aims to estimate two types of SMPL parameters: shape and pose, from a given
image or video. In this task, finding the exact value of both shape and pose parameters
is very important. However, since our task is to control a specific body part in human
action, we ignore the shape parameters $\boldsymbol{\beta }$ and use only the pose
parameters $\boldsymbol{\theta }$ to generate human motion in this study.
Pose parameter from the original work [14] represents the rotations of a joint with respect to its parent in the kinematic tree.
This rotation ($R_{t}$) is the axis-angle rotation represented in 3D. Twenty-three
joints plus one global rotation makes each person object to have 72 ($R_{t}~ \in ~
\mathrm{\mathbb{R}}^{24\times 3}$) pose parameters in total. In addition, $D_{t}~
\in ~ \mathrm{\mathbb{R}}^{1\times 3}$, so the translation of a root joint is added
to capture the semantics of a given action, such as translations along the $x$, $y$,
and $z$ axes. Recently, rotations have been represented in the 6D rotation representation
[18]; following this work, we translate $R_{t}$ and $D_{t}$ into 6D. To summarize, the
rotation matrix $R_{t}~ \in ~ \mathrm{\mathbb{R}}^{24\times 6}$consists of the joint
and global rotations of a human in 6D representation, and $D_{t}~ \in ~ \mathrm{\mathbb{R}}^{1\times
6}$ is a translation matrix in 6D. We refer to the combination of $R_{t}$ and $D_{t}$
as $P_{t}~ \in ~ \mathrm{\mathbb{R}}^{25\times 6}$. Thus, since we assume 60 frames
of $P_{t}$ as the action ($\mathrm{\mathcal{A}}$), it can be expressed as $\mathrm{\mathcal{A}~
}\in ~ \mathrm{\mathbb{R}}^{25\times 6\times 60}$.
2.2 Human Motion Generation
Previously, studies related to human motion generation were limited to the image domain,
generating human body parts in an image [19] or, transferring the appearance of the source person to the target pose [20]. Recently, there are several reported studies [2, 21-25] on handling human motion generation. The purpose of motion generation is to generate
a range of plausible output motion sequences. We can summarize the corresponding research
from two perspectives: architecture and multimodality.
Most human motion generation methods employ generative adversarial networks (GANs)
and VAE models, whose basic architectures have been developed in several forms. A
recurrent neural network (RNN) combined with a VAE model was used to generate high-quality
human motions, as demonstrated by [22]. The creation of multiple frames of motion was achieved using a GAN architecture
with a hierarchical generator that combines a convolutional neural network (CNN) and
RNN, as demonstrated in [21]. Previous works which deals with human motion in image domain [23], [24] also leverage architecture of GAN to handle human motion. In the work [23], GAN-based framework is proposed to transfer human motion from source image to target
image and in the work [24], they also utilize GAN architecture to generate plausible intermediate frames given
two start and end frames as input. Transformer [26], which is based on the attention mechanism, is a powerful approach for encoding sequences
that have long-range dependencies. Since the motion in each frame is highly correlated,
it is important to consider the temporal aspects of a motion sequence into account
when generating plausible and natural human motions. Therefore, in the present study,
we use transformer-based VAE, as in [2].
There have been many efforts to generate and control human motions using various types
of conditions, such as text and audio. Some studies [3,4] effectively produced diverse 3D human movements that were highly relevant to the
textual descriptions. A decomposition-to-composition framework for synthesizing dancing
movements from input music was proposed by [27]. There has been a previous study which generates different form of output from the
input. In the work [25], they receive continuous motion capture data as input an outputs different modality,
Labanotation. In the present work, we do not use any modalities except for the action
action classes for the human motion control. Even without the use of such modalities,
we achieve a certain degree of control in the human motions.
2.3 Action-conditioned Transformer VAE
Recently, the human motion synthesis [1,2] has drawn attention from several researchers. In particular, [2] employed a conditional VAE model to generate human actions. The overall process is
described in Fig. 1; using the transformer architecture as the encoder, the model takes arbitrary-length
sequences of poses and an action label as the inputs to produce distribution parameters
$\boldsymbol{\mu }$ and $\mathbf{\Sigma }$ of the motion latent space. Using the reparameterization
trick, the model can sample a latent vector $\boldsymbol{z}$ from the attained distribution.
Given the single latent vector $\boldsymbol{z}$ and an action label a, the decoder
then synthesizes human motions from a given distribution.
Fig. 1. Action-conditioned transformer VAE[2]: In the training stage, given a sequence of body poses , …, , both encoder and decoder of the VAE in the transformer structure, are exploited. In the test stage, the transformer-VAE decoder generates the sequence , …, using class token and latent vector with positional encodings (PE).
Fig. 2. Overall process of human motion control: To control the target body part(left arm), we first calculate the Jacobian matrices of the left arm and remaining body parts, before performing low-rank factorization and SVD, respectively. Next, we select an attribute vector $a_i$ , among the target attribute vectors $A_r$ and project it onto the null space of the remaining body parts’ attribute vectors $B_n$ . Using attribute vector $n$ we can control the left arm while preserving the movements of the other parts of the human body.
2.4 Latent Space In Generation Models
Interpretation of latent spaces from trained generation models is drawing attention
for not only its theoretical purpose but also use in manipulating images (or any type
of generated output). Searching the interpretable directions of a latent space can
be categorized into two groups based on the approach, i.e., supervised and unsupervised.
Supervised methods [5,8] use off-the-shelf classifiers [5], or predefined edited images [8] as guidance for finding semantically meaningful directions. However, defining the
complex attributes in a given image is not a trivial task. For example, simple transformation
of an object in an imagesuch as horizontal shifting along the x-axis is easy. However,
changing the relatively complex attributes (e.g., gender, smile, and age) is not an
easy task. This is the motivation for unsupervised methods [9,10,28,29], and [30]s which approach the problem through statistical and mathematical analyses. In this
study, we adopt an unsupervised method for finding the attribute vectors that allow
us to control specific joints in the human body.
3. Preliminary
Before introducing our methods, we refer to the approach outlined in [10] to describe how the attribute vector $\boldsymbol{n}$ is identified, as detailed
below.
3.1 Low-rank Factorization
To achieve semantic manipulation of the synthesized sample $\boldsymbol{G}\left(\boldsymbol{z}\right)$,
prior studies [5,8,31] have linearly shifted the latent code $\boldsymbol{z},\,\,\boldsymbol{n}\in \mathrm{\mathbb{R}}^{{\boldsymbol{d}_{\boldsymbol{z}}}}$
in the direction of the attribute vector n as follows:
where $\boldsymbol{\alpha }$ is the editing strength. To find the attribute vector
that significantly alters$\boldsymbol{G}\left(\boldsymbol{z}\right),$ we set the optimization
problem that maximizes the variance of the difference as follows:
where $\boldsymbol{G}\left(\boldsymbol{z}+\boldsymbol{\alpha }\boldsymbol{n}\right)=\boldsymbol{G}\left(\boldsymbol{z}\right)+\boldsymbol{\alpha
}\boldsymbol{J}_{\boldsymbol{z}}\boldsymbol{n}+\boldsymbol{o}\left(\boldsymbol{\alpha
}\right)$ by the first-order Taylor series approximation, and $\boldsymbol{J}_{\boldsymbol{z}}$
is the Jacobian matrix of the generator $\boldsymbol{G}\left(\cdot \right)$ with respect
to $\boldsymbol{z}$.We can find $\boldsymbol{n}$ by solving (2) with a closed-form solution, which is the eigenvector of a matrix $\boldsymbol{J}_{\boldsymbol{z}}^{\boldsymbol{T}}\boldsymbol{J}_{\boldsymbol{z}}$
with the largest eigenvalue.
A previous study [10] has suggested that $\boldsymbol{J}_{\boldsymbol{z}}^{\boldsymbol{T}}\boldsymbol{J}_{\boldsymbol{z}}$
is perturbed with noise given the fact that it is a degenerate matrix, and it can
be decomposed with low-rank matrix and the sparse matrix. This becomes the motivation
for low-rank factorization which is as follows:
where $\left\| \boldsymbol{L}\right\| _{\boldsymbol{*}}=\sum _{\boldsymbol{i}}\boldsymbol{\sigma
}_{\boldsymbol{i}}\left(\boldsymbol{M}\right)$ is the nuclear norm defined by the
sum of all singular values, $\left\| \boldsymbol{S}\right\| _{1}=\sum _{\boldsymbol{ij}}\left|
\boldsymbol{M}_{\boldsymbol{ij}}\right| ,$ ant $\boldsymbol{\lambda }$ is a parameter
that balances the low-rank matrix $\boldsymbol{L}$ and the sparse matrix S. This low-rank
factorization can be solved by the alternating directions method of multipliers [32]. Then $\boldsymbol{J}_{\boldsymbol{z}}^{\boldsymbol{T}}\boldsymbol{J}_{\boldsymbol{z}}$
can be decomposed as follows:
Using the singular value decomposition (SVD) of matrix $\boldsymbol{L}^{\boldsymbol{*}}$,
we find the attribute vector $\boldsymbol{V}$ that is a right singular vector of the
SVD.
4. Proposed Method
In this section, we present solutions to primarily address the following three key
challenges that arise in controlling human motions via latent space of conditional
VAE: (1) entangled attribute vectors, (2) posterior collapse, and (3) lack of diversity
of motions in datasets.
4.1 Separate Control of The Body Parts
Our objective is to control one part of the human body (e.g., an arm), while preserving
the movements of the other parts (e.g., a leg). To do this, we utilize null space
projection [10].
By projecting the attribute vector from a target body part onto the null space of
the remaining body parts’ attribute vectors, we can move the intended body part independently
from the others, a concept referred to as disentangled control. This can be formulated
as follows.
From Eq. (4), let $V_{target}$ and $V_{other}$ be right singular matrices resulting from the SVD
of low-rank matrices $L_{target}$ and $L_{other}$ of the target and the other body
parts, respectively. The matrix $L_{target}$ and $L_{other}$ can be calculated from
the Jacobian matrix obtained by differentiating with respect to only outputs associated
with each body part. In addition, $r_{target}$, and $r_{other}$ denote the ranks of
$L_{target}$ and $L_{other}$, respectively. Then, we can express $V_{target}=\left[A_{r},\,\,A_{n}\right]$,
where $A_{r}=\left[a_{1},\ldots ,\,\,a_{{r_{target}}}\right]$ and $A_{n}=\left[a_{{d_{z}}-{r_{target}}},\ldots
\,\,a_{{d_{z}}}\right]$, and $V_{other}=\left[B_{r},\,\,B_{n}\right]$ where $B_{r}=\left[b_{1},\ldots
,\,\,b_{{r_{other}}}\right]$ and $B_{n}=\left[b_{{d_{z}}-{r_{other}}},\ldots ,\,\,b_{{d_{z}}}\right]$.
$B_{n}$ can be interpreted as a set of vectors that have little effects on the other
body parts. Hence, we can achieve the partwise controllability by projecting one attribute
vector ai onto the orthogonal complementary space of $B_{r}\,,$ which is formulated
as controllability by projecting one attribute vector ai onto the orthogonal complementary
space of $B_{r}\,,$ which is formulated as
We can easily see that $B_{n}$ is the null space of $L_{other}\,.$ Since each joint
in the SMPL model parameters can be handled by indexing, we can compute the Jacobians
of specific body parts. This enables us to compute the low-rank matrices and attribute
vectors for the target and remaining parts without extra effort, for instance, by
using binary masks.
4.2 Posterior Collapse and Low-rank Factorization
Although the transformer-based VAE shows excellent performance in human motion synthesis,
because of its flexible structure, it is vulnerable to posterior collapse, where the
generative model only exploits a subset of the latent variables [12]. When the latent space is reduced, the capacity of the generative model is reduced,
thereby degrading its performance. However, posterior collapse induces a more critical
problem in low-rank factorization. Since the trained latent vector is sparse, some
information included in the Jacobian matrix may have sparse form; this information
may then be wrongly included in matrix $S$ employed to extract sparse noise during
low-rank factorization; hence the discrepancy of information between the low-rank
matrix $L$ and $J_{z}^{T}J_{z}$ increases. Therefore, it is essential to mitigate
posterior collapse for effective action control.
There are numerous of strategies to circumvent the posterior collapse. We can categorize
these strategies into three groups: focusing on modifying the variational inference
objective [33], limiting the capacity of the decoder [34-36], and designing an optimization scheme for training the VAE [37-39]. We adopt sigmoid and cyclical annealing schedules [40,41], which are widely used optimization schemes that schedule the KL-term weighting hyperparameter.
The VAE objective, which is called the evidence lower bound (ELBO) can be expressed
as follows:
The meaning of the KL-term in Eq. (6) is a distance between a prior and its posterior approximate. To maximize the ELBO,
the model attempts to minimize the KL-term. Therefore, if $\beta $ is large in the
early stage of training, when the decoder is immature and the KL-term is relatively
large, the model ignores the posterior, which leads to posterior collapse. Thus, a
small $\beta $ can be used at the beginning of training to ensure diversity in generation,
and gradually increased $\beta $ according to the scheduling scheme.
Generally, an active unit (AU) is one of the indicators of the effects of KL-term
scheduling; it tends to have small values when posterior collapse occurs. We will
describe the effects of KL-term scheduling in the experimental section.
4.3 Data Augmentation
We use data augmentation to achieve higher controllability over the human actions
based on observations of datasets. After inspecting the human action dataset (UESTC
[42]), we observed that many motions of each body part from the dataset are similar to
those of the others, which implies that the angles between the joints are bounded
by the homogeneity of the dataset. We also observed that most of the motions are concentrated
in the arms than the other body parts; this is because 40 classes from the UESTC dataset
are those of aerobic exercises, which essentially involve movements of arms. Based
on the observations, data augmentation was applied to the dataset by increasing the
change rate of the motions. We obtain the change rate by calculating the differences
between parameters in adjacent frames. Then, by multiplying a number between 1 and
1.5, we obtain the motion instance with increased range of movements. Fig. 3 shows that the range of motions from each part increased after augmentation of the
UESTC dataset.
The detailed implementation of data augmentation is as follows: First, we converted
parameters given by 6D representations as in [18] to Euler angles. The reason behind this conversion is to ensure interpretability
of the parameters. After conversion, since we have 60 sets of parameters $\left(\theta
\right)$ which can be represented as $\mathrm{\mathcal{A}}'\in ~ \mathrm{\mathbb{R}}^{25\times
3\times 60}$, we subtract the parameters of one frame from those in the previous frame.
This gives the amount of angle difference between frames, which can be considered
as motion between the frames. After subtraction, we multiply a number between 1 and
1.5 to the angle differences acquired; this step is the main point in data augmentation
that allows more active motions. In the final step, by converting the Euler representations
back to the 6D representation, we can handle the data in the original form.
For more information on the 6D representation, please refer to [18]. Fig. 4 shows the implementation of the proposed data augmentation method
Fig. 3. Effects of data augmentation: The graphs show the output differences of the each joint when the latent vectors move along the eigen vectors of $J^TJ$. . We use eigenvectors corresponding to the top-k eignen values. The solid line is the mean value of the difference across the k values, and the colored region spans ±1σ (k=7) . The range of the motions from each part is observed to increased after applying data augmentation.
Fig. 4. Data augmentation applied for more active actions: The figure illustrates data augmentation applied to human actions to capture more active actions than those available in the dataset. After calculation of the differences between adjacent frames, we multiply with α, which is a value uniformly distributed between 1 and 1.5, to generate more exaggerated actions. A' stands for the human action in euler angle representation and $A'_{aug}$ is its augmented correspondence.
5. Experiments
In this section, the datasets, implementation details, and performance measures for
the experiments are introduced (Section 5.1). Next, an ablation study is presented
(Section 5.2), and visualization is provided (Section 5.3). Finally, the performance
of the method in the unconditional setting is shown (Section 5.4).
5.1 Experimental Setup
a: Dataset
Experiments are performed on two datasets, UESTC [42] and HumanAct12 [1], which are postprocessed as in [2] for the 3D human motion generation task. The UESTC dataset is a large-scale RGB-D
action dataset that covers the entire 360$^{\mathrm{◦}}$ viewing angles and consists
of 40 action categories.
Among the 25K video sequences of the UESTC, following [2], about 10,650 sequences were used for training, and the remaining sequences 13,350
sequences were used for testing according as per the official cross-subject protocol
of [42]. The HumanAct12 dataset is adopted from the existing dataset PHSPD dataset [43,44] and is composed of 1,191 videos with 12 action categories.
b: Implementation details
For fair comparison, the training and testing were conducted under the same configuration
as [2] without scheduling the weight of the KL term ($\beta $ in (6)) during training. Sigmoid annealing, as in [40], defines the weighting factor of the KL term $\beta _{t}=u\cdot \left(1/\left(1+e^{-kn+b}\right)\right)$
where u is an upper bound for the KL term, and n is the training step; k and b are
parameters that control the rate of weight change. We set k and n to 5e ${-}$ 6 and
13.5, respectively, and u was set to 2e-6. Cyclical annealing, as conducted in [41], defines the KL term as
where t is the iteration number, T is the total number of training iterations, $f$
is a monotonically increasing function, and $u$ is the upper bound as well. We set
M and R to 4 and 0.5, respectively, and $u$ was set to 1e-6. Experiments were performed
to select the appropriate hyperparameters for sigmoid annealing [40]. Three hyper-parameters were used in this annealing scheme: $k,~ b$, and $u$. Here,
$k$ and $b$ were used to control the rate of the KL-term weight change, and $u$ was
used as the upper bound of the weight value. Since the final performance of the network
was mainly affected by the $u$ value, experiments were conducted by varying only $u$.
These results are shown in the Table 2, and the values of $k$ and $b$ were set so that the KL-term weight is half of the
maximum value of $u$ in the middle point of the training process, as shown in Fig. 5. For the value of $u$, we implemented experiments with two values, namely 5e${-}$6
and 1e${-}$6. As seen, the Frenchet inception distance (FID) score is best when the
value of 5e${-}$6 was used and the AU is largest when the value of 1e${-}$6 was used.
Although the model using the value of 1e${-}$6 could be regarded as the best model
for circumventing posterior collapse, its unacceptable FID score insists that this
model does not guarantee the quality of generated human motion, which is fundamental
functionality for generation model. Accordingly, the value of 5e${-}$6 is adopted
in this work.
Fig. 5. KL-term weight after applying sigmoid annealing.
c: Evaluation metrics
In accordance with previous researches [1,2], the FID score is adopted for the training and test datasets to measure the quality
and diversity of motion generation. To extract the motion features, we used the action
recognition models from [1]. As discussed in Section IV-B, AU [45] was employed to estimate the degree of posterior collapse. The activity of the latent
dimension was measured as $A_{z}=Cov\left(\mathrm{\mathbb{E}}_{z\sim q\left(z|x\right)}\left[z\right]\right)$.
$A_{z}~ $was regarded to be active if $A_{z}$ > 0.001 in this study.
The term $J_{z}^{T}J_{z}$ refers to the product of the Jacobian matrix of the generator
with respect to z and its transpose. The matrix $L$ can be interpreted as the noise-removed,
low-rank representation of $J_{z}^{T}J_{z}$ as derived from Eq. (4). Given an attribute vector n that targets a specific body part, it follows that the
value of $n^{T}L_{target}n~ $should be large, whereas the value of $n^{T}L_{other}n$
other n should be small due to the null space projection.
Furthermore, it is evident that the attribute vector n, de- rived from $L$, should
induce a change in $J_{z}^{T}J_{z}$toward the desired attribute direction. This assumption
holds true if $L$ is a properly decomposed, noise-removed representation of $J_{z}^{T}J_{z}$.
Consequently, as illustrated in Eqs. (9) and (10), the numerators encapsulate the ratios representing the sepa- rate control of the
human body with respect to $J_{z}^{T}J_{z}$. and $L$, respectively.
By employing the SC score, it is possible to systematically evaluate the partwise
control of human motions, providing a robust framework for analyzing the effectiveness
of the method.
The difference in the value between $SC_{J_{z}^{T}{J_{z}}}$ and $SC_{L}$ can be interpreted
as the discrepancy between $J^{T}J$ and $L$ when moving along with the target attribute
vector n attained by applying SVD to $L$. In summary, both the intrarelationship (numerator
of each equation) that gives the intuition for separate control of a body part and
the interrelationship that gives the intuition for the relevance between $J^{T}J$
and low-rank matrix $L$ should be evaluated.
4.2 Ablation Study
a: KL-term scheduling
This section validates the effectiveness of the KL-term annealing schemes herein.
As discussed in Section 4.2, we applied sigmoid annealing [40] and cyclic annealing [41] KL term to avoid posterior collapse. Hence, we employed AU to observe the number
of active dimensions among the total number of dimensions in the latent vector. As
seen in Table 1,when cyclic annealing is applied to the UESTC dataset, the AU increased to 3.3 times
the original; this means that when training with cyclic annealing, the generation
model used 3.3 times more variables. On the other hand, when sigmoid annealing was
applied, the AU increased to 4.1 times the original, meaning that sigmoid annealing
is more effective than cyclic annealing, as is clear in Fig. 6. The three graphs in Fig. 6 show how much the dimensions in the latent variables are active in ACTOR [2], ACTORcycle, and ACTORsigmoid. We see that there are more active units of ACTORsigmoid
than those of ACTORcycle. This means that sigmoid scheduling is more effective for
mitigating posterior collapse.
Fig. 3. Active unit variance: Each graph describes Az of all dimensions of latent space in three ACTOR [2] variants, ACTOR, ACTORcycle, and ACTORsigmoid. Both sigmoid and cyclic annealing are effective for mitigating posterior collapse.
The results from Table 1 show that both types of SC scores ($SC_{{J^{T}}J}$and $SC_{L}$ ) under any kind of
scheduling scheme in- crease compared to the baseline ACTOR model. This means that
the partwise disentangled controllability increases. The baseline model, ACTOR$†$,
even exhibits a negative value for the $SC_{{J^{T}}J},$ which means that the discovered
attribute vector $n$ is highly tangled with the Jacobian matrix of the other body
parts $J$ other. In addition, note that the difference be- tween $SC_{{J^{T}}J}$ and
$SC_{L}$, which we called the interrelationship decreases when AU is increases. This
indicates that posterior collapse induces more critical problems in low-rank factorization.
For comparison, sigmoid annealing [40] was applied to the KL term that has been proven in the UESTC dataset to have positive
effects on mitigating posterior collapse. These results are shown in Table 3. As in the case of UESTC, the AU of HumanAct12 increased to 5.4 times that of the
baseline model, which means that the active dimensions in the latent space increased.
In Section V-A, we suggested a new metric, SC scores ($SC_{{J^{T}}J}$ and $SC_{L}$).
Since the $SC_{{J^{T}}J}$ and $SC_{L}$ increased and the gap between $SC_{{J^{T}}J}$
and $SC_{L}$). reduced, we conclude that the control of a body part is separated by
a greater amount and that the relevance between $J^{T}J$ and $L$ increases. Therefore,
our method allowed us to control actions in a partwise manner more easily than the
na\"{i}ve method on the HumanAct12 [1] and UESTC [42] datasets.
b: Data augmentation
This subsection shows the quantitative results of our method. First, we measured the
quality of the generated motion sequences using the FID score between the feature
distribution of generated motions and that of real motions. Table 1 shows the correlation between data augmentation and the realistic quality of the
synthesized motions. Comparing the baseline model (ACTOR) with/without data augmentation,
it can be concluded that more realistic generated sequences of motions can be obtained
with data augmentation. In addition, the model can perceive a wider range of motions
compared to the na\"{i}ve training.
Table 1. Comparison of ACTOR variants.
Method
|
UESTC
|
FIDtr↓
|
FIDtest
|
Acc. ↑
|
AU↑
|
SC$_{J^{T}J}$↑
|
SC$_L$↑
|
Multimod→
|
Real†
|
2.93±0.26
|
2.79±0.29
|
98.8±0.1
|
-
|
-
|
-
|
14.16±0.06
|
ACTOR†
|
0.12±0.00
|
2.79±0.29
|
2.79±0.29
|
10
|
0.23
|
-0.35
|
14.66±0.03
|
ACTORaug
|
2.79±0.29
|
2.79±0.29
|
-
|
16
|
-
|
-
|
-
|
ACTORsigmoid
|
14.32±1.19
|
18.25±1.38
|
92.0±0.55
|
41
|
0.30
|
0.39
|
15.08±0.10
|
ACTORcycle
|
23.38±2.14
|
26.04±3.45
|
80.0±0.96
|
33
|
0.29
|
0.00
|
17.73±0.10
|
Table 2. Experiment for the hyper-parameter of KL-term annealing. † is quoted from
[2].
Method
|
UESTC
|
FIDtr↓
|
FIDtest↓
|
AU↑
|
ACTOR†
|
20.49±2.31
|
23.43±2.20
|
10
|
ACTORsigmoid (5e-6)
|
14.32±1.19
|
18.25±1.38
|
41
|
ACTORsigmoid (1e-6)
|
20.49±2.31
|
20.49±2.31
|
76
|
Table 3. Comparison of ACTOR variant. † is quoted from [2].
Method
|
HumanAct12
|
FIDtr↓
|
AU↑
|
SC$_{J^{T}J}$↑
|
SC$_L$↑
|
Real
|
0.02±0.00
|
-
|
-
|
-
|
ACTOR
|
0.12±0.00
|
14
|
0.00
|
-0.18
|
ACTORsigmoid
|
0.16±0.00
|
75
|
0.30
|
0.39
|
4.3 Conditional Generation
Fig. 7 depicts the effectiveness of our method in terms of motion control. For two different
classes, the method was applied to control an arm part and a leg part, respectively.
The first two rows of Fig. 7 show the show the results of arm control, and the last two rows show the results
of leg control. For arm control, it is noted that the elbow joints of the controlled
output were more stretched than those of the original output. This occurs not only
in the red-circled frames but also in the other frames of Fig. 7. For leg control, it is recognized that the right knee joint of the controlled output
was higher than that of the original output. The most important thing is that even
after each target body part was manipulated, the other body parts maintained their
original motions.
More qualitative results from the UESTC dataset and the HumanAct12 datasets are shown.
Figs. 10 and 11 show ex- amples of controlling arm parts and leg parts, respectively, for the UESTC
dataset. Figs. 12 and 13 show examples of controlling the arm parts and leg parts, respectively, for the HumanAct12
dataset.
Figs. 10 and 12 show that arm can be controlled by finding the related latent vector $n_{arm}$. In
the first two rows of Fig. 10, when $n_{arm}$. is applied to action class 9, it can be seen that the elbow of the
human figure is bent when compared to the original class 16 human figure. Likewise,
by applying the latent direction vector for the elbow, the same effect as class 9
is applied to class 35. For classes 10 and 14 in the subsequent rows, it is observed
that the human body model is raising its arms more than in the original action class.
Fig. 12 shows that the effect written below the class name is applied to the HumanAct12 dataset.
Figs. 11 and 13 show the effects of the latent vector related to the legs of the human body model
from the UESTC and HumanAct12 datasets, respectively.
To further assess the generalization capability of the proposed method, we applied
it to another generative model as described in [13]. This evaluation aims to verify the robust- ness and adaptability of the proposed
method across different generative frameworks. Fig. 8 shows the vector $n_{arm}$ applied to the jumping class.
Fig. 7. Qualitative results for the conditional setting: Controlling only an arm in the action in the first row is described in the second row. Likewise, controlling only a leg in the action in the third row is described in the fourth row.
Fig. 8. Qualitative results in conditional setting: These actions are generated under the conditional setting from work[13]. The first row is the original class jumping. The second row is when we applied the arm moving vector to the first row. The format of the human model follows the work[13]
Fig. 9. Qualitative results in unconditional setting: These actions are generated under the unconditional setting. The first two rows are generated from the inputs of two different classes. The action in the third row is generated from the average latent vector of the first and second row action. Only the arm part of the third action is controlled and described in the fourth row.
Fig. 10. Qualitative results on the UESTC dataset (arm-control).
Fig. 11. Qualitative results on the UESTC dataset (leg-control).
Fig. 12. Qualitative results on the HumanAct12 dataset (arm-control).
Fig. 13. Qualitative results on the HumanAct12 dataset (leg-control).
Fig. 14. Qualitative results on the UESTC dataset (Unconditional).
4.4 Unconditional Generation
The strongest advantage of generating 3D human motions lies in generating diverse
and human-like motion sequences. To this end, the methodology was expanded from a
class- conditional setting to an unconditional setting to generate new actions between
two classes and to observe how these actions can be controlled. This is well visualized
in Fig. 9. The first two rows of the figure show motion sequences from two different classes.
In addition, the third row represents the generated action from the interpolated latent
vector between the latent vectors of class 3 and 18. Since the generated motion in
the third row includes the sitting motion of class 3 (first row) and arm circling
motion of class 18 (second row) at the same time, it is concluded that semantic interpolation
in the latent space is possible. Further, arm control was im- plemented using the
interpolated latent vector, and the result is shown in the fourth row of Fig. 9. Unlike the action in the third row, the action in the fourth row bends its elbows
more while restraining the other joints from moving. This implies that arm control
can also be applied to the interpolated latent. Thus, we show that the possibility
of our method can generate much diverse actions under the unconditional setting. Fig. 14 shows more results from unconditional generation.
Since generated actions are composed of frames, static images alone are insufficient
to show the qual- ity. Hence, please refer to our project page for videos. https://josephkkim.github.io/Motion_Control/
5. Conclusion
This study demonstrated methods for directly controlling motions in the latent space
of the human action generative model. To improve controllability, we employed three
techniques: (1) attribute vector projection, (2) mitigating posterior collapse, and
(3) data augmentation. As a result, we achieved control of a target body part while
preserving the movements of the remaining body parts. In particular, we discovered
that as the number of activated dimensions in the latent vector increased, the controllability
also increased, meaning that posterior collapse and controllability are closely related.
In class-conditional VAE, various controls were difficult because the latent space
was too small. While previous research has concentrated on generating diverse representations
of human motion, this paper addresses the problem of motion generation through the
exploration of the latent space, allowing for the modification of outputs from generative
models. In recent studies, for text and image generative models, prompt engineering
has been employed to produce desirable outputs. However, our approach is novel in
that it directly explores and exploits the latent space. In our future work, we intend
to explore further action control of unconditional generative models with large-scale
datasets. Furthermore, the ability to control outputs through direct access to the
latent space is not only significant for human action generative models but also holds
important implications for users of generative models across various modalities. This
approach enables users to gain more precise control over the outputs of diverse generative
models.
ACKNOWLEDGMENTS
This work was supported by the MSIT(Ministry of Science and ICT), Korea, under the
Graduate School of Metaverse Convergence support program(IITP-RS-2022-00156318) supervised
by the IITP(Institute for Information & Communications Technology Planning & Evaluation),
and by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT)(RS-2024-00414230).
REFERENCES
C. Guo, X. Zuo, S.Wang, S. Zou, Q. Sun, A. Deng, M. Gong, and L. Cheng, ``Action2motion:
Conditioned generation of 3d human motions,'' in Proc. ACM Int’l Conf. Multimedia,
2020, pp. 2021-2029.
M. Petrovich, M. J. Black, and G. Varol, ``Action-conditioned 3d human motion synthesis
with transformer vae,'' in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 10 985-10
995.
G. Tevet, B. Gordon, A. Hertz, A. H. Bermano, and D. Cohen-Or, ``Motionclip: Exposing
human motion generation to clip space,'' 2022, arXiv:2203.08063.
M. Petrovich, M. J. Black, and G. Varol, ``Temos: Generating diverse human motions
from textual descriptions,'' 2022, arXiv:2204.14109.
L. Goetschalckx, A. Andonian, A. Oliva, and P. Isola, ``Ganalyze: Toward visual definitions
of cognitive mage properties,'' in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 5744-5753.
Z. Chen and N. Chen, ``Children’s football action recognition based on lstm and a
v-dbn,'' IEIE ransactions on Smart Processing & Computing, vol. 12, no. 4, pp. 312-322,
2023.
Y. Shi, ``Image recognition of skeletal action for online physical education class
based on convolutional neural network,'' IEIE Transactions on Smart Processing & Computing,
vol. 12, no. 1, pp. 55-63, 2023.
A. Jahanian, L. Chai, and P. Isola, ``On the" steerability" of generative adversarial
networks,'' in Proc. Int. Conf. Learn. Represent., 2019.
Y. Shen and B. Zhou, ``Closed-form factorization of latent semantics in gans,'' in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 1532-1540.
J. Zhu, R. Feng, Y. Shen, D. Zhao, Z.-J. Zha, J. Zhou, and Q. Chen, ``Lowrank subspaces
in gans,'' in Proc. Adv. Neural Inf. Process. Syst., 2021, pp. 16 648-16 658.
D. P. Kingma and M. Welling, ``Auto-encoding variational bayes,'' 2013, arXiv:1312.6114.
B. Dai, Z.Wang, and D.Wipf, ``The usual suspects? reassessing blame for vae posterior
collapse,'' in International Conference on Machine Learning. PMLR, 2020, pp. 2313-2322.
Q. Lu, Z. Yipeng, M. Lu, and V. Roychowdhury, ``Action-conditioned on demand motion
generation,'' in Proc. ACM Int’l Conf. Multimedia, 2022, pp. 2249-2257.
M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, ``Smpl: A skinned
multi-person linear model,'' in ACM Transactions on Graphics, 2015, pp. 1-16.
Q. Sun, Y. Xiao, J. Zhang, S. Zhou, C.-S. Leung, and X. Su, ``A local correspondence-aware
hybrid cnn-gcn model for single-image human body reconstruction,'' IEEE Transaction
on Multimedia, 2022.
Y. Sun, L. Xu, Q. Bao, W. Liu, W. Gao, and Y. Fu, ``Learning monocular regression
of 3d people in crowds via scene-aware blending and deocclusion,'' IEEE Transactions
on Multimedia, 2023.
H. Zhang, Y. Meng, Y. Zhao, X. Qian, Y. Qiao, X. Yang, and Y. Zheng, ``3d human pose
and shape reconstruction from videos via confidence-aware temporal feature aggregation,''
IEEE Transactions on Multimedia, 2022.
Y. Zhou, C. Barnes, J. Lu, J. Yang, and H. Li, ``On the continuity of rotation representations
in neural networks,'' in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp.
5745-5753.
H. Tang and N. Sebe, ``Total generate: Cycle in cycle generative adversarial networks
for generating human faces, hands, bodies, and natural scenes,'' IEEE Transactions
on Multimedia, vol. 24, pp. 2963-2974, 2021.
L. Ma, K. Huang, D. Wei, Z.-Y. Ming, and H. Shen, ``Fda-gan: Flowbased dual attention
gan for human pose transfer,'' IEEE Transactions on Multimedia, 2021.
X. Lin and M. R. Amer, ``Human motion modeling using dvgans,'' 2018, arXiv:1804.10652.
I. Habibie, D. Holden, J. Schwarz, J. Yearsley, and T. Komura, ``A recurrent variational
autoencoder for human motion synthesis,'' in Proc. British Mach. Vis. Conf., 2017.
F. Ma, G. Xia, and Q. Liu, ``Spatial consistency constrained gan for human motion
transfer,'' IEEE Trans. Circuits Syst. Video Technol., 2021.
S. Wen, W. Liu, Y. Yang, T. Huang, and Z. Zeng, ``Generating realistic videos from
keyframes with concatenated gans,'' IEEE Trans. Circuits Syst. Video Technol., 2018.
N. Xie, Z. Miao, X.-P. Zhang, W. Xu, M. Li, and J. Wang, ``Sequential gesture learning
for continuous labanotation generation based on the fusion of graph neural networks,''
IEEE Trans. Circuits Syst. Video Technol., 2021.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \L{}. Kaiser,
and I. Polosukhin, ``Attention is all you need,'' Proc. Adv. Neural Inf. Process.
Syst., 2017.
H.-Y. Lee, X. Yang, M.-Y. Liu, T.-C. Wang, Y.-D. Lu, M.-H. Yang, and J. Kautz, ``Dancing
to music,'' Proc. Adv. Neural Inf. Process. Syst., 2019.
E. H\"{a}rk\"{o}nen, A. Hertzmann, J. Lehtinen, and S. Paris, ``Ganspace: Discovering
interpretable gan controls,'' Proc. Adv. Neural Inf. Process. Syst., pp. 9841-9850,
2020.
Y. Wei, Y. Shi, X. Liu, Z. Ji, Y. Gao, Z. Wu, and W. Zuo, ``Orthogonal jacobian regularization
for unsupervised disentanglement in image generation,'' in Proc. IEEE Int. Conf. Comput.
Vis., 2021, pp. 6721-6730.
V. Abrol, P. Sharma, and A. Patra, ``Improving generative modelling in vaes using
multimodal prior,'' IEEE Transactions on Multimedia, vol. 23, pp. 2153-2161, 2020.
Y. Shen, J. Gu, X. Tang, and B. Zhou, ``Interpreting the latent space of gans for
semantic face editing,'' in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020,
pp. 9243-9252.
Z. Lin, M. Chen, and Y. Ma, ``The augmented lagrange multiplier method for exact recovery
of corrupted low-rank matrices,'' 2010, arXiv:1009.5055.
A. Razavi, A. v. d. Oord, B. Poole, and O. Vinyals, ``Preventing posterior collapse
with delta-vaes,'' 2019, arXiv:1901.03416.
S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio, ``Generating
sentences from a continuous space,'' 2015, arXiv:1511.06349.
I. Gulrajani, K. Kumar, F. Ahmed, A. A. Taiga, F. Visin, D. Vazquez, and A. Courville,
``Pixelvae: A latent variable model for natural images,'' 2016, arXiv:1611.05013.
Z. Yang, Z. Hu, R. Salakhutdinov, and T. Berg-Kirkpatrick, ``Improved variational
autoencoders for text modeling using dilated convolutions,'' in International Conference
on Machine Learning. PMLR, 2017, pp. 3881-3890.
J. He, D. Spokoyny, G. Neubig, and T. Berg-Kirkpatrick, ``Lagging inference networks
and posterior collapse in variational autoencoders,'' 2019, arXiv:1901.05534.
Y. Kim, S.Wiseman, A. Miller, D. Sontag, and A. Rush, ``Semi-amortized variational
autoencoders,'' in International Conference on Machine Learning. PMLR, 2018, pp. 2678-2687.
B. Li, J. He, G. Neubig, T. Berg-Kirkpatrick, and Y. Yang, ``A surprisingly effective
fix for deep latent variable modeling of text,'' arXiv:1909.00868.
D. Liu and G. Liu, ``A transformer-based variational autoencoder for sentence generation,''
in Proc. Int. Cong. Neural Netw. (IJCNN). IEEE, 2019, pp. 1-7.
H. Fu, C. Li, X. Liu, J. Gao, A. Celikyilmaz, and L. Carin, ``Cyclical annealing schedule:
A simple approach to mitigating kl vanishing,'' in Proc. Hum. Lang. Technol., Annu.
Conf. North Amer. Chapter Assoc. Comput. Lingustics, 2019, pp. 240-250.
Y. Ji, F. Xu, Y. Yang, F. Shen, H. T. Shen, andW.-S. Zheng, ``A large-scale RGB-D
database for arbitrary-view human action recognition,'' in Proc. ACM Int’l Conf. Multimedia,
2018, p. 1510-1518.
S. Zou, X. Zuo, Y. Qian, S. Wang, C. Xu, M. Gong, and L. Cheng, ``3d human shape reconstruction
from a polarization image,'' in Proc. Eur. Conf. Comput. Vis., 2020, pp. 351-368.
S. Zou, X. Zuo, Y. Qian, S.Wang, C. Guo, C. Xu, M. Gong, and L. Cheng, ``Polarization
human shape and pose dataset,'' arXiv:2004.14899.
Y. Burda, R. Grosse, and R. Salakhutdinov, ``Importance weighted autoencoders,'' 2015,
arXiv:1509.00519.
Hyunsung Kim received the B.S. and M.S.degrees in electronics engineering from Sogang
University, Seoul, South Korea, in 2020 and 2022, respectively. He is currently Associate
Researcher with LG electronics, Seoul, South Korea. His current research interests
include image processing, computer vision, human pose estimation, and deep learning.
Kyeongbo Kong received the B.S. degree in electronics engineering from Sogang University,
Seoul, South Korea, in 2015, and the M.S. and Ph.D. degrees in electrical engineering
from the Pohang University of Science and Technology (POSTECH), Pohang, South Korea,
in 2017 and 2020, respectively. From 2020 to 2021, he was worked as a Postdoctoral
Fellow with the Department of Electrical Engineering, POSTECH, Pohang, South Korea.
From 2021 to 2023, he was an Assistant Professor of Media School at Pukyong National
University, Busan. He is currently an Assistant Professor of Electronics Engineering
at Pusan National University. His current research interests include image processing,
computer vision, machine learning, and deep learning.
Joseph Kihoon Kim received the B.S. degree from Handong Global University, South Korea,
in 2021. and recieved M.S. degree in 2023 from Sogang University, South Korea. He
is currently Associate Researcher with Samsung electronics, Suwon, South Korea. His
current research interests include computer vision and generative models.
James Lee received the B.S. degree in electronics engineering from Kookmin University,
Seoul, South Korea, in 2021 and the M.S. degree in electronics engineering from Sogang
University, Seoul, South Korea, in 2023. He is currently Associate Researcher with
Samsung electronics, Hwaseong, South Korea. His current research interests include
computer vision and machine learning.
Geonho Cha received the B.S. and Ph.D. degrees from the School of Electrical Engineering
and Computer Science, Seoul National University, Korea, in 2013 and 2019, respectively.
In 2019 and 2020, he was a Postdoctoral Researcher in the same school. Currently,
he is an AI Researcher at NAVER Cloud, Seongnam, Korea. His research interests include
neural 3D representations, deformable models, computer vision, deep learning, pattern
recognition, and their applications.
Ho-Deok Jang received a B.S. degree in Electrical Engineering from Hongik University
in 2017 and a M.S. degree in Electrical Engineering (Division of Future Vehicle) from
Korea Advanced Institute of Science and Technology (KAIST) in 2019. He is currently
a Research Engineer of CLOVA AI at NAVER Cloud Corp, Seongnam. His research interests
include object recognition, image/video processing, and deep learning.
Dongyoon Wee received both his Bachelor of Science in 2008 from the School of Electrical
Engineering and Computer Science at Seoul National University in Seoul, South Korea.
In 2011, he completed his Master's degree at the same institution. From 2011 to 2017,
he worked as a research engineer at LS Cable and System, LG CNS and Buzzvil subsequently.
In 2017, he began his role as a research engineer at Naver, Seongnam, South Korea,
where he now leads a research team focused on advancing the fields of video understanding
and 3D computer vision.
Suk-Ju Kang (Member, IEEE) received a B.S. degree in electronic engineering from Sogang
University, South Korea, in 2006, and a Ph.D. degree in electrical and computer engineering
from the Pohang University of Science and Technology, in 2011. From 2011 to 2012,
he was a Senior Researcher with LG Display, where he was a project leader for resolution
enhancement and multi-view 3D system projects. From 2012 to 2015, he was an Assistant
Professor of Electrical Engineering at Dong-A University, Busan. He is currently a
Professor of Electronic Engineering at Sogang University. He was a recipient of the
IEIE/IEEE Joint Award for Young IT Engineer of the Year, in 2019. His current research
interests include image analysis and enhancement, video processing, multimedia signal
processing, circuit design for display systems, and deep learning systems.