Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 13, No. 06, p.609-621

ISSN (online) :

2287-5255

Received : 31 August 2024Accepted : 24 September 202430 December 2024

DOI :

https://doi.org/10.5573/IEIESPC.2024.13.6.609

Regular Paper

Enhanced Control of Human Motion Generation using Action-conditioned Transformer VAE with Low-rank Factorization

KimHyunsung^1,^† KongKyeongbo^2,^† KimJoseph Kihoon^3,^† LeeJames^3,^† ChaGeonho⁴ JangHo-Deok⁴ WeeDognyoon⁴ KangSuk-Ju⁵

( LG Electronics, Seoul, Korea hs9767.kim@lge.com)
( Department of Electrical & Electronics Engineering, Pusan National University, Busan, Korea kbkong@pusan.ac.kr)
( Samsung Electronics, Gyeonggi-do, Korea {joseph.kim, jims.lee}@samsung.com)
( Naver, Gyeonggi-do, Korea {geonho.cha, hodeok.jang, dongyoon.wee}@navercorp.com)
( Department of Electronics Engineering, Sogang University, Seoul, Korea sjkang@sogang.ac.kr)

^*Corresponding Author: Suk-Ju Kang

† Hyunsung Kim, Kyeongbo Kong, Joseph Kihoon Kim, James Lee contributed equally to this work.

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

This paper presents an action-conditioned transformer variational autoencoder (VAE) designed to generate realistic and diverse human motion sequences. The model enables control of specific body parts of the generated human motions, thereby achieving more degrees of freedom and diversity in human actions. In order to achieve control of the body parts, this paper acquires attribute vectors through low-rank factorization and null space projection. We employ scheduling schemes for the KL-term ( ) and data augmentation to address posterior collapse to promote motion diversity. Evaluations on the UESTC and HumanAct12 datasets demonstrate the effectiveness of the proposed model and methods, showing plausible and humanlike actions. In addition, we show the application of control to actions generated in unconditional settings, thus revealing the potential for future research. To the best of our knowledge, this is a pioneering work on directly controlling motions in the latent space without using other modalities.

Keywords

Disentangled control, 3D human mesh generation, Latent space

1. Introduction

Recently, generative models that be used to create human motions have been studied extensively ^[1,^2]. Similar to other generative models, these synthesize realistic data from learned distributions. Since the generative model is probabilistic, controlling the generated human motions is a very challenging task. Some studies have attempted to control motions based on text ^[3,^4], but it is still difficult to precisely control specific body parts.

This paper presents a novel approach to directly control human motions in the latent space of a generative model. We employ a transformer-based variational autoencoder (VAE) ^[2] to learn the sequence-level latent space of human motions. In the image domain, there are reported studies on control of the content generated by generative models ^[5-^10]; these use either supervised or unsupervised methods to find directions that change the content to semantically meaningful directions. Discovering meaningful directions using supervised methods has advantages since it can drive the output of a generative model toward the intended direction. However, since human motions have high degrees of freedom, it is challenging to accurately supervise learning. Therefore, this paper focuses on an unsupervised method to control human motions.

There are three key challenges in the task of controlling generated human motions from the baseline model ^[2]: (1) entangled attribute vectors, (2) posterior collapse, and (3) lack of diversity of motions in the dataset. The first challenge originates from the use of an unsupervised method. Since it is not feasible to explicitly define the attribute vectors to be modified, it is necessary to choose a semantically meaningful direction from the set of discovered directions. However, due to bias that resides in the dataset, most attribute vectors are entangled, resulting in changes to unintended as well as intended body parts. Low-rank decomposition with projection to an orthogonal complement space is employed to encourage partwise control of human motions, inspired by ^[10]. By applying this method, it is possible to reflect specific intentions and control a particular body part independently from the other body parts.

The second problem is derived from the employed generative model (i.e., transformer VAE) ^[11]. Despite the good performance of the transformer VAE, posterior collapse wherein the generative model only exploits a subset of the latent space ^[12] occurs owing to the transformer’s complex structure. This posterior collapse induces a critical problem in low-rank factorization that we bypass through simple scheduling schemes for the KL-term, which is the distance between a prior and its posterior approximate.

Lastly, it is observed that human action datasets, such as UESTC and HumanAct12, comprise similar motions for each body part, which degrades the controllability of human actions. To increase the diversity of motions in human action datasets, a novel data augmentation method that can be applied to human motion data is proposed that utilizes the differences between motion frames for each joint.

To demonstrate the effectiveness of the proposed solution, extensive experiments are conducted on the UESTC and HumanAct12 datasets to assess the controllability of human motions. Additionally, to explore the applicability of our proposed method in different contexts, we applied it to the most similar existing model ^[13]. This comparison allows us to evaluate the performance and versatility of our approach in a broader range of scenarios. For evaluation, metrics from ^[1] are adopted to measure the naturalness and diversity of the generated actions. Additionally, a novel metric called the SC score is introduced to effectively measure specific control of human body parts.

2. Related Work

We briefly review the background literature on skinned multi person linear (SMPL) model and action-conditioned transformer VAE.

2.1 SMPL

The SMPL model ^[14] is a skinned vertex-based model that describes various human body shapes and poses. This model is simple in the sense that it is computationally efficient while being visually realistic, which means that it is possesses fewer artifacts and expresses plausible poses than other models. It is also action-oriented because it is designed for compatibility with existing 3D animation software. Simple Specific task related to the SMPL model is human pose and shape estimation task ^[15-^17], which aims to estimate two types of SMPL parameters: shape and pose, from a given image or video. In this task, finding the exact value of both shape and pose parameters is very important. However, since our task is to control a specific body part in human action, we ignore the shape parameters $\boldsymbol{\beta }$ and use only the pose parameters $\boldsymbol{\theta }$ to generate human motion in this study.

Pose parameter from the original work ^[14] represents the rotations of a joint with respect to its parent in the kinematic tree. This rotation ($R_{t}$) is the axis-angle rotation represented in 3D. Twenty-three joints plus one global rotation makes each person object to have 72 ($R_{t}~ \in ~ \mathrm{\mathbb{R}}^{24\times 3}$) pose parameters in total. In addition, $D_{t}~ \in ~ \mathrm{\mathbb{R}}^{1\times 3}$, so the translation of a root joint is added to capture the semantics of a given action, such as translations along the $x$, $y$, and $z$ axes. Recently, rotations have been represented in the 6D rotation representation ^[18]; following this work, we translate $R_{t}$ and $D_{t}$ into 6D. To summarize, the rotation matrix $R_{t}~ \in ~ \mathrm{\mathbb{R}}^{24\times 6}$consists of the joint and global rotations of a human in 6D representation, and $D_{t}~ \in ~ \mathrm{\mathbb{R}}^{1\times 6}$ is a translation matrix in 6D. We refer to the combination of $R_{t}$ and $D_{t}$ as $P_{t}~ \in ~ \mathrm{\mathbb{R}}^{25\times 6}$. Thus, since we assume 60 frames of $P_{t}$ as the action ($\mathrm{\mathcal{A}}$), it can be expressed as $\mathrm{\mathcal{A}~ }\in ~ \mathrm{\mathbb{R}}^{25\times 6\times 60}$.

2.2 Human Motion Generation

Previously, studies related to human motion generation were limited to the image domain, generating human body parts in an image ^[19] or, transferring the appearance of the source person to the target pose ^[20]. Recently, there are several reported studies ^[2, ^21-^25] on handling human motion generation. The purpose of motion generation is to generate a range of plausible output motion sequences. We can summarize the corresponding research from two perspectives: architecture and multimodality.

Most human motion generation methods employ generative adversarial networks (GANs) and VAE models, whose basic architectures have been developed in several forms. A recurrent neural network (RNN) combined with a VAE model was used to generate high-quality human motions, as demonstrated by ^[22]. The creation of multiple frames of motion was achieved using a GAN architecture with a hierarchical generator that combines a convolutional neural network (CNN) and RNN, as demonstrated in ^[21]. Previous works which deals with human motion in image domain ^[23], ^[24] also leverage architecture of GAN to handle human motion. In the work ^[23], GAN-based framework is proposed to transfer human motion from source image to target image and in the work ^[24], they also utilize GAN architecture to generate plausible intermediate frames given two start and end frames as input. Transformer ^[26], which is based on the attention mechanism, is a powerful approach for encoding sequences that have long-range dependencies. Since the motion in each frame is highly correlated, it is important to consider the temporal aspects of a motion sequence into account when generating plausible and natural human motions. Therefore, in the present study, we use transformer-based VAE, as in ^[2].

There have been many efforts to generate and control human motions using various types of conditions, such as text and audio. Some studies ^[3,^4] effectively produced diverse 3D human movements that were highly relevant to the textual descriptions. A decomposition-to-composition framework for synthesizing dancing movements from input music was proposed by ^[27]. There has been a previous study which generates different form of output from the input. In the work ^[25], they receive continuous motion capture data as input an outputs different modality, Labanotation. In the present work, we do not use any modalities except for the action action classes for the human motion control. Even without the use of such modalities, we achieve a certain degree of control in the human motions.

2.3 Action-conditioned Transformer VAE

Recently, the human motion synthesis ^[1,^2] has drawn attention from several researchers. In particular, ^[2] employed a conditional VAE model to generate human actions. The overall process is described in Fig. 1; using the transformer architecture as the encoder, the model takes arbitrary-length sequences of poses and an action label as the inputs to produce distribution parameters $\boldsymbol{\mu }$ and $\mathbf{\Sigma }$ of the motion latent space. Using the reparameterization trick, the model can sample a latent vector $\boldsymbol{z}$ from the attained distribution. Given the single latent vector $\boldsymbol{z}$ and an action label a, the decoder then synthesizes human motions from a given distribution.

Fig. 1. Action-conditioned transformer VAE[2]: In the training stage, given a sequence of body poses , …, , both encoder and decoder of the VAE in the transformer structure, are exploited. In the test stage, the transformer-VAE decoder generates the sequence , …, using class token and latent vector with positional encodings (PE).

Fig. 2. Overall process of human motion control: To control the target body part(left arm), we first calculate the Jacobian matrices of the left arm and remaining body parts, before performing low-rank factorization and SVD, respectively. Next, we select an attribute vector $a_i$ , among the target attribute vectors $A_r$ and project it onto the null space of the remaining body parts’ attribute vectors $B_n$ . Using attribute vector $n$ we can control the left arm while preserving the movements of the other parts of the human body.

2.4 Latent Space In Generation Models

Interpretation of latent spaces from trained generation models is drawing attention for not only its theoretical purpose but also use in manipulating images (or any type of generated output). Searching the interpretable directions of a latent space can be categorized into two groups based on the approach, i.e., supervised and unsupervised. Supervised methods ^[5,^8] use off-the-shelf classifiers ^[5], or predefined edited images ^[8] as guidance for finding semantically meaningful directions. However, defining the complex attributes in a given image is not a trivial task. For example, simple transformation of an object in an imagesuch as horizontal shifting along the x-axis is easy. However, changing the relatively complex attributes (e.g., gender, smile, and age) is not an easy task. This is the motivation for unsupervised methods ^[9,^10,^28,^29], and ^[30]s which approach the problem through statistical and mathematical analyses. In this study, we adopt an unsupervised method for finding the attribute vectors that allow us to control specific joints in the human body.

3. Preliminary

Before introducing our methods, we refer to the approach outlined in ^[10] to describe how the attribute vector $\boldsymbol{n}$ is identified, as detailed below.

3.1 Low-rank Factorization

To achieve semantic manipulation of the synthesized sample $\boldsymbol{G}\left(\boldsymbol{z}\right)$, prior studies ^[5,^8,^31] have linearly shifted the latent code $\boldsymbol{z},\,\,\boldsymbol{n}\in \mathrm{\mathbb{R}}^{{\boldsymbol{d}_{\boldsymbol{z}}}}$ in the direction of the attribute vector n as follows:

(1)

$ x^{edit}=G\left(z+\alpha n\right), $

where $\boldsymbol{\alpha }$ is the editing strength. To find the attribute vector that significantly alters$\boldsymbol{G}\left(\boldsymbol{z}\right),$ we set the optimization problem that maximizes the variance of the difference as follows:

(2)

$ n=\underset{\left\| n\right\| _{2}=1}{\arg \,\max }\left\| G\left(z+\alpha n\right)-G\left(z\right)\right\| _{2}^{2}\approx \underset{\left\| n\right\| _{2}=1}{\arg \,\max }\alpha ^{2}n^{T}J_{z}^{T}J_{z}\,n\,, $

where $\boldsymbol{G}\left(\boldsymbol{z}+\boldsymbol{\alpha }\boldsymbol{n}\right)=\boldsymbol{G}\left(\boldsymbol{z}\right)+\boldsymbol{\alpha }\boldsymbol{J}_{\boldsymbol{z}}\boldsymbol{n}+\boldsymbol{o}\left(\boldsymbol{\alpha }\right)$ by the first-order Taylor series approximation, and $\boldsymbol{J}_{\boldsymbol{z}}$ is the Jacobian matrix of the generator $\boldsymbol{G}\left(\cdot \right)$ with respect to $\boldsymbol{z}$.We can find $\boldsymbol{n}$ by solving (2) with a closed-form solution, which is the eigenvector of a matrix $\boldsymbol{J}_{\boldsymbol{z}}^{\boldsymbol{T}}\boldsymbol{J}_{\boldsymbol{z}}$ with the largest eigenvalue.

A previous study ^[10] has suggested that $\boldsymbol{J}_{\boldsymbol{z}}^{\boldsymbol{T}}\boldsymbol{J}_{\boldsymbol{z}}$ is perturbed with noise given the fact that it is a degenerate matrix, and it can be decomposed with low-rank matrix and the sparse matrix. This becomes the motivation for low-rank factorization which is as follows:

(3)

$ \min _{L,~ S}\left\| L\right\| _{*}+\lambda \left\| S\right\| _{1}s.t.\,\,\,\,\,M=L+S, $

where $\left\| \boldsymbol{L}\right\| _{\boldsymbol{*}}=\sum _{\boldsymbol{i}}\boldsymbol{\sigma }_{\boldsymbol{i}}\left(\boldsymbol{M}\right)$ is the nuclear norm defined by the sum of all singular values, $\left\| \boldsymbol{S}\right\| _{1}=\sum _{\boldsymbol{ij}}\left| \boldsymbol{M}_{\boldsymbol{ij}}\right| ,$ ant $\boldsymbol{\lambda }$ is a parameter that balances the low-rank matrix $\boldsymbol{L}$ and the sparse matrix S. This low-rank factorization can be solved by the alternating directions method of multipliers ^[32]. Then $\boldsymbol{J}_{\boldsymbol{z}}^{\boldsymbol{T}}\boldsymbol{J}_{\boldsymbol{z}}$ can be decomposed as follows:

(4)

$ J_{z}^{T}J_{z}=L^{*}+S^{*}. $

Using the singular value decomposition (SVD) of matrix $\boldsymbol{L}^{\boldsymbol{*}}$, we find the attribute vector $\boldsymbol{V}$ that is a right singular vector of the SVD.

4. Proposed Method

In this section, we present solutions to primarily address the following three key challenges that arise in controlling human motions via latent space of conditional VAE: (1) entangled attribute vectors, (2) posterior collapse, and (3) lack of diversity of motions in datasets.

4.1 Separate Control of The Body Parts

Our objective is to control one part of the human body (e.g., an arm), while preserving the movements of the other parts (e.g., a leg). To do this, we utilize null space projection ^[10].

By projecting the attribute vector from a target body part onto the null space of the remaining body parts’ attribute vectors, we can move the intended body part independently from the others, a concept referred to as disentangled control. This can be formulated as follows.

From Eq. (4), let $V_{target}$ and $V_{other}$ be right singular matrices resulting from the SVD of low-rank matrices $L_{target}$ and $L_{other}$ of the target and the other body parts, respectively. The matrix $L_{target}$ and $L_{other}$ can be calculated from the Jacobian matrix obtained by differentiating with respect to only outputs associated with each body part. In addition, $r_{target}$, and $r_{other}$ denote the ranks of $L_{target}$ and $L_{other}$, respectively. Then, we can express $V_{target}=\left[A_{r},\,\,A_{n}\right]$, where $A_{r}=\left[a_{1},\ldots ,\,\,a_{{r_{target}}}\right]$ and $A_{n}=\left[a_{{d_{z}}-{r_{target}}},\ldots \,\,a_{{d_{z}}}\right]$, and $V_{other}=\left[B_{r},\,\,B_{n}\right]$ where $B_{r}=\left[b_{1},\ldots ,\,\,b_{{r_{other}}}\right]$ and $B_{n}=\left[b_{{d_{z}}-{r_{other}}},\ldots ,\,\,b_{{d_{z}}}\right]$. $B_{n}$ can be interpreted as a set of vectors that have little effects on the other body parts. Hence, we can achieve the partwise controllability by projecting one attribute vector ai onto the orthogonal complementary space of $B_{r}\,,$ which is formulated as controllability by projecting one attribute vector ai onto the orthogonal complementary space of $B_{r}\,,$ which is formulated as

(5)

$ n=\left(I-B_{r}B_{r}^{T}\right)a_{i}=B_{n}B_{n}^{T}a_{i},\,\,\,\,where\,\,\,a_{i}\in A_{r}. $

We can easily see that $B_{n}$ is the null space of $L_{other}\,.$ Since each joint in the SMPL model parameters can be handled by indexing, we can compute the Jacobians of specific body parts. This enables us to compute the low-rank matrices and attribute vectors for the target and remaining parts without extra effort, for instance, by using binary masks.

4.2 Posterior Collapse and Low-rank Factorization

Although the transformer-based VAE shows excellent performance in human motion synthesis, because of its flexible structure, it is vulnerable to posterior collapse, where the generative model only exploits a subset of the latent variables ^[12]. When the latent space is reduced, the capacity of the generative model is reduced, thereby degrading its performance. However, posterior collapse induces a more critical problem in low-rank factorization. Since the trained latent vector is sparse, some information included in the Jacobian matrix may have sparse form; this information may then be wrongly included in matrix $S$ employed to extract sparse noise during low-rank factorization; hence the discrepancy of information between the low-rank matrix $L$ and $J_{z}^{T}J_{z}$ increases. Therefore, it is essential to mitigate posterior collapse for effective action control.

There are numerous of strategies to circumvent the posterior collapse. We can categorize these strategies into three groups: focusing on modifying the variational inference objective ^[33], limiting the capacity of the decoder ^[34-^36], and designing an optimization scheme for training the VAE ^[37-^39]. We adopt sigmoid and cyclical annealing schedules ^[40,^41], which are widely used optimization schemes that schedule the KL-term weighting hyperparameter. The VAE objective, which is called the evidence lower bound (ELBO) can be expressed as follows:

(6)

$ L_{ELBO}=\mathrm{\mathbb{E}}_{q\phi \left(z|x\right)}\left[logp_{\theta }\left(x|z\right)\right]-\beta KL\left(q_{\phi }\left(z|x\right)||\,p\left(z\right)\right)\,. $

The meaning of the KL-term in Eq. (6) is a distance between a prior and its posterior approximate. To maximize the ELBO, the model attempts to minimize the KL-term. Therefore, if $\beta $ is large in the early stage of training, when the decoder is immature and the KL-term is relatively large, the model ignores the posterior, which leads to posterior collapse. Thus, a small $\beta $ can be used at the beginning of training to ensure diversity in generation, and gradually increased $\beta $ according to the scheduling scheme.

Generally, an active unit (AU) is one of the indicators of the effects of KL-term scheduling; it tends to have small values when posterior collapse occurs. We will describe the effects of KL-term scheduling in the experimental section.

4.3 Data Augmentation

We use data augmentation to achieve higher controllability over the human actions based on observations of datasets. After inspecting the human action dataset (UESTC ^[42]), we observed that many motions of each body part from the dataset are similar to those of the others, which implies that the angles between the joints are bounded by the homogeneity of the dataset. We also observed that most of the motions are concentrated in the arms than the other body parts; this is because 40 classes from the UESTC dataset are those of aerobic exercises, which essentially involve movements of arms. Based on the observations, data augmentation was applied to the dataset by increasing the change rate of the motions. We obtain the change rate by calculating the differences between parameters in adjacent frames. Then, by multiplying a number between 1 and 1.5, we obtain the motion instance with increased range of movements. Fig. 3 shows that the range of motions from each part increased after augmentation of the UESTC dataset.

The detailed implementation of data augmentation is as follows: First, we converted parameters given by 6D representations as in ^[18] to Euler angles. The reason behind this conversion is to ensure interpretability of the parameters. After conversion, since we have 60 sets of parameters $\left(\theta \right)$ which can be represented as $\mathrm{\mathcal{A}}'\in ~ \mathrm{\mathbb{R}}^{25\times 3\times 60}$, we subtract the parameters of one frame from those in the previous frame. This gives the amount of angle difference between frames, which can be considered as motion between the frames. After subtraction, we multiply a number between 1 and 1.5 to the angle differences acquired; this step is the main point in data augmentation that allows more active motions. In the final step, by converting the Euler representations back to the 6D representation, we can handle the data in the original form.

For more information on the 6D representation, please refer to ^[18]. Fig. 4 shows the implementation of the proposed data augmentation method

Fig. 3. Effects of data augmentation: The graphs show the output differences of the each joint when the latent vectors move along the eigen vectors of $J^TJ$. . We use eigenvectors corresponding to the top-k eignen values. The solid line is the mean value of the difference across the k values, and the colored region spans ±1σ (k=7) . The range of the motions from each part is observed to increased after applying data augmentation.

Fig. 4. Data augmentation applied for more active actions: The figure illustrates data augmentation applied to human actions to capture more active actions than those available in the dataset. After calculation of the differences between adjacent frames, we multiply with α, which is a value uniformly distributed between 1 and 1.5, to generate more exaggerated actions. A' stands for the human action in euler angle representation and $A'_{aug}$ is its augmented correspondence.

5. Experiments

In this section, the datasets, implementation details, and performance measures for the experiments are introduced (Section 5.1). Next, an ablation study is presented (Section 5.2), and visualization is provided (Section 5.3). Finally, the performance of the method in the unconditional setting is shown (Section 5.4).

5.1 Experimental Setup

a: Dataset

Experiments are performed on two datasets, UESTC ^[42] and HumanAct12 ^[1], which are postprocessed as in ^[2] for the 3D human motion generation task. The UESTC dataset is a large-scale RGB-D action dataset that covers the entire 360$^{\mathrm{◦}}$ viewing angles and consists of 40 action categories.

Among the 25K video sequences of the UESTC, following ^[2], about 10,650 sequences were used for training, and the remaining sequences 13,350 sequences were used for testing according as per the official cross-subject protocol of ^[42]. The HumanAct12 dataset is adopted from the existing dataset PHSPD dataset ^[43,^44] and is composed of 1,191 videos with 12 action categories.

b: Implementation details

For fair comparison, the training and testing were conducted under the same configuration as ^[2] without scheduling the weight of the KL term ($\beta $ in (6)) during training. Sigmoid annealing, as in ^[40], defines the weighting factor of the KL term $\beta _{t}=u\cdot \left(1/\left(1+e^{-kn+b}\right)\right)$ where u is an upper bound for the KL term, and n is the training step; k and b are parameters that control the rate of weight change. We set k and n to 5e ${-}$ 6 and 13.5, respectively, and u was set to 2e-6. Cyclical annealing, as conducted in ^[41], defines the KL term as

(7)

$\beta =~ \left\{\begin{array}{l} \mu \cdot f\left(\tau \right),\,\,\,\,\tau <R\\ ~ \mu ,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\tau >R \end{array}\right.~ $ with

(8)

$ \tau =\frac{mod\left(r-1,\left\lceil T/M\right\rceil \right)}{T/M}, $

where t is the iteration number, T is the total number of training iterations, $f$ is a monotonically increasing function, and $u$ is the upper bound as well. We set M and R to 4 and 0.5, respectively, and $u$ was set to 1e-6. Experiments were performed to select the appropriate hyperparameters for sigmoid annealing ^[40]. Three hyper-parameters were used in this annealing scheme: $k,~ b$, and $u$. Here, $k$ and $b$ were used to control the rate of the KL-term weight change, and $u$ was used as the upper bound of the weight value. Since the final performance of the network was mainly affected by the $u$ value, experiments were conducted by varying only $u$. These results are shown in the Table 2, and the values of $k$ and $b$ were set so that the KL-term weight is half of the maximum value of $u$ in the middle point of the training process, as shown in Fig. 5. For the value of $u$, we implemented experiments with two values, namely 5e${-}$6 and 1e${-}$6. As seen, the Frenchet inception distance (FID) score is best when the value of 5e${-}$6 was used and the AU is largest when the value of 1e${-}$6 was used. Although the model using the value of 1e${-}$6 could be regarded as the best model for circumventing posterior collapse, its unacceptable FID score insists that this model does not guarantee the quality of generated human motion, which is fundamental functionality for generation model. Accordingly, the value of 5e${-}$6 is adopted in this work.

Fig. 5. KL-term weight after applying sigmoid annealing.

c: Evaluation metrics

In accordance with previous researches ^[1,^2], the FID score is adopted for the training and test datasets to measure the quality and diversity of motion generation. To extract the motion features, we used the action recognition models from ^[1]. As discussed in Section IV-B, AU ^[45] was employed to estimate the degree of posterior collapse. The activity of the latent dimension was measured as $A_{z}=Cov\left(\mathrm{\mathbb{E}}_{z\sim q\left(z|x\right)}\left[z\right]\right)$. $A_{z}~ $was regarded to be active if $A_{z}$ > 0.001 in this study.

The term $J_{z}^{T}J_{z}$ refers to the product of the Jacobian matrix of the generator with respect to z and its transpose. The matrix $L$ can be interpreted as the noise-removed, low-rank representation of $J_{z}^{T}J_{z}$ as derived from Eq. (4). Given an attribute vector n that targets a specific body part, it follows that the value of $n^{T}L_{target}n~ $should be large, whereas the value of $n^{T}L_{other}n$ other n should be small due to the null space projection.

Furthermore, it is evident that the attribute vector n, de- rived from $L$, should induce a change in $J_{z}^{T}J_{z}$toward the desired attribute direction. This assumption holds true if $L$ is a properly decomposed, noise-removed representation of $J_{z}^{T}J_{z}$. Consequently, as illustrated in Eqs. (9) and (10), the numerators encapsulate the ratios representing the sepa- rate control of the human body with respect to $J_{z}^{T}J_{z}$. and $L$, respectively.

By employing the SC score, it is possible to systematically evaluate the partwise control of human motions, providing a robust framework for analyzing the effectiveness of the method.

(9)

$ SC_{{J^{T}}J}=\frac{n^{T}J_{target}^{T}J_{target}n-n^{T}J_{other}^{T}J_{other}n}{\max \left(n^{T}J_{target}^{T}J_{target},n^{T}J_{other}^{T}J_{other}n\right)}, \\ $

(10)

$ SC_{L}=\frac{n^{T}L_{target}n-n^{T}L_{other}n}{\max \left(n^{T}L_{target}n,n^{T}L_{other}n\right)}. $

The difference in the value between $SC_{J_{z}^{T}{J_{z}}}$ and $SC_{L}$ can be interpreted as the discrepancy between $J^{T}J$ and $L$ when moving along with the target attribute vector n attained by applying SVD to $L$. In summary, both the intrarelationship (numerator of each equation) that gives the intuition for separate control of a body part and the interrelationship that gives the intuition for the relevance between $J^{T}J$ and low-rank matrix $L$ should be evaluated.

4.2 Ablation Study

a: KL-term scheduling

This section validates the effectiveness of the KL-term annealing schemes herein. As discussed in Section 4.2, we applied sigmoid annealing ^[40] and cyclic annealing ^[41] KL term to avoid posterior collapse. Hence, we employed AU to observe the number of active dimensions among the total number of dimensions in the latent vector. As seen in Table 1,when cyclic annealing is applied to the UESTC dataset, the AU increased to 3.3 times the original; this means that when training with cyclic annealing, the generation model used 3.3 times more variables. On the other hand, when sigmoid annealing was applied, the AU increased to 4.1 times the original, meaning that sigmoid annealing is more effective than cyclic annealing, as is clear in Fig. 6. The three graphs in Fig. 6 show how much the dimensions in the latent variables are active in ACTOR ^[2], ACTORcycle, and ACTORsigmoid. We see that there are more active units of ACTORsigmoid than those of ACTORcycle. This means that sigmoid scheduling is more effective for mitigating posterior collapse.

Fig. 3. Active unit variance: Each graph describes Az of all dimensions of latent space in three ACTOR [2] variants, ACTOR, ACTORcycle, and ACTORsigmoid. Both sigmoid and cyclic annealing are effective for mitigating posterior collapse.

The results from Table 1 show that both types of SC scores ($SC_{{J^{T}}J}$and $SC_{L}$ ) under any kind of scheduling scheme in- crease compared to the baseline ACTOR model. This means that the partwise disentangled controllability increases. The baseline model, ACTOR$†$, even exhibits a negative value for the $SC_{{J^{T}}J},$ which means that the discovered attribute vector $n$ is highly tangled with the Jacobian matrix of the other body parts $J$ other. In addition, note that the difference be- tween $SC_{{J^{T}}J}$ and $SC_{L}$, which we called the interrelationship decreases when AU is increases. This indicates that posterior collapse induces more critical problems in low-rank factorization.

For comparison, sigmoid annealing ^[40] was applied to the KL term that has been proven in the UESTC dataset to have positive effects on mitigating posterior collapse. These results are shown in Table 3. As in the case of UESTC, the AU of HumanAct12 increased to 5.4 times that of the baseline model, which means that the active dimensions in the latent space increased. In Section V-A, we suggested a new metric, SC scores ($SC_{{J^{T}}J}$ and $SC_{L}$). Since the $SC_{{J^{T}}J}$ and $SC_{L}$ increased and the gap between $SC_{{J^{T}}J}$ and $SC_{L}$). reduced, we conclude that the control of a body part is separated by a greater amount and that the relevance between $J^{T}J$ and $L$ increases. Therefore, our method allowed us to control actions in a partwise manner more easily than the na\"{i}ve method on the HumanAct12 ^[1] and UESTC ^[42] datasets.

b: Data augmentation

This subsection shows the quantitative results of our method. First, we measured the quality of the generated motion sequences using the FID score between the feature distribution of generated motions and that of real motions. Table 1 shows the correlation between data augmentation and the realistic quality of the synthesized motions. Comparing the baseline model (ACTOR) with/without data augmentation, it can be concluded that more realistic generated sequences of motions can be obtained with data augmentation. In addition, the model can perceive a wider range of motions compared to the na\"{i}ve training.

Table 1. Comparison of ACTOR variants.

Method	UESTC
Method	FID_tr↓	FIDtest	Acc. ↑	AU↑	SC$_{J^{T}J}$↑	SC$_L$↑	Multimod→
Real^†	2.93^±0.26	2.79^±0.29	98.8^±0.1	-	-	-	14.16^±0.06
ACTOR^†	0.12^±0.00	2.79^±0.29	2.79^±0.29	10	0.23	-0.35	14.66^±0.03
ACTOR_aug	2.79^±0.29	2.79^±0.29	-	16	-	-	-
ACTOR_sigmoid	14.32^±1.19	18.25^±1.38	92.0^±0.55	41	0.30	0.39	15.08^±0.10
ACTOR_cycle	23.38^±2.14	26.04^±3.45	80.0^±0.96	33	0.29	0.00	17.73^±0.10

Table 2. Experiment for the hyper-parameter of KL-term annealing. † is quoted from ^[2].

Method	UESTC
Method	FID_tr↓	FID_test↓	AU↑
ACTOR^†	20.49^±2.31	23.43^±2.20	10
ACTOR_{sigmoid (5e-6)}	14.32^±1.19	18.25^±1.38	41
ACTOR_{sigmoid (1e-6)}	20.49^±2.31	20.49^±2.31	76

Table 3. Comparison of ACTOR variant. † is quoted from ^[2].

Method	HumanAct12
Method	FID_tr↓	AU↑	SC$_{J^{T}J}$↑	SC$_L$↑
Real	0.02^±0.00	-	-	-
ACTOR	0.12^±0.00	14	0.00	-0.18
ACTOR_sigmoid	0.16^±0.00	75	0.30	0.39

4.3 Conditional Generation

Fig. 7 depicts the effectiveness of our method in terms of motion control. For two different classes, the method was applied to control an arm part and a leg part, respectively. The first two rows of Fig. 7 show the show the results of arm control, and the last two rows show the results of leg control. For arm control, it is noted that the elbow joints of the controlled output were more stretched than those of the original output. This occurs not only in the red-circled frames but also in the other frames of Fig. 7. For leg control, it is recognized that the right knee joint of the controlled output was higher than that of the original output. The most important thing is that even after each target body part was manipulated, the other body parts maintained their original motions.

More qualitative results from the UESTC dataset and the HumanAct12 datasets are shown. Figs. 10 and 11 show ex- amples of controlling arm parts and leg parts, respectively, for the UESTC dataset. Figs. 12 and 13 show examples of controlling the arm parts and leg parts, respectively, for the HumanAct12 dataset.

Figs. 10 and 12 show that arm can be controlled by finding the related latent vector $n_{arm}$. In the first two rows of Fig. 10, when $n_{arm}$. is applied to action class 9, it can be seen that the elbow of the human figure is bent when compared to the original class 16 human figure. Likewise, by applying the latent direction vector for the elbow, the same effect as class 9 is applied to class 35. For classes 10 and 14 in the subsequent rows, it is observed that the human body model is raising its arms more than in the original action class. Fig. 12 shows that the effect written below the class name is applied to the HumanAct12 dataset. Figs. 11 and 13 show the effects of the latent vector related to the legs of the human body model from the UESTC and HumanAct12 datasets, respectively.

To further assess the generalization capability of the proposed method, we applied it to another generative model as described in ^[13]. This evaluation aims to verify the robust- ness and adaptability of the proposed method across different generative frameworks. Fig. 8 shows the vector $n_{arm}$ applied to the jumping class.

Fig. 7. Qualitative results for the conditional setting: Controlling only an arm in the action in the first row is described in the second row. Likewise, controlling only a leg in the action in the third row is described in the fourth row.

Fig. 8. Qualitative results in conditional setting: These actions are generated under the conditional setting from work[13]. The first row is the original class jumping. The second row is when we applied the arm moving vector to the first row. The format of the human model follows the work[13]

Fig. 9. Qualitative results in unconditional setting: These actions are generated under the unconditional setting. The first two rows are generated from the inputs of two different classes. The action in the third row is generated from the average latent vector of the first and second row action. Only the arm part of the third action is controlled and described in the fourth row.

Fig. 10. Qualitative results on the UESTC dataset (arm-control).

Fig. 11. Qualitative results on the UESTC dataset (leg-control).

Fig. 12. Qualitative results on the HumanAct12 dataset (arm-control).

Fig. 13. Qualitative results on the HumanAct12 dataset (leg-control).

Fig. 14. Qualitative results on the UESTC dataset (Unconditional).

4.4 Unconditional Generation

The strongest advantage of generating 3D human motions lies in generating diverse and human-like motion sequences. To this end, the methodology was expanded from a class- conditional setting to an unconditional setting to generate new actions between two classes and to observe how these actions can be controlled. This is well visualized in Fig. 9. The first two rows of the figure show motion sequences from two different classes. In addition, the third row represents the generated action from the interpolated latent vector between the latent vectors of class 3 and 18. Since the generated motion in the third row includes the sitting motion of class 3 (first row) and arm circling motion of class 18 (second row) at the same time, it is concluded that semantic interpolation in the latent space is possible. Further, arm control was im- plemented using the interpolated latent vector, and the result is shown in the fourth row of Fig. 9. Unlike the action in the third row, the action in the fourth row bends its elbows more while restraining the other joints from moving. This implies that arm control can also be applied to the interpolated latent. Thus, we show that the possibility of our method can generate much diverse actions under the unconditional setting. Fig. 14 shows more results from unconditional generation.

Since generated actions are composed of frames, static images alone are insufficient to show the qual- ity. Hence, please refer to our project page for videos. https://josephkkim.github.io/Motion_Control/

5. Conclusion

This study demonstrated methods for directly controlling motions in the latent space of the human action generative model. To improve controllability, we employed three techniques: (1) attribute vector projection, (2) mitigating posterior collapse, and (3) data augmentation. As a result, we achieved control of a target body part while preserving the movements of the remaining body parts. In particular, we discovered that as the number of activated dimensions in the latent vector increased, the controllability also increased, meaning that posterior collapse and controllability are closely related. In class-conditional VAE, various controls were difficult because the latent space was too small. While previous research has concentrated on generating diverse representations of human motion, this paper addresses the problem of motion generation through the exploration of the latent space, allowing for the modification of outputs from generative models. In recent studies, for text and image generative models, prompt engineering has been employed to produce desirable outputs. However, our approach is novel in that it directly explores and exploits the latent space. In our future work, we intend to explore further action control of unconditional generative models with large-scale datasets. Furthermore, the ability to control outputs through direct access to the latent space is not only significant for human action generative models but also holds important implications for users of generative models across various modalities. This approach enables users to gain more precise control over the outputs of diverse generative models.

ACKNOWLEDGMENTS

This work was supported by the MSIT(Ministry of Science and ICT), Korea, under the Graduate School of Metaverse Convergence support program(IITP-RS-2022-00156318) supervised by the IITP(Institute for Information & Communications Technology Planning & Evaluation), and by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT)(RS-2024-00414230).

REFERENCES

C. Guo, X. Zuo, S.Wang, S. Zou, Q. Sun, A. Deng, M. Gong, and L. Cheng, ``Action2motion: Conditioned generation of 3d human motions,'' in Proc. ACM Int’l Conf. Multimedia, 2020, pp. 2021-2029.

M. Petrovich, M. J. Black, and G. Varol, ``Action-conditioned 3d human motion synthesis with transformer vae,'' in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 10 985-10 995.

G. Tevet, B. Gordon, A. Hertz, A. H. Bermano, and D. Cohen-Or, ``Motionclip: Exposing human motion generation to clip space,'' 2022, arXiv:2203.08063.

M. Petrovich, M. J. Black, and G. Varol, ``Temos: Generating diverse human motions from textual descriptions,'' 2022, arXiv:2204.14109.

L. Goetschalckx, A. Andonian, A. Oliva, and P. Isola, ``Ganalyze: Toward visual definitions of cognitive mage properties,'' in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 5744-5753.

Z. Chen and N. Chen, ``Children’s football action recognition based on lstm and a v-dbn,'' IEIE ransactions on Smart Processing & Computing, vol. 12, no. 4, pp. 312-322, 2023.

Y. Shi, ``Image recognition of skeletal action for online physical education class based on convolutional neural network,'' IEIE Transactions on Smart Processing & Computing, vol. 12, no. 1, pp. 55-63, 2023.

A. Jahanian, L. Chai, and P. Isola, ``On the" steerability" of generative adversarial networks,'' in Proc. Int. Conf. Learn. Represent., 2019.

Y. Shen and B. Zhou, ``Closed-form factorization of latent semantics in gans,'' in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 1532-1540.

J. Zhu, R. Feng, Y. Shen, D. Zhao, Z.-J. Zha, J. Zhou, and Q. Chen, ``Lowrank subspaces in gans,'' in Proc. Adv. Neural Inf. Process. Syst., 2021, pp. 16 648-16 658.

D. P. Kingma and M. Welling, ``Auto-encoding variational bayes,'' 2013, arXiv:1312.6114.

B. Dai, Z.Wang, and D.Wipf, ``The usual suspects? reassessing blame for vae posterior collapse,'' in International Conference on Machine Learning. PMLR, 2020, pp. 2313-2322.

Q. Lu, Z. Yipeng, M. Lu, and V. Roychowdhury, ``Action-conditioned on demand motion generation,'' in Proc. ACM Int’l Conf. Multimedia, 2022, pp. 2249-2257.

M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, ``Smpl: A skinned multi-person linear model,'' in ACM Transactions on Graphics, 2015, pp. 1-16.

Q. Sun, Y. Xiao, J. Zhang, S. Zhou, C.-S. Leung, and X. Su, ``A local correspondence-aware hybrid cnn-gcn model for single-image human body reconstruction,'' IEEE Transaction on Multimedia, 2022.

Y. Sun, L. Xu, Q. Bao, W. Liu, W. Gao, and Y. Fu, ``Learning monocular regression of 3d people in crowds via scene-aware blending and deocclusion,'' IEEE Transactions on Multimedia, 2023.

H. Zhang, Y. Meng, Y. Zhao, X. Qian, Y. Qiao, X. Yang, and Y. Zheng, ``3d human pose and shape reconstruction from videos via confidence-aware temporal feature aggregation,'' IEEE Transactions on Multimedia, 2022.

Y. Zhou, C. Barnes, J. Lu, J. Yang, and H. Li, ``On the continuity of rotation representations in neural networks,'' in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 5745-5753.

H. Tang and N. Sebe, ``Total generate: Cycle in cycle generative adversarial networks for generating human faces, hands, bodies, and natural scenes,'' IEEE Transactions on Multimedia, vol. 24, pp. 2963-2974, 2021.

L. Ma, K. Huang, D. Wei, Z.-Y. Ming, and H. Shen, ``Fda-gan: Flowbased dual attention gan for human pose transfer,'' IEEE Transactions on Multimedia, 2021.

X. Lin and M. R. Amer, ``Human motion modeling using dvgans,'' 2018, arXiv:1804.10652.

I. Habibie, D. Holden, J. Schwarz, J. Yearsley, and T. Komura, ``A recurrent variational autoencoder for human motion synthesis,'' in Proc. British Mach. Vis. Conf., 2017.

F. Ma, G. Xia, and Q. Liu, ``Spatial consistency constrained gan for human motion transfer,'' IEEE Trans. Circuits Syst. Video Technol., 2021.

S. Wen, W. Liu, Y. Yang, T. Huang, and Z. Zeng, ``Generating realistic videos from keyframes with concatenated gans,'' IEEE Trans. Circuits Syst. Video Technol., 2018.

N. Xie, Z. Miao, X.-P. Zhang, W. Xu, M. Li, and J. Wang, ``Sequential gesture learning for continuous labanotation generation based on the fusion of graph neural networks,'' IEEE Trans. Circuits Syst. Video Technol., 2021.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \L{}. Kaiser, and I. Polosukhin, ``Attention is all you need,'' Proc. Adv. Neural Inf. Process. Syst., 2017.

H.-Y. Lee, X. Yang, M.-Y. Liu, T.-C. Wang, Y.-D. Lu, M.-H. Yang, and J. Kautz, ``Dancing to music,'' Proc. Adv. Neural Inf. Process. Syst., 2019.

E. H\"{a}rk\"{o}nen, A. Hertzmann, J. Lehtinen, and S. Paris, ``Ganspace: Discovering interpretable gan controls,'' Proc. Adv. Neural Inf. Process. Syst., pp. 9841-9850, 2020.

Y. Wei, Y. Shi, X. Liu, Z. Ji, Y. Gao, Z. Wu, and W. Zuo, ``Orthogonal jacobian regularization for unsupervised disentanglement in image generation,'' in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 6721-6730.

V. Abrol, P. Sharma, and A. Patra, ``Improving generative modelling in vaes using multimodal prior,'' IEEE Transactions on Multimedia, vol. 23, pp. 2153-2161, 2020.

Y. Shen, J. Gu, X. Tang, and B. Zhou, ``Interpreting the latent space of gans for semantic face editing,'' in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 9243-9252.

Z. Lin, M. Chen, and Y. Ma, ``The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices,'' 2010, arXiv:1009.5055.

A. Razavi, A. v. d. Oord, B. Poole, and O. Vinyals, ``Preventing posterior collapse with delta-vaes,'' 2019, arXiv:1901.03416.

S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio, ``Generating sentences from a continuous space,'' 2015, arXiv:1511.06349.

I. Gulrajani, K. Kumar, F. Ahmed, A. A. Taiga, F. Visin, D. Vazquez, and A. Courville, ``Pixelvae: A latent variable model for natural images,'' 2016, arXiv:1611.05013.

Z. Yang, Z. Hu, R. Salakhutdinov, and T. Berg-Kirkpatrick, ``Improved variational autoencoders for text modeling using dilated convolutions,'' in International Conference on Machine Learning. PMLR, 2017, pp. 3881-3890.

J. He, D. Spokoyny, G. Neubig, and T. Berg-Kirkpatrick, ``Lagging inference networks and posterior collapse in variational autoencoders,'' 2019, arXiv:1901.05534.

Y. Kim, S.Wiseman, A. Miller, D. Sontag, and A. Rush, ``Semi-amortized variational autoencoders,'' in International Conference on Machine Learning. PMLR, 2018, pp. 2678-2687.

B. Li, J. He, G. Neubig, T. Berg-Kirkpatrick, and Y. Yang, ``A surprisingly effective fix for deep latent variable modeling of text,'' arXiv:1909.00868.

D. Liu and G. Liu, ``A transformer-based variational autoencoder for sentence generation,'' in Proc. Int. Cong. Neural Netw. (IJCNN). IEEE, 2019, pp. 1-7.

H. Fu, C. Li, X. Liu, J. Gao, A. Celikyilmaz, and L. Carin, ``Cyclical annealing schedule: A simple approach to mitigating kl vanishing,'' in Proc. Hum. Lang. Technol., Annu. Conf. North Amer. Chapter Assoc. Comput. Lingustics, 2019, pp. 240-250.

Y. Ji, F. Xu, Y. Yang, F. Shen, H. T. Shen, andW.-S. Zheng, ``A large-scale RGB-D database for arbitrary-view human action recognition,'' in Proc. ACM Int’l Conf. Multimedia, 2018, p. 1510-1518.

S. Zou, X. Zuo, Y. Qian, S. Wang, C. Xu, M. Gong, and L. Cheng, ``3d human shape reconstruction from a polarization image,'' in Proc. Eur. Conf. Comput. Vis., 2020, pp. 351-368.

S. Zou, X. Zuo, Y. Qian, S.Wang, C. Guo, C. Xu, M. Gong, and L. Cheng, ``Polarization human shape and pose dataset,'' arXiv:2004.14899.

Y. Burda, R. Grosse, and R. Salakhutdinov, ``Importance weighted autoencoders,'' 2015, arXiv:1509.00519.

Hyunsung Kim

Hyunsung Kim received the B.S. and M.S.degrees in electronics engineering from Sogang University, Seoul, South Korea, in 2020 and 2022, respectively. He is currently Associate Researcher with LG electronics, Seoul, South Korea. His current research interests include image processing, computer vision, human pose estimation, and deep learning.

Kyeongbo Kong

Kyeongbo Kong received the B.S. degree in electronics engineering from Sogang University, Seoul, South Korea, in 2015, and the M.S. and Ph.D. degrees in electrical engineering from the Pohang University of Science and Technology (POSTECH), Pohang, South Korea, in 2017 and 2020, respectively. From 2020 to 2021, he was worked as a Postdoctoral Fellow with the Department of Electrical Engineering, POSTECH, Pohang, South Korea. From 2021 to 2023, he was an Assistant Professor of Media School at Pukyong National University, Busan. He is currently an Assistant Professor of Electronics Engineering at Pusan National University. His current research interests include image processing, computer vision, machine learning, and deep learning.

Joseph Kihoon Kim

Joseph Kihoon Kim received the B.S. degree from Handong Global University, South Korea, in 2021. and recieved M.S. degree in 2023 from Sogang University, South Korea. He is currently Associate Researcher with Samsung electronics, Suwon, South Korea. His current research interests include computer vision and generative models.

James Lee

James Lee received the B.S. degree in electronics engineering from Kookmin University, Seoul, South Korea, in 2021 and the M.S. degree in electronics engineering from Sogang University, Seoul, South Korea, in 2023. He is currently Associate Researcher with Samsung electronics, Hwaseong, South Korea. His current research interests include computer vision and machine learning.

Geonho Cha

Geonho Cha received the B.S. and Ph.D. degrees from the School of Electrical Engineering and Computer Science, Seoul National University, Korea, in 2013 and 2019, respectively. In 2019 and 2020, he was a Postdoctoral Researcher in the same school. Currently, he is an AI Researcher at NAVER Cloud, Seongnam, Korea. His research interests include neural 3D representations, deformable models, computer vision, deep learning, pattern recognition, and their applications.

Ho-Deok Jang

Ho-Deok Jang received a B.S. degree in Electrical Engineering from Hongik University in 2017 and a M.S. degree in Electrical Engineering (Division of Future Vehicle) from Korea Advanced Institute of Science and Technology (KAIST) in 2019. He is currently a Research Engineer of CLOVA AI at NAVER Cloud Corp, Seongnam. His research interests include object recognition, image/video processing, and deep learning.

Dongyoon Wee

Dongyoon Wee received both his Bachelor of Science in 2008 from the School of Electrical Engineering and Computer Science at Seoul National University in Seoul, South Korea. In 2011, he completed his Master's degree at the same institution. From 2011 to 2017, he worked as a research engineer at LS Cable and System, LG CNS and Buzzvil subsequently. In 2017, he began his role as a research engineer at Naver, Seongnam, South Korea, where he now leads a research team focused on advancing the fields of video understanding and 3D computer vision.

Suk-Ju Kang

Suk-Ju Kang (Member, IEEE) received a B.S. degree in electronic engineering from Sogang University, South Korea, in 2006, and a Ph.D. degree in electrical and computer engineering from the Pohang University of Science and Technology, in 2011. From 2011 to 2012, he was a Senior Researcher with LG Display, where he was a project leader for resolution enhancement and multi-view 3D system projects. From 2012 to 2015, he was an Assistant Professor of Electrical Engineering at Dong-A University, Busan. He is currently a Professor of Electronic Engineering at Sogang University. He was a recipient of the IEIE/IEEE Joint Award for Young IT Engineer of the Year, in 2019. His current research interests include image analysis and enhancement, video processing, multimedia signal processing, circuit design for display systems, and deep learning systems.

IEIE SPC IEIE Transactions on Smart Processing & Computing

Journal Search

Journal XML

Journal Information

Enhanced Control of Human Motion Generation using Action-conditioned Transformer VAE with Low-rank Factorization

Abstract

Keywords

1. Introduction

2. Related Work

2.1 SMPL

2.2 Human Motion Generation

2.3 Action-conditioned Transformer VAE

2.4 Latent Space In Generation Models

3. Preliminary

3.1 Low-rank Factorization

(1)

(2)

(3)

(4)

4. Proposed Method

4.1 Separate Control of The Body Parts

(5)

4.2 Posterior Collapse and Low-rank Factorization

(6)

4.3 Data Augmentation

5. Experiments

5.1 Experimental Setup

(7)

(8)

Fig. 5. KL-term weight after applying sigmoid annealing.

(9)

(10)

4.2 Ablation Study

Fig. 3. Active unit variance: Each graph describes Az of all dimensions of latent space in three ACTOR [2] variants, ACTOR, ACTORcycle, and ACTORsigmoid. Both sigmoid and cyclic annealing are effective for mitigating posterior collapse.

4.3 Conditional Generation

Fig. 7. Qualitative results for the conditional setting: Controlling only an arm in the action in the first row is described in the second row. Likewise, controlling only a leg in the action in the third row is described in the fourth row.

Fig. 8. Qualitative results in conditional setting: These actions are generated under the conditional setting from work[13]. The first row is the original class jumping. The second row is when we applied the arm moving vector to the first row. The format of the human model follows the work[13]

Fig. 10. Qualitative results on the UESTC dataset (arm-control).

Fig. 11. Qualitative results on the UESTC dataset (leg-control).

Fig. 12. Qualitative results on the HumanAct12 dataset (arm-control).

Fig. 13. Qualitative results on the HumanAct12 dataset (leg-control).

Fig. 14. Qualitative results on the UESTC dataset (Unconditional).

4.4 Unconditional Generation

5. Conclusion

ACKNOWLEDGMENTS

REFERENCES

Hyunsung Kim

Kyeongbo Kong

Joseph Kihoon Kim

James Lee

Geonho Cha

Ho-Deok Jang

Dongyoon Wee

Suk-Ju Kang

Article Information (continued)

Keywords

IEIE SPC

IEIE Transactions on Smart Processing & Computing