Mobile QR Code QR CODE

  1. (Interdisciplinary Program on Creative Engineering, KOREATECH, Cheonan-si 31253, Korea
  2. (Department of Electrical, Electronics and Communication Engineering, KOREATECH, Cheonan-si 31253, Korea )

Pose estimation, Filtering, Teleoperation

1. Introduction

Various kinds of mobile robots have been developed and used in different applications [1-5]. Although some research is devoted to the automatic control of mobile robots [1], most practical robots are still operated by human operators with the help of video images acquired by cameras mounted on the robots [2-5]. Humanoid robots are designed to have joint structures like those of humans, so they can mimic human body movements and perform various tasks that may be dangerous for humans. However, to control these robots successfully, accurate pose data for manipulating each joint must be generated and processed with a very short latency period.

Various methods have been proposed to control different kinds of humanoid robots by estimating the poses of human operators in remote sites [6-12]. Two main approaches are the use of dedicated positioning controllers and the estimation of the operator’s pose from videos. A master arm was used as a dedicated controller to manipulate a slave arm [6]. Both the arms were designed with the same specifications to move in the same way. A slave arm at a remote location was controlled using a signal generated from electromyograph (EMG) sensors attached to an operator’s arm [7]. The EMG sensors measure the electrical activity of the operator’s muscles and generate analog output signals that are fed to a micro-controller to control the robot. This approach has been proven to provide more accurate data than other methods, but it requires multiple special sensors to be attached to the operator’s body before the robot is manipulated.

Another constraint to this approach is anthropometrical differences: when the lengths of the operator’s arm are different from those of the slave arm, it becomes difficult to control the robot. If the dedicated controller provides only the positions of a hand-like end-effector, inverse kinematics (IK)-based solutions can yield a different robotic pose from the human pose. This pose inconsistency can result in the robot colliding with its surroundings and being damaged.

Recently, virtual reality (VR) has been used for teleoperation systems to provide operators with immersive visual feedback about a robot and its environment [11,12]. However, such systems suffer from poor maneuverability without any haptic feedback or physical constructions. Vision-based human pose estimation has been investigated extensively in the field of computer vision and image processing [8-10]. Important features are extracted from an image that can include people. The features are analyzed to generate the corresponding poses, which can be used in many applications, including robot manipulation [13,14] and the creation of animated movies and games [10].

Robot manipulation using human pose estimation algorithms does not require operators to attach sensors or other electronic devices to their bodies. The estimated joint locations are usually represented as 2-D positions within the image, but recently, 3-D pose estimation has been researched [9, 13, 14]. 2-D joint positions have been initially obtained from images, and then 3-D joint locations were inferred based on the human anthropometrical information [9]. Another method [13] involves the measurement of a person's pose using a RGB-D sensor. The data are analyzed with a depth map, and the operator’s pose is numerically calculated and used to control a humanoid robot. 2-D joint positions were extracted from the RGB image using the OpenPose algorithm [10] and coupled with corresponding depth information to estimate a 3-D pose [14].

However, the 3-D pose obtained using such methods cannot be applied directly to robot manipulation due to its lack of accuracy. Refining inaccurate poses requires a huge amount of computation and results in very slow operation (about 5 seconds) [13]. The accuracy of the robotic control is usually considered a top priority at the expense of manipulating speed, so the inaccurate poses are corrected by an anthropometric difference mapping model called HUMROB.

Both the pose inconsistency and the slow manipulating speed prevent humanoid teleoperation from being widely used. Therefore, improvements in both aspects are necessary. We propose a humanoid teleoperation system to manipulate both of a robot’s arms, as shown in Fig. 1. The 3-D pose data for actuating the joint motors of the robot are generated using a vision-based 3-D human pose estimation (HPE) algorithm without dedicated controllers.

For accuracy, two poses obtained from stereoscopic images are fused and refined using simple but effective multiple filtering. Some joints in a pose can be incorrectly located, but more appropriate positions for the joints can be estimated from another pose extracted from a slightly different point of view.

The rest of the paper is organized as follows. In the following section, related works are presented. In Section 3, the proposed teleoperation system is explained in detail. In Section 4, experimental results are presented. Finally, conclusions are presented in Section 5.

Fig. 1. Overview of the proposed humanoid teleopera-tion system. The humanoid robot is naturally manipulated by visually estimating the 3-D pose of the operator. An accurate pose can be determined by fusing two poses estimated from stereoscopic images.

2. Related Work

2.1 Teleoperation Systems using RGB-D Sensors

Methods using RGB-D sensors are mainly used for the teleoperation of humanoid robots. The sensors are used to estimate the joint positions of the operator and fitting the joint positions to the robot’s joints. The teleoperation system for the CENTAURO robot uses RGB-D sensors to estimate a human pose [14]. The OpenPose algorithm [10] is used to obtain 2D joint positions of an operator from the 2D image, and depth information is used to estimate the joint’s three-dimensional position. This method needs a huge amount of computation to correct erroneous or unavailable depth values and to handle the hand orientation. Therefore, the corrected data are delivered much more slowly with a period of about 7 Hz, even though the data rate of an RGB-D sensor is 30 frames per second (fps).

Another teleoperation method using RGB-D sensors [13] uses the HUMROB model, which can adapt to anthropometric differences for each operator to the robot limbs. The anthropometric adaptation is processed in about three seconds, so it is impossible to operate the robot in real time. Initial poses can be obtained easily with RGB-D sensors in these systems, but the precision of the systems is subject to the low accuracy of RGB-D sensors, and the refinement process for accurate poses usually requires a huge computational load, which prevents real-time operation.

2.2 Filtering-based Pose Estimation

A Kalman filter that estimates latent states from noisy data has also been used in human pose estimation [15,16]. A quaternion-based Kalman filter was presented to track the human pose in real time [15]. This method obtains a human limb pose from small inertial/magnetic sensor modules and uses preprocessing to produce a quaternion using the QUEST algorithm. A quaternion representing rotation is filtered through a quaternion-based Kalman filter to track a person's movements. This method can obtain accurate results, but many dedicated sensors must be attached to the body to obtain data.

The pose information of the arms and hands of a person was obtained through five Leap Motion sensors and a GEAK Watch. The information was then corrected through a Kalman filter followed by a particle filter [16]. In this method, the state of the center position of the palm is estimated by the Kalman filter. To estimate the orientation of the hand, the factored quaternion algorithm is used with the particle filter. This method is basically reliable because it uses a number of Leap Motion sensors and a GEAK Watch to reduce possible measurement error through a hybrid filter algorithm using a Kalman filter and particle filter.

In contrast, our system uses a single stereo camera to generate two 3-D candidate poses at a high rate. Two poses with some positional errors are fused by a simple hybrid filter to effectively determine an appropriate pose. Notably, only the positional information is filtered in the proposed system.

3. The Proposed Humanoid Teleoperation System

A control system estimates an operator’s pose and sends data to operate a humanoid robot at a remote site using the Robot Operating System (ROS) [17]. Fig. 2 illustrates the process of the proposed teleoperation system, which uses a conventional 3-D HPE method [18]. The proposed system consists of three threads for both HPE of stereo images and for pose refinement. Notably, the threads for both HPE run asynchronously at around 15 Hz, while the thread for pose refinement runs the hybrid filtering in regular cycles at 60 Hz. By doing this, any new pose determined from HPE is used immediately in the pose refinement without a delay. As a result, the effective rate of the pose update is slightly higher than each HPE rate.

In the projective image formation, pose ambiguity can happen, which leads to inaccurate pose estimation. Therefore, two versions of the operator’s pose are obtained from a pair of stereo images in the proposed control system. Some joints in a pose can be incorrectly located, but more appropriate positions for the joints can be estimated from another pose extracted from a different point of view.

A cascade of simple yet effective multiple filters fuses the two poses and refines the pose reliably and stably in real time. In the teleoperation-monitoring system, the determined pose is converted to joint angle data to drive the motors of the robot. Finally, using ROS-Serial, the joint angle data are transformed to a ROS message that can be sent to the humanoid robot over a network.

Fig. 2. Flow chart of the proposed teleoperation system. The three threads indicated in dashed boxes run asynchronously.

3.1 Human pose refinement technique control

Two initial 3-D operator poses denoted by $\mathbf{P}_{i}^{s}$, $s\in \{L,R\}$, are obtained using the HPE method [19] with input stereo images acquired at frame index $i$. $\mathrm{P}_{i}^{\mathrm{s}}=\left\{p_{i}^{s,~ j}\right\}$ consists of the 3-D positions of eight joints in the upper body, where $j\in \{neck,top,lwri,lelb,lsho,rwri,relb,rsho\}$ indicates the joint index annotated in Fig. 3(a). Fig. 3(b) shows the corresponding motor positions and the base position in the humanoid robot LIMS. For the arm of the robot, four motors $\phi _{i},i\in \left[1\cdots 4\right]$, can be actuated. $\phi _{1},\phi _{2}$, and $\phi _{3}$ are located on the shoulder, and $\phi _{4}$ actuates the elbow.

Each joint position is fused effectively by selecting an appropriate one among two corresponding candidate positions in the initially obtained poses and the previous position as follows:

$ \overset{˜}{\mathbf{p}}_{k}^{j}=\mathrm{MED}\left[\overset{˜}{\mathbf{p}}_{k-1}^{j},\mathbf{p}_{i^{L}}^{L,j},\mathbf{p}_{i^{R}}^{R,j}\right], $

where $\mathrm{MED}[\cdot ]$ represents a median filter that chooses a median value for each component of the input positions. The frame indices $i^{L}=\left\{i,i-1\right\}$, $i^{R}=\left\{i,i-1\right\}$ for both poses can be different because of the asynchronous execution of the HPE. At least one of $i^{L}$ and $i^{R}$ is $i$. $\overset{˜}{\mathbf{p}}_{k}^{j}$ is produced in regular cycles at 60 Hz. The learning-based HPE method sometimes produces irrelevant poses due to pose ambiguity. The median filtering technique is applied to remove any outlier, thereby eliminating such noisy poses in the robotic movement.

A Kalman filter is then used for each joint to smooth the movement:

$ \hat{\mathbf{p}}_{k}^{j}=\text{Kalman}\left(\left\{\overset{˜}{\mathbf{p}}_{\Theta }^{j}\right\}\right), \\ \Theta =\left\{k,k-1,\cdots ,k-K\right\}, $

where $\textit{K}$ is set to 50 in our experiments. Even though the median filter corrects inaccurate poses, the median results may include repeated poses or rapid pose changes due to irregularities in HPE processing time. The Kalman filter alleviates these glitch effect by smoothing the movement.

Fig. 3. Correspondences between the skeletal posture and the humanoid LIMS (a) The pose configuration of eight joints, (b) The motors located in LIMS and the base position indicated as a green dot.

3.2 Motor Angle Generation

To actuate a motor $\phi _{i}$, the positional information for each joint is converted into a motor angle $q_{\phi _{i}}^{s}$ that will be sent to the robot. This motor angle is easily calculated as:

$ \mathbf{Q}_{k}=J2Q\left(\hat{\mathbf{P}}_{k}\right), \\ \mathbf{Q}_{k}=\left[q_{\phi _{1}}^{L},q_{\phi _{2}}^{L},q_{\phi _{3}}^{L},q_{\phi _{4}}^{L},q_{\phi _{1}}^{R},q_{\phi _{2}}^{R},q_{\phi _{3}}^{R},q_{\phi _{4}}^{R}\right], $

where $J2Q\left(\cdot \right)$ represents a humanoid-specific $Q$-calculation function [19].

3.3 Infinite Impulse Response Filter

Most glitch effect are removed by the pose refinement, but the result of the median and Kalman filters, $\mathbf{Q}_{k}$, can be still regarded as a staircase-like signal considering the high frequency of the robot’s controller. To generate more continuous signals, various interpolation methods can be applied. In the proposed system, an infinite impulse response (IIR) filter is used to produce continuous stabilized signals with very low complexity:

$ \mathbf{Q}_{k}^{\mathrm{IIR}}=\alpha \mathbf{Q}_{k}+\left(1-\alpha \right)\mathbf{Q}_{k-1}^{\mathrm{IIR}}, $

where the constant damping coefficient $\alpha $ is set to 0.05 in our experiments.

4. Experimental Results

4.1 Experimental Environments

The teleoperation system was run on a PC with an Intel-i9, a TITAN V, and Windows 10. The stereo camera in the teleoperation system acquires images with a resolution of 1280 x 720 at 25 fps. The two initial poses are obtained asynchronously at about 15 Hz.

The humanoid robot was controlled by a PC with an Intel-i7, 32GB of RAM, a TITAN V, and Ubuntu 16.0. The stereo camera connected to the humanoid robot is identical to that in the teleoperation-monitoring system. The stereo images are encoded and streamed to the HMD for the operator through an HTTP server. The humanoid robot LIMS has a positioning repeatability of 0.153 mm, and the joint positions of both of its arms are updated every 12 ms [19].

4.2.1 System Operating Rate

We measured the runtime speed of our system and compared it to a system using the CENTAURO robot [14], as shown in Table 1. The frequency output of the RGB-D sensor’s color image and point cloud is approximately 30~Hz. This drops to 8.07 Hz at the OpenPose node when tracking both the human body and hands, which matches the OpenPose benchmark. Finally, the pose information after refinement is delivered to the robot every 140 ms.

In the proposed method, the HPE module estimates two versions of the operator’s pose from stereo images at about 15 Hz. Notably, the two poses are determined in 3-D rather than the 2-D pose obtained by OpenPose [14]. The rate of the filter-based pose-refinement process in our system is 60 Hz, which is 8.5 times faster than the previous system [14] and improves stability by providing more detailed information about where the robot must reach

Table 1. A summary of the frequency at which the data messages are published from ROS.



Frequency (Hz)













4.2.2 Filtering-based Pose Refinement

Fig. 6 illustrates that the pose fusion module selectively collects appropriate joint positions even with the presence of inaccurate estimation. Fig. 6(a) and (b) show three consecutive poses estimated using HPE with the left and right frames of the stereoscopic video. The poses in Fig. 6(b) show an abrupt change since the pose at $k_{0}+1$ was inaccurately estimated due to pose ambiguity. The median filter produced a pose at $k_{0}+1$, as shown in Fig. 6(c), where the poses indicated by red arrows are the input. The resultant pose was properly corrected through the median filter.

Fig. 7 illustrates the pose smoothing by the Kalman filter. The HPE performed in a separate thread irregularly takes a longer time to estimate a new pose. In this case, the pose fusion module applied at 60 Hz can use identical poses as input several times and produce the same resultant poses as in Fig. 7(a), where an identical pose is repeated from $k_{0}+1$ to $k_{0}+3$. In addition, this situation generates a rapid pose change between $k_{0}$ and $k_{0}+1$. The Kalman filter alleviates the glitch effect by smoothing the movement.

Fig. 4. Pose fusion of Y position of the right shoulder by median filter. In the red box, the two poses obtained from stereo images indicate very different directions. In this case, the previous position can be maintained through the median filter.
Fig. 5. Example of pose refinement with respect to the right wrist’s X position. The Kalman filtering increases the stability of joint motion by smoothing out the rapid change of the fused pose in the green box.
Fig. 6.Pose correction in the pose fusion module. Each subfigure represents three consecutive poses estimated using HPE (a)-(b) Poses estimated using the left images and the right images of the stereoscopic video, respectively, (c) Poses obtained using the pose fusion module of Eq.(1). The inaccurate pose at in (b) was effectively corrected by the pose fusion.
Fig. 7. Pose smoothing by Kalman filtering (a) The resultant poses of the pose fusion, (b) The output of the Kalman filter.

5. Conclusion

We have proposed a robot teleoperation system in which the operator's pose candidates are initially obtained using stereo images, and an accurate pose can be estimated using a simple yet effective cascade filter. Unlike conventional systems that slowly control the robots with a focus on motion accuracy, the proposed system can accurately and remotely manipulate a robot in real time. The experiments confirmed that the proposed system teleoperated the robot in real time efficiently and generated accurate motion compared to the conventional systems.


Lee S. U., Choi Y., Jeong K., Apr. 2019, Domestic recent works on robotic system for safety of nuclear power plants, Journal of the Korean Society for Precision Engineering, Vol. 36, No. 4, pp. 323-329DOI
Noh J.-H., Shin S., Park J.-H., May. 2013, A study on the robot for mining of underground resources, Journal of the Korean Society of Marine Engineering, Vol. 37, No. 4, pp. 399-403DOI
Lee S.-Y., Kim J.-Y., Cho S.-H., Shin C., Mar. 2019, Educational indoor autonomous mobile robot system using a LiDAR and a RGB-D camera, Journal of Institute of Korean Electrical and Electronics Engineers, Vol. 23, No. 1, pp. 44-52DOI
Jung M.-J., Kim D., 1998, Review of remote robot technology, The Magazine of the IEEK, Vol. 25, No. 2, pp. 182-190URL
Nourbakhsh I. R., Siegwart R., 2011, Introduction to Autonomous Mobile Robots, The MIT PressURL
Ji D. H., Jeon J. H., Kang H. S., Choi H. S., 2015, Design and control of the master arm for control of industrial robot arm, Journal of Korean Society for Precision Engineering, Vol. 32, No. 12, pp. 1055-1063DOI
Lee J., Jung K. K., Lee H.-K., Eom K.-H., Mar. 2003, A virtual robot arm control by EMG pattern recognition of fuzzy-SOFM method, Institute of Korean Electrical and Electronics Engineers - Computer and Information, Vol. 40, No. 2, pp. 9-16URL
Andriluka M., Pishchulin L., Gehler P., Schiele B., 2014, 2D human pose estimation: New benchmark and state of the art analysis, IEEE Conference on Computer Vision and Pattern RecognitionDOI
Akhter I., Black M. J., 2015, Pose-conditioned joint angle limits for 3D human pose reconstruction, IEEE Conference on Computer Vision and Pattern RecognitionDOI
Cao Z., Simon T., Wei S.-E., Sheikh Y., 2017, Realtime multi-person 2D pose estimation using part affinity fields, IEEE Conference on Computer Vision and Pattern RecognitionDOI
Lipton J. I., Fay A. J., Rus D., 2018, Baxter’s homunculus: Virtual reality spaces for teleoperation in manufacturing, IEEE Robotics and Automation Letters, Vol. 3, No. 1, pp. 179-186DOI
Whitney D., Rosen E., Phillips E., Konidaris G., Tellex S., 2020, Comparing Robot Grasping Teleoperation across Desktop and Virtual Reality with ROS Reality, In International Symposium on Robotics Research, pp. 335-350DOI
Wang S., Zuo X., Wang R., Cheng F., Yang R., 2017, A generative human-robot motion retargeting approach using a single depth sensor, IEEE International Conference on Robotics and AutomationDOI
Rolley-Parnell E.-J., Kanoulas D., Laurenzi A., Delhaisse B., Rozo L., Caldwell D. G., Tsagarakis N. G., 2018, Bi-manual articulated robot teleoperation using an external RGB-D range sensor, International Conference on Control, Automation, Robotics and VisionDOI
Yun X., Bachmann E. R., Dec. 2006, Design, implementation, and experimental results of a quaternion-based Kalman filter for human body motion tracking, IEEE Transactions on Robotics, Vol. 22, No. 6, pp. 1216-1227DOI
Liang Y., Du G., Li F., Zhang P., Aug 2019, Markerless human-manipulator interface with vibration feedback using multi-sensors, IEEE International Conference on Real-time Computing and RoboticsDOI
Quigley M., Conley K., Gerkey B., Faust J., Jan 2009, ROS: an open-source robot operating system, IEEE International Conference on Robotics and AutomationURL
Kanazawa A., Black M. J., Jacobs D. W., Malik J., Jun. 2018, End-to-end recovery of human shape and pose, IEEE Conference on Computer Vision and Pattern RecognitionDOI
Kim Y.-J., Dec. 2017, Anthropomorphic low-inertia high-stiffness manipulator for high-speed safe interaction, IEEE Transactions on Robotics, Vol. 33, pp. 1358-1374DOI


Jae-Min Sa

Jae-Min Sa received a B.S. degree in Electrical Engineering from KOREA-TECH, Korea. Currently, he is a graduate student in the Inter-dis-ciplinary Program in Creative Engi-neering at KOREATECH, Korea. His research interests include computer vision, pose estimation, robot vision, and machine learning.

Kang-Sun Choi

Kang-Sun Choi received a Ph.D. degree in nonlinear filter design in 2003, an M.S. in 1999, and a B.S. in 1997 in electronic engineering from Korea University. In 2011, he joined the School of Electrical, Electronics & Communication Engineering at Korea University of Technology and Edu-cation, where he is currently an assistant professor. From 2008 to 2010, he was a research professor in the Department of Electronic Engineering at Korea University. From 2005 to 2008, he worked at Samsung Electronics, Korea, as a Senior Software Engineer. From 2003 to 2005, he was a visiting scholar at the University of Southern California. His research interests are in the areas of multimedia compression, video processing, and computational photography. He is the recipient of an IEEE International Conference on Consumer Electronics Special Merit Award (2012).