SaJae-Min1
ChoiKang-Sun2
-
(Interdisciplinary Program on Creative Engineering, KOREATECH, Cheonan-si 31253, Korea
cia2002@koreatech.ac.kr)
-
(Department of Electrical, Electronics and Communication Engineering, KOREATECH, Cheonan-si
31253, Korea
ks.choi@koreatech.ac.kr
)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Pose estimation, Filtering, Teleoperation
1. Introduction
Various kinds of mobile robots have been developed and used in different applications
[1-5]. Although some research is devoted to the automatic control of mobile robots [1], most practical robots are still operated by human operators with the help of video
images acquired by cameras mounted on the robots [2-5]. Humanoid robots are designed to have joint structures like those of humans, so they
can mimic human body movements and perform various tasks that may be dangerous for
humans. However, to control these robots successfully, accurate pose data for manipulating
each joint must be generated and processed with a very short latency period.
Various methods have been proposed to control different kinds of humanoid robots
by estimating the poses of human operators in remote sites [6-12]. Two main approaches are the use of dedicated positioning controllers and the estimation
of the operator’s pose from videos. A master arm was used as a dedicated controller
to manipulate a slave arm [6]. Both the arms were designed with the same specifications to move in the same way.
A slave arm at a remote location was controlled using a signal generated from electromyograph
(EMG) sensors attached to an operator’s arm [7]. The EMG sensors measure the electrical activity of the operator’s muscles and generate
analog output signals that are fed to a micro-controller to control the robot. This
approach has been proven to provide more accurate data than other methods, but it
requires multiple special sensors to be attached to the operator’s body before the
robot is manipulated.
Another constraint to this approach is anthropometrical differences: when the
lengths of the operator’s arm are different from those of the slave arm, it becomes
difficult to control the robot. If the dedicated controller provides only the positions
of a hand-like end-effector, inverse kinematics (IK)-based solutions can yield a different
robotic pose from the human pose. This pose inconsistency can result in the robot
colliding with its surroundings and being damaged.
Recently, virtual reality (VR) has been used for teleoperation systems to provide
operators with immersive visual feedback about a robot and its environment [11,12]. However, such systems suffer from poor maneuverability without any haptic feedback
or physical constructions. Vision-based human pose estimation has been investigated
extensively in the field of computer vision and image processing [8-10]. Important features are extracted from an image that can include people. The features
are analyzed to generate the corresponding poses, which can be used in many applications,
including robot manipulation [13,14] and the creation of animated movies and games [10].
Robot manipulation using human pose estimation algorithms does not require operators
to attach sensors or other electronic devices to their bodies. The estimated joint
locations are usually represented as 2-D positions within the image, but recently,
3-D pose estimation has been researched [9, 13, 14]. 2-D joint positions have been
initially obtained from images, and then 3-D joint locations were inferred based on
the human anthropometrical information [9]. Another method [13] involves the measurement of a person's pose using a RGB-D sensor. The data are analyzed
with a depth map, and the operator’s pose is numerically calculated and used to control
a humanoid robot. 2-D joint positions were extracted from the RGB image using the
OpenPose algorithm [10] and coupled with corresponding depth information to estimate a 3-D pose [14].
However, the 3-D pose obtained using such methods cannot be applied directly to
robot manipulation due to its lack of accuracy. Refining inaccurate poses requires
a huge amount of computation and results in very slow operation (about 5 seconds)
[13]. The accuracy of the robotic control is usually considered a top priority at the
expense of manipulating speed, so the inaccurate poses are corrected by an anthropometric
difference mapping model called HUMROB.
Both the pose inconsistency and the slow manipulating speed prevent humanoid teleoperation
from being widely used. Therefore, improvements in both aspects are necessary. We
propose a humanoid teleoperation system to manipulate both of a robot’s arms, as shown
in Fig. 1. The 3-D pose data for actuating the joint motors of the robot are generated using
a vision-based 3-D human pose estimation (HPE) algorithm without dedicated controllers.
For accuracy, two poses obtained from stereoscopic images are fused and refined
using simple but effective multiple filtering. Some joints in a pose can be incorrectly
located, but more appropriate positions for the joints can be estimated from another
pose extracted from a slightly different point of view.
The rest of the paper is organized as follows. In the following section, related
works are presented. In Section 3, the proposed teleoperation system is explained
in detail. In Section 4, experimental results are presented. Finally, conclusions
are presented in Section 5.
Fig. 1. Overview of the proposed humanoid teleopera-tion system. The humanoid robot is naturally manipulated by visually estimating the 3-D pose of the operator. An accurate pose can be determined by fusing two poses estimated from stereoscopic images.
2. Related Work
2.1 Teleoperation Systems using RGB-D Sensors
Methods using RGB-D sensors are mainly used for the teleoperation of humanoid
robots. The sensors are used to estimate the joint positions of the operator and fitting
the joint positions to the robot’s joints. The teleoperation system for the CENTAURO
robot uses RGB-D sensors to estimate a human pose [14]. The OpenPose algorithm [10] is used to obtain 2D joint positions of an operator from the 2D image, and depth
information is used to estimate the joint’s three-dimensional position. This method
needs a huge amount of computation to correct erroneous or unavailable depth values
and to handle the hand orientation. Therefore, the corrected data are delivered much
more slowly with a period of about 7 Hz, even though the data rate of an RGB-D sensor
is 30 frames per second (fps).
Another teleoperation method using RGB-D sensors [13] uses the HUMROB model, which can adapt to anthropometric differences for each operator
to the robot limbs. The anthropometric adaptation is processed in about three seconds,
so it is impossible to operate the robot in real time. Initial poses can be obtained
easily with RGB-D sensors in these systems, but the precision of the systems is subject
to the low accuracy of RGB-D sensors, and the refinement process for accurate poses
usually requires a huge computational load, which prevents real-time operation.
2.2 Filtering-based Pose Estimation
A Kalman filter that estimates latent states from noisy data has also been used
in human pose estimation [15,16]. A quaternion-based Kalman filter was presented to track the human pose in real time
[15]. This method obtains a human limb pose from small inertial/magnetic sensor modules
and uses preprocessing to produce a quaternion using the QUEST algorithm. A quaternion
representing rotation is filtered through a quaternion-based Kalman filter to track
a person's movements. This method can obtain accurate results, but many dedicated
sensors must be attached to the body to obtain data.
The pose information of the arms and hands of a person was obtained through five
Leap Motion sensors and a GEAK Watch. The information was then corrected through a
Kalman filter followed by a particle filter [16]. In this method, the state of the center position of the palm is estimated by the
Kalman filter. To estimate the orientation of the hand, the factored quaternion algorithm
is used with the particle filter. This method is basically reliable because it uses
a number of Leap Motion sensors and a GEAK Watch to reduce possible measurement error
through a hybrid filter algorithm using a Kalman filter and particle filter.
In contrast, our system uses a single stereo camera to generate two 3-D candidate
poses at a high rate. Two poses with some positional errors are fused by a simple
hybrid filter to effectively determine an appropriate pose. Notably, only the positional
information is filtered in the proposed system.
3. The Proposed Humanoid Teleoperation System
A control system estimates an operator’s pose and sends data to operate a humanoid
robot at a remote site using the Robot Operating System (ROS) [17]. Fig. 2 illustrates the process of the proposed teleoperation system, which uses a conventional
3-D HPE method [18]. The proposed system consists of three threads for both HPE of stereo images and
for pose refinement. Notably, the threads for both HPE run asynchronously at around
15 Hz, while the thread for pose refinement runs the hybrid filtering in regular cycles
at 60 Hz. By doing this, any new pose determined from HPE is used immediately in the
pose refinement without a delay. As a result, the effective rate of the pose update
is slightly higher than each HPE rate.
In the projective image formation, pose ambiguity can happen, which leads to inaccurate
pose estimation. Therefore, two versions of the operator’s pose are obtained from
a pair of stereo images in the proposed control system. Some joints in a pose can
be incorrectly located, but more appropriate positions for the joints can be estimated
from another pose extracted from a different point of view.
A cascade of simple yet effective multiple filters fuses the two poses and refines
the pose reliably and stably in real time. In the teleoperation-monitoring system,
the determined pose is converted to joint angle data to drive the motors of the robot.
Finally, using ROS-Serial, the joint angle data are transformed to a ROS message that
can be sent to the humanoid robot over a network.
Fig. 2. Flow chart of the proposed teleoperation system. The three threads indicated in dashed boxes run asynchronously.
3.1 Human pose refinement technique control
Two initial 3-D operator poses denoted by $\mathbf{P}_{i}^{s}$, $s\in \{L,R\}$,
are obtained using the HPE method [19] with input stereo images acquired at frame index $i$. $\mathrm{P}_{i}^{\mathrm{s}}=\left\{p_{i}^{s,~
j}\right\}$ consists of the 3-D positions of eight joints in the upper body, where
$j\in \{neck,top,lwri,lelb,lsho,rwri,relb,rsho\}$ indicates the joint index annotated
in Fig. 3(a). Fig. 3(b) shows the corresponding motor positions and the base position in the humanoid robot
LIMS. For the arm of the robot, four motors $\phi _{i},i\in \left[1\cdots 4\right]$,
can be actuated. $\phi _{1},\phi _{2}$, and $\phi _{3}$ are located on the shoulder,
and $\phi _{4}$ actuates the elbow.
Each joint position is fused effectively by selecting an appropriate one among
two corresponding candidate positions in the initially obtained poses and the previous
position as follows:
where $\mathrm{MED}[\cdot ]$ represents a median filter that chooses a median
value for each component of the input positions. The frame indices $i^{L}=\left\{i,i-1\right\}$,
$i^{R}=\left\{i,i-1\right\}$ for both poses can be different because of the asynchronous
execution of the HPE. At least one of $i^{L}$ and $i^{R}$ is $i$. $\overset{˜}{\mathbf{p}}_{k}^{j}$
is produced in regular cycles at 60 Hz. The learning-based HPE method sometimes produces
irrelevant poses due to pose ambiguity. The median filtering technique is applied
to remove any outlier, thereby eliminating such noisy poses in the robotic movement.
A Kalman filter is then used for each joint to smooth the movement:
where $\textit{K}$ is set to 50 in our experiments. Even though the median filter
corrects inaccurate poses, the median results may include repeated poses or rapid
pose changes due to irregularities in HPE processing time. The Kalman filter alleviates
these glitch effect by smoothing the movement.
Fig. 3. Correspondences between the skeletal posture and the humanoid LIMS (a) The pose configuration of eight joints, (b) The motors located in LIMS and the base position indicated as a green dot.
3.2 Motor Angle Generation
To actuate a motor $\phi _{i}$, the positional information for each joint is
converted into a motor angle $q_{\phi _{i}}^{s}$ that will be sent to the robot. This
motor angle is easily calculated as:
where $J2Q\left(\cdot \right)$ represents a humanoid-specific $Q$-calculation
function [19].
3.3 Infinite Impulse Response Filter
Most glitch effect are removed by the pose refinement, but the result of the
median and Kalman filters, $\mathbf{Q}_{k}$, can be still regarded as a staircase-like
signal considering the high frequency of the robot’s controller. To generate more
continuous signals, various interpolation methods can be applied. In the proposed
system, an infinite impulse response (IIR) filter is used to produce continuous stabilized
signals with very low complexity:
where the constant damping coefficient $\alpha $ is set to 0.05 in our experiments.
4. Experimental Results
4.1 Experimental Environments
The teleoperation system was run on a PC with an Intel-i9, a TITAN V, and Windows
10. The stereo camera in the teleoperation system acquires images with a resolution
of 1280 x 720 at 25 fps. The two initial poses are obtained asynchronously at about
15 Hz.
The humanoid robot was controlled by a PC with an Intel-i7, 32GB of RAM, a TITAN
V, and Ubuntu 16.0. The stereo camera connected to the humanoid robot is identical
to that in the teleoperation-monitoring system. The stereo images are encoded and
streamed to the HMD for the operator through an HTTP server. The humanoid robot LIMS
has a positioning repeatability of 0.153 mm, and the joint positions of both of its
arms are updated every 12 ms [19].
4.2.1 System Operating Rate
We measured the runtime speed of our system and compared it to a system using
the CENTAURO robot [14], as shown in Table 1. The frequency output of the RGB-D sensor’s color image and point cloud is approximately
30~Hz. This drops to 8.07 Hz at the OpenPose node when tracking both the human body
and hands, which matches the OpenPose benchmark. Finally, the pose information after
refinement is delivered to the robot every 140 ms.
In the proposed method, the HPE module estimates two versions of the operator’s
pose from stereo images at about 15 Hz. Notably, the two poses are determined in 3-D
rather than the 2-D pose obtained by OpenPose [14]. The rate of the filter-based pose-refinement process in our system is 60 Hz, which
is 8.5 times faster than the previous system [14] and improves stability by providing more detailed information about where the robot
must reach
Table 1. A summary of the frequency at which the data messages are published from ROS.
Method
|
Stage
|
Frequency (Hz)
|
[14]
|
OpenNI2
|
OpenPose
|
Pt2Xbot
|
29.82
|
8.07
|
7.13
|
Ours
|
HPE
|
Filtering
|
14.97
|
60.00
|
4.2.2 Filtering-based Pose Refinement
Fig. 6 illustrates that the pose fusion module selectively collects appropriate joint positions
even with the presence of inaccurate estimation. Fig. 6(a) and (b) show three consecutive poses estimated using HPE with the left and right frames of
the stereoscopic video. The poses in Fig. 6(b) show an abrupt change since the pose at $k_{0}+1$ was inaccurately estimated due
to pose ambiguity. The median filter produced a pose at $k_{0}+1$, as shown in Fig. 6(c), where the poses indicated by red arrows are the input. The resultant pose was properly
corrected through the median filter.
Fig. 7 illustrates the pose smoothing by the Kalman filter. The HPE performed in a separate
thread irregularly takes a longer time to estimate a new pose. In this case, the pose
fusion module applied at 60 Hz can use identical poses as input several times and
produce the same resultant poses as in Fig. 7(a), where an identical pose is repeated from $k_{0}+1$ to $k_{0}+3$. In addition, this
situation generates a rapid pose change between $k_{0}$ and $k_{0}+1$. The Kalman
filter alleviates the glitch effect by smoothing the movement.
Fig. 4. Pose fusion of Y position of the right shoulder by median filter. In the red box, the two poses obtained from stereo images indicate very different directions. In this case, the previous position can be maintained through the median filter.
Fig. 5. Example of pose refinement with respect to the right wrist’s X position. The Kalman filtering increases the stability of joint motion by smoothing out the rapid change of the fused pose in the green box.
Fig. 6.Pose correction in the pose fusion module. Each subfigure represents three consecutive poses estimated using HPE (a)-(b) Poses estimated using the left images and the right images of the stereoscopic video, respectively, (c) Poses obtained using the pose fusion module of Eq.(1). The inaccurate pose at in (b) was effectively corrected by the pose fusion.
Fig. 7. Pose smoothing by Kalman filtering (a) The resultant poses of the pose fusion, (b) The output of the Kalman filter.
5. Conclusion
We have proposed a robot teleoperation system in which the operator's pose candidates
are initially obtained using stereo images, and an accurate pose can be estimated
using a simple yet effective cascade filter. Unlike conventional systems that slowly
control the robots with a focus on motion accuracy, the proposed system can accurately
and remotely manipulate a robot in real time. The experiments confirmed that the proposed
system teleoperated the robot in real time efficiently and generated accurate motion
compared to the conventional systems.
REFERENCES
Lee S. U., Choi Y., Jeong K., Apr. 2019, Domestic recent works on robotic system for
safety of nuclear power plants, Journal of the Korean Society for Precision Engineering,
Vol. 36, No. 4, pp. 323-329
Noh J.-H., Shin S., Park J.-H., May. 2013, A study on the robot for mining of underground
resources, Journal of the Korean Society of Marine Engineering, Vol. 37, No. 4, pp.
399-403
Lee S.-Y., Kim J.-Y., Cho S.-H., Shin C., Mar. 2019, Educational indoor autonomous
mobile robot system using a LiDAR and a RGB-D camera, Journal of Institute of Korean
Electrical and Electronics Engineers, Vol. 23, No. 1, pp. 44-52
Jung M.-J., Kim D., 1998, Review of remote robot technology, The Magazine of the IEEK,
Vol. 25, No. 2, pp. 182-190
Nourbakhsh I. R., Siegwart R., 2011, Introduction to Autonomous Mobile Robots, The
MIT Press
Ji D. H., Jeon J. H., Kang H. S., Choi H. S., 2015, Design and control of the master
arm for control of industrial robot arm, Journal of Korean Society for Precision Engineering,
Vol. 32, No. 12, pp. 1055-1063
Lee J., Jung K. K., Lee H.-K., Eom K.-H., Mar. 2003, A virtual robot arm control by
EMG pattern recognition of fuzzy-SOFM method, Institute of Korean Electrical and Electronics
Engineers - Computer and Information, Vol. 40, No. 2, pp. 9-16
Andriluka M., Pishchulin L., Gehler P., Schiele B., 2014, 2D human pose estimation:
New benchmark and state of the art analysis, IEEE Conference on Computer Vision and
Pattern Recognition
Akhter I., Black M. J., 2015, Pose-conditioned joint angle limits for 3D human pose
reconstruction, IEEE Conference on Computer Vision and Pattern Recognition
Cao Z., Simon T., Wei S.-E., Sheikh Y., 2017, Realtime multi-person 2D pose estimation
using part affinity fields, IEEE Conference on Computer Vision and Pattern Recognition
Lipton J. I., Fay A. J., Rus D., 2018, Baxter’s homunculus: Virtual reality spaces
for teleoperation in manufacturing, IEEE Robotics and Automation Letters, Vol. 3,
No. 1, pp. 179-186
Whitney D., Rosen E., Phillips E., Konidaris G., Tellex S., 2020, Comparing Robot
Grasping Teleoperation across Desktop and Virtual Reality with ROS Reality, In International
Symposium on Robotics Research, pp. 335-350
Wang S., Zuo X., Wang R., Cheng F., Yang R., 2017, A generative human-robot motion
retargeting approach using a single depth sensor, IEEE International Conference on
Robotics and Automation
Rolley-Parnell E.-J., Kanoulas D., Laurenzi A., Delhaisse B., Rozo L., Caldwell D.
G., Tsagarakis N. G., 2018, Bi-manual articulated robot teleoperation using an external
RGB-D range sensor, International Conference on Control, Automation, Robotics and
Vision
Yun X., Bachmann E. R., Dec. 2006, Design, implementation, and experimental results
of a quaternion-based Kalman filter for human body motion tracking, IEEE Transactions
on Robotics, Vol. 22, No. 6, pp. 1216-1227
Liang Y., Du G., Li F., Zhang P., Aug 2019, Markerless human-manipulator interface
with vibration feedback using multi-sensors, IEEE International Conference on Real-time
Computing and Robotics
Quigley M., Conley K., Gerkey B., Faust J., Jan 2009, ROS: an open-source robot operating
system, IEEE International Conference on Robotics and Automation
Kanazawa A., Black M. J., Jacobs D. W., Malik J., Jun. 2018, End-to-end recovery of
human shape and pose, IEEE Conference on Computer Vision and Pattern Recognition
Kim Y.-J., Dec. 2017, Anthropomorphic low-inertia high-stiffness manipulator for high-speed
safe interaction, IEEE Transactions on Robotics, Vol. 33, pp. 1358-1374
Author
Jae-Min Sa received a B.S. degree in Electrical Engineering from KOREA-TECH, Korea.
Currently, he is a graduate student in the Inter-dis-ciplinary Program in Creative
Engi-neering at KOREATECH, Korea. His research interests include computer vision,
pose estimation, robot vision, and machine learning.
Kang-Sun Choi received a Ph.D. degree in nonlinear filter design in 2003, an M.S.
in 1999, and a B.S. in 1997 in electronic engineering from Korea University. In 2011,
he joined the School of Electrical, Electronics & Communication Engineering at Korea
University of Technology and Edu-cation, where he is currently an assistant professor.
From 2008 to 2010, he was a research professor in the Department of Electronic Engineering
at Korea University. From 2005 to 2008, he worked at Samsung Electronics, Korea, as
a Senior Software Engineer. From 2003 to 2005, he was a visiting scholar at the University
of Southern California. His research interests are in the areas of multimedia compression,
video processing, and computational photography. He is the recipient of an IEEE International
Conference on Consumer Electronics Special Merit Award (2012).