In this section, for the shortcomings of SLAM algorithm in low texture scenes, the
study designs corresponding solution algorithms from the perspective of PLS MFF. Aiming
at the problem of tracking loss of MFF algorithm in complex scenes such as robot steering,
the study designs the corresponding solution algorithm from the perspective of multi-sensor
fusion.
3.1. Improved Design of ORB-SLAM2 Algorithm Based on Fusion of PLS Features
Traditional SLAM systems suffer from accuracy slippage, tracking loss in low texture,
and robot steering issues when it comes to improving WR autonomous mobility. In order
to solve the problem of accuracy slippage in low texture scenarios, the study starts
from RGB-D camera, of which ORB-SLAM2 is selected for improvement [15]. The specific improvement measures are to perform PLS MFF. In order to address the
issue of robot tracking loss in steering scenarios, the research develops an enhanced
algorithm that utilizes multi-sensor fusion. Fig. 1 depicts the main flow of the upgraded ORB-SLAM2 algorithm.
Fig. 1. The main process of improving the ORB-SLAM2 algorithm.
From Fig. 1, it can be seen that the updated ORB-SLAM2 algorithm mainly consists of three parts,
namely image input, visual odometry, and backend optimization. The image input includes
color and depth maps, while visual odometry covers the extraction and processing of
point, line, and surface features, pose estimation, and keyframe selection. Meanwhile,
backend optimization includes content such as Manhattan Frame (MF) extraction and
processing, and filtering of keyframes, etc. In improving the ORB-SLAM2 algorithm,
the role of the visual odometry module is to determine whether to insert the current
frame as a keyframe into the backend. In this process, uninitialized features need
to first construct an initial map, while initialized features need to estimate pose
through nonlinear optimization. Subsequently, determine whether to insert keyframes
based on tracking quality and backend optimization. Backend optimization is responsible
for processing keyframes, reconstructing new map points and lines through triangulation,
and estimating state variables such as keyframe pose, point, line, and face positions
in local maps using nonlinear optimization methods. At the same time, research is
being conducted on using the alignment characteristics of MF to correct potential
errors in triangulation reconstruction and improve the accuracy of triangulation reconstruction.
By improving the performance of ORB-SLAM2 algorithm in RGB-D cameras, the tracking
accuracy of robots can be enhanced.
Image features generally contain two parts: key regions and descriptors, where key
regions involve key points, lines and faces [16]. Agglomerative Hierarchical Clustering (AHC) facet features are improved by adjusting
the mean filter window size according to the depth value and performing adaptive mean
filtering on the depth map. In visual odometry, the extraction of point features involves
treating the pixel coordinates (PC) as an observation which is shown in Eq. (1).
In Eq. (1), $U$ and $U$ represent the horizontal and vertical coordinates of the pixel, respectively.
Eq. (2) displays the line characteristics' observed values.
In Eq. (2), $\Pi(\cdot)$ stands for taking the chi-square coordinates, $\|\cdot\|$ is the mode
length of the orientation, and $p_s$ and $p_e$ are the PC. The filtered depth map
is shown in Eq. (3).
In Eq. (3), $I(x, y)$ and $S(x, y)$ are the pixel value of coordinate $(x, y)$ in the filtered
depth map and integral image, respectively. $r$ is the product of the coordinate pixel
value and the camera constant in the depth map, and $S$ is the integral image. The
improved AHC surface features are shown in Eq. (4).
In Eq. (4), $d_\Upsilon$ represents the distance between the surface and the origin of the camera
coordinate system of the current frame, and $N$ is the normal vector, which is expressed
as $N = [N_x, N_y, N_z]^T$. Eq. (5) illustrates how the location of the Bth map point in the world coordinate system
(WCS) is determined during initialization.
In Eq. (5), $d_h$ and $p_h$ represent the depth of the $h$th feature point and PC, respectively,
and $\Pi^{-1}$ is the inverse projection function. For pose estimation, the study
used nonlinear optimization methods to estimate the current frame pose. The commonly
used nonlinear optimization methods include Levenberg-Marquardt (LM) method and Gaussian
Newton method [17]. Fig. 2 illustrates the process of applying LM in SLAM to solve the least squares problem.
Fig. 2. The process of using LM to solve the least squares problem in SLAM.
Fig. 3. Points and lines triangulation.
From Fig. 2, it can be seen that the key to using LM to solve the least squares issue is to solve
the Jacobian matrix and increment, and update the multidimensional state variables.
In the back-end optimization, MFs are extracted mainly in the 3D lines and surfaces
of the camera coordinate system of the current frame. In updating the local maps,
the study adopts the triangulation method of 2D points and the triangulation method
of 2D lines. The triangulation of points and lines is shown schematically in Fig. 3.
In Fig. 3, the 2D endpoints of the line in the $n$ frame are represented by $p_e^{c_n}$ and
$p_s^{c_n}$, whereas $O_n$ and $O_k$ stand in for the optical centers of various cameras.
And $p_e^{c_k}$ and $p_s^{c_k}$ are the 2D endpoints of the line in the current frame.
$p_e^{c_k}$ and $p_s^{c_k}$ are the 2D endpoints of the line in the current frame,
and $\pi'$ is the plane. $p^{c_k}$ and $p^{c_n}$ are the 2D points of the current
frame and the matching 2D points of the $n$ frame, respectively. The local map optimization
of points, lines, planes and MFs also adopts the LM method. The screening of keyframes
is to remove redundant keyframes, and the judgment of redundant keyframes is based
on the fact that 90% of the PC of the frame will be observed by other keyframes.
3.2. Design of Improved VINS-Mono Algorithm Based on Multi-sensor Fusion
In the previous section, the ORB-SLAM2 algorithm based on the improved fusion of multiple
features of PC was studied and designed for the SLAM algorithm that suffers from the
problem of accuracy degradation in low texture scenes. However, the algorithm suffers
from problems such as tracking loss in complex scenes such as feature sparsity and
robot steering. To solve this problem, the study designed an improved algorithm for
multi-sensor fusion. The algorithm fuses Mono, IMU and encoder, and Mono and IMU form
Monocular Visual Inertial Navigation System (Mono-VINS), so the short name of this
improved algorithm is Improved Mono-VINS Algorithm. The flow of the algorithm is shown
in Fig. 4.
Fig. 4. Process of improved Mono-VINS algorithm based on multi-sensor fusion.
From Fig. 4, it can be seen that the modified Mono-VINS algorithm mainly includes sensor information
acquisition, sensor preprocessing, vision-IMU-encoder joint initialization, vision-IMU-encoder
sliding window optimization, and vision-IMU-encoder closed-loop optimization. Among
them, sensor preprocessing includes point feature extraction and processing, IMU pre
integration, and so on. Vision-IMU-encoder closed-loop optimization mainly includes
closed-loop detection, closed-loop optimization, and keyframe database. In the improved
Mono-VINS algorithm, sensor preprocessing is responsible for processing the raw data
of monocular cameras, IMUs, and encoders. The joint initialization module of vision-IMU-encoder
needs to calculate the initial sliding window state variable value for sliding window
optimization. Vision-IMU-encoder sliding window optimization requires inserting keyframes
into closed-loop optimization, and this process requires estimating the state variables
of each frame image in the sliding window at different times. Vision-IMU-encoder closed-loop
optimization requires matching the current keyframe with the keyframe database. Eq.
(6) illustrates the measurement model under discrete time in IMU preprocessing [18].
In Eq. (6), $\partial'_{m+p/n}$ is the acceleration of the airframe, $\omega'_{m+p/n}$ is the
angular velocity of the airframe, $m$ and $q$ represent frame $m$ and frame $q$, respectively,
and $g^w$ represents the gravity vector in the WCS. $R_w^{b_{m+q/n}}$ is the rotation
matrix of the body coordinate system from the WCS to the moment of IMU data in frame
$q$. Additionally, $\partial_{m+q/n}$ and $\omega_{m+q/n}$ stand for the actual values
of angular velocity and acceleration, respectively. The Gaussian measurement noise
for acceleration and angular velocity is represented by $Z_\partial$ and $Z_\omega$,
respectively, while the acceleration and angular velocity biases between the $m$ and
$m+1$ frame pictures are represented by $b_{\partial_m}$ and $b_{\omega_m}$. The IMU
observations are obtained by integrating the IMU measurements. Position estimation
is performed by calculating the body position at the $m+1$ frame. The airframe velocity
is calculated as shown in Eq. (7)
[19].
In Eq. (7), $v_{e_m}$ is the linear velocity and $\omega_{e_m}$ represents the angular velocity.
$v_{right}$ and $v_{left}$ are the speed of the left wheel and the right wheel respectively,
and $D$ is the distance between the left and right wheels. The encoder speed $v_{e_m}^{e_m}$
at the moment of the $m$ frame is shown in Eq. (8).
The IMU velocity observation $v_{b_m}^{b_m}$ is shown in Eq. (9).
In Eq. (9), $R_e^b$ represents the rotation of the encoder with respect to the IMU, and $(p_e^b)_\alpha$
and $(p_e^b)_\alpha$ denote the $\alpha$ and $\beta$ component values of the encoder
displacement $(p_e^b)_\alpha$ with respect to the IMU, respectively. The slip factor
is calculated as shown in Eq. (10)
[20].
In Eq. (10), $\omega_{b_m}$ represents the instantaneous angular velocity of the airframe, $\delta$
represents the adjustment parameter, and $\delta$ is the threshold value. The smaller
the value of the slip factor, the more serious the data distortion. The main steps
of the joint vision-IMU-encoder initialization are shown in Fig. 5.
Fig. 5. The main steps of visual IMU encoder joint initialization.
Fig. 6. The main process of visual IMU encoder sliding window optimization.
In Fig. 5, the first step of the joint vision-IMU-encoder initialization is the computation
of the camera position, the 2D point features, the 3D positions of the points and
the map point positions. The second step is to compute the optimal solution of IMU
angular velocity bias for all image frames. The third stage is to ascertain whether
or not there has been a significant change in the optimal solution. If so, re-integrate;
if not, update the IMU observations. Step four involves computing the airframe velocity,
scaling factor, and gravity vector. The next task is to determine whether the acceleration
of gravity and the gravity vector mode length are close to one another. If the judgment
is yes, the gravity vector mode length is set to the acceleration of gravity, otherwise
the process is ended. The sixth step is to define the WCS origin to coincide with
the reference coordinate system (RCS) origin to complete the initialization and end
the process. The IMU angular velocity bias $b_\omega$ optimal solution is calculated
as shown in Eq. (11).
In Eq. (11), $q_{b_m}^{c_0}$ represents the attitude of the body in the RCS, $\otimes$ is the
quaternion multiplication, and $M$ is the total images used for initialization. $\gamma_{b_{m+1}}^{b_m}$
is the relative rotation between the two images, and $q_{b_{m+1}}^{c_0}$ is the attitude
of the body in the RCS at the moment of the $m+1$ image. $\|\cdot\|^2$ is the square
of the Euclidean distance. The airframe velocity, gravity vector and scale factor
are calculated as shown in Eq. (12).
In Eq. (12), $v_{b_m}^{b_m}$ represents the velocity of the body in the body coordinate system
at the moment of the $m$th image frame, and $g^{c_0}$ and $S$ are the scale and gravity
factor, respectively. Fig. 6 depicts the primary workflow of the vision-IMU-encoder sliding window optimization.
The definition of the state variables and the definition of the objective function
are the first and second steps, respectively, in the vision-IMU-encoder sliding window
optimization shown in Fig. 6. Obtaining the IMU measurement error term and the visual measurement error term is
the third step. And the fourth step is to determine whether the encoder measurement
error term is needed. The fifth step is to obtain the visual, IMU and encoder measurement
error terms elegantly comparable matrix, and the sixth step is to obtain the state
variables. The seventh step is to determine whether to use the current frame as the
key frame, if the judgment is yes, then go to the subsequent closed-loop optimization,
otherwise go back to the sensor initialization. Eq. (13) displays the body state expression.
In Eq. (13), $(q_{b_m}^w, p_{b_m}^w)$ is the initial value of the body position and $v_{b_m}^w$
is the initial value of the body velocity in the WCS. In Eq. (14), the encoder measurement error term is displayed.
In Eq. (14), $R_w^{b_m}$ represents the rotation matrix of the airframe coordinate system from
the WCS to the moment of IMU data in the $m$th frame. The study optimizes the main
frame heading angle and airframe location in the closed-loop optimization. And the
state variables at the time of optimization are defined as shown in Eq. (15).
In Eq. (15), $p_{b_m}^w$ represents the airframe position in the WCS in frame $m$. $\psi_{b_m}^w$
is the airframe heading angle in the WCS of frame $m$.