LeeYeejin1
KangByeongkeun2
-
(Department of Electrical and Information Engineering, Seoul National University of
Science and Technology / Seoul 01811, Korea yeejinlee@seoultech.ac.kr )
-
(Department of Electronic and IT Media Engineering, Seoul National University of Science
and Technology / Seoul 01811, Korea byeongkeun.kang@seoultech.ac.kr)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Visual attention estimation, Intelligent transportation system, Convolutional neural networks, Saliency estimation, Video-based
1. Introduction
While driving, there often exist the part of the road circumstances that are more
crucial for safety than others. Most human drivers have been well trained by experience
and knowledge to give more attention to those regions. Accordingly, even with the
limited ability of human perception, humans use their capability efficiently and effectively
to maximize safety.
Recently, with the advancement of deep neural networks and GPUs, the development of
autonomous driving vehicles and robots has been accelerated. Considering that these
vehicles and robots also have a limited capacity for processing data from sensors,
giving more attention to relatively crucial areas can improve efficiency and safety.
Hence, we investigated a framework to find relatively critical regions given the data
from an imaging sensor. For human drivers, the framework can also be utilized to identify
whether a driver has perceived the important regions or not by comparing them to the
gaze information of the driver. Then, driver-assistance systems could provide information
about where the driver might need to give more attention. Moreover, the framework
can also be utilized to help student drivers practice on roads or in virtual environments.
Hence, the framework could be valuable for both autonomous driving systems and human
drivers.
As mentioned earlier, humans in general have successfully navigated complex driving
environments and are proficient at identifying crucial regions. Hence, the aim of
the framework is to mimic the gaze behavior of human drivers and especially to locate
the areas where human drivers would give more attention. To determine the regions,
humans often consider the movement of objects/people as well as the type/location
of things/stuff. For instance, vehicles that are coming closer are generally more
important than vehicles that are going farther away.
Also, vehicles on roads are more important than chairs on sidewalks. Moreover, while
driving at high speed, distant things might come closer shortly. Consequently, drivers
might need to give them attention. However, far objects are tiny in images, so it
is often hard to extract meaningful information such as the types of objects.
Hence, to achieve robust and accurate attention prediction, we utilized visual fixation
data that was collected from human drivers on actual roads. Given the dataset, we
propose a framework that utilizes both multi-scale color images and motion information
to predict where drivers would look for safe driving, as shown in Fig. 1.
The contributions of this paper are as follows:
· We present a framework that utilizes both color images and motion information to
consider both the movements of objects/people and the type/location of things/stuff;
· We propose using multiscale images and motion maps to better extract features from
both tiny/far objects and large/close things
· We experimentally demonstrate that the proposed framework achieves state-of-the-art
quantitative results on an actual driving dataset.
Fig. 1. Overview of the proposed multiscale color and motion-based attention prediction network.
2. Related Works
Deep neural network-based supervised learning approaches usually require huge datasets.
Accordingly, research on prediction of drivers’ visual attention has been initiated
by collecting an appropriate dataset. Alletto et al. collected an actual driving dataset
that includes drivers’ gaze-fixation information to learn drivers’ behavior from the
dataset [1]. Tawari et al. utilized that dataset and presented a CNN-based framework to estimate
divers’ visual attention [2].
Palazzi et al. proposed coarse-to-fine CNNs that consist of a coarse module and a
fine module [3]. The coarse module predicts an initial saliency map using 16 consecutive frames.
Then, the fine module estimates the final saliency map using the output of the coarse
module and the corresponding RGB frame. Later, Palazzi et al. extended the network
to a multi-branch network that consists of a raw video stream, a motion information
stream, and a scene semantics stream [4].
Xia et al. presented a framework that extracts features using convolution layers and
predicts a final gaze map using a convolutional LSTM network [5]. Fang et al. [6] proposed the semantic context-induced attentive fusion network (SCAFNet), which utilizes
RGB frames and corresponding semantic segmentation maps [7]. Li et al. proposed DualNet, which consists of a static branch and a dynamic branch
where the static branch and the dynamic branch take a single frame and two consecutive
frames as input, respectively [8]. Ning et al. explicitly extracted motion information and presented a Y-shape structured
network that uses both image and motion information [9]. Kang et al. presented a CNN framework that fuses feature representations at multiple
resolutions from RGB images [10]. Later, they also presented another model that uses motion information [11].
Recently, Nakazawa et al. proposed a method that incorporates multiple center biases
to improve accuracy [12]. Lv et al. utilized reinforcement learning to avoid scattered attention prediction
[13].
Other datasets were also proposed for drivers’ attention estimation. Xia et al. presented
another drivers’ attention dataset, the Berkeley DeepDrive Attention (BDD-A) dataset,
which consists of 1,232 online videos [5]. While the gaze information in another dataset [1] was collected during actual driving, that in the BDD-A dataset was collected in a
lab environment. Fang et al. also presented another drivers’ attention dataset, DADA-2000,
to predict accidents by using drivers’ attention [6]. They collected 2,000 video clips (658,476 frames) online and categorized them into
54 accident categories (e.g., pedestrians, vehicles, and cyclists). They annotated
crashing objects spatially, accident occurrence temporally, and the attention map
at each frame.
This study utilizes an actual driving dataset [1] for experimental demonstration. It is also related to the methods that use both color
and motion information explicitly. We experimentally demonstrate that the proposed
framework achieves state-of-the-art performance on the dataset.
3. Proposed Method
To determine regions that are more crucial for safe driving, humans often consider
the movement of objects/ people as well as the type/location of things/stuff. For
instance, vehicles coming closer are generally more important than vehicles going
farther away. Also, vehicles on roads are more important than chairs on sidewalks.
Hence, we propose a framework that utilizes both color images and estimated motion
information to predict where drivers would look for safe driving.
Moreover, while driving at high speed, distant things might come closer shortly. Consequently,
drivers might need to give them attention. However, as far objects are tiny on images,
it is often hard to extract meaningful information such as the types of objects. Hence,
to better deal with things of diverse sizes and at various distances, the proposed
framework utilizes multi-scaled color images and motion information.
Given a sequence of images, pixel-wise motion information is estimated by using an
optical flow estimation method. Then, a pyramid of color images and a pyramid of optical
flow maps are constructed for multi-scaled inputs (see Fig. 2). Color images are forward processed by a color-based attention prediction (CAP)
stream, and optical flow maps are processed by a motion-based attention prediction
(MAP) stream. Then, to obtain the final prediction, outputs of both streams are processed
by the final attention prediction (FAP) stream.
Fig. 2. Proposed multiscale color and motion-based attention prediction network. CAP streams and MAP streams utilize shared weights for inputs with varying resolutions, while the weights between the two streams are not shared.
3.1 MCMNet
To achieve accurate and robust attention prediction, we propose multiscale color and
motion-based (MCM) attention prediction network. The network can be divided into three
main parts, multiscale CAP stream, multiscale MAP stream, and FAP part as shown in
Fig. 1.
In the multiscale color-based attention prediction stream, a pyramid of color images
is constructed by images with varying resolutions (see Fig. 2). Then, convolution neural networks are used to extract features that are useful
for attention prediction. Similarly, the multiscale motion-based attention prediction
stream constructs another pyramid of optical flow maps with diverse resolutions and
extracts features using separate convolution neural networks. The final attention
prediction part concatenates the outputs of the two streams and processes the concatenated
map to obtain the final attention map.
Regarding the FAP stream, given the outputs of each pyramid level from the two streams,
the outputs from pyramid levels with the reduced resolution are upsampled to obtain
the original resolution. Then, the outputs are concatenated and normalized by channel-wise
min-max normalization. Given the normalized output, final attention prediction is
obtained by point-wise convolution.
The details of the convolutional neural networks in the CAP stream are shown in Fig. 3. The single CAP stream that takes the original input image is equivalent to another
network architecture [16]. The CAP stream consists of stem layers, multi-resolution feature extraction through
four stages, and final layers. The stem layers consist of two strided convolution
layers to reduce the resolution of feature maps. The stage index of the multi-resolution
feature extraction part is denoted by a superscript $M$.
The first stage $M^{1}$ processes at a single resolution, while the second, third,
and fourth stages ($M^{2},M^{3},$ and $M^{4}$) process in two, three, and four varying
resolutions. The first subscripts represent the indices of branches where the corresponding
resolution of each branch is $2^{1-i}$ of the resolution of the output of $M^{1}$.
As shown in Fig. 3, the outputs of the modules at the same step are fused to merge the extracted information
at varying resolutions. In the end, the extracted features are upsampled to the highest
resolution of the branches and are concatenated. This is followed by the final layers,
which consist of two pointwise convolution layers.
More details of the modules in the CAP stream are shown in Fig. 4. The module in the first stage consists of four residual blocks where each block
consists of three stacked convolution layers and one skip connection. The modules
in the other stages consist of four blocks where each block contains two stacked convolution
layers and one skip connection. For both modules, if the number of channels between
the input and output is different, the input is processed by a pointwise convolution
to make them equal. This is denoted by a dot-dash line and is to be able to process
a residual connection.
The structure of the MAP stream is the same as that of the CAP stream except for the
first strided convolution layer in the stem layers. As motion information is encoded
in a map with two channels rather than three channels, the kernel dimension of the
first layer is modified accordingly. While other layers are the same, the parameters
are of course not shared.
Fig. 3. Color-based attention prediction stream in the MCMNet.
Fig. 4. Details of the blocks in the color-based attention prediction stream in the MCMNet.
3.2 Motion Estimation
Pixel-wise motion between consecutive frames is estimated by first converting RGB
frames to grayscale images and by using an optical flow estimation algorithm [17]. The optical flow map at the $\mathrm{t}$-th frame is estimated by using the two
consecutive grayscale images at the $\mathrm{t}$-th frame and the ($\mathrm{t}-1$)-th
frame. For the first frame of each sequence, as no previous frame exists, the corresponding
optical flow map is computed by using the next frame and the current frame assuming
motion is temporarily smooth.
Given two grayscale images, we first construct a pyramid of images for each frame
where each pyramid contains three images. The images are the original frame, the image
with half resolution, and the image with quarter resolution. Then, iterative displacement
estimation is performed to obtain more accurate motion estimates where for each pyramid
level, three iterations are processed. Also, to avoid noisy estimates, assuming slow
varying displacement, a $5\times 5$ neighboring region is used for each pixel. Fig. 5 shows the colorized optical flow map, the corresponding RGB frame, and the previous
RGB image from the top to the bottom row.
Fig. 5. Visualization of the result of motion estimation.
4. Experiments and Results
4.1 Dataset
The DR(eye)VE dataset [1] was utilized to demonstrate the effectiveness of the proposed framework by comparing
it to other previous methods. The dataset consists of 74 videos (555,000 frames) and
was collected from eight drivers under varying weather and lighting conditions. To
obtain gaze information, eye-tracking glasses (SMI ETG 2w) were used. The collected
RGB frames have a resolution of $1920\times 1080$ and were collected at 25 FPS. Example
data is shown in Fig. 6.
We follow the same video sequence split in [1] for training and evaluation. We excluded the frames that are marked as errors by
[1]. We also excluded the data that was collected while vehicles were not moving. This
was done because at those moments, drivers’ gaze behavior can be unrelated to driving
safely on roads.
Fig. 6. Qualitative comparison of the proposed framework and other previous methods.
4.2 Results
For quantitative comparison, the average of correlation coefficients was used. For
each frame, correlation coefficients were computed as follows:
where $P_{hw}^{f}$ and $G_{hw}^{f}$ denote the final attention prediction and the
ground truth attention at the ($\mathrm{h},\mathrm{w})$ pixel of the $\mathrm{f}$-th
frame, and $\overline{P}^{f}$ and $\overline{G}^{f}$ represent the average of the
attention prediction and that of the ground truth attention of the $\mathrm{f}$-th
frame.
Then, the average of correlation coefficients is computed over all the frames in the
evaluation split as follows:
where $N$ denotes the total number of frames in the evaluation split.
Table 1 shows a quantitative comparison of the proposed MCMNet and other previous methods.
The comparison demonstrates that the proposed method outperforms all previous methods
by utilizing the multiscale color and motion-based attention prediction framework.
The "baseline" method denotes using the average of ground truth annotations for attention
prediction. The "Itti" [14] and "GBVS" [15] methods are traditional saliency detection methods that are not based on deep neural
networks. All other methods are deep neural network-based approaches.
Fig. 6 shows the qualitative comparisons. Each column shows the results of varying frames.
From the first row to the bottom row, the images show input images, the results of
GBVS [15], those of Multi-branch [4], those of HR Attention Net [10], those of Motion HR Attention Net [11], those of the proposed MCMNet, and the ground truth fixation map, respectively. The
qualitative results demonstrate that the proposed method achieves more accurate and
smoother attention prediction.
Table 2 shows the results of the ablation study using either only a single CAP stream or
a single MAP stream. The CAP stream uses an original RGB frame, and the MAP stream
uses a motion map at the original resolution. Accordingly, the two results do not
have explicit multiscaling and color/motion fusion. The results show that motion information
is competitively informative compared to color information while slightly less. The
results also demonstrate that by utilizing explicit multiscaling and color/motion
information fusion, the performance is improved.
Table 1. Quantitative Comparison using the Average of Correlation Coefficient.
Method
|
Correlation Coefficient
|
Baseline
|
0.47
|
Itti [14]
|
0.16
|
GBVS [15]
|
0.20
|
Tawari [2]
|
0.51
|
HWS [5]
|
0.55
|
Multi-Branch [4]
|
0.56
|
Motion HR Attention Net [11]
|
0.58
|
HR Attention Net [10]
|
0.60
|
Proposed MCMNet
|
0.61
|
Table 2. Ablation Study of the Components of the MCMNet.
Method
|
Correlation Coefficient
|
Single CAP stream
|
0.59
|
Single MAP stream
|
0.57
|
Proposed MCMNet
|
0.61
|
5. Conclusion
We presented a novel method that utilizes both color and motion information to achieve
robust and accurate attention prediction. The proposed MCMNet consists of three components:
the CAP stream, MAP stream, and FAP part. The first two streams extract attention-related
features from multiscale color images and multiscale motion information, respectively,
and the FAP part merges them and predicts the final attention map. To demonstrate
the effectiveness of the proposed method, we experimented with an actual driving dataset.
Experimental results showed that the proposed framework achieves state-of-art performance.
ACKNOWLEDGMENTS
This study was financially supported by the Seoul National University of Science
and Technology.
REFERENCES
Alletto S., Palazzi A., Solera F., Calderara. S., Cucchiara R., 2016, DR(eye)VE: A
Dataset for Attention-Based Tasks with Applications to Autonomous and Assisted Driving,
2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW),
pp. 54-60
Tawari. A., Kang B., 2017, A Computational Framework for Driver's Visual Attention
Using a Fully Convolutional Architecture, 2017 IEEE Intelligent Vehicles Symposium
(IV), pp. 887-894
Palazzi A., Solera F., Calderara S., Alletto. S., Cucchiara R., 2017, Learning Where
to Attend Like a Human Driver, 2017 IEEE Intelligent Vehicles Symposium (IV), pp.
920-925
Palazzi A., Abati D., Calderara S., Solera. F., Cucchiara R., July 2019, Predicting
the Driver's Focus of Attention: The DR(eye)VE Project, in IEEE Transactions on Pattern
Analysis and Machine Intelligence, Vol. 41, No. 7, pp. 1720-1733
Xia Y., Zhang D., Kim J., Nakayama K., Zipser K., Whitney D., 2019, Predicting Driver
Attention in Critical Situations, in Computer Vision - ACCV 2018, Lecture Notes in
Computer Science, Vol. 11365
Fang J., Yan D., Qiao J., Xue J., Wang. H., Li S., 2019, DADA-2000: Can Driving Accident
be Predicted by Driver Attention${\mathit{f}}$ Analyzed by A Benchmark, 2019 IEEE
Intelligent Transportation Systems Conference (ITSC), pp. 4303-4309
Fang J., Yan D., Qiao J., Xue. J., Yu H., 2021, DADA: Driver Attention Prediction
in Driving Accident Scenarios, in IEEE Transactions on Intelligent Transportation
Systems
Li A., 2017, Learning Driver Gaze, M. Eng thesis
Ning M., Lu. C., Gong J., 2019, An Efficient Model for Driving Focus of Attention
Prediction using Deep Learning, 2019 IEEE Intelligent Transportation Systems Conference
(ITSC), pp. 1192-1197
Kang. B., Lee Y., Apr. 2020., High-Resolution Neural Network for Driver Visual Attention
Prediction, Sensors, Vol. 20, No. 7, pp. 2030
Kang. B., Lee Y., May 2021., A Driver's Visual Attention Prediction Using Optical
Flow, Sensors, Vol. 21, No. 11, pp. 3722
Nakazawa. S., Nakada Y., 2020, Improvement of Mixture-of-Experts-Type Model to Construct
Dynamic Saliency Maps for Predicting Drivers' Attention, 2020 IEEE Symposium Series
on Computational Intelligence (SSCI)
Lv K., Sheng H., Xiong Z., Li. W., Zheng L., 2021, Improving Driver Gaze Prediction
with Reinforced Attention, in IEEE Transactions on Multimedia
Itti L., Koch. C., Niebur E., Nov. 1998, A Model of Saliency-Based Visual Attention
for Rapid Scene Analysis, in IEEE Transactions on Pattern Analysis and Machine Intelligence,
Vol. 20, No. 11, pp. 1254-1259
Harel J., Koch. C., Perona P., 2007, Graph-based Visual Saliency, in Advances in Neural
Information Processing Systems 19, MIT Press, pp. 545-552
Wang J., Sun K., Cheng T., Jiang B., Deng C., Zhao Y., Liu D., Mu Y., Tan M., Wang
X., Liu W., Xiao B., Oct. 2021, Deep High-Resolution Representation Learning for Visual
Recognition, in IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.
43, No. 10, pp. 3349-3364
Farnebäck G., 2003, Two-Frame Motion Estimation Based on Polynomial Expansion, in
Scandinavian Conference on Image Analysis, Lecture Notes in Computer Science, Springer
Author
Yeejin Lee received a Ph.D. degree in electrical and computer engineering from
the University of California at San Diego, La Jolla, CA, USA, in 2017. She was a Postdoctoral
Fellow in radiology with the University of California at Los Angeles, Los Angeles,
CA, USA, from 2017 to 2018. She is currently an assistant professor with Seoul National
University of Science and Technology, Seoul, Republic of Korea. Her current research
interests include computer vision, color image processing, and machine learning.
Byeongkeun Kang is currently an assistant professor at Seoul National University
of Science and Technology, Seoul. He was a Postdoctoral Fellow at the Robotics Institute,
Carnegie Mellon University, Pittsburgh, PA, USA, from 2018 to 2019. He received a
B.S. degree in electrical and electronic engineering from Yonsei University, Seoul,
Republic of Korea, in 2013, and M.S. and Ph.D. degrees in electrical and computer
engineering from the University of California at San Diego, La Jolla, CA, USA, in
2015 and 2018, respectively. His current research interests include semantic segmentation,
object detection, and human–machine interaction.