Mobile QR Code

1. (Department of Electrical and Information Engineering, Seoul National University of Science and Technology / Seoul 01811, Korea yeejinlee@seoultech.ac.kr )
2. (Department of Electronic and IT Media Engineering, Seoul National University of Science and Technology / Seoul 01811, Korea byeongkeun.kang@seoultech.ac.kr)

Visual attention estimation, Intelligent transportation system, Convolutional neural networks, Saliency estimation, Video-based

## 1. Introduction

While driving, there often exist the part of the road circumstances that are more crucial for safety than others. Most human drivers have been well trained by experience and knowledge to give more attention to those regions. Accordingly, even with the limited ability of human perception, humans use their capability efficiently and effectively to maximize safety.

Recently, with the advancement of deep neural networks and GPUs, the development of autonomous driving vehicles and robots has been accelerated. Considering that these vehicles and robots also have a limited capacity for processing data from sensors, giving more attention to relatively crucial areas can improve efficiency and safety. Hence, we investigated a framework to find relatively critical regions given the data from an imaging sensor. For human drivers, the framework can also be utilized to identify whether a driver has perceived the important regions or not by comparing them to the gaze information of the driver. Then, driver-assistance systems could provide information about where the driver might need to give more attention. Moreover, the framework can also be utilized to help student drivers practice on roads or in virtual environments. Hence, the framework could be valuable for both autonomous driving systems and human drivers.

As mentioned earlier, humans in general have successfully navigated complex driving environments and are proficient at identifying crucial regions. Hence, the aim of the framework is to mimic the gaze behavior of human drivers and especially to locate the areas where human drivers would give more attention. To determine the regions, humans often consider the movement of objects/people as well as the type/location of things/stuff. For instance, vehicles that are coming closer are generally more important than vehicles that are going farther away.

Also, vehicles on roads are more important than chairs on sidewalks. Moreover, while driving at high speed, distant things might come closer shortly. Consequently, drivers might need to give them attention. However, far objects are tiny in images, so it is often hard to extract meaningful information such as the types of objects.

Hence, to achieve robust and accurate attention prediction, we utilized visual fixation data that was collected from human drivers on actual roads. Given the dataset, we propose a framework that utilizes both multi-scale color images and motion information to predict where drivers would look for safe driving, as shown in Fig. 1.

The contributions of this paper are as follows:

· We present a framework that utilizes both color images and motion information to consider both the movements of objects/people and the type/location of things/stuff;

· We propose using multiscale images and motion maps to better extract features from both tiny/far objects and large/close things

· We experimentally demonstrate that the proposed framework achieves state-of-the-art quantitative results on an actual driving dataset.

## 2. Related Works

Deep neural network-based supervised learning approaches usually require huge datasets. Accordingly, research on prediction of drivers’ visual attention has been initiated by collecting an appropriate dataset. Alletto et al. collected an actual driving dataset that includes drivers’ gaze-fixation information to learn drivers’ behavior from the dataset [1]. Tawari et al. utilized that dataset and presented a CNN-based framework to estimate divers’ visual attention [2].

Palazzi et al. proposed coarse-to-fine CNNs that consist of a coarse module and a fine module [3]. The coarse module predicts an initial saliency map using 16 consecutive frames. Then, the fine module estimates the final saliency map using the output of the coarse module and the corresponding RGB frame. Later, Palazzi et al. extended the network to a multi-branch network that consists of a raw video stream, a motion information stream, and a scene semantics stream [4].

Xia et al. presented a framework that extracts features using convolution layers and predicts a final gaze map using a convolutional LSTM network [5]. Fang et al. [6] proposed the semantic context-induced attentive fusion network (SCAFNet), which utilizes RGB frames and corresponding semantic segmentation maps [7]. Li et al. proposed DualNet, which consists of a static branch and a dynamic branch where the static branch and the dynamic branch take a single frame and two consecutive frames as input, respectively [8]. Ning et al. explicitly extracted motion information and presented a Y-shape structured network that uses both image and motion information [9]. Kang et al. presented a CNN framework that fuses feature representations at multiple resolutions from RGB images [10]. Later, they also presented another model that uses motion information [11].

Recently, Nakazawa et al. proposed a method that incorporates multiple center biases to improve accuracy [12]. Lv et al. utilized reinforcement learning to avoid scattered attention prediction [13].

Other datasets were also proposed for drivers’ attention estimation. Xia et al. presented another drivers’ attention dataset, the Berkeley DeepDrive Attention (BDD-A) dataset, which consists of 1,232 online videos [5]. While the gaze information in another dataset [1] was collected during actual driving, that in the BDD-A dataset was collected in a lab environment. Fang et al. also presented another drivers’ attention dataset, DADA-2000, to predict accidents by using drivers’ attention [6]. They collected 2,000 video clips (658,476 frames) online and categorized them into 54 accident categories (e.g., pedestrians, vehicles, and cyclists). They annotated crashing objects spatially, accident occurrence temporally, and the attention map at each frame.

This study utilizes an actual driving dataset [1] for experimental demonstration. It is also related to the methods that use both color and motion information explicitly. We experimentally demonstrate that the proposed framework achieves state-of-the-art performance on the dataset.

## 3. Proposed Method

To determine regions that are more crucial for safe driving, humans often consider the movement of objects/ people as well as the type/location of things/stuff. For instance, vehicles coming closer are generally more important than vehicles going farther away. Also, vehicles on roads are more important than chairs on sidewalks. Hence, we propose a framework that utilizes both color images and estimated motion information to predict where drivers would look for safe driving.

Moreover, while driving at high speed, distant things might come closer shortly. Consequently, drivers might need to give them attention. However, as far objects are tiny on images, it is often hard to extract meaningful information such as the types of objects. Hence, to better deal with things of diverse sizes and at various distances, the proposed framework utilizes multi-scaled color images and motion information.

Given a sequence of images, pixel-wise motion information is estimated by using an optical flow estimation method. Then, a pyramid of color images and a pyramid of optical flow maps are constructed for multi-scaled inputs (see Fig. 2). Color images are forward processed by a color-based attention prediction (CAP) stream, and optical flow maps are processed by a motion-based attention prediction (MAP) stream. Then, to obtain the final prediction, outputs of both streams are processed by the final attention prediction (FAP) stream.

### 3.1 MCMNet

To achieve accurate and robust attention prediction, we propose multiscale color and motion-based (MCM) attention prediction network. The network can be divided into three main parts, multiscale CAP stream, multiscale MAP stream, and FAP part as shown in Fig. 1.

In the multiscale color-based attention prediction stream, a pyramid of color images is constructed by images with varying resolutions (see Fig. 2). Then, convolution neural networks are used to extract features that are useful for attention prediction. Similarly, the multiscale motion-based attention prediction stream constructs another pyramid of optical flow maps with diverse resolutions and extracts features using separate convolution neural networks. The final attention prediction part concatenates the outputs of the two streams and processes the concatenated map to obtain the final attention map.

Regarding the FAP stream, given the outputs of each pyramid level from the two streams, the outputs from pyramid levels with the reduced resolution are upsampled to obtain the original resolution. Then, the outputs are concatenated and normalized by channel-wise min-max normalization. Given the normalized output, final attention prediction is obtained by point-wise convolution.

The details of the convolutional neural networks in the CAP stream are shown in Fig. 3. The single CAP stream that takes the original input image is equivalent to another network architecture [16]. The CAP stream consists of stem layers, multi-resolution feature extraction through four stages, and final layers. The stem layers consist of two strided convolution layers to reduce the resolution of feature maps. The stage index of the multi-resolution feature extraction part is denoted by a superscript $M$.

The first stage $M^{1}$ processes at a single resolution, while the second, third, and fourth stages ($M^{2},M^{3},$ and $M^{4}$) process in two, three, and four varying resolutions. The first subscripts represent the indices of branches where the corresponding resolution of each branch is $2^{1-i}$ of the resolution of the output of $M^{1}$. As shown in Fig. 3, the outputs of the modules at the same step are fused to merge the extracted information at varying resolutions. In the end, the extracted features are upsampled to the highest resolution of the branches and are concatenated. This is followed by the final layers, which consist of two pointwise convolution layers.

More details of the modules in the CAP stream are shown in Fig. 4. The module in the first stage consists of four residual blocks where each block consists of three stacked convolution layers and one skip connection. The modules in the other stages consist of four blocks where each block contains two stacked convolution layers and one skip connection. For both modules, if the number of channels between the input and output is different, the input is processed by a pointwise convolution to make them equal. This is denoted by a dot-dash line and is to be able to process a residual connection.

The structure of the MAP stream is the same as that of the CAP stream except for the first strided convolution layer in the stem layers. As motion information is encoded in a map with two channels rather than three channels, the kernel dimension of the first layer is modified accordingly. While other layers are the same, the parameters are of course not shared.

### 3.2 Motion Estimation

Pixel-wise motion between consecutive frames is estimated by first converting RGB frames to grayscale images and by using an optical flow estimation algorithm [17]. The optical flow map at the $\mathrm{t}$-th frame is estimated by using the two consecutive grayscale images at the $\mathrm{t}$-th frame and the ($\mathrm{t}-1$)-th frame. For the first frame of each sequence, as no previous frame exists, the corresponding optical flow map is computed by using the next frame and the current frame assuming motion is temporarily smooth.

Given two grayscale images, we first construct a pyramid of images for each frame where each pyramid contains three images. The images are the original frame, the image with half resolution, and the image with quarter resolution. Then, iterative displacement estimation is performed to obtain more accurate motion estimates where for each pyramid level, three iterations are processed. Also, to avoid noisy estimates, assuming slow varying displacement, a $5\times 5$ neighboring region is used for each pixel. Fig. 5 shows the colorized optical flow map, the corresponding RGB frame, and the previous RGB image from the top to the bottom row.

## 4. Experiments and Results

### 4.1 Dataset

The DR(eye)VE dataset [1] was utilized to demonstrate the effectiveness of the proposed framework by comparing it to other previous methods. The dataset consists of 74 videos (555,000 frames) and was collected from eight drivers under varying weather and lighting conditions. To obtain gaze information, eye-tracking glasses (SMI ETG 2w) were used. The collected RGB frames have a resolution of $1920\times 1080$ and were collected at 25 FPS. Example data is shown in Fig. 6.

We follow the same video sequence split in [1] for training and evaluation. We excluded the frames that are marked as errors by [1]. We also excluded the data that was collected while vehicles were not moving. This was done because at those moments, drivers’ gaze behavior can be unrelated to driving safely on roads.

### 4.2 Results

For quantitative comparison, the average of correlation coefficients was used. For each frame, correlation coefficients were computed as follows:

##### (1)
$\mathrm{CC}^{f}=\frac{\sum _{h=1}^{H}\sum _{w=1}^{W}\left(P_{hw}^{f}-\overline{P}^{f}\right)\left(G_{hw}^{f}-\overline{G}^{f}\right)}{\sqrt{\sum _{h=1}^{H}\sum _{w=1}^{W}\left(P_{hw}^{f}-\overline{P}^{f}\right)^{2}}\sqrt{\sum _{h=1}^{H}\sum _{w=1}^{W}\left(G_{hw}^{f}-\overline{G}^{f}\right)^{2}}}$

where $P_{hw}^{f}$ and $G_{hw}^{f}$ denote the final attention prediction and the ground truth attention at the ($\mathrm{h},\mathrm{w})$ pixel of the $\mathrm{f}$-th frame, and $\overline{P}^{f}$ and $\overline{G}^{f}$ represent the average of the attention prediction and that of the ground truth attention of the $\mathrm{f}$-th frame.

Then, the average of correlation coefficients is computed over all the frames in the evaluation split as follows:

##### (2)
$\mathrm{CC}=\frac{1}{N}\sum _{f=1}^{N}\mathrm{CC}^{f}$

where $N$ denotes the total number of frames in the evaluation split.

Table 1 shows a quantitative comparison of the proposed MCMNet and other previous methods. The comparison demonstrates that the proposed method outperforms all previous methods by utilizing the multiscale color and motion-based attention prediction framework. The "baseline" method denotes using the average of ground truth annotations for attention prediction. The "Itti" [14] and "GBVS" [15] methods are traditional saliency detection methods that are not based on deep neural networks. All other methods are deep neural network-based approaches.

Fig. 6 shows the qualitative comparisons. Each column shows the results of varying frames. From the first row to the bottom row, the images show input images, the results of GBVS [15], those of Multi-branch [4], those of HR Attention Net [10], those of Motion HR Attention Net [11], those of the proposed MCMNet, and the ground truth fixation map, respectively. The qualitative results demonstrate that the proposed method achieves more accurate and smoother attention prediction.

Table 2 shows the results of the ablation study using either only a single CAP stream or a single MAP stream. The CAP stream uses an original RGB frame, and the MAP stream uses a motion map at the original resolution. Accordingly, the two results do not have explicit multiscaling and color/motion fusion. The results show that motion information is competitively informative compared to color information while slightly less. The results also demonstrate that by utilizing explicit multiscaling and color/motion information fusion, the performance is improved.

##### Table 1. Quantitative Comparison using the Average of Correlation Coefficient.
 Method Correlation Coefficient Baseline 0.47 Itti [14] 0.16 GBVS [15] 0.20 Tawari [2] 0.51 HWS [5] 0.55 Multi-Branch [4] 0.56 Motion HR Attention Net [11] 0.58 HR Attention Net [10] 0.60 Proposed MCMNet 0.61
##### Table 2. Ablation Study of the Components of the MCMNet.
 Method Correlation Coefficient Single CAP stream 0.59 Single MAP stream 0.57 Proposed MCMNet 0.61

## 5. Conclusion

We presented a novel method that utilizes both color and motion information to achieve robust and accurate attention prediction. The proposed MCMNet consists of three components: the CAP stream, MAP stream, and FAP part. The first two streams extract attention-related features from multiscale color images and multiscale motion information, respectively, and the FAP part merges them and predicts the final attention map. To demonstrate the effectiveness of the proposed method, we experimented with an actual driving dataset. Experimental results showed that the proposed framework achieves state-of-art performance.

### ACKNOWLEDGMENTS

This study was financially supported by the Seoul National University of Science and Technology.

### REFERENCES

1
Alletto S., Palazzi A., Solera F., Calderara. S., Cucchiara R., 2016, DR(eye)VE: A Dataset for Attention-Based Tasks with Applications to Autonomous and Assisted Driving, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 54-60
2
Tawari. A., Kang B., 2017, A Computational Framework for Driver's Visual Attention Using a Fully Convolutional Architecture, 2017 IEEE Intelligent Vehicles Symposium (IV), pp. 887-894
3
Palazzi A., Solera F., Calderara S., Alletto. S., Cucchiara R., 2017, Learning Where to Attend Like a Human Driver, 2017 IEEE Intelligent Vehicles Symposium (IV), pp. 920-925
4
Palazzi A., Abati D., Calderara S., Solera. F., Cucchiara R., July 2019, Predicting the Driver's Focus of Attention: The DR(eye)VE Project, in IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, No. 7, pp. 1720-1733
5
Xia Y., Zhang D., Kim J., Nakayama K., Zipser K., Whitney D., 2019, Predicting Driver Attention in Critical Situations, in Computer Vision - ACCV 2018, Lecture Notes in Computer Science, Vol. 11365
6
Fang J., Yan D., Qiao J., Xue J., Wang. H., Li S., 2019, DADA-2000: Can Driving Accident be Predicted by Driver Attention${\mathit{f}}$ Analyzed by A Benchmark, 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pp. 4303-4309
7
Fang J., Yan D., Qiao J., Xue. J., Yu H., 2021, DADA: Driver Attention Prediction in Driving Accident Scenarios, in IEEE Transactions on Intelligent Transportation Systems
8
Li A., 2017, Learning Driver Gaze, M. Eng thesis
9
Ning M., Lu. C., Gong J., 2019, An Efficient Model for Driving Focus of Attention Prediction using Deep Learning, 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pp. 1192-1197
10
Kang. B., Lee Y., Apr. 2020., High-Resolution Neural Network for Driver Visual Attention Prediction, Sensors, Vol. 20, No. 7, pp. 2030
11
Kang. B., Lee Y., May 2021., A Driver's Visual Attention Prediction Using Optical Flow, Sensors, Vol. 21, No. 11, pp. 3722
12
Nakazawa. S., Nakada Y., 2020, Improvement of Mixture-of-Experts-Type Model to Construct Dynamic Saliency Maps for Predicting Drivers' Attention, 2020 IEEE Symposium Series on Computational Intelligence (SSCI)
13
Lv K., Sheng H., Xiong Z., Li. W., Zheng L., 2021, Improving Driver Gaze Prediction with Reinforced Attention, in IEEE Transactions on Multimedia
14
Itti L., Koch. C., Niebur E., Nov. 1998, A Model of Saliency-Based Visual Attention for Rapid Scene Analysis, in IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, No. 11, pp. 1254-1259
15
Harel J., Koch. C., Perona P., 2007, Graph-based Visual Saliency, in Advances in Neural Information Processing Systems 19, MIT Press, pp. 545-552
16
Wang J., Sun K., Cheng T., Jiang B., Deng C., Zhao Y., Liu D., Mu Y., Tan M., Wang X., Liu W., Xiao B., Oct. 2021, Deep High-Resolution Representation Learning for Visual Recognition, in IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 43, No. 10, pp. 3349-3364
17
Farnebäck G., 2003, Two-Frame Motion Estimation Based on Polynomial Expansion, in Scandinavian Conference on Image Analysis, Lecture Notes in Computer Science, Springer

## Author

##### Yeejin Lee

Yeejin Lee received a Ph.D. degree in electrical and computer engineering from the University of California at San Diego, La Jolla, CA, USA, in 2017. She was a Postdoctoral Fellow in radiology with the University of California at Los Angeles, Los Angeles, CA, USA, from 2017 to 2018. She is currently an assistant professor with Seoul National University of Science and Technology, Seoul, Republic of Korea. Her current research interests include computer vision, color image processing, and machine learning.

##### Byeongkeun Kang

Byeongkeun Kang is currently an assistant professor at Seoul National University of Science and Technology, Seoul. He was a Postdoctoral Fellow at the Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, USA, from 2018 to 2019. He received a B.S. degree in electrical and electronic engineering from Yonsei University, Seoul, Republic of Korea, in 2013, and M.S. and Ph.D. degrees in electrical and computer engineering from the University of California at San Diego, La Jolla, CA, USA, in 2015 and 2018, respectively. His current research interests include semantic segmentation, object detection, and human–machine interaction.