LodhiBilal Ahmed1
UllahRehmat2
ImranSajida3
ImranMuhammad4
KimByung-Seo5
-
( School of Computing, Ulster University, Belfast Campus, BT15 1AP, UK b.lodhi@ulster.ac.uk)
-
( Cardiff School of Technologies, Cardiff Metropolitan University, Cardiff, CF5 2YB,
UK rullah@cardifffmet.ac.uk)
-
( Department of Computer Engineering, King Faisal University, Al-Ahsa, 31982, Saudi
Arabia skamran@kfu.edu.sa)
-
(,5 Department of Software and Communications Engineering, Hongik University, Sejong
City 30016, Korea royimranpk@gmail.com, jsnbs@hongik.ac.kr)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Image segmentation, Semantic segmentation, Deep neural network, Convolutional networks, Artificial intelligence (AI), 6G, Virtual reality, Surveillance cameras, Autonomous cars
1. Introduction
Deep learning has achieved compelling results in image segmentation [1,2], and a Convolutional Neural Network (CNN) exhibits higher accuracy and efficiency
than other approaches [3]. A CNN uses feedforward layers with small receptive fields at each layer to learn
complex features. CNNs have been used in image recognition [4], recommender systems [5], object detection [6], and semantic segmentation [7], owing to their high performance. Recently, CNNs with deeper architectures have been
applied to various use cases, including intelligent security, virtual reality, surveillance,
and autonomous or self-driving vehicles [8]. Autonomous vehicles are equipped with a variety of sensor systems to detect obstacles,
lanes, and parking places [1,2,9]. This field is quite promising, with numerous potential benefits such as increased
safety, lower costs, more comfortable travel, increased mobility, and a smaller environmental
footprint [1]. In autonomous vehicles, vehicle movements are controlled autonomously without human
intervention. Different studies in this area have focused on improving navigation
and driving systems to their full capabilities in an unpredictable environment. The
navigation system will have to rely on sensor data to make its own decisions under
various conditions, similar to a real driver. Semantic segmentation takes an important
step towards traffic scene understanding by parsing an image into different regions
with specific semantic categories, such as roads, vehicles, and pedestrians. The predicted
images can be used to plan the vehicle behavior and avoid collisions [10,11].
Region-based models, weakly supervised models, and fully convolutional networks have
been used for semantic image segmentation. This paper focused on fully convolutional
networks. A fully convolutional network (FCN) [7] can be used to learn discriminative features in successive convolutional layers (lower
resolution) and map the features to a pixel space (high resolution). A FCN is considered
one of the pioneering algorithms in semantic segmentation. In the original FCN model,
the encoder (contracting network) uses convolutional layers and pooling layers. These
pooling layers are replaced by up-sampling layers in the decoder path. Unlike the
other segmentation approaches, FCN can predict inputs of any size. The feature resolution
of the output is reduced owing to the several convolutional and pooling layers in
FCN. Therefore, the FCN model generates a coarser label output. Most FCN models attempt
to recover information loss in the decoder path by adding skip connections between
the encoder and decoder layers.
U-Net [12] used learnable weights and convolutional layers instead of a simple interpolation
in the decoder path. In U-Net, multiple up-sampling layers are added, followed by
convolutional layers after each up-sampling layer. U-Net achieved the benchmark accuracy
on the medical image dataset. U-Net added a concatenating skip connection to propagate
context information to high-resolution layers to deal with the information loss in
the up-sampling path. SegNet [13] further improved the decoder path by transferring the max-pooling indices to the
corresponding decoder. Transferring max-pooling indices helps SegNet reduce the number
of parameters in its network and achieve good accuracy. The U-Net-based algorithms
still suffer from information loss in the decoder path because of the limited information
available in the decoder.
Recent studies have shown that convolutional networks can be substantially deeper,
more accurate, and more efficient to train if they contain shorter connections between
layers closer to the input and those closer to the output [4]. A fully convolutional densely connected network (FC-DenseNet) [14], which is an extension of DenseNet, was proposed to address the semantic segmentation
problem. Unlike other FCN variants, FC-DenseNet does not require pre-training on a
large-scale image dataset. FC-DenseNet achieved state-of-the-art results on the urban
scene benchmark dataset Camvid [15]. Despite the high accuracy of FC-DenseNet, the skip connection increases the number
of parameters [16] in the decoder path exponentially, which increases the training time and computational
complexity. Furthermore, the decoder path of FC-DenseNet has fewer weight updates
than skip connections, showing that FC-DenseNet relies on the high-resolution layers
of its encoder.
U-Net usually performs multiple down-sampling before the concatenating process, resulting
in resolution loss that makes it difficult for U-Net architecture to recover from
this problem as this information is unavailable in the decoder path. DenseNet architecture
keeps the input closer to the output at any given layer. Therefore, it can solve the
information/resolution loss problem of U-Net. On the other hand, DenseNet memory growth
is enormous. Therefore, to address the aforementioned issue, this study developed
an architecture combining dense block and skip connections, which exploits the fact
that there should be shorter connections between layers close to the input and those
close to the output. The decoder path of the proposed architecture, SenseNet, also
contains connections closer to the input. SenseNet improves training and decreases
inference time by reducing the skip connection overhead and avoiding exponential parameter
growth. SenseNet has no parameter explosion problem, resulting in low memory requirements.
Moreover, SenseNet increases the dependence on deeper layer features instead of earlier
layer features in the decoder path.
The main contributions of this paper are as follows.
1. A model that uses a dense block with a DenseNet-BC layer structure and bottleneck
skip connection is proposed to avoid the feature map explosion and information loss
in the decoder path.
2. SenseNet could outperform current state-of-the-art models on the standard dataset
CamVid without pre-trained parameters or any further post-processing.
The remainder of the paper is structured as follows: Section 2 reports a literature
review. Section 3 outlines the methodology with all the building blocks of the proposed
architecture. Section 4 provides the experimental setup details, including the dataset
and training. Section 5 analyzes the results obtained, and Section 6 concludes the
paper.
2. Related Work
This section discusses the related work in the context of CNNs for semantic segmentation.
Before CNNs, the models relied on efficient hand-crafted features for pixel-wise classification
of images. The authors in FCN [7] combined an up-sampling network with the contracting network to predict each pixel
in the image. A considerable amount of contextual information is lost in the deeper
layer of contracting networks because of pooling operations. FCN uses a simple bi-linear
interpolation approach to up-sample images and recover lost contextual information.
Multiple FCN variants, such as deconvolution, were proposed to improve the up-sampling
path of the original FCN, where the resolution was degraded during the pooling operation.
The decoder path helps increase the resolution of feature maps. The output of the
original FCN model [7] was much coarser because the FCN model used only one up-sampling layer for increasing
the resolution of a 16 ${\times}$ 16 image.
UNet is an extended version of the FCN and adds learnable weights during up-sampling.
U-Net adds skip connections to propagate the contextual information to recover high-resolution
layers. Segnet [13] proposed increasing the number of skip connections to address the insufficient number
in FCN. On the other hand, instead of passing the feature maps through skip connections,
the max-pooling indices were transferred from the encoder to the decoder, which made
Segnet much more memory efficient than FCN.
In the U-Net architecture, an encoder is a convolution operation with spatial reductions
to extract features. The decoder attempts to recover the resolution by some operation
(interpolation or transposed convolution). The encoder can be pretrained on ImageNet
for better performance, but it can be trained from the ground up. In the family of
encoder-decoder architecture, FANet [17] achieves good accuracy with fast attention modules and extra downsampling throughout
the network. Furthermore, SFNet [18] delivers a Flow Alignment Module (FAM) to align the feature maps of adjacent levels
for better fusion.
A pyramid scene parsing network [19] aggregates context information using a pyramid pooling module. Pyramid pooling is
based on dilated convolutions [20] and ResnNet [21]. In the pyramid scene parsing network, the pyramid pooling module up-samples the
input using multiple kernels covering half the entire or part of an image and concatenates
this information with ResNet feature maps. FCN models are trained in an end-to-end
manner, which is the major advantage over region-based network models. FCN-based models
rely on complex methods to recover information loss in the pooling operation. Furthermore,
this loss of information also affects the end-to-end training of the encoder network
as the structure information of objects is lost until the classification layer. Therefore,
many FCN models must be pre-trained on large-scale datasets for higher performance.
3. Methodology
This section explains all the building blocks of SenseNet architecture as follows.
3.1 DenseNet-BC
DenseNet connects a layer directly to all subsequent layers, as shown in Fig. 1. The input of a layer in DenseNet is the concatenated output of all preceding layers
in a dense block. For example, the n$^{\mathrm{th}}$ layer input in an m-layer dense
block can be defined as follows. The Haskell operator symbol ``++'' was used for concatenation
to simplify the notations.
where $(x_{0}++x_{1}++x_{2}++\ldots ++x_{n-1})$ is the concatenated output of the
preceding layers, and F is a composite residual function [4]. Further information on Denseblock is reported elsewhere [4].
Fig. 1. Dense block as described in [4]. A layer in DenseNet receives input from all preceding layers in the network and concatenates its output with the input.
3.2 Dense Skip Connection
This paper proposes the use of a Dense skip connection to bring the information in
the decoder path close to the input of the encoder path. A straightforward way is
to connect a concatenated output of the encoder layer to the decoder layer, i.e.,
a skip connection between high-resolution dense blocks and low-resolution dense blocks.
On the other hand, these skip connections require considerable computation and memory.
The dense block output is the concatenated output of all the layers in the dense block,
including the input of the dense block. Suppose the dense block output is used as
an input to another dense block that converts the two dense blocks into a single dense
block with an increased number of layers, resulting in higher memory and computation
requirements. The output of a dense block that has a skip connection with another
dense block can be obtained as follows:
where Bi is the output of block i; $x_{0}++x_{1}++\ldots ++x_{n}$is the concatenated
output of block i with n number of layers. B$_{i-1}$ is the concatenated output of
the previous dense block.
Fig. 2 shows how a skip connection between two dense blocks allows them to act like a single
dense block, which requires more computation and memory resources. Fig. 2 (left) shows two dense blocks with two layers. The output of a dense block is the
concatenated output of all the layers (L1, L2) in the dense block and is denoted as
y. This output is fed to the other dense block, and every layer (S 1, S 2) in the
dense block will receive y as an input. Y in Fig. 2 (left) is shown as the output of the dense block. Intuitively, the graphs in Fig. 2 (left) can be constructed as Fig. 2 (right), where Xo = x and the output Yo of the network will be the same as Y. Moreover,
skip connections between the dense blocks are more complex than the skip connections
between layers. The L1 norm of the trained weights (normalized by the input connections)
of FC-DenseNet-103 was plotted to study the effect of skip connections between dense
blocks (Fig. 3). The effects of the transition-up layer on every layer of the subsequent dense block
were also plotted (Fig. 3, first row of the plot).
Fig. 2. Illustration showing how a concatenated input converts two disjointed dense blocks into one big dense block. x and y are the input and output of dense block L1, respectively, which has two convolutional layers. The output is again inputted to dense block L2, which has two convolutional layers. Xo is the input of a single dense block with four convolutional layers, and Yo is the output of this dense block. The output of two disjointed dense blocks with skip connections will be identical to that of a single dense block.
Fig. 3 shows the average heatmaps of the first (left) and last (right) dense block weights
in the decoder path. The first dense block has 12 layers, while the skip connection
carries the output of 12 layers and the block input. Each plot in Fig. 3 contains the average weights of the transition-up layer (first row only), skip connection
and current dense block. A red pixel in Fig. 3 indicates that the target layer uses (on average) the source layer output. Fig. 3 (left) shows that the middle dense block is not used by the first dense block in
the decoder path and does not show any change. Therefore, the FC-DenseNet architecture
depends mostly on the feature maps of the corresponding encoder (the skip connection
rows in Fig. 3 (left) show the weight changes). The first row in Fig. 3 (right) shows the dependence of the dense block on the transition-up layer as the
weight change can be observed in the first row. On the other hand, the overall (average)
weight of the layers of skip connection is much higher than that of the transition-up
layer. Furthermore, the first column of the plots in Fig. 3 shows that the first layer of the dense block of the skip connection has most of
the weight for this dense block. The other layers of this dense block do not receive
a larger share in the overall weight, which proves the correlation of low-level features
in the decoder path. The number of weights for the skip connection is much higher
than the transition-up layer in the decoder path.
Fig. 3. L1 norm of the weights of a dense block with its skip connections. The rows of the plot contain a deconvolutional layer (first row), a skip connection (the encoder’s dense block), and the current dense block (the decoder’s dense block).
A bottleneck-dense skip connection was proposed to address the aforementioned issue.
The last layer of the dense block of the encoder processes the feature maps of all
preceding layers, and the output of the last layer of the dense block of the encoder
can be used as a bottleneck skip connection. The outputs of the deconvolutional feature
were concatenated with bottleneck skip connection features. The bottleneck skip connection
will carry a k number of features where k is the growth rate of a dense block.
3.3 DenseNet to SenseNet
This paper proposes a new semantic image segmentation architecture that combines all
the building blocks explained in the previous subsections (Fig. 4). The SenseNet architecture consists of the following: 1) the encoder path that learns
the latent features using successive convolution dense blocks and transitions down
layers, 2) the decoder path that has learnable dense blocks and transition up layers,
and 3) bottleneck dense skip connections that carry context information from the encoder
(high-resolution dense blocks) to the decoder (low-resolution dense blocks) and limits
the network to expand the parameters. The transition down layer has a 1 ${\times}$
1 convolutional layer followed by a max pooling layer so that the dimension of the
input feature maps is reduced (Table 1). Furthermore, 1 ${\times}$ 1 is believed to be similar to a fully connected single
neuron. It connects to all values in the input.
Table 1. Layer structure of the Transition down and Transition up layers.
Transaction Down
|
Batch Normalization
|
ReLu
|
1x1 Convolution
|
Dropout p=0.2
|
2x2 Max Pooling
|
Transaction Up
|
Batch Normalization
|
ReLu
|
3x3 Transposed Convolution (stride 2)
|
Fig. 4. Proposed SenseNet model with two dense blocks that takes an RGB image (three channels) as input and performs a convolutional operation using successive dense blocks. The encoder path of the proposed model has two blocks (both labeled as DB). There is a dense block between the encoder and the decoder. The last two blocks (both labeled as DB) constitute the decoder path. The decoder path first up-samples the feature maps using deconvolution operation and then concatenates the feature maps with the up-sampled output through skip connections. The bottleneck skip connections are shown by a dotted line.
After weighting, sum, and activation, it ends up with a value of one. In the SenseNet
case, it will pool features of the feature maps of all dense skip connections and
generate a new output, which is a combination of the inputs of all layers. The transition
up layer uses the deconvolutional layer that gradually reverses the effects of convolution.
The bottleneck skip connection output is concatenated with the transition-up layer
output to obtain various input features for the subsequent dense block and to pass
the context information from the encoder (high-resolution layers) to the decoder (low-resolution
layers). The output of a deconvolutional dense block is obtained as follows:
where B$_{i}$ is the output of the i$^{\mathrm{th}}$ dense block and x$_{0}$, x$_{1}$,
and x$_{n-1}$ are the outputs of the n layers in the i$^{\mathrm{th}}$ dense block.
Note that B$_{i-1}$ is not concatenated with the output of the dense block, although
every layer in the i$^{th}$ block will receive the output of the i${-$1}$^{th}$ block
(B$_{i-1}$) as an input. Table 1 presents the architecture of the transition-down and transition-up layers. Table 2 lists the building blocks of SenseNet and shows the architecture of the bottleneck
layer and composite function.
Table 2. Building blocks of the SenseNet: Layer architecture of Bottleneck layer and composite function.
Bottleneck Layer
|
Batch Normalization
|
ReLu
|
1x1 Convolution
|
Dropout p=0.2
|
Composite Function (3x3 Convolution)
|
Batch Normalization
|
ReLu
|
Convolution with a given size
|
Dropout p=0.2
|
4. Experiment
This study evaluated the performance of SenseNet on the CamVid dataset [15]. Various block sizes and growth rates were used in these experiments. The results
in terms of IoU (Intersection over Union) and global accuracy (pixel-wise accuracy)
are provided. The IoU is used widely as an evaluation metric for object detection.
For any class c, IoU can be calculated as follows:
All the input pixels were looped to calculate the overlap and union areas. In set
theory, IoU can be defined as follows:
4.1 Experimental Setting
Dataset: CamVid [15] is a segmented video dataset for understanding urban scenes. The extracted frames
from the CamVid dataset were used [13]. The frame was divided as follows: 367 frames for training, 101 frames for validation,
and 233 frames for testing. Each frame was 360 ${\times}$ 480 in size, and its pixels
were categorized into 11 semantic classes. SenseNet was trained on frames from CamVid,
which were cropped randomly to 224 ${\times}$ 224. The images were normalized to the
mean and standard deviation of the data. Data augmentation was used to generate a
variety of inputs on which SenseNet was trained. It also reduced the need for regularization
in the models. The images with a probability of 0.5 were flipped horizontally.
4.2 Training and Inference Details
As previously explained, SenseNet architecture was proposed for image segmentation.
The architecture was implemented in Tensorflow with one GPU. The hyperparameters of
the architecture are given as follows. SenseNet was trained using the RMSProp optimizer.
The learning rate of RMSProp was set to 0.001 and reduced as follows:
where initial_lr$_{\mathrm{i}}$ is the initial learning rate, x is the current epoch,
and z is the number of epochs after the learning rate is reduced. The learning rate
will be reduced based on z. i and z were set to 1$^{\mathrm{e-3}}$ and 3, respectively.
A dropout rate of 0.2 was used. The l2 norm and a weight decay of 1${\times}$10$^{-4}$
were used to regularize the model. A batch size of 2 was used. Batch normalization
with a moving mean and variance was applied. Standard geometric transformations such
as image flipping, cropping, and scaling were used as image augmentation for training
the model.
4.3 Results and Analysis
Table 3 lists the main results of the SenseNet. Different configurations of layers can be
applied for other tasks.
The evaluation results show that SenseNet outperformed the conventional models in
terms of IoU. Table 3 lists the IoU and global accuracy scores of SenseNet. The performance of the SenseNet
increased as the number of parameters increased. For the evaluation, SenseNet was
trained and tested on images cropped randomly to 224 ${\times}$ 224 and on overlapping
images cropped to 224 ${\times}$ 224. This section reports the results of three configurations
of the SenseNet model: 1) SenseNet-78 (a growth rate of 12), 2) SenseNet-108 (a growth
rate of 12), and SenseNet-abc (a growth rate of 16). SenseNet-78 and SenseNet-108
have four dense blocks, while SenseNet-abc has five dense blocks. A maximum of four
images of 224 ${\times}$ 224 were cropped from one image from the dataset, and the
SenseNet model was fine-tuned on these cropped images. The accuracy of SenseNet increases
with the size of the dataset (Table 3).
Table 3. Performance comparison of SenseNet model SenseNet with other models. The symbol * indicates that this study implemented FCN-DenseNet 56. FC-DenseNet 67 and 103 could not be trained due to resource constraints. The results of FC-DenseNet and SenseNet, which were trained on the same class weights for a fair comparison, are also listed. ** indicates the results obtained by SenseNet after being trained on overlapping images of 224 × 224 cropped from the original image of 360×480.
Model
|
Pretrained
|
Parameters (M)
|
Building
|
Tree
|
Sky
|
Car
|
Sign
|
Road
|
Pedestrian
|
Fence
|
Pole
|
Sidewalk
|
Cyclist
|
Mean IoU
|
global accuracy
|
SegNet
|
✓
|
29.5
|
68.7
|
52.0
|
87.0
|
58.5
|
13.4
|
86.2
|
25.3
|
17.9
|
16.0
|
60.5
|
24.8
|
46.4
|
62.5
|
Bayesian SegNet
|
✓
|
29.5
|
n/a
|
63.1
|
86.9
|
DeconvNet
|
✓
|
252.0
|
n/a
|
48.9
|
85.9
|
FCN8
|
✓
|
134.5
|
77.8
|
71.0
|
88.7
|
7.1
|
32.7
|
91.2
|
41.7
|
24.4
|
19.9
|
72.7
|
31.0
|
57.0
|
88.0
|
DeepLab-LFOV
|
✓
|
37.3
|
81.5
|
74.6
|
89.0
|
82.2
|
42.3
|
92.2
|
48.4
|
27.2
|
14.3
|
75.4
|
50.1
|
61.6
|
-
|
FC-DenseNet 56 (k=12)*
|
|
1.5
|
90.05
|
65.87
|
7.14
|
83.9
|
64.18
|
58.04
|
4.50
|
1.59
|
50.55
|
3.19
|
0
|
39.0
|
80.76
|
FC-DenseNet 56 (k=12)**
|
|
1.5
|
90.67
|
64.42
|
14.90
|
87.15
|
69.84
|
58.95
|
14.16
|
11.89
|
64.90
|
18.22
|
4.65
|
45.43
|
81.18
|
SenseNet 78 (k=12)
|
|
2.3
|
87.44
|
64.68
|
21.37
|
84.86
|
60.14
|
51.54
|
21.26
|
4.74
|
58.03
|
25.62
|
14.02
|
44.89
|
86.42
|
SenseNet 108 (k=12)
|
|
3.3
|
89.16
|
77.55
|
31.57
|
93.88
|
60.43
|
67.33
|
33.22
|
43.71
|
81.90
|
27.12
|
22.84
|
57.16
|
86.42
|
SenseNet 108 (k=12)**
|
|
3.3
|
91.61
|
85.04
|
35.30
|
96.92
|
77.78
|
74.89
|
44.87
|
59.94
|
87.65
|
32.57
|
37.87
|
65.86
|
90.83
|
Fig. 5 shows the error rate reduction and increase in accuracy of the multiple configurations
of SenseNet. The decrease in the error rate demonstrates the high performance of SenseNet.
The results on the CamVid dataset clearly show that SenseNet achieves state-of-the-art
performance in terms of IoU. SenseNet does not require pre-training. Many of the semantic
segmentation models listed in Table 3 were pretrained on larger datasets, such as ImageNet [22], to perform segmentation. The segmentation results on unrepresented classes in the
dataset can be improved by addressing the class imbalance issue. SenseNet uses fewer
parameters (Table 3) and outperforms FCDenseNet. Bottleneck skip connections reduce memory requirements
and computational complexity. SenseNet outperforms FC-DenseNet in terms of IoU and
achieves a similar accuracy to FC-DenseNet; SenseNet used only 2.3M (million) parameters,
while FC-DenseNet-103 used 9.4M (million) parameters.
Fig. 5. Changes in the validation loss and validation accuracy of two SenseNet models (SenseNet-78 and SenseNet-108). The accuracy is indicated by a solid line, while the loss is indicated by a dotted line.
Multipath-DenseNet [16] showed that DenseNet makes several shorter paths with the dense block input in very
deep neural networks because of the higher number of low-level feature maps. Longer
skip connections make it more difficult for a network to learn in the decoder path.
Moreover, the classes in the dataset are very imbalanced. Addressing the class imbalance
of the data in CamVid would improve the performance of SenseNet. The results of the
implementation of FC-DenseNet are presented. The results of FCDenseNet 56, large configurations
of FC-DenseNet (67 and 103) were not implemented due to computational resource constraints.
The results in Table 3 of FC-Densenet show that the classes where the number of instances and pixels were
higher have very high accuracy, and lower accuracy is observed in unrepresentative
classes. This study did not find information about handling the class imbalance reported
elsewhere [14].
5. Conclusion
This paper proposed SenseNet, which is based on DenseNet for semantic segmentation.
The SenseNet uses a dense block to build the encoder and decoder paths. Unlike DenseNet,
SenseNet does not concatenate the input with the output of the dense block in the
decoder path. Feeding the output of a dense block to another dense block requires
more computation and memory. Therefore, this paper proposes bottleneck skip connections
whose features will be concatenated with the transition-up layer features in the decoder
path. The experimental results show that the SenseNet model outperforms the baseline
models regarding IoU.
ACKNOWLEDGMENTS
This research was supported by Strategic Networking & Development Program funded by
the Ministry of Science and ICT through the National Research Foundation of Korea(RS-2023-00277267).
REFERENCES
T. Nguyen and M. Yoo, ``Fusing LIDAR sensor and RGB camera for object detection in
autonomous vehicle with fuzzy logic approach,'' in International Conference on Information
Networking, IEEE Computer Society, Jan. 2021, pp. 788-791.
Han’guk T’ongsin Hakhoe, IEEE Communications Society, Denshi Jōhō Tsūshin Gakkai
(Japan). Tsūshin Sosaieti, and Institute of Electrical and Electronics Engineers,
ICTC 2019: the 10th International Conference on ICT Convergence: “ICT Convergence
Leading the Autonomous Future”: October 16-18, 2019, Ramada Plaza Hotel, Jeju Island,
Korea
LeCun, ``Lenet-5.'' convolutional neural networks,
G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, ``Densely Connected Convolutional
Networks.'' [Online]. Available: https://github.com/liuzhuang13/DenseNet.
A. Van Den Oord, S. Dieleman, and B. Schrauwen, ``Deep content-based music recommendation.''
R. Girshick, ``Fast R-CNN.'' [Online]. Available: https://github.com/rbgirshick
J. Long, E. Shelhamer, and T. Darrell, ``Fully Convolutional Networks for Semantic
Segmentation.''
W. Bouzidi, S. Bouaafia, M. A. Hajjaji, and L. M. Bergasa, ``Enhanced U-Net Approach:
Semantic Segmentation for Self-Driving Cars Applications.''
H. Pan, Y. Hong, W. Sun, and Y. Jia, ``Deep Dual-Resolution Networks for Real-Time
and Accurate Semantic Segmentation of Traffic Scenes,'' IEEE Transactions on Intelligent
Transportation Systems, vol. 24, no. 3, pp. 3448-3460, Mar. 2023, doi: 10.1109/TITS.2022.3228042.
L. Bartolomei, L. Teixeira, and M. Chli, ``Perception-aware path planning for UAVs
using semantic segmentation,'' in IEEE International Conference on Intelligent Robots
and Systems, Institute of Electrical and Electronics Engineers Inc., Oct. 2020, pp.
5808-5815. doi: 10.1109/IROS45743.2020.9341347.
M. Hua, Y. Nan, and S. Lian, ``Small Obstacle Avoidance Based on RGB-D Semantic Segmentation.''
O. Ronneberger, P. Fischer, and T. Brox, ``U-Net: Convolutional Networks for Biomedical
Image Segmentation,'' May 2015, [Online]. Available: http://arxiv.org/abs/1505.04597
V. Badrinarayanan, A. Kendall, and R. Cipolla, ``SegNet: A Deep Convolutional Encoder-Decoder
Architecture for Image Segmentation,'' IEEE Trans Pattern Anal Mach Intell, vol. 39,
no. 12, pp. 2481-2495, Dec. 2017, doi: 10.1109/TPAMI.2016.2644615.
[14] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio, “The One Hundred
Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation.” [Online].
Available: https://github.com/SimJeg/FC-DenseNet Article(CrossRefLink)
G. J. Brostow, J. Fauqueur, and R. Cipolla, ``Semantic object classes in video: A
high-definition ground truth database,'' Pattern Recognit Lett, vol. 30, no. 2, pp.
88-97, Jan. 2009, doi: 10.1016/j.patrec.2008.04.005. Article(CrossRefLink)
B. Lodhi and J. Kang, ``Multipath-DenseNet: A Supervised ensemble architecture of
densely connected convolutional networks,'' Inf Sci (N Y), vol. 482, pp. 63-72, May
2019, doi: 10.1016/j.ins.2019.01.012.
P. Hu et al., ``Real-Time Semantic Segmentation with Fast Attention,'' IEEE Robot
Autom Lett, vol. 6, no. 1, pp. 263-270, Jan. 2021, doi: 10.1109/LRA.2020.3039744.
X. Li et al., ``Semantic Flow for Fast and Accurate Scene Parsing.'' [Online]. Available:
https://github.com/lxtGH/SFSegNets.
H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, ``Pyramid Scene Parsing Network.''
F. Yu and V. Koltun, ``Multi-Scale Context Aggregation by Dilated Convolutions,''
Nov. 2015,
K. He, X. Zhang, S. Ren, and J. Sun, ``Deep Residual Learning for Image Recognition.''
[Online]. Available: http://image-net.org/challenges/LSVRC/2015
L. Deng and X. Li, ``Machine learning paradigms for speech recognition: An overview,''
IEEE Trans Audio Speech Lang Process, vol. 21, no. 5, pp. 1060-1089, 2013, doi: 10.1109/TASL.2013.2244083.
Author
Bilal Ahmed Lodhireceived his B.S. degree in computer science from Baqai Medical University,
Karachi, Pakistan, in 2004, the M.S. degree in computer science from the National
University of Computer and Emerging Sciences, Islamabad, Pakistan, in 2009, and the
Ph.D. degree in computer science from Korea University, Seoul, South Korea, in 2019.
He worked as an Ultrasound Research Engineer with Alpinion Medical Systems, Seoul.
From 2020 to 2022, he was a Research Fellow with the School of Electronics, Electrical
Engineering and Computer Science, Queen’s University Belfast (QUB), Belfast, U.K.
Prior to joining QUB, he was a Research Fellow with the University of Seoul, Republic
of Korea. He is currently an Assistant Professor at School of Computing, Ulster University.
Rehmat Ullahis an Assistant Professor at the School of Technologies, Cardiff Metropolitan
University, UK. He recieved his Ph.D. in Electronics and Computer Engineering from
Hongik University, South Korea. Previously, he worked as an Assistant Professor at
Gachon University, South Korea, and as a Post-Doctorate Research Fellow at the University
of St Andrews, UK, and Queen’s University Belfast, UK. His research focuses on the
broader areas of network and distributed systems, particularly the development of
architectures, algorithms, and protocols for emerging paradigms such as edge computing,
IoT, ICN/NDN, and distributed machine learning for edge computing systems. This includes
the design, measurement studies, prototyping, testbed development, and performance
evaluations. He served as a general chair, TPC member, keynote speaker, and session
chair for several flagship conferences, such as ACM ICN 2022, ACM IMC 2018, and ICC
2023 and ICC2024. His research has been published in premier conferences, journals,
and patents including UCC, ACM ICN, HotMobile, IEEE Communications Magazine, IEEE
Transactions on Parallel and Distributed Systems, IEEE Transactions on Network Science
and Engineering, IEEE Internet of Things Journal, IEEE Wireless Communications Magazine,
IEEE Network Magazine, Journal of Network and Computer Applications, and Future Generation
Computer Systems. He currently holds six patents. In 2022, Dr. Rehmat was recognized
as a Global Talent by the Royal Academy of Engineering, UK. More information is available
from https://rehmatkhan.com/
Sajida Imranreceived her Ph.D. degree from Ajou University in 2018. She worked as
an Assistant Professor at the Department of Computer Engineering, University of Lahore,
Pakistan. She is currently working as an Assistant Professor with the Department of
Computer Engineering at King Faisal University, Saudi Arabia. She has authored several
international journals. Her research interests include wireless internet technologies
for the localization, detection, and tracking of objects and applications of the Internet
of Things using various machine learning techniques.
Muhammad Imranreceived his B.S. in computer science from COMSATS University Pakistan
in 2015 and his M.S. in computer science from the Virtual University of Pakistan in
2019. He is pursuing a Ph.D. in software and communication engineering with the Department
of Electronics and Computer Engineering at Hongik University, South Korea. His research
interests include Cloud/edge computing, the Internet of Things, information-centric
networking, and named data networking. He worked as an educator in the school education
department in Punjab, Pakistan, from 2016 to 2021.
Byung-Seo Kimreceived his B.S. degree in Electrical Engineering from In-Ha University,
In-Chon, Korea, in 1998 and his M.S. and Ph.D. in Electrical and Computer Engineering
from the University of Florida in 2001 and 2004, respectively. Dr. Yuguang Fang supervised
his Ph.D. study. Between 1997 and 1999, he worked for Motorola Korea Ltd., PaJu, Korea,
as a CIM Engineer in ATR&D. From January 2005 to August 2007, he worked for Motorola
Inc., Schaumburg, Illinois, as a Senior Software Engineer in Networks and Enterprises
for designing the protocol and network architecture of wireless broadband mission-critical
communications. He is a professor in the Department of Software and Communications
Engineering at Hongik University, Korea. He is an IEEE Senior Member and an Associate
Editor of IEEE Access, Telecommunication Systems, and Journal of the Institute of
Electrics and Information Engineers. His studies have appeared in approximately 260
publications and 32 patents. His research interests include designing and developing
efficient wireless/wired networks, link-adaptable/cross-layer-based protocols, multi-protocol
structures, wireless CCNs/NDNs, Mobile Edge Computing, physical layer design for broadband
PLC, and resource allocation algorithms for wireless network.