Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 13, No. 04, p.328-336

ISSN (online) :

2287-5255

Received : 17 November 2023Revised : 1 January 2024Accepted : 21 January 2024

DOI :

https://doi.org/10.5573/IEIESPC.2024.13.4.328

Regular Paper

This paper is an extended version of a paper published in Proceedings of the 11th International Conference on Green and Human Information Technology (ICGHIT 2023), Bangkok, Thailand, 31 January-2 February 2023.

SenseNet: Densely Connected, Fully Convolutional Network with Bottleneck Skip Connection for Image Segmentation

LodhiBilal Ahmed¹ UllahRehmat² ImranSajida³ ImranMuhammad⁴ KimByung-Seo⁵

( School of Computing, Ulster University, Belfast Campus, BT15 1AP, UK b.lodhi@ulster.ac.uk)
( Cardiff School of Technologies, Cardiff Metropolitan University, Cardiff, CF5 2YB, UK rullah@cardifffmet.ac.uk)
( Department of Computer Engineering, King Faisal University, Al-Ahsa, 31982, Saudi Arabia skamran@kfu.edu.sa)
(,5 Department of Software and Communications Engineering, Hongik University, Sejong City 30016, Korea royimranpk@gmail.com, jsnbs@hongik.ac.kr)

^*Corresponding Author: Byung-Seo Kim

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

This paper presents SenseNet, a convolution neural network (CNN) model for image segmentation. SenseNet architecture includes encoders with their corresponding decoders and bottleneck skip connections. The last layer of the architecture is a classification layer that classifies each pixel of an image for image segmentation. SenseNet addresses the limitations of conventional semantic segmentation models. Moreover, the skip connection does not include sufficient information for the recovery in the decoder path. This paper proposes a novel network structure combining a modified dense block and dense skip connection for efficient information recovery at the decoder path. Furthermore, this paper also proposes a dense, long skip connection that transfers the feature maps of each layer of the encoder to a layer of the decoder. This dense skip connection helps the network recover the information efficiently in the decoder path. SenseNet achieves state-of-the-art accuracy with fewer parameters and high-level features in the decoder path. This study evaluated SenseNet on the urban scene benchmark dataset CamVid and measured the performance in terms of intersection over union (IoU) and global accuracy. SenseNet outperformed the baseline model by an 8.7% increase in IoU. SenseNet can be downloaded from https://github.com/sensenetskip/sensenet.

Keywords

Image segmentation, Semantic segmentation, Deep neural network, Convolutional networks, Artificial intelligence (AI), 6G, Virtual reality, Surveillance cameras, Autonomous cars

1. Introduction

Deep learning has achieved compelling results in image segmentation ^[1,^2], and a Convolutional Neural Network (CNN) exhibits higher accuracy and efficiency than other approaches ^[3]. A CNN uses feedforward layers with small receptive fields at each layer to learn complex features. CNNs have been used in image recognition ^[4], recommender systems ^[5], object detection ^[6], and semantic segmentation ^[7], owing to their high performance. Recently, CNNs with deeper architectures have been applied to various use cases, including intelligent security, virtual reality, surveillance, and autonomous or self-driving vehicles ^[8]. Autonomous vehicles are equipped with a variety of sensor systems to detect obstacles, lanes, and parking places ^[1,^2,^9]. This field is quite promising, with numerous potential benefits such as increased safety, lower costs, more comfortable travel, increased mobility, and a smaller environmental footprint ^[1]. In autonomous vehicles, vehicle movements are controlled autonomously without human intervention. Different studies in this area have focused on improving navigation and driving systems to their full capabilities in an unpredictable environment. The navigation system will have to rely on sensor data to make its own decisions under various conditions, similar to a real driver. Semantic segmentation takes an important step towards traffic scene understanding by parsing an image into different regions with specific semantic categories, such as roads, vehicles, and pedestrians. The predicted images can be used to plan the vehicle behavior and avoid collisions ^[10,^11].

Region-based models, weakly supervised models, and fully convolutional networks have been used for semantic image segmentation. This paper focused on fully convolutional networks. A fully convolutional network (FCN) ^[7] can be used to learn discriminative features in successive convolutional layers (lower resolution) and map the features to a pixel space (high resolution). A FCN is considered one of the pioneering algorithms in semantic segmentation. In the original FCN model, the encoder (contracting network) uses convolutional layers and pooling layers. These pooling layers are replaced by up-sampling layers in the decoder path. Unlike the other segmentation approaches, FCN can predict inputs of any size. The feature resolution of the output is reduced owing to the several convolutional and pooling layers in FCN. Therefore, the FCN model generates a coarser label output. Most FCN models attempt to recover information loss in the decoder path by adding skip connections between the encoder and decoder layers.

U-Net ^[12] used learnable weights and convolutional layers instead of a simple interpolation in the decoder path. In U-Net, multiple up-sampling layers are added, followed by convolutional layers after each up-sampling layer. U-Net achieved the benchmark accuracy on the medical image dataset. U-Net added a concatenating skip connection to propagate context information to high-resolution layers to deal with the information loss in the up-sampling path. SegNet ^[13] further improved the decoder path by transferring the max-pooling indices to the corresponding decoder. Transferring max-pooling indices helps SegNet reduce the number of parameters in its network and achieve good accuracy. The U-Net-based algorithms still suffer from information loss in the decoder path because of the limited information available in the decoder.

Recent studies have shown that convolutional networks can be substantially deeper, more accurate, and more efficient to train if they contain shorter connections between layers closer to the input and those closer to the output ^[4]. A fully convolutional densely connected network (FC-DenseNet) ^[14], which is an extension of DenseNet, was proposed to address the semantic segmentation problem. Unlike other FCN variants, FC-DenseNet does not require pre-training on a large-scale image dataset. FC-DenseNet achieved state-of-the-art results on the urban scene benchmark dataset Camvid ^[15]. Despite the high accuracy of FC-DenseNet, the skip connection increases the number of parameters ^[16] in the decoder path exponentially, which increases the training time and computational complexity. Furthermore, the decoder path of FC-DenseNet has fewer weight updates than skip connections, showing that FC-DenseNet relies on the high-resolution layers of its encoder.

U-Net usually performs multiple down-sampling before the concatenating process, resulting in resolution loss that makes it difficult for U-Net architecture to recover from this problem as this information is unavailable in the decoder path. DenseNet architecture keeps the input closer to the output at any given layer. Therefore, it can solve the information/resolution loss problem of U-Net. On the other hand, DenseNet memory growth is enormous. Therefore, to address the aforementioned issue, this study developed an architecture combining dense block and skip connections, which exploits the fact that there should be shorter connections between layers close to the input and those close to the output. The decoder path of the proposed architecture, SenseNet, also contains connections closer to the input. SenseNet improves training and decreases inference time by reducing the skip connection overhead and avoiding exponential parameter growth. SenseNet has no parameter explosion problem, resulting in low memory requirements. Moreover, SenseNet increases the dependence on deeper layer features instead of earlier layer features in the decoder path.

The main contributions of this paper are as follows.

1. A model that uses a dense block with a DenseNet-BC layer structure and bottleneck skip connection is proposed to avoid the feature map explosion and information loss in the decoder path.

2. SenseNet could outperform current state-of-the-art models on the standard dataset CamVid without pre-trained parameters or any further post-processing.

The remainder of the paper is structured as follows: Section 2 reports a literature review. Section 3 outlines the methodology with all the building blocks of the proposed architecture. Section 4 provides the experimental setup details, including the dataset and training. Section 5 analyzes the results obtained, and Section 6 concludes the paper.

2. Related Work

This section discusses the related work in the context of CNNs for semantic segmentation. Before CNNs, the models relied on efficient hand-crafted features for pixel-wise classification of images. The authors in FCN ^[7] combined an up-sampling network with the contracting network to predict each pixel in the image. A considerable amount of contextual information is lost in the deeper layer of contracting networks because of pooling operations. FCN uses a simple bi-linear interpolation approach to up-sample images and recover lost contextual information. Multiple FCN variants, such as deconvolution, were proposed to improve the up-sampling path of the original FCN, where the resolution was degraded during the pooling operation. The decoder path helps increase the resolution of feature maps. The output of the original FCN model ^[7] was much coarser because the FCN model used only one up-sampling layer for increasing the resolution of a 16 ${\times}$ 16 image.

UNet is an extended version of the FCN and adds learnable weights during up-sampling. U-Net adds skip connections to propagate the contextual information to recover high-resolution layers. Segnet ^[13] proposed increasing the number of skip connections to address the insufficient number in FCN. On the other hand, instead of passing the feature maps through skip connections, the max-pooling indices were transferred from the encoder to the decoder, which made Segnet much more memory efficient than FCN.

In the U-Net architecture, an encoder is a convolution operation with spatial reductions to extract features. The decoder attempts to recover the resolution by some operation (interpolation or transposed convolution). The encoder can be pretrained on ImageNet for better performance, but it can be trained from the ground up. In the family of encoder-decoder architecture, FANet ^[17] achieves good accuracy with fast attention modules and extra downsampling throughout the network. Furthermore, SFNet ^[18] delivers a Flow Alignment Module (FAM) to align the feature maps of adjacent levels for better fusion.

A pyramid scene parsing network ^[19] aggregates context information using a pyramid pooling module. Pyramid pooling is based on dilated convolutions ^[20] and ResnNet ^[21]. In the pyramid scene parsing network, the pyramid pooling module up-samples the input using multiple kernels covering half the entire or part of an image and concatenates this information with ResNet feature maps. FCN models are trained in an end-to-end manner, which is the major advantage over region-based network models. FCN-based models rely on complex methods to recover information loss in the pooling operation. Furthermore, this loss of information also affects the end-to-end training of the encoder network as the structure information of objects is lost until the classification layer. Therefore, many FCN models must be pre-trained on large-scale datasets for higher performance.

3. Methodology

This section explains all the building blocks of SenseNet architecture as follows.

3.1 DenseNet-BC

DenseNet connects a layer directly to all subsequent layers, as shown in Fig. 1. The input of a layer in DenseNet is the concatenated output of all preceding layers in a dense block. For example, the n$^{\mathrm{th}}$ layer input in an m-layer dense block can be defined as follows. The Haskell operator symbol ``++'' was used for concatenation to simplify the notations.

(1)

$ x_{n}=F_{n}\left(x_{0}++x_{1}++x_{2}++\ldots ++x_{n-1}\right) $

where $(x_{0}++x_{1}++x_{2}++\ldots ++x_{n-1})$ is the concatenated output of the preceding layers, and F is a composite residual function ^[4]. Further information on Denseblock is reported elsewhere ^[4].

Fig. 1. Dense block as described in [4]. A layer in DenseNet receives input from all preceding layers in the network and concatenates its output with the input.

3.2 Dense Skip Connection

This paper proposes the use of a Dense skip connection to bring the information in the decoder path close to the input of the encoder path. A straightforward way is to connect a concatenated output of the encoder layer to the decoder layer, i.e., a skip connection between high-resolution dense blocks and low-resolution dense blocks. On the other hand, these skip connections require considerable computation and memory. The dense block output is the concatenated output of all the layers in the dense block, including the input of the dense block. Suppose the dense block output is used as an input to another dense block that converts the two dense blocks into a single dense block with an increased number of layers, resulting in higher memory and computation requirements. The output of a dense block that has a skip connection with another dense block can be obtained as follows:

(2)

$ B_{i}=B_{i-1}+x_{0}++x_{1}++\ldots ++x_{n} $

where Bi is the output of block i; $x_{0}++x_{1}++\ldots ++x_{n}$is the concatenated output of block i with n number of layers. B$_{i-1}$ is the concatenated output of the previous dense block.

Fig. 2 shows how a skip connection between two dense blocks allows them to act like a single dense block, which requires more computation and memory resources. Fig. 2 (left) shows two dense blocks with two layers. The output of a dense block is the concatenated output of all the layers (L1, L2) in the dense block and is denoted as y. This output is fed to the other dense block, and every layer (S 1, S 2) in the dense block will receive y as an input. Y in Fig. 2 (left) is shown as the output of the dense block. Intuitively, the graphs in Fig. 2 (left) can be constructed as Fig. 2 (right), where Xo = x and the output Yo of the network will be the same as Y. Moreover, skip connections between the dense blocks are more complex than the skip connections between layers. The L1 norm of the trained weights (normalized by the input connections) of FC-DenseNet-103 was plotted to study the effect of skip connections between dense blocks (Fig. 3). The effects of the transition-up layer on every layer of the subsequent dense block were also plotted (Fig. 3, first row of the plot).

Fig. 2. Illustration showing how a concatenated input converts two disjointed dense blocks into one big dense block. x and y are the input and output of dense block L1, respectively, which has two convolutional layers. The output is again inputted to dense block L2, which has two convolutional layers. Xo is the input of a single dense block with four convolutional layers, and Yo is the output of this dense block. The output of two disjointed dense blocks with skip connections will be identical to that of a single dense block.

Fig. 3 shows the average heatmaps of the first (left) and last (right) dense block weights in the decoder path. The first dense block has 12 layers, while the skip connection carries the output of 12 layers and the block input. Each plot in Fig. 3 contains the average weights of the transition-up layer (first row only), skip connection and current dense block. A red pixel in Fig. 3 indicates that the target layer uses (on average) the source layer output. Fig. 3 (left) shows that the middle dense block is not used by the first dense block in the decoder path and does not show any change. Therefore, the FC-DenseNet architecture depends mostly on the feature maps of the corresponding encoder (the skip connection rows in Fig. 3 (left) show the weight changes). The first row in Fig. 3 (right) shows the dependence of the dense block on the transition-up layer as the weight change can be observed in the first row. On the other hand, the overall (average) weight of the layers of skip connection is much higher than that of the transition-up layer. Furthermore, the first column of the plots in Fig. 3 shows that the first layer of the dense block of the skip connection has most of the weight for this dense block. The other layers of this dense block do not receive a larger share in the overall weight, which proves the correlation of low-level features in the decoder path. The number of weights for the skip connection is much higher than the transition-up layer in the decoder path.

Fig. 3. L1 norm of the weights of a dense block with its skip connections. The rows of the plot contain a deconvolutional layer (first row), a skip connection (the encoder’s dense block), and the current dense block (the decoder’s dense block).

A bottleneck-dense skip connection was proposed to address the aforementioned issue. The last layer of the dense block of the encoder processes the feature maps of all preceding layers, and the output of the last layer of the dense block of the encoder can be used as a bottleneck skip connection. The outputs of the deconvolutional feature were concatenated with bottleneck skip connection features. The bottleneck skip connection will carry a k number of features where k is the growth rate of a dense block.

3.3 DenseNet to SenseNet

This paper proposes a new semantic image segmentation architecture that combines all the building blocks explained in the previous subsections (Fig. 4). The SenseNet architecture consists of the following: 1) the encoder path that learns the latent features using successive convolution dense blocks and transitions down layers, 2) the decoder path that has learnable dense blocks and transition up layers, and 3) bottleneck dense skip connections that carry context information from the encoder (high-resolution dense blocks) to the decoder (low-resolution dense blocks) and limits the network to expand the parameters. The transition down layer has a 1 ${\times}$ 1 convolutional layer followed by a max pooling layer so that the dimension of the input feature maps is reduced (Table 1). Furthermore, 1 ${\times}$ 1 is believed to be similar to a fully connected single neuron. It connects to all values in the input.

Table 1. Layer structure of the Transition down and Transition up layers.

Transaction Down	Batch Normalization	ReLu	1x1 Convolution	Dropout p=0.2	2x2 Max Pooling
Transaction Up	Batch Normalization	ReLu	3x3 Transposed Convolution (stride 2)

Fig. 4. Proposed SenseNet model with two dense blocks that takes an RGB image (three channels) as input and performs a convolutional operation using successive dense blocks. The encoder path of the proposed model has two blocks (both labeled as DB). There is a dense block between the encoder and the decoder. The last two blocks (both labeled as DB) constitute the decoder path. The decoder path first up-samples the feature maps using deconvolution operation and then concatenates the feature maps with the up-sampled output through skip connections. The bottleneck skip connections are shown by a dotted line.

After weighting, sum, and activation, it ends up with a value of one. In the SenseNet case, it will pool features of the feature maps of all dense skip connections and generate a new output, which is a combination of the inputs of all layers. The transition up layer uses the deconvolutional layer that gradually reverses the effects of convolution. The bottleneck skip connection output is concatenated with the transition-up layer output to obtain various input features for the subsequent dense block and to pass the context information from the encoder (high-resolution layers) to the decoder (low-resolution layers). The output of a deconvolutional dense block is obtained as follows:

(3)

$ B_{i}=x_{0}++x_{1}++\ldots ++x_{n-1} $

where B$_{i}$ is the output of the i$^{\mathrm{th}}$ dense block and x$_{0}$, x$_{1}$, and x$_{n-1}$ are the outputs of the n layers in the i$^{\mathrm{th}}$ dense block. Note that B$_{i-1}$ is not concatenated with the output of the dense block, although every layer in the i$^{th}$ block will receive the output of the i${-$1}$^{th}$ block (B$_{i-1}$) as an input. Table 1 presents the architecture of the transition-down and transition-up layers. Table 2 lists the building blocks of SenseNet and shows the architecture of the bottleneck layer and composite function.

Table 2. Building blocks of the SenseNet: Layer architecture of Bottleneck layer and composite function.

Bottleneck Layer	Batch Normalization	ReLu	1x1 Convolution	Dropout p=0.2
Composite Function (3x3 Convolution)	Batch Normalization	ReLu	Convolution with a given size	Dropout p=0.2

4. Experiment

This study evaluated the performance of SenseNet on the CamVid dataset ^[15]. Various block sizes and growth rates were used in these experiments. The results in terms of IoU (Intersection over Union) and global accuracy (pixel-wise accuracy) are provided. The IoU is used widely as an evaluation metric for object detection. For any class c, IoU can be calculated as follows:

(4)

$ IoU=\frac{Areaofoverlap}{AreaofUnion} $

All the input pixels were looped to calculate the overlap and union areas. In set theory, IoU can be defined as follows:

(5)

$ IoU=\frac{A\cap B}{A\cup B} $

4.1 Experimental Setting

Dataset: CamVid ^[15] is a segmented video dataset for understanding urban scenes. The extracted frames from the CamVid dataset were used ^[13]. The frame was divided as follows: 367 frames for training, 101 frames for validation, and 233 frames for testing. Each frame was 360 ${\times}$ 480 in size, and its pixels were categorized into 11 semantic classes. SenseNet was trained on frames from CamVid, which were cropped randomly to 224 ${\times}$ 224. The images were normalized to the mean and standard deviation of the data. Data augmentation was used to generate a variety of inputs on which SenseNet was trained. It also reduced the need for regularization in the models. The images with a probability of 0.5 were flipped horizontally.

4.2 Training and Inference Details

As previously explained, SenseNet architecture was proposed for image segmentation. The architecture was implemented in Tensorflow with one GPU. The hyperparameters of the architecture are given as follows. SenseNet was trained using the RMSProp optimizer. The learning rate of RMSProp was set to 0.001 and reduced as follows:

(6)

$ new\_ lr_{i}=initial\_ lr_{i}*e^{0.995\frac{x}{z}} $

where initial_lr$_{\mathrm{i}}$ is the initial learning rate, x is the current epoch, and z is the number of epochs after the learning rate is reduced. The learning rate will be reduced based on z. i and z were set to 1$^{\mathrm{e-3}}$ and 3, respectively. A dropout rate of 0.2 was used. The l2 norm and a weight decay of 1${\times}$10$^{-4}$ were used to regularize the model. A batch size of 2 was used. Batch normalization with a moving mean and variance was applied. Standard geometric transformations such as image flipping, cropping, and scaling were used as image augmentation for training the model.

4.3 Results and Analysis

Table 3 lists the main results of the SenseNet. Different configurations of layers can be applied for other tasks.

The evaluation results show that SenseNet outperformed the conventional models in terms of IoU. Table 3 lists the IoU and global accuracy scores of SenseNet. The performance of the SenseNet increased as the number of parameters increased. For the evaluation, SenseNet was trained and tested on images cropped randomly to 224 ${\times}$ 224 and on overlapping images cropped to 224 ${\times}$ 224. This section reports the results of three configurations of the SenseNet model: 1) SenseNet-78 (a growth rate of 12), 2) SenseNet-108 (a growth rate of 12), and SenseNet-abc (a growth rate of 16). SenseNet-78 and SenseNet-108 have four dense blocks, while SenseNet-abc has five dense blocks. A maximum of four images of 224 ${\times}$ 224 were cropped from one image from the dataset, and the SenseNet model was fine-tuned on these cropped images. The accuracy of SenseNet increases with the size of the dataset (Table 3).

Table 3. Performance comparison of SenseNet model SenseNet with other models. The symbol * indicates that this study implemented FCN-DenseNet 56. FC-DenseNet 67 and 103 could not be trained due to resource constraints. The results of FC-DenseNet and SenseNet, which were trained on the same class weights for a fair comparison, are also listed. ** indicates the results obtained by SenseNet after being trained on overlapping images of 224 × 224 cropped from the original image of 360×480.

Model	Pretrained	Parameters (M)	Building	Tree	Sky	Car	Sign	Road	Pedestrian	Fence	Pole	Sidewalk	Cyclist	Mean IoU	global accuracy
SegNet	✓	29.5	68.7	52.0	87.0	58.5	13.4	86.2	25.3	17.9	16.0	60.5	24.8	46.4	62.5
Bayesian SegNet	✓	29.5	n/a											63.1	86.9
DeconvNet	✓	252.0	n/a											48.9	85.9
FCN8	✓	134.5	77.8	71.0	88.7	7.1	32.7	91.2	41.7	24.4	19.9	72.7	31.0	57.0	88.0
DeepLab-LFOV	✓	37.3	81.5	74.6	89.0	82.2	42.3	92.2	48.4	27.2	14.3	75.4	50.1	61.6	-
FC-DenseNet 56 (k=12)*		1.5	90.05	65.87	7.14	83.9	64.18	58.04	4.50	1.59	50.55	3.19	0	39.0	80.76
FC-DenseNet 56 (k=12)**		1.5	90.67	64.42	14.90	87.15	69.84	58.95	14.16	11.89	64.90	18.22	4.65	45.43	81.18
SenseNet 78 (k=12)		2.3	87.44	64.68	21.37	84.86	60.14	51.54	21.26	4.74	58.03	25.62	14.02	44.89	86.42
SenseNet 108 (k=12)		3.3	89.16	77.55	31.57	93.88	60.43	67.33	33.22	43.71	81.90	27.12	22.84	57.16	86.42
SenseNet 108 (k=12)**		3.3	91.61	85.04	35.30	96.92	77.78	74.89	44.87	59.94	87.65	32.57	37.87	65.86	90.83

Fig. 5 shows the error rate reduction and increase in accuracy of the multiple configurations of SenseNet. The decrease in the error rate demonstrates the high performance of SenseNet. The results on the CamVid dataset clearly show that SenseNet achieves state-of-the-art performance in terms of IoU. SenseNet does not require pre-training. Many of the semantic segmentation models listed in Table 3 were pretrained on larger datasets, such as ImageNet ^[22], to perform segmentation. The segmentation results on unrepresented classes in the dataset can be improved by addressing the class imbalance issue. SenseNet uses fewer parameters (Table 3) and outperforms FCDenseNet. Bottleneck skip connections reduce memory requirements and computational complexity. SenseNet outperforms FC-DenseNet in terms of IoU and achieves a similar accuracy to FC-DenseNet; SenseNet used only 2.3M (million) parameters, while FC-DenseNet-103 used 9.4M (million) parameters.

Fig. 5. Changes in the validation loss and validation accuracy of two SenseNet models (SenseNet-78 and SenseNet-108). The accuracy is indicated by a solid line, while the loss is indicated by a dotted line.

Multipath-DenseNet ^[16] showed that DenseNet makes several shorter paths with the dense block input in very deep neural networks because of the higher number of low-level feature maps. Longer skip connections make it more difficult for a network to learn in the decoder path. Moreover, the classes in the dataset are very imbalanced. Addressing the class imbalance of the data in CamVid would improve the performance of SenseNet. The results of the implementation of FC-DenseNet are presented. The results of FCDenseNet 56, large configurations of FC-DenseNet (67 and 103) were not implemented due to computational resource constraints. The results in Table 3 of FC-Densenet show that the classes where the number of instances and pixels were higher have very high accuracy, and lower accuracy is observed in unrepresentative classes. This study did not find information about handling the class imbalance reported elsewhere ^[14].

5. Conclusion

This paper proposed SenseNet, which is based on DenseNet for semantic segmentation. The SenseNet uses a dense block to build the encoder and decoder paths. Unlike DenseNet, SenseNet does not concatenate the input with the output of the dense block in the decoder path. Feeding the output of a dense block to another dense block requires more computation and memory. Therefore, this paper proposes bottleneck skip connections whose features will be concatenated with the transition-up layer features in the decoder path. The experimental results show that the SenseNet model outperforms the baseline models regarding IoU.

ACKNOWLEDGMENTS

This research was supported by Strategic Networking & Development Program funded by the Ministry of Science and ICT through the National Research Foundation of Korea(RS-2023-00277267).

REFERENCES

T. Nguyen and M. Yoo, ``Fusing LIDAR sensor and RGB camera for object detection in autonomous vehicle with fuzzy logic approach,'' in International Conference on Information Networking, IEEE Computer Society, Jan. 2021, pp. 788-791.

Han’guk T’ongsin Hakhoe, IEEE Communications Society, Denshi Jōhō Tsūshin Gakkai (Japan). Tsūshin Sosaieti, and Institute of Electrical and Electronics Engineers, ICTC 2019: the 10th International Conference on ICT Convergence: “ICT Convergence Leading the Autonomous Future”: October 16-18, 2019, Ramada Plaza Hotel, Jeju Island, Korea

LeCun, ``Lenet-5.'' convolutional neural networks,

G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, ``Densely Connected Convolutional Networks.'' [Online]. Available: https://github.com/liuzhuang13/DenseNet.

A. Van Den Oord, S. Dieleman, and B. Schrauwen, ``Deep content-based music recommendation.''

R. Girshick, ``Fast R-CNN.'' [Online]. Available: https://github.com/rbgirshick

J. Long, E. Shelhamer, and T. Darrell, ``Fully Convolutional Networks for Semantic Segmentation.''

W. Bouzidi, S. Bouaafia, M. A. Hajjaji, and L. M. Bergasa, ``Enhanced U-Net Approach: Semantic Segmentation for Self-Driving Cars Applications.''

H. Pan, Y. Hong, W. Sun, and Y. Jia, ``Deep Dual-Resolution Networks for Real-Time and Accurate Semantic Segmentation of Traffic Scenes,'' IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 3, pp. 3448-3460, Mar. 2023, doi: 10.1109/TITS.2022.3228042.

L. Bartolomei, L. Teixeira, and M. Chli, ``Perception-aware path planning for UAVs using semantic segmentation,'' in IEEE International Conference on Intelligent Robots and Systems, Institute of Electrical and Electronics Engineers Inc., Oct. 2020, pp. 5808-5815. doi: 10.1109/IROS45743.2020.9341347.

M. Hua, Y. Nan, and S. Lian, ``Small Obstacle Avoidance Based on RGB-D Semantic Segmentation.''

O. Ronneberger, P. Fischer, and T. Brox, ``U-Net: Convolutional Networks for Biomedical Image Segmentation,'' May 2015, [Online]. Available: http://arxiv.org/abs/1505.04597

V. Badrinarayanan, A. Kendall, and R. Cipolla, ``SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation,'' IEEE Trans Pattern Anal Mach Intell, vol. 39, no. 12, pp. 2481-2495, Dec. 2017, doi: 10.1109/TPAMI.2016.2644615.

[14] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio, “The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation.” [Online]. Available: https://github.com/SimJeg/FC-DenseNet Article(CrossRefLink)

G. J. Brostow, J. Fauqueur, and R. Cipolla, ``Semantic object classes in video: A high-definition ground truth database,'' Pattern Recognit Lett, vol. 30, no. 2, pp. 88-97, Jan. 2009, doi: 10.1016/j.patrec.2008.04.005. Article(CrossRefLink)

B. Lodhi and J. Kang, ``Multipath-DenseNet: A Supervised ensemble architecture of densely connected convolutional networks,'' Inf Sci (N Y), vol. 482, pp. 63-72, May 2019, doi: 10.1016/j.ins.2019.01.012.

P. Hu et al., ``Real-Time Semantic Segmentation with Fast Attention,'' IEEE Robot Autom Lett, vol. 6, no. 1, pp. 263-270, Jan. 2021, doi: 10.1109/LRA.2020.3039744.

X. Li et al., ``Semantic Flow for Fast and Accurate Scene Parsing.'' [Online]. Available: https://github.com/lxtGH/SFSegNets.

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, ``Pyramid Scene Parsing Network.''

F. Yu and V. Koltun, ``Multi-Scale Context Aggregation by Dilated Convolutions,'' Nov. 2015,

K. He, X. Zhang, S. Ren, and J. Sun, ``Deep Residual Learning for Image Recognition.'' [Online]. Available: http://image-net.org/challenges/LSVRC/2015

L. Deng and X. Li, ``Machine learning paradigms for speech recognition: An overview,'' IEEE Trans Audio Speech Lang Process, vol. 21, no. 5, pp. 1060-1089, 2013, doi: 10.1109/TASL.2013.2244083.

Author

Bilal Ahmed Lodhi

Bilal Ahmed Lodhireceived his B.S. degree in computer science from Baqai Medical University, Karachi, Pakistan, in 2004, the M.S. degree in computer science from the National University of Computer and Emerging Sciences, Islamabad, Pakistan, in 2009, and the Ph.D. degree in computer science from Korea University, Seoul, South Korea, in 2019. He worked as an Ultrasound Research Engineer with Alpinion Medical Systems, Seoul. From 2020 to 2022, he was a Research Fellow with the School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast (QUB), Belfast, U.K. Prior to joining QUB, he was a Research Fellow with the University of Seoul, Republic of Korea. He is currently an Assistant Professor at School of Computing, Ulster University.

Rehmat Ullah

Rehmat Ullahis an Assistant Professor at the School of Technologies, Cardiff Metropolitan University, UK. He recieved his Ph.D. in Electronics and Computer Engineering from Hongik University, South Korea. Previously, he worked as an Assistant Professor at Gachon University, South Korea, and as a Post-Doctorate Research Fellow at the University of St Andrews, UK, and Queen’s University Belfast, UK. His research focuses on the broader areas of network and distributed systems, particularly the development of architectures, algorithms, and protocols for emerging paradigms such as edge computing, IoT, ICN/NDN, and distributed machine learning for edge computing systems. This includes the design, measurement studies, prototyping, testbed development, and performance evaluations. He served as a general chair, TPC member, keynote speaker, and session chair for several flagship conferences, such as ACM ICN 2022, ACM IMC 2018, and ICC 2023 and ICC2024. His research has been published in premier conferences, journals, and patents including UCC, ACM ICN, HotMobile, IEEE Communications Magazine, IEEE Transactions on Parallel and Distributed Systems, IEEE Transactions on Network Science and Engineering, IEEE Internet of Things Journal, IEEE Wireless Communications Magazine, IEEE Network Magazine, Journal of Network and Computer Applications, and Future Generation Computer Systems. He currently holds six patents. In 2022, Dr. Rehmat was recognized as a Global Talent by the Royal Academy of Engineering, UK. More information is available from https://rehmatkhan.com/

Sajida Imran

Sajida Imranreceived her Ph.D. degree from Ajou University in 2018. She worked as an Assistant Professor at the Department of Computer Engineering, University of Lahore, Pakistan. She is currently working as an Assistant Professor with the Department of Computer Engineering at King Faisal University, Saudi Arabia. She has authored several international journals. Her research interests include wireless internet technologies for the localization, detection, and tracking of objects and applications of the Internet of Things using various machine learning techniques.

Muhammad Imran

Muhammad Imranreceived his B.S. in computer science from COMSATS University Pakistan in 2015 and his M.S. in computer science from the Virtual University of Pakistan in 2019. He is pursuing a Ph.D. in software and communication engineering with the Department of Electronics and Computer Engineering at Hongik University, South Korea. His research interests include Cloud/edge computing, the Internet of Things, information-centric networking, and named data networking. He worked as an educator in the school education department in Punjab, Pakistan, from 2016 to 2021.

Byung-Seo Kim

Byung-Seo Kimreceived his B.S. degree in Electrical Engineering from In-Ha University, In-Chon, Korea, in 1998 and his M.S. and Ph.D. in Electrical and Computer Engineering from the University of Florida in 2001 and 2004, respectively. Dr. Yuguang Fang supervised his Ph.D. study. Between 1997 and 1999, he worked for Motorola Korea Ltd., PaJu, Korea, as a CIM Engineer in ATR&D. From January 2005 to August 2007, he worked for Motorola Inc., Schaumburg, Illinois, as a Senior Software Engineer in Networks and Enterprises for designing the protocol and network architecture of wireless broadband mission-critical communications. He is a professor in the Department of Software and Communications Engineering at Hongik University, Korea. He is an IEEE Senior Member and an Associate Editor of IEEE Access, Telecommunication Systems, and Journal of the Institute of Electrics and Information Engineers. His studies have appeared in approximately 260 publications and 32 patents. His research interests include designing and developing efficient wireless/wired networks, link-adaptable/cross-layer-based protocols, multi-protocol structures, wireless CCNs/NDNs, Mobile Edge Computing, physical layer design for broadband PLC, and resource allocation algorithms for wireless network.

Article Information (continued)

Regular Paper

Keywords :

Keywords

Keyword :

Image segmentation

Keyword :

Semantic segmentation

Keyword :

Deep neural network

Keyword :

Convolutional networks

Keyword :

Artificial intelligence (AI)

Keyword :

Virtual reality

Keyword :

Surveillance cameras

Keyword :

Autonomous cars

This display is generated from NISO JATS XML with jats-style.xsl. The XSLT engine is Saxonica.