(School of Computer Science and Engineering, Kyungpook National University, Daegu,
{nematullo.9006, neloyou}@knu.ac.kr
Copyright © The Institute of Electronics and Information Engineers(IEIE)
LSR, Loss function, Gradient descent, Backpropagation
1. Introduction
Artificial neural networks (ANNs) were first proposed by Doctor Frank Rosenblatt in
1960, and are used widely for the prediction and classification of complex data [1]. ANN is a compound of biological neurons networked in layer forms to learn the parameters,
weights, and biases. A typical neuron in the input and output layer performs the following
· Neurons receive inputs from the previous layer.
· The output of a neuron is multiplied by the weights and added up together. This
is called a weighted sum.
· Bias is added to adjust the weighted sum.
· The result is then passed into the activation function to bring down the values
between a range.
· The output of the activation function, which is the outcome of a neuron, is passed
on to the next layer.
Neurons in the output layer produce the result of a network.
The loss is calculated based on the predicted value (regression or classification)
for a given input. After performing this for a set of input data, the optimal value
for each parameter is obtained by applying gradient descent (GD) on the cost function
[2] on backpropagation. Understanding loss function, GD, and backpropagation are essential
in this process. This process will help ML researchers to train his/her model again
and again until the loss is equal to 0. The trained model will be accurate. The GD
will help the researcher find the minimum cost function [3]. Until now, we were describing only a single or double input, and single or double
outputs were described using perceptron for a feedforward neural network.
The backpropagation algorithm is a broad view of perceptron for training multiple
input datasets. In other words, it sometimes is called multilayer perceptron. Backpropagation
makes it achievable to handle gradient methods for training multilayer networks, updating
weights to reduce error; for example, GD, or other alternative forms, such as stochastic
gradient descent (SGD), are generally utilized [4]. As ANNs have beneficial applications in many fields [1,5] including pattern recognition, face identification, signal recognition, and machine
translation, it is essential to shed light continuous insight on DL algorithms, such
as CNN, RNN, GNN, and GAN. Hence, this paper explains ANNs for novice DL practitioners.
The remainder of this paper is organized as follows. Section 2 gives a detailed survey
on least square regression, loss function, GD, and backpropagation applications. Section
3 provides a step-by-step straightforward guide on least square regression, loss function,
and GD with backpropagation. Finally, the conclusion section summarizes the paper.
2. Related Work
2.1 Least-square-regression
Least square regression (LSR) is vital for prediction applications, including microarray
missing value estimation, disease spread estimation, and weather condition estimation.
In [6,7] proposed a novel algorithm called least square multi-splitting (LSMS) to solve large-scale
regression problems in parallel. The authors designed a technique of partition for
regression design based on cluster analysis. The LSMS approach can compete as a common
function algorithm with the LSR. Similarly, the result shows that LSMS is a comprehensive
and well performer technique for the partitioning of large-scale regression problems.
For comparison, at every truncation, it is estimated that the partitions must utilize
the cluster analysis based on three links: single, complete, and average. This work
is the optimal choice for multicategory classification, and it is based on the newly
proposed technique of group-wise retargeted least squares regression (GReLSR). This
work is an extension of prior retargeted least square regression proposed by XuYao
Lingfeng et al. The authors proposed a new reformulation of ReLSR for utilizing a
groupwise regularization to restrict the translation values of ReLSR and produce the
novel GReLSR technique. The performance of GReLSR was compared with the seven prior
multicategory classification techniques. The result showed that the recently proposed
GReLSR technique could be a state-of-the-art method and a better performer, unlike
prior findings [8]. Drug availability has recently been an important part of human life. This study
has been proposed to forecast drug availability for the next month through the LSR
algorithm. For the sake of forecasting stock for medicine requirements, the authors
collected data from January to November 2017 from Puskemas Community health center,
East Kalimantan, Indonesia, to predict the future requirements of drug quantity. Nevertheless,
the shortage of this work is that the authors did not compare LSR with other prediction
algorithms to show the accuracy of their results [9]. Zhao Shuping et al. [10] proposed a novel discriminant and sparsity-based LSR. Compared to prior LSR techniques,
the authors demonstrated the relationship among the training samples with L1 regularization
to combinedly learn the discriminative projection matrix and the orthogonal relaxing
term. Compared with other studies in the image classification area, a comprehensive
experiment was done on the Yale B, LFW, and 15-Scene databases to demonstrate outperforming
results. The work, together with eight prior algorithms, was compared. The results
showed the efficiency of the proposed approach for classification. The COVID-19 outbreak
is a global concern. O. Roseline et al. reported the predictive modelling of COVID-19
based on a linear regression model. The authors of this study have analyzed the impact
of the traveling history and contacts regarding COVID-19 confirmed cases in Nigeria.
The ordinary LSR was used as a tool to fit the data along with diagnostic checks datasets
which were derived from the Nigeria Centre for Disease Control (NCDC) website starting
from April 05 to April 13, 2020. In the end, they compared both results. The comparative
results show that traveling history and contacts increase the likelihood of being
infected with COVID-19 by 85% to 88%, respectively [11]. Generally, COVID-19 is increasing rapidly in countries where the population is dense.
India is also one such country in the world. Work was done based on multiple linear
regression to predict COVID-19 upcoming active cases in the last two weeks of August
2020 in Odisha and India. The authors suggested that the containment facilities in
the India and Odisha should be reinforced by responsible authorities to keep people
healthy and decrease the number of patients [12]. Qin Lei et al. proposed another prediction about COVID-19 cases of 2019. The data
regarding patients’ categories like dry cough, fever, chest distress, coronavirus,
and pneumonia was derived from the social media search index from 31 December 2019
until 17 February 2020. The authors proposed a method for COVID-19 prediction based
on five regularized regression methods, including subset selection, forward selection,
lasso regression, ridge regression, and elastic net to prevent overfitting while prediction.
The comparative result regarding patients showed that by January 22 2020 the number
of Coronavirus, Pneumonia and new suspected cases would increase sprightly and decrease
slightly [13,14]. Table 1 lists the different regression analysis techniques.
Table 1. Summary of Different Versions of Regression Analysis Techniques.
Multi class classification
Discriminant LSR
2.2 Loss Function
A loss is the difference between the original outcome and the predicted outcome for
a given instance. It is also called the error or residual. When the loss of a set
of instances occurs, it is called the cost of the model. The sum of squared error
(SSE) or cross-entropy (CE) is typically used as a cost function. The parameter values
for different purposes in [15], such as sequential sampling, classification, and optimal control, were estimated
using the cost function. Sequential sampling analysis is one of the important computation
techniques in various fields, including economics, engineering, medical science, and
statistics. K Jampachaisri et al. proposed an Empirical Bayes (EB) prediction method
[15] for parameter estimation in a sequential sampling plan (SSP) based on squared error
loss (SEL) and precautionary loss function (PL). The proposed EB approach was compared
with the prior single-sampling statistical approach. The result shows that the EB
in SSP via computation of SEL and PL affords the highest probability of acceptance
and the smallest average sample number. The advantage of the proposed method is that
the mean square error is always convex when it uses its parameter. Nevertheless, the
authors did not compare the result of their proposed method with more than one sequential
sampling technique.
DL algorithms will be an essential tool for everyday life. In particular, developing
new techniques for training complex datasets quickly and accurately is on-demand.
In [16] proposed a novel activation function called Reward cum Penalty loss function based
Extreme Learning Machine (RPELM) to prevent those data points which do not fall on
the targeted location and draft reward method for the data points which fall on the
desired location. The authors compared the RP-ELM result with prior activation functions
to demonstrate precise classification outputs. The result shows that the proposed
RP-ELM activation function performs better than hinge and quadratic losses regarding
fast and correct classification. The main objective of any activation function is
to apply linear data to nonlinear data. S Ma et al. proposed a nonlinear optimization
method called variable-step beetle antennae search algorithm (GBAS) to update the
performance of the Huber loss function [17]. Currently, DL algorithms are de facto important tools in many areas [18-20]. Proposed novel ideas to contrivance multiple loss functions based on kernel density
estimation (KDE) in [21] to predict the probability density function (PDF) regarding spoofing detection. The
authors used a recent automatic speaker verification (ASV) spoof 2019 dataset to execute
the actual scenario. The experimental results show that the proposed KDE-based loss
functions are superior to the conventional loss functions, which were exploited for
estimating anti-spoofing detection until now. The superiority of this finding suggests
a new idea of KDE based on several loss functions in the field of DL. The weakness
of the work was that the authors did not compare their proposed KDE-based loss functions
with conventional loss functions to be clear to the reader. Traffic classification
is also a long-term challenge in the research community. In [22] proposed a new loss function method called UniLoss. The authors of the work tried
to classify the problem of an imbalanced dataset of VoIP traffic while occurring in
a minority category with poor misclassification performance and high accuracy. In
the experimental part of this finding, the newly proposed UniLoss function was run
in four different types of deep neural networks (DNN), including convolutional neural
network (CNN), recurrent neural network (RNN), ResNet, and FusionNet. Similarly, two
conventional loss functions were run in the above-mentioned DNN algorithms for comparison.
The research outcome showed that the proposed UniLoss function has a higher F1 score
of single categories for better performance than the other two conventional loss functions.
The supremacy of this finding is outperforming the novel UniLoss approach compared
to other conventional loss functions. Unlike the drawbacks of the finding, authors
did not experiment with other types of data for classification but only VoIP data.
Despite the remarkable development in the field of biomedical technology DL algorithms
is also becoming an important tool of prognosis in biomedical science. H Seo et al.
proposed a generalized loss function (GLF) technique with functional parameters in
[23] for optimal decision making in small target segmentation. The proposed method displayed
more precise discovery and segmentation of lung and liver cancer tumors. An outcome
of the work showed that the proposed GLF performs an accurate diagnosis compared to
prior techniques in terms of detection and segmentation of lung and liver and tumors.
Similarly, several studies have been done in the past using different types of loss
functions in [24] and [25]. Table 2 lists different papers which were used the loss function for different applications.
Table 2. Summary of Different Versions of Loss Function Techniques.
Loss Function Approach
For Sequential
Object Detection
Cross Entropy & ED
WC-Entropy Loss
Asymmetric Loss
Gaussian Loss Function
2.3 Gradient Descent
Gradient descent (GD) is an optimization algorithm for finding the absolute minimum
to minimize the error of the function while training fixed datasets. It has diverse
applications, including image classification, entity clustering, weather condition,
and disease spread prediction. D Zou et al. proposed a technique [26] for binary classification while training a deep, fully connected neural network in
combination with Rectified Linear Unit (ReLU) activation function and cross-entropy
loss function using GD. The result showed that with proper random weight initialization,
GD could find absolute minima while training loss for an over-parametrized deep ReLU
activation function under supposition on the training fixed amount of data. They compared
the training performance of their findings with two prior proposed techniques. Their
technique could discover global minima faster than the findings, which are exploited
in the experimental part of the paper. An advantage of the finding is that the authors
could prove their idea regarding Gaussian random initialization pursued through GD,
which produces a sequence of iterations that remind inside a small disorder region
placed at the initial weights. The disadvantage of the proposed method is that the
authors did not compare their proposed method performance with more than two previous
techniques. J Flynn et al. proposed a novel method for deducing multiplane image (MPI)
scene representation in combination with Learned Gradient Descent (LGD) [27]. They compared the MPI results with a well-known method called Soft3D and some multiple
methods in deep learning. The results showed that the proposed MPI scene representation,
combined with the learned GD, showed better results than other traditional techniques,
particularly in solving complicated, nonlinear inverse problems. The dominance of
the proposed MPI technique was that authors could implement their novel idea based
on LGD. The weak point of this work was the complication of implementing RAM requirements
and the speed of LGD while training, which takes multiple days by utilizing more than
one GPU. J Lee et al. proposed additional learning dynamics of GD in the theoretical
approach for parameter space of deep nonlinear network and classification [28]. They compared their proposed technique with traditional SGD and CEL for classification
purposes by exploiting MNIST and CIFAR datasets. The result showed that the proposed
technique is more sufficient than the SGD and CEL techniques in training a wide DNN
while gaining minimum error. The superiority of the work is that theoretical results
are solely precise in the finding is authors did not experiment for comparison of
their proposed classification technique with more than two classification datasets
to prove the robustness of their technique. One continuous progress in the DL research
community is reproducing tremendous works about updating convex, non-convex and concave
functions. Two-time scale gradient descent ascent (GDA) for solving nonconvex-concave
minimax problems has been proposed in [29]. The main aim of this work is to solve the nonconvex-concave minimax problem and
compare the performance of GDA with two prior techniques Wasserstein robustness model
(WRM) and Gradient Descent max (GDmA). The results showed that the proposed GDA outperforms
both prior techniques, which were compared during the experiment by exploiting three
types of classification datasets. An advantage of the finding is that the authors
could prove the solution for the nonconvex-concave minimax problem while training
combined classification datasets. On the other hand, the authors did not prove why
they chose only three categories of classification datasets for a result comparison
and showed the efficiency of GDA while there are many classification datasets. M M
Amiri et al. in [30] proposed a novel analog distributed stochastic gradient descent (A-DSGD) to reduce
noise from the channel bandwidth with a combination of parameter servers (PS). In
addition, one of the prior techniques, called digital distributed stochastic gradient
descent (DDSGD) was exploited for a performance comparison with A-DSGD. The results
showed that A-DSGD has faster performance than D-DSGD because of its available channel
bandwidth. In the last decade, many studies have been conducted using different GD
types for different applications, such as classification [31], sparse linear problems [32], new task learning [33], data overfitting prevention [34], and inverse filtering [35].
2.4 Backpropagation
A combined Empirical Mode Decomposition-Variational Mode Decomposition-Genetic Algorithm-Backpropagation
(EMD-VMD-GA-BP) model was introduced in [36] for carbon price prediction. The proposed model used several techniques for accurate
carbon price estimation in the Hubei market, particularly the backpropagation (BP)
neural network model combined with a genetic algorithm (GA) for accurate prediction
purposes. The proposed model outperformed other prediction models. The advantage of
this work is that the proposed model can meaningfully reduce the exertion of carbon
price time series forecasting. In [37] Multi-input Multi-Output (MIMO) technique model for Wireless Sensor Network (WSN)
was proposed, which addresses the Cluster Head (CH) recognition challenge for the
MIMO sensor network. The proposed technique is based on a neural network BP algorithm.
The main aim of the proposed model was to address the location identification problem
of CH for the MIMO sensor network, which is utilizable in Intelligent Transportation
System (ITS). The model minimizes the total estimation error compared to other proposed
techniques. ML is used in several areas, such as agriculture. L Wang et al. proposed
maize growth monitoring on the North China plain utilizing a hybrid genetic algorithm-based
BP neural network [38]. An experimental part of the study was conducted through the remotely sensed leaf
area index (LAI) and vegetation temperature condition index (VTCI). The data derived
from the Global LAnd Surface Satellite (GLASS) and Moderate-resolution Imaging Spectroradiometer
(MODIS) data were chosen as the key indicators of maize growth and a hybrid GA based
on the BP neural network (BPNN).
The (GA-BPNN) model was designed to provide enough information on maize growth at
the core growing stage. The proposed BPNN-based GA-BPNN performed satisfactorily compared
to other techniques. A new method of ML algorithms combination was proposed called
Genetic Algorithm-Back propagation called (GA-BP) [39,40]. The proposal aimed to predict clothing pressure. The proposed GA-BP algorithm does
not require complex modeling compared to prior girdle pressure predicting models,
such as General Regression Neural Network (GRNN) and grey BP. W Yang et al. proposed
an analysis method for skeletal anthropology fields based on an improved BP algorithm.
The main aim of the study was to determine if the selected skull was male or female
from the compound of skeleton [40]. The proposed improved BP algorithm showed better classification accuracy with a
97.232 % training stage and 0.01 mean square error. The results for comparing the
performance were trained with other prior techniques, including the Cranical Sagittal
chord and Apical sagittal chord. Consequently, improved BP showed outstanding performance
compared to two prior techniques.
3. Mathematical Calculations
3.1 Least Square Regression
The least square regression model aimed to determine the relationship between the
dependent and independent variables. The large amount of missed data was assessed
using the least square model. All data points of the dependent variable depend upon
the alteration of the independent variable [41,42]. Fig. 1 shows how linear regression has fit the datapoints (X-input, Y-output), given in
Table 4, using line function f(x) = mx + b.
Fig. 1. Fitting line on the data points.
The line equation on the data points was determined by estimating (approximating)
the value for m and b. The initial step of the LSR was calculating the slope of the
function, computing the y-intercept, and finally substituting the input values in
the model explored. Before calculating the slope of the function, it is essential
to know the mean of the input and output values, as shown in Table 5. As the first step of the goal is to determine b (slope) of the function, each observation
from X and Y axes must be subtracted from the mean of each column and demonstrated
in a separate column. Table 5 lists this process in detail. With the values calculated in Table 5, the slope (m) [42] can be found using the following formula
The next step after finding the slope is finding the y-intercept (b). Y-intercept
is a point at which the trend line crosses on the y-axis and then moves upward,
The values of m and b were calculated manually. This can be computed using Excel and
plot the line of best fit (trend line), as shown in Fig. 2.
Fig. 2. Line of best fit.
Table 3. Summary of the Different Versions of Gradient Descent Methods.
Inverse Filtering
Convex setting
Image representation
Catastrophic forgetting
Minimum error analysis
Table 4
Table 5.
X - X̂
Y - Ŷ
(X -)2
3.2 Loss Function
3.2 Loss Function
The main goal of the loss function is to minimize the data loss occurrence while training
complex datasets [2,43]. Table 6 provides some examples of working hour training datasets for predicting the gain
of a worker. An example is given that when a worker works approximately one hour,
his salary will be 13 dollars. On the other hand, what happens when more than one
hour is worked?
Table 6.
Working time (X)
Salary of Worker (Y)
What happens when more than 10 hours are worked? What about if there is very larger
amount of work data for prediction? For this type of training, multiple ML algorithms
can be used to deal with existing challenges while researching. It is important to
follow some simple equations or ML hypotheses. Table 7 provides an example of a prediction of a worker’s salary. As the working time is
X integer and the salary of a worker is the Y variable, then the ML hypothesis must
be followed for a simple prediction as below
where $\hat{Y}$ is an undefined prediction variable. The variable X is the input data
that should be multiplied with variable W to update the whole input data for predicting
an accurate output [44,45]. The next stage of computing loss is determined by exploiting the obtained loss formula
as follows:
Tables 7-9 list the prediction of a worker's salary for various W values.
Tables 10(a) and (b) list the mean square error or total loss of each training. The loss decreases
through every iteration. This example did not show the condition in which the loss
equals 0. On the other hand, only an example of loss is shown, decreasing the total
loss in every training. In addition, if L = 0, more training is needed to achieve
the actual goal. Fig. 3 shows different $\hat{Y}$ on the data points during the training process. The graph
shows the loss minimization using simple mathematical computations, including graphs
and tables. The first random value W$^{1}$, which is always equal to W$^{1}$=1, should
be pointed out before every phase. For example, if there are 1000 or 10000 random
guesses, the computer can automatically provide a variable value and compute other
random guesses of W$^{n}$ for updating the output [6]. Therefore, variable W has an important role in the neural networking field. There
are three random guesses of W, each with a simple value of 1, 2, and 3, for easy understanding
of the loss function [45]. Table 11 shows three values of the X and Y axes in each column. Similarly, three more columns
for describing the output of the predictions are needed to minimize the loss while
training. A straightforward mathematical computation was performed at the bottom using
the same values in the prediction part. The hypothesis $\hat{y}=x\star w$ was exploited
to update each training data of input x [8]. $\mathrm{Cost}=\sum (\hat{y}-y)^{2}$ was used for minimizing the loss function.
For example,
As shown in the above examples, the loss in the first iteration was 140 but 35 in
the second iteration.
Eventually, the loss approaches 0. Fig. 4 shows the loss for different possible lines. The red line indicates the gap between
the true and error lines. The particular point where L = 0 in the graph denotes the
local minima of a function [45].
Table 7.
time (X)
Salary of
Worker (Y)
Ȳ (W=1)
Loss (W=1)
4320 / 4
= 1080
Table 8
time (X)
Salary of
Worker (Y)
Ȳ (W=21)
Loss (W=21)
MSE = 0
Table 9
time (X)
Salary of
Worker (Y)
Ȳ (W=21)
Loss (W=21)
1920 / 4
= 480
Fig. 3. Several random guesses of w${\cap}$y to approach the true line.
Fig. 4. Loss for each line.
3.3 Gradient Descent and Backpropagation
Gradient descent is an iterative optimization algorithm for finding the minimum of
the cost function [46] as shown in Fig. 5. This section reviews the concept of gradient descent with simple expressive calculations
working on the cost function, the sum of squared error (SSE). The equation of SSE
is given below
Table 12 explains the cost calculation on the given data.
For the first training example, the prediction variables $\hat{Y}$ are needed to calculate
the loss function [46].
Fig. 5. Gradient Descent.
Table 10
Working time (X)
= 1080
= 480
= 67.5
Table 11
The gradient of the cost function was calculated with respect to all the parameters
(weights and biases) to find the global loss minimum [47]. The equation is given below.
where ${\omega}$ is the derivative for loss with respect to the weight parameters.
${\alpha}$ is the learning rate that controls the speed of convergence. The weight
values were updated by subtracting the derivative of the loss function [48]. If there are three weight values in the network, their values are three weight values
in the network, as shown in Fig. 6, their values are calculated as below,
We used 10000 images for test and 50000 images for training along through 10 epochs.
In the execution, we regulate learning rate to 0.01. Total iteration is 1000 with
approximately 95% of accuracy as shown in Fig. 7.
Fig. 7. Backpropagation and loss minimization.
The objective of this article was to evaluate the ANN functionalities, such as cost
function, gradient calculation, and backpropagation. An extensive explanation of ANN
was made for novices in DL. The paper aimed to provide a straightforward clarification
of ANNs. In the future study, we will adopt the same approach in explaining other
deep learning algorithms, such as CNN, RNN, DNN, and their latest versions.
This work was supported by the National Research Foundation of Korea (NRF) grant
funded by the Korean government (MSIT) (No. RS-2022-00166267).
M. Majumder, “Artificial Neural Network,” pp. 49-54, 2015.

“Loss function - Wikipedia.” (accessed Mar. 29, 2023).

“Gradient descent - Wikipedia.” (accessed Mar. 30, 2023).

“Backpropagation - Wikipedia.” (accessed Mar. 30, 2023).

S. A. Kalogirou, “Applications of artificial neural-networks for energy systems,”
Appl. Energy, vol. 67, no. 1-2, pp. 17-35, 2000.

L. da Fontoura Costa and G. Travieso, “Fundamentals of neural networks,” Neurocomputing,
vol. 10, no. 2, pp. 205-207, 1996.

G. Inghelbrecht, R. Pintelon, and K. Barbe, “Large-Scale Regression: A Partition Analysis
of the Least Squares Multisplitting,” IEEE Trans. Instrum. Meas., vol. 69, no. 6,
pp. 2635-2647, 2020.

L. Wang and C. Pan, “Groupwise Retargeted Least-Squares Regression,” IEEE Trans. Neural
Networks Learn. Syst., vol. 29, no. 4, pp. 1352-1358, 2018.

N. Dengen, Haviluddin, L. Andriyani, M. Wati, E. Budiman, and F. Alameka, “Medicine
Stock Forecasting Using Least Square Method,” Proc. - 2nd East Indones. Conf. Comput.
Inf. Technol. Internet Things Ind. EIConCIT 2018, no. Ci, pp. 100-103, 2018.

S. Zhao, B. Zhang, and S. Li, “Discriminant and Sparsity Based Least Squares Regression
with l1 Regularization for Feature Representation,” ICASSP, IEEE Int. Conf. Acoust.
Speech Signal Process. - Proc., vol. 2020-May, pp. 1504-1508, 2020.

R. O. Ogundokun, A. F. Lukman, G. B. M. Kibria, J. B. Awotunde, and B. B. Aladeitan,
“Predictive modelling of COVID-19 confirmed cases in Nigeria,” Infect. Dis. Model.,
vol. 5, pp. 543-548, 2020.

S. Rath, A. Tripathy, and A. Ranjan, “Since January 2020 Elsevier has created a COVID-19
resource centre with free information in English and Mandarin on the novel coronavirus
COVID-19. The COVID-19 resource centre is hosted on Elsevier Connect, the company’s
public news and information,” no. January, 2020.

L. Qin et al., “Prediction of number of cases of 2019 novel coronavirus (COVID-19)
using social media search index,” Int. J. Environ. Res. Public Health, vol. 17, no.
7, 2020.

R. Gan, J. Tan, L. Mo, Y. Li, and D. Huang, “Using Partial Least Squares Regression
to Fit Small Data of H7N9 Incidence Based on the Baidu Index,” IEEE Access, vol. 8,
pp. 60392-60400, 2020.

K. Jampachaisri, K. Tinochai, S. Sukparungsee, and Y. Areepong, “Empirical bayes based
on squared error loss and precautionary loss functions in sequential sampling plan,”
IEEE Access, vol. 8, pp. 51460-51469, 2020.

P. Anand and A. Bharti, “A combined reward-penalty loss function based extreme learning
machine for binary classification,” 2019 2nd Int. Conf. Adv. Comput. Commun. Paradig.
ICACCP 2019, 2019.

S. Ma, D. Li, T. Hu, Y. Xing, Z. Yang, and W. Nai, “Huber Loss Function Based on Variable
Step Beetle Antennae Search Algorithm with Gaussian Direction,” Proc. - 2020 12th
Int. Conf. Intell. Human-Machine Syst. Cybern. IHMSC 2020, vol. 1, pp. 248-251, 2020.

B. Sung Lee, R. Phattharaphon, S. Yean, J. Liu, and M. Shakya, “Euclidean Distance
based Loss Function for Eye-Gaze Estimation,” 2020 IEEE Sensors Appl. Symp. SAS 2020
- Proc., 2020.

T. H. Phan and K. Yamamoto, “Resolving Class Imbalance in Object Detection with Weighted
Cross Entropy Losses,” arXiv, 2020.

Di. Rengasamy, B. Rothwell, and G. P. Figueredo, “Asymmetric Loss Functions for Deep
Learning Early Predictions of Remaining Useful Life in Aerospace Gas Turbine Engines,”
Proc. Int. Jt. Conf. Neural Networks, 2020.

A. Gomez-Alanis, J. A. Gonzalez-Lopez, and A. M. Peinado, “A Kernel Density Estimation
Based Loss Function and its Application to ASV-Spoofing Detection,” IEEE Access, vol.
8, no. i, pp. 108530-108543, 2020.

L. Xu, X. Zhou, X. Lin, Y. Ren, Y. Qin, and J. Liu, “A New Loss Function for Traffic
Classification Task on Dramatic Imbalanced Datasets,” IEEE Int. Conf. Commun., vol.
2020-June, 2020.

H. Seo, M. Bassenne, and L. Xing, “Closing the Gap between Deep Neural Network Modeling
and Biomedical Decision-Making Metrics in Segmentation via Adaptive Loss Functions,”
IEEE Trans. Med. Imaging, vol. 40, no. 2, pp. 585-593, 2021.

N. Zhang et al., “Robust T-S Fuzzy Model Identification Approach Based on FCRM Algorithm
and L1-Norm Loss Function,” IEEE Access, vol. 8, pp. 33792-33805, 2020.

Z. Li, J. F. Cai, and K. Wei, “Towards the optimal construction of a loss function
without spurious local minima for solving quadratic equations,” arXiv, vol. 66, no.
5, pp. 3242-3260, 2018.

D. Zou, Y. Cao, D. Zhou, and Q. Gu, “Gradient descent optimizes over-parameterized
deep ReLU networks,” Mach. Learn., vol. 109, no. 3, pp. 467-492, 2020.

J. Flynn et al., “Deepview: View synthesis with learned gradient descent,” Proc. IEEE
Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2019-June, pp. 2362-2371,

J. Lee et al., “Wide neural networks of any depth evolve as linear models under gradient
descent,” J. Stat. Mech. Theory Exp., vol. 2020, no. 12, 2020.

T. Lin, C. Jin, and M. I. Jordan, “On gradient descent ascent for nonconvex-concave
minimax problems,” arXiv, 2019.

M. M. Amiri and D. Gündüz, “Machine learning at the wireless edge: Distributed stochastic
gradient descent over-the-air,” arXiv, vol. 68, pp. 2155-2169, 2019.

S. Goel, A. Gollakota, Z. Jin, S. Karmalkar, and A. Klivans, “Superpolynomial Lower
Bounds for Learning One-Layer Neural Networks using Gradient Descent,” arXiv, 2020.

E. Amid, M. K. Warmuth, J. Abernethy, and S. Agarwal, “Winnowing with Gradient Descent,”
Proc. Mach. Learn. Res., vol. 125, pp. 1-20, 2020.

M. Farajtabar, N. Azizan, A. Mott, and A. Li, “Orthogonal gradient descent for continual
learning,” arXiv, vol. 108, 2019.

M. Li, M. Soltanolkotabi, and S. Oymak, “Gradient descent with early stopping is provably
robust to label noise for overparameterized neural networks,” arXiv, 2019.

C. Cheng, N. Emirov, and Q. Sun, “Preconditioned gradient descent algorithm for inverse
filtering on spatially distributed networks,” arXiv, vol. 27, pp. 1834-1838, 2020..

W. Sun and C. Huang, “A carbon price prediction model based on secondary decomposition
algorithm and optimized back propagation neural network,” J. Clean. Prod., vol. 243,
p. 118671, 2020..

A. Mukherjee, D. K. Jain, P. Goswami, Q. Xin, L. Yang, and J. J. P. C. Rodrigues,
“Back Propagation Neural Network Based Cluster Head Identification in MIMO Sensor
Networks for Intelligent Transportation Systems,” IEEE Access, vol. 8, pp. 28524-28532,

L. Wang, P. Wang, S. Liang, Y. Zhu, J. Khan, and S. Fang, “Monitoring maize growth
on the North China Plain using a hybrid genetic algorithm-based back-propagation neural
network model,” Comput. Electron. Agric., vol. 170, no. 46, p. 105238, 2020.

Z. Jie and M. Qiurui, “Establishing a Genetic Algorithm-Back Propagation model to
predict the pressure of girdles and to determine the model function,” Text. Res. J.,
vol. 90, no. 21-22, pp. 2564-2578, 2020.

W. Yang, X. Liu, K. Wang, J. Hu, G. Geng, and J. Feng, “Sex determination of three-dimensional
skull based on improved backpropagation neural network,” Comput. Math. Methods Med.,
vol. 2019, 2019.

L. P. Huelsman, for Engineers, no. November. McGraw-Hill Science/Engineering/Math,

“Analysis of the vulnerability estimation and neighbor value prediction in autonomous
systems | Scientific Reports.” (accessed Mar. 30, 2023).

J. Brownlee, “Loss and Loss Functions for Training Deep Learning Neural Networks,”
Mach. Learn. Mastery, pp. 1-19, 2019,

H. D. Learning et al., “Perceptron,” pp. 1-9, 2020,

G. C. Mqef, “Mathematics for,” Quant. Lit. Why Numer. matters Sch., no. c, pp. 533-540,

S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to
algorithms, vol. 9781107057. 2013.

G. Lawson, “Maxima and minima,” Edinburgh Math. Notes, vol. 32, pp. xxii-xxiii, 1940,

L. Multipliers, “Paul ’ s Online Notes Section 3-5: Lagrange Multipliers,” pp. 1-15,

Rahmatov Nematullo is currently a Post-doctoral researcher in Kyungpook National
University. He received his Ph.D. degree in the Department of Computer Science and
Engineering at Kyungpook National University Daegu, South Korea in 2022. His research
interests include Artificial Intelligence, Machine and Deep Learning, Natural Language
Processing and Neural Machine translation. His future research plan is sentence modeling
and machine translation based on novel algorithms of Deep Learning. He was a recipient
of Computer Science and Engineering award in 2019 at the Department of Computer Science
and Engineering, Kyungpook National University.
Hoki Baek received his B.S, M.S., and Ph.D. from the Department of Computer Science
at Ajou University in Suwon, South Korea, in 2006, 2008, and 2014, respectively.
From March 2014 to February 2015, he served as a full-time researcher at Ajou Univer-sity's
Jangwee Defense Research Institute, and from March 2015 to February 2021, he was a
Lecture Professor in the Department of Military and Digital Convergence at Ajou University.
Currently, he is an Assistant Professor for the School of Computer Science and Engineering
at Kyungpook National University. He is a life member and a director of the Korean
Institute of Communications and Information Sciences (KICS), and is an editorial board
member for the Journal of the Korean Institute of Communications and Information Sciences
(J-KICS). He is a member of the Defense Information Technology Standards (DITA) Standard
Working Group (SWG) for the Ministry of National Defense, serving from June 2020 to
May 2024. His research interests include 5G/6G communications and networks, UAV networks,
Wi-Fi, IoT, military communications and networks, and positioning and time synchronization.