YoshimuraUkyo
InoueToshiyuki
TsuchiyaAkira
KishineKeiji
-
(Department of Electronic Systems Engineering, The University of Shiga Prefecture /
2500 Hassaka, Hikone, Shiga 522-8533, Japan oi23uyoshimura@ec.usp.ac.jp, {inoue.t,
tsuchiya.a, kishine.k}@e.usp.ac.jp )
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
SoC design and applications, Neural chips, High-level synthesis, Bioinformatics, Radar and sonar signal processing
1. Introduction
Recently, systems that identify vital signs such as the heartbeat have been developed
for early detection of diseases and biometric identification. Electrocardiograms (ECGs)
are commonly used to obtain heart-rate information in medical institutions, but they
are not suitable for application to embedded systems because they are expensive, and
their use is restricted due to the electrodes and cables required. Heartrate can also
be measured without touching the human body by using a microwave sensor [1-3]. By embedding a system using microwave sensors in walls, ceilings, or seats, stress-free
health monitoring and biometric identification can be achieved.
Biometric systems require regression and classification, which are done using
machine learning. Machine learning methods are effective for regression and classification,
especially deep learning methods, which use multiple hidden layers in a neural network
(NN) [4-15]. There are mainly two types of deep learning: convolutional neural networks (CNNs)
[4-8] and recurrent neural networks (RNNs) [9-13]. CNNs extract local features of images or time-series data by filtering and processing
neighbor pixels or electric signals. Therefore, it is difficult for CNNs to deal with
a change of signals in the long term. In general, CNNs are used for image recognition
and object detection [6].
RNNs have feedback loops in an NN, and the past data affect the present data.
Therefore, RNNs are suitable for regression and classification of time-series data.
The long short-term memory (LSTM) method [12,13] is a type of RNN that improves the performance for long-term data such as heartbeats,
and classification of ECG signals using LSTM has been reported [13]. Recently, complementary NNs combining CNNs with RNNs have been developed to overcome
the drawback of CNNs [14,15]. The complementary NNs require more hardware resources compared with a simple CNN
or RNN. Therefore, a simple LSTM is suitable for the NN in embedded systems from the
viewpoint of the scale of the circuits and systems.
Machine learning methods, including LSTM, consist of training and inference phases.
In the training phase, weights and biases of an NN are determined using data for training.
The weights and bias values are stored after the training phase. In the inference
phase, inference results for unknown input data are output in accordance with the
weights and biases. Since training takes a long time, the training phase is done with
a large-scale computer before the inference phase. However, the inference process
needs to be fast so that time-series data sampled at high frequency can be processed
and the identification can be executed in real time, especially in biometric identification.
In addition, in order to embed biometric identification systems in various places,
systems with small size and low power consumption are desired.
Fig. 1. NN model with LSTM and dense layers ($\textit{M}$ = 1).
In general, CPUs and GPUs with multiple cores and high clock frequencies are used
to reduce the computation time for machine learning. However, both CPUs and GPUs are
large and typically consume high power (~100 W). A field-programmable gate array (FPGA)-based
LSTM method has the potential to reduce the inference time and power consumption compared
to CPU/GPU-based ones because a specialized circuit for LSTM that operates with low
frequency is installed inside the FPGA [16,17]. Therefore, implementing only the inference phase in an FPGA after performing the
training phase can be a better solution for embedded biometric systems.
While parallelizing a large number of calculations is effective for reducing the
computation time, more hardware resources are required for the parallelization. Fixed-point
arithmetic can reduce the computation time and hardware resource utilization in FPGA-based
LSTM [17]. However, it is not suitable for biometric identification because it causes degradation
of the computation accuracy compared with floating-point arithmetic.
In earlier work, we proposed algorithms for low-energy floating-point LSTM for
the regression of microwave-sensor signals and implemented them in a small-scale FPGA
[18]. In that report, the methodology and preliminary experimental results were described
briefly. In this paper, we present our algorithms with simplifying activation functions
and parallelized, pipelined, and unrolled processing in more detail. Experiments were
done to compare the computation time, power consumption, and hardware-resource utilization
with each algorithm. The results showed that the computation time of the proposed
algorithm combining parallel, pipelined, and unrolled processing was reduced by 95%
compared to that with the sequential algorithm. In addition, the amount of energy
consumption with the proposed algorithm was reduced by 92% and 91% compared to that
with a high-end GPU and CPU, respectively.
In Section 2, we explain why LSTM is suitable for time-series data, such as the
signals of a microwave sensor, and present its mathematical expression. Next, we describe
simplified activation functions and the improvement of the algorithms of LSTM to reduce
the resource utilization and accelerate the calculation in Section 3. Then, we present
the experimental results in Section 4. Finally, we discuss the results of the experiment
in Section 5 before concluding the paper in Section 6.
2. Mathematical Expression for LSTM
2.1 LSTM and Dense Layers
A basic NN model consisting of LSTM and dense layers is shown in Fig. 1. The notation $x_{t}(t=1,2,\ldots ,N)$ refers to the input signal at $t$, while $h_{N}$
is the output of the $N$-th cell, and $N$ is the number of input signals. The matrix
size of $h_{t}$ is ($L_{h}$, 1), where $L_{h}$ is the number of components in a cell
(hidden layer). The cell output determined by the input data is transferred to the
next cell. In other words, the output data $h_{N}$ is determined from both the past
output data $h_{1}$ to $h_{N-1}$ and the input data. Therefore, LSTM is suitable for
regression of time-series data.
The output layer is called a dense layer. The output data of the dense layer
$y_{M}$ are determined by the output data from the LSTM layer $h_{N}$ and the weights
$W_{M}^{d}$ biases and $b^{d}$ for the dense layer, where \textit{M} is the number
of the final outputs. When $M=1$, this model works for regression, and $y_{1}$ can
be expressed as:
where the matrix size of $W_{1}^{d}$ is ($L_{h}$, 1), and that of $b_{1}^{d}$
is (1, 1). When $M$${\geq}$ 2, this model works as a classifier.
2.2 LSTM Cells
The cell structure of a standard LSTM is shown in Fig. 2. The LSTM cell has four gates: forget, input, cell, and output gates. Each gate consists
of an activation function that is either a sigmoid ($\sigma $) or hyperbolic tangent
(tanh) function, two weight matrices $W$ and $U$, and a bias$~ b.$ The gate outputs
$i_{t}$, $f_{t}$, $\overset{˜}{C_{t}}$, and $o_{t}$ and the cell outputs $h_{t}$ and
$C_{t}$ can be expressed as:
Fig. 2. Architecture of LSTM cell.
Fig. 3. Comparison of calculation results of sigmoid and hard-sigmoid functions.
First, $i_{t}$, $f_{t}$, $\overset{˜}{C_{t}}$, and $o_{t}$ are calculated from
Eq. (2) to (5). Next, the cell state $C_{t}$ is determined using $i_{t}$, $f_{t}$, $\overset{˜}{C_{t}}$,
and $C_{t-1}$ from Eq. (6). Finally, output $h_{t}$ is determined using $C_{t}$ and $o_{t}$ from Eq. (7). The calculation of Eq. (2) to (7) is carried out by matrix calculation. The size of each matrix is determined by $L_{h}$
in each gate. Since $x_{t}$ is one-dimensional, the sizes of the weight and bias matrices
are as follows: ($L_{h}$, 1) for $i_{t}$, $f_{t}$, $\overset{˜}{C_{t}}$,$o_{t}$, $C_{t}$,
and $b^{i,f,g,o}$; (1, $L_{h}$) for $W^{i,f,g,o}$; and ($L_{h}$, $L_{h}$) for $U^{i,f,g,o}$.
Each gate from Eq. (2) to (5) has two types of activation functions: a sigmoid function $\sigma $ and hyperbolic
tangent (tanh).
3. Method of Implementation for Accelerating FPGA-based LSTM
Based on the equations presented in Section 2, we reviewed the extent to which
the algorithms can decrease the computation time in an FPGA-based LSTM. We also examined
the amount of resource utilization with these algorithms. First, we explain how to
simplify the activation functions. Then, we present methods for accelerating the calculation
for inference in an FPGA-based LSTM.
3.1 Simplifying Activation Functions
The LSTM cells have two types of activation functions: sigmoid and hyperbolic
tangent functions. We simplified the activation functions to decrease the computation
time and reduce the amount of resource utilization. The normal sigmoid function $\sigma
$ can be expressed as:
There is an exponential calculation and division in the sigmoid function that
require more complex computation than addition or multiplication. As an alternative,
the Keras library for NNs in Python [19] provides a hard sigmoid function that expresses the sigmoid function as a piecewise
linear approximation. The hard sigmoid function can be carried out using only multiplication
and addition, which has the advantage of reducing the amount of resource utilization
in an FPGA. The equation for the hard-sigmoid function $\sigma _{h}$ is given as follows:
The calculation results of the sigmoid and hard sigmoid functions are shown in
Fig. 3. The hard sigmoid function has the smallest squared error with the sigmoid. Also,
since the hard sigmoid function is available from the Keras library, it is possible
to use it in the training phase. Therefore, the computation results are the same in
both inference and training.
The hyperbolic tangent can be expressed as:
Since this function has four exponential calculations, the computation time and
the amount of resource utilization are increased. Therefore, we implemented a simplified
hyperbolic tangent as follows:
Here, we set variables $a$ and $b$ to $e^{x}$ and $e^{-x}$, respectively. This
reduces the amount of exponential arithmetic because the results of the exponential
arithmetic are stored in the variables.
3.2 Sequential Algorithm
A straightforward algorithm is a sequential algorithm in which Eq. (2) to (7) are calculated sequentially, as shown in Fig. 4. The operations of Eq. (2) to (5) require calculation of a product including the two-dimensional matrix $h_{t-1}U$,
so the number of for loops in each operation is $L_{h}^{2}$. The number of loops in
calculating Eqs. (6) or (7) is composed of only a one-dimensional matrix $\mathrm{L}_{\mathrm{h}}.$
Hence, the total number of loops of the LSTM cell is $\left(4L_{h}^{2}+2L_{h}\right)$.
Fig. 4. Flow of sequential algorithm and number of loops.
Fig. 5. Flow of parallel algorithm and number of loops.
3.3 Parallel and Pipelined Algorithm
A parallel algorithm can take full advantage of the characteristics of an FPGA.
As mentioned, the number of required loops is $L_{h}^{2}$ for Eq. (2) to (5). Since the calculation of these equations is carried out independently, the operation
can be parallelized, as shown in Fig. 5. More specifically, after the product of $h_{t-1}U$ and a sum of $x_{t}W$ and $\textit{b}$
are parallelized and calculated, the calculation of each function ($\sigma $ or tanh)
is carried out in parallel. However, the calculation of Eqs. (6) and (7) cannot be
parallelized because the result of Eq. (6) is used in Eq. (7). Therefore, the total number of loops with the parallel algorithm is $\left(L_{h}^{2}+2L_{h}\right)$.
The ratio of the total number of loops with the parallel algorithm to that with
the sequential algorithm can be expressed as:
As $L_{h}$ increases, the ratio approaches 1/4. Since there is a linear relationship
between the number of loops and the computation time, reducing the number of loops
results in shorter computation time. Assuming that the time required for one loop
is the same in the operations of Eq. (2) to (7), the computation time with the parallel algorithm can become approximately 1/4 that
obtained with the sequential algorithm. In the parallel algorithm, the order of operations
is different from that in the sequential algorithm, but the operations and parameters
are not changed. Therefore, the amounts of resource utilization are almost the same.
In addition to the parallel algorithm, pipelining is effective for reducing computation
time. In pipelining, the next calculation starts before the present calculation is
completed. As shown in Fig. 6, pipelining was applied to the calculation of Eqs. (6) and (7) in parallel-pipelined
algorithm A (PP-A). In parallel-pipelined algorithm B (PP-B), pipelining the calculation
of $h_{t-1}U$ in Eq. (2) to (5) is applied to PP-A. However, pipelining increases the amount of resource utilization
because circuit blocks for both the present and next operations are required.
Fig. 6. Flow of PP-A, PP-B, and PPU and number of loops. The variables of for loops
are $i(i=0,2,\ldots ,L_{h})$ and $j(j=0,1,\ldots ,L_{h})$.
Fig. 7. Diagram of experimental system for evaluation.
3.4 Proposed Parallel-pipelined Unrolled Algorithm
Decreasing the number of loops is also effective for reducing the computation
time. Therefore, we propose a parallel-pipelined and unrolled algorithm (PPU), where
unrolling is applied to PP-B. The result of the present loop is not used in the next
loop, so the loops can be unrolled and executed simultaneously. Fig. 6 also shows the flow of PPU in the case of factor $\textit{F}$ = 2 (the number of
unrolled loops). The computation time is halved, but the resource utilization is doubled.
4. Experiment and Results
We performed an experiment to evaluate the computation time, resource utilization,
and power consumption with each algorithm in an FPGA. In this section, we present
the experimental procedure and results of the proposed algorithm on an FPGA and a
high-end GPU and CPU.
4.1 Implementation of Proposed System
An NN model using LSTM and dense layers was implemented to regress microwave-sensor
signals. The number of inputs for the time-series data was set to $N=50$, the number
of components in a cell was set to $L_{h}=128$, and the number of outputs for dense
layers was set to $M=1$. First, the NN model was constructed using the Keras library
in Python (backend: TensorFlow). After that, weights and biases were determined in
the training phase on a PC with a GPU.
As a result of training, the mean absolute error (MAE) and root mean squared
error (RMSE) for the training data (1949 sets) were 0.069 and 0.101, respectively.
The MAE and RMSE for the test data (1949 sets) were 0.100 and 0.143, respectively.
Next, we developed a C-based program of the LSTM model for the FPGA-based inference
using the proposed algorithms. All variables were set to a 32-bit floating-point type,
and the weights and biases were preserved in the BRAM. The program was then converted
into an HDL-based program using Xilinx Vivado HLS 2019.2.
Xilinx Vivado 2019.2 was used to implement the experimental system, including
the FPGA-based NN model shown in Fig. 7. The clock frequency of the FPGA was set to 100 MHz. The resource utilization and
power consumption were confirmed by the implementation report in the software. A ZYBO
Z7-20 was used in the experiments, which is an FPGA board based on ZYNQ-7020 and includes
a CPU (microcontroller) and FPGA. The CPU sends the input data of the NN model and
receives its output data with the FPGA via AXI4-Lite. We also implemented the NN model
using a ZYBO Z7-20 (ZYBO CPU), a CPU (optimization option: -O3) in C language, an
NVIDIA RTX 2080 Ti (GPU), and an Intel Core i7 9700 (Intel CPU) with the Keras library
in Python to compare the computation time and power consumption with the FPGA-based
model.
4.2 Effect of Simplifying Activation Functions
We compared the experimental results of the computation time and resource utilization
with sequential algorithms using normal and simplified activation functions. The comparison
of computation time and power consumption is shown in Table 1. The simplified type reduced the power consumption by 5% and the computation time
by 3% compared to the normal one. The amount of energy consumption (power$~ \times
~ $time) of the simplified type was also reduced by 8% compared to the normal one.
The results of the hardware-resource utilization are shown in Table 2. The resource utilization became about half as large as that of the normal type except
for BRAM. These results demonstrate that simplifying the activation function is effective
in reducing both resource consumption and computation time.
Table 1. Computation time and power consumption.
Type
|
Power [W]
|
Time [ms]
|
Energy [mJ]
|
Normal
|
1.909
|
381.4
|
728.1
|
Simplified
|
1.801
|
371.5
|
669.1
|
Table 2. Comparison of resource utilization in sequential algorithms. Usage rates
in ZYBO Z7-20 are shown.
Type
|
LUT
|
LUTRAM
|
FF
|
BRAM
|
DSP
|
Normal
|
9108
|
258
|
8408
|
71.5
|
55
|
17.1%
|
1.48%
|
7.9%
|
51.0%
|
25%
|
Simplified
|
4464
|
55
|
3853
|
68.5
|
38
|
8.4%
|
0.26%
|
3.6%
|
48.9%
|
17%
|
4.3 Comparison of Each Algorithm
Next, we compared the experimental results of the computation time and resource
utilization of the sequential, parallel, parallel pipeline (PP-A, PP-B), and parallel-pipeline
unrolled (PPU) algorithms. All the algorithms in this experiment used the simplified
activation function. The comparison of computation time and power consumption is shown
in Table 3.
The processing speed with the ZYBO CPU was 1.7 times as fast as that with the
FPGA using the sequential algorithm because the clock frequency of the CPU (667~MHz)
is higher than that of the FPGA (100 MHz), and the processing of the CPU is optimized
by the compiler. The use of the parallel algorithm reduced the computation time by
74% compared to the FPGA-based sequential algorithm. This result is in good agreement
with the theoretical value calculated from Eq. (8) in Section 3. The computation time of the parallel algorithm was reduced by 50% compared
to the ZYBO CPU-based sequential algorithm.
The amount of energy consumption (power$~ \times ~ $time) of the parallel algorithm
was reduced by 61% and 48% compared to the GPU and Intel CPU, respectively. The computation
time of PP-A was reduced by about 3% compared to the parallel algorithm because the
ratio of pipeline operations to the total number of operations is small. In PP-B,
the computation time decreased by 52% compared to PP-A. This happened because the
computation of the product of $h_{t-1}U$ occupies a large part of the calculation
for LSTM computation.
In PPU, the computation time was reduced by 50% compared to PP-B. This result
is in good agreement with the theoretical value in Section 3. By using a higher clock
frequency (118 MHz), the computation time was reduced by 16% compared to the PPU using
a 100-MHz clock. As a result, the amount of energy consumption of the PPU was reduced
by 95% through the sequential algorithm. In addition, the amount of energy consumption
of the PPU was reduced by 92% and 91% compared to the GPU and Intel CPU, respectively.
The results of the hardware-resource utilization with each algorithm are shown
in Table 4. The resource utilization of the sequential without optimizing activation function
was twice that of the sequential function with the optimizing activation function.
This indicates that the resource utilization greatly depends on how the activation
function is implemented. The resource utilization of the parallel algorithm was slightly
lower than that of the sequential algorithm. The reason for the lower resource utilization
is that the synthesis is optimized in the software.
Table 3. Computation time and power consumption for each chip.
Chip
|
Algorithm
|
Power
[W]
|
Time
[ms]
|
Energy
[mJ]
|
FPGA
|
Sequential
|
1.801
|
371.5
|
669.1
|
Parallel
|
1.849
|
95.78
|
177.1
|
PP-A
|
1.83
|
92.92
|
170.0
|
PP-B
|
1.734
|
44.37
|
76.9
|
PPU
|
1.864
|
22.24
|
41.4
|
1.932*
|
18.76*
|
36.2*
|
Chip
|
Language
|
Power
[W]
|
Time
[ms]
|
Energy
[mJ]
|
GPU
|
Python
|
66.2
|
6.91
|
457.4
|
Intel CPU
|
Python
|
63.4
|
5.36
|
339.8
|
ZYBO CPU
|
C
|
1.4
|
213
|
298.2
|
* When clock frequency is 118 MHz
Table 4. Comparison of resource utilization for each algorithm. Usage rates in ZYBO
Z7-20 are shown.
Algorithm
|
LUT
|
LUTRAM
|
FF
|
BRAM
|
DSP
|
Sequential
|
4464
|
55
|
3853
|
68.50
|
38
|
8.39 %
|
0.32 %
|
3.6 %
|
48.9 %
|
17%
|
Parallel
|
4229
|
45
|
3701
|
67.5
|
34
|
7.95%
|
0.26%
|
3.5%
|
48.2%
|
16%
|
PP-A
|
4328
|
72
|
3735
|
68
|
34
|
8.14 %
|
0.41 %
|
3.5%
|
48.6 %
|
16%
|
PP-B
|
4399
|
73
|
4033
|
68
|
34
|
8.27 %
|
0.42 %
|
3.8%
|
48.6%
|
15%
|
PPU
|
8537
|
111
|
7218
|
74.5
|
68
|
16.1%
|
0.64 %
|
6.8%
|
53.2%
|
31%
|
PPU*
|
8557
|
111
|
7218
|
74.5
|
68
|
16.1%
|
0.64 %
|
6.8%
|
53.2%
|
31%
|
* When clock frequency is 118 MHz
The utilization of PP-B increased slightly compared to the utilization of the
parallel algorithm. The utilization of PPU was about twice as high as that of PP-B.
This result is in agreement with the theoretical value in Section 3. This is equivalent
to the resource utilization of the sequential algorithm. The high utilization rates
of BRAM in all algorithms are due to containing weights ($\textit{W}$ and $\textit{U}$)
and biases ($\textit{b}$) in a BRAM region. Only nine BRAM regions were needed to
calculate the NN model.
5. Discussion
In this section, we compare the performance of our system with that of previous
works [16, 17, 20]. First, we examined the number of floating-point operations per
second (flop/s) in PPU (118 MHz) as a performance index of the computation speed.
When a dataset of 50 points was input to the proposed NN consisting of the LSTM and
dense layers, the total number of the floating-point operations was 6.758 Mflop. The
computation time of PPU (118 MHz) was 18.76 ms, as shown in Table 3. Therefore, the performance index of the computation speed was 0.360~Gflop/s.
The comparison of implementations of the LSTM models on an FPGA is shown in Table 5, where [op] is a unit for the number of fixed-point operations. We compared the proposed
system using PPU with a previous system adopting the same floating-point arithmetic
[16]. We found that the energy efficiency of the previous system was about twice as high
as that of the proposed method, but the resource consumption of the LUT was 22 times
larger, that of the FF was 25 times larger, and that of the DSP was 17 times larger.
The high resource utilization and power consumption of the previous system are due
to the fact that the experiments were conducted with three layers of LSTM with 250
hidden layers. Therefore, even when we compared the resource utilization of the proposed
system with three-layer LSTM and the previous system, we found that the previous system
used more resources than ours.
The energy efficiency of our system with the PPU (118~MHz) was about 1.24 times
higher than that of the previous system [17], even though the clock frequency was about 1.2 times higher (142 MHz) and the calculation
was carried out with fixed-point arithmetic. Furthermore, the resource utilization
of FF in our system with PPU was smaller than in the previous system, and the resource
utilization of LUT and DSP was almost the same. The reason why the resource utilization
was slightly larger than the previous system’s is that the PPU performs floating-point
operations. Since the power consumption and the number of fixed-point bits are not
specified in the previous system [20], we cannot compare its efficiency with that of our system. However, the resource
utilization and operating frequency of the proposed system are lower than those of
the previous system, so we can conclude that the power consumption is greatly reduced.
Table 5. Comparison of implementations for LSTM models on FPGA.
Chip
|
T. W.
|
[17]
|
[16]
|
[20]
|
FPGA chip
|
Zynq
XC7Z020
|
Virtex7
XCVX485t
|
Zynq
XC7Z020
|
Zynq
XC7Z020
|
Frequency [MHz]
|
118
|
150
|
142
|
200
|
Precision [bit]
|
Float 32
|
Float 32
|
Fixed 16
|
Fixed -
|
Weight storage
|
on-chip
|
off-chip
|
off-chip
|
on-chip
|
LUT utilization #
|
8557
|
189871
|
7201
|
39851
|
FF utilization #
|
7218
|
181634
|
12960
|
54488
|
DSP utilization #
|
68
|
1176
|
50
|
206
|
Performance
index
|
0.360
Gflop/s
|
7.62
Gflop/s
|
0.284
Gop/s
|
20
Gop/s
|
Power [W]
|
1.932
|
19.63
|
1.942
|
-
|
Efficiency
|
0.186
Gflop/J
|
0.388
Gflop/J
|
0.146
Gop/J
|
-
|
6. Conclusion
In this paper, we proposed algorithms to improve the energy consumption of LSTM
by modifying the inference algorithm in an FPGA. Experimental results showed that
using our parallel algorithms could reduce the computation time by 74% compared to
the FPGA-based sequential algorithm without increasing resource utilization. In addition,
the amount of energy consumption with the proposed algorithm combining parallel, pipelined,
and unrolled processing was reduced by 95% compared to the sequential algorithm. Furthermore,
the amount of energy consumption with PPU was reduced by 92% and 91% compared to that
with a GPU and Intel CPU, respectively.
REFERENCES
Sakamoto T., Imasaka R., Taki H., Sato T., Yoshioka M., Inoue K., Fukuda T., Sakai
H., 2015, Accurate heartbeat monitoring using ultra-wideband radar, IEICE Electronics
Express, Vol. 12, No. 3, pp. 20141197
Droitcour A. D., Boric-Lubecke O., Lubecke V. M., Lin J., Kovacs G. T. A., 2004, Range
correlation and I/Q performance benefits in single-chip silicon doppler radars for
noncontact cardiopulmonary monitoring, IEEE Transactions on Microwave Theory and Techniques,
Vol. 52, No. 3, pp. 838-848
Droitcour A. D., Jan. 2006., Non-contact measurement of heart and respiration rates
with single chip microwave doppler radar
Jafferson A.-J., Ponnusamy V., Jovic J., Trajanovic M., June 2021, An IoT based cloud
EEG signal analytic framework for thought to text mapping, IEIE Transactions on Smart
Processing & Computing, Vol. 10, No. 3, pp. 183-188
Lee K., Kim Y., Oct. 2019, Hardware implementation of the simplified digital spiking
neural network on FPGA, IEIE Transactions on Smart Processing & Computing, Vol. 8,
No. 5, pp. 405-414
Nguyen X.-T., Nguyen T.-N., Lee H.-J., Kim H., Dec. 2020, An accurate weight binarization
scheme for CNN object detectors with two scaling factors, IEIE Transactions on Smart
Processing & Computing, Vol. 9, No. 6, pp. 497-503
Lee H.-B., Kim G., Kim J.-H., Kang Y.-S., Park C., June 2021, Optimal design of convolutional
neural network for EEG-based authentication, IEIE Transactions on Smart Processing
& Computing, Vol. 10, No. 3, pp. 199-203
Jhong S.-Y., Tseng P.-Y., Siriphockpirom N., Hsia C.-H., Huang M.-S., Hua K.-L., Chen
Y.-Y., 2020, An automated biometric identification system using CNN-based palm vein
recognition, in 2020 International Conference on Advanced Robotics and Intelligent
Systems (ARIS), pp. 1-6
Lynn H. M., Pan S. B., Kim P., Oct. 2019, A deep bidirectional GRU network model for
biometric electrocardiogram classification based on recurrent neural networks, IEEE
Access, Vol. 7, pp. 145395-145405
Singh S., Pandey S.K., Pawar U., Janghel R.R., 2018, Classification of ECG arrhythmia
using recurrent neural networks, Procedia Computer Science, Vol. 132, pp. 1290-1297
Salloum R., Kuo C.-C. J., 2017, ECG-based biometrics using recurrent neural networks,
in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. 2062-2066
Hochreiter S., Schmidhuber J., Dec. 1997, Long short-term memory, Neural Computation,
Vol. 9, pp. 1735-1780
Saadatnejad S., Oveisi M., Hashemi M., 2020, LSTM-based ECG classification for continuous
monitoring on personal wearable devices, IEEE Journal of Biomedical and Health Informatics,
Vol. 24, No. 2, pp. 515-523
Han S., Lee W., Eom H., Kim J., Park C., Aug. 2020, Detection of arrhythmia using
1D convolution neural network with LSTM model, IEIE Transactions on Smart Processing
& Computing, Vol. 9, No. 4, pp. 261-265
Yang C., Jiang W., Guo Z., Nov. 2019, Time series data classification based on dual
path CNN-RNN cascade network, IEEE Access, Vol. 7, pp. 155304-155312
Guan Y., Yuan Z., Sun G., Cong J., 2017, FPGA-based accelerator for long short-term
memory recurrent neural networks, in 2017 22nd Asia and South Pacific Design Automation
Conference (ASP-DAC), Vol. , No. , pp. 629-634
Chang A. X. M., Martini B., Culurciello E., Nov. 2015, Recurrent neural networks hardware
implementation on FPGA, arXiv e-prints, arXiv:1511.05552
Yoshimura U., Inoue T., Tsuchiya A., Kishine K., Jan. 2021, Implementation of low-energy
LSTM with parallel and pipelined algorithm in small-scale FPGA, in 2021 International
Conference on Electronics, Information, and Communication (ICEIC 2021), Jeju, South
Korea, pp. 114-117.
Chollet F., 2015, Keras
Hao Y., Quigley S., Oct. 2017, The implementation of a deep recurrent neural network
language model on a Xilinx FPGA, arXiv e-prints, arXiv:1710.10296
Author
Ukyo Yoshimura received the B.E. degree in Electronic Systems Engi-neering from
the University of Shiga Prefecture, Shiga, Japan in 2019. Since the same year, he
has enrolled a master's course Graduate school of Engineering in the University of
Shiga Prefecture. His research interests include FPGA based systems, embedded systems,
and machine learning.
Toshiyuki Inoue received the B.S., M.S. and Ph.D. degrees in Electrical Electronic
and Information Engi-neering from Osaka University, Osaka, Japan in 2010, 2012 and
2015, respectively. He joined the Department of Electronic Systems Engineering, the
University of Shiga Prefecture, in 2017, and has been an Assistant Professor since
2017. His research interests include RF circuits for wireless communication, wireless
sensor networks, radio-over-fiber technique and optoelectronics. Dr. Inoue is a member
of the IEEE, the Institute of Electronics, Information and Communication Engineers
(IEICE) of Japan, and the Japan Society of Applied Physics (JSAP). He received the
Paper Award in 2013 from IEICE.
Akira Tsuchiya received the B.E., M.E. and Ph.D. degrees in Communi-cations and
Computer Engineering from Kyoto University, Kyoto, Japan, in 2001, 2003, and 2005,
respectively. Since 2005, he has been an Assistant Professor in the Department of
Communications and Computer Engineering, Graduate School of Informatics, Kyoto university.
Since 2017, he has been an Associate Professor in the Department of Electronic Systems
Engineering, the University of Shiga Prefecture, Shiga, Japan. His research interest
includes modeling and design of on-chip passive components of high-frequency CMOS,
and high-speed analog circuit design. He is a member of the IEEE, IEICE and IPSJ.
Keiji Kishine received the B.S., and M.S. degrees in engineering science from Kyoto
University, Kyoto, Japan, and Ph.D. degree in informatics from Kyoto University, Kyoto,
Japan, in 1990, 1992, and 2006, respectively. In 1992, he joined the Electrical Communication
Laboratories, Nippon Telegraph and Telephone Corporation (NTT), Tokyo, Japan. He has
been engaged in research and design of high-speed, low-power circuits for Gb/s LSIs
using Si-bipolar transistors, with application to optical communication systems in
NTT System Electronics Laboratories, Kanagawa, Japan. From 1997, he has been worked
on research and development of over Gb/s Clock and Data Recovery IC at Network Service
Innovation Laboratory in NTT Network Innovation Laboratories, Kanagawa, Japan. He
worked at Ubiquitous Interface Laboratory in NTT Microsystems Integration Laboratories,
Kanagawa, Japan. Since 2008, he had been an Associate Professor, and now he is a Professor
with the school of engineering, the University of Shiga prefecture, Shiga, Japan.
Dr. Kishine is a member of the IEEE Solid-State Circuits Society (SSCS) and Circuits
and Systems (CAS), the Institute of Electronics, Information and Communication Engineers
(IEICE) of Japan, the Institute of Electrical Engineers (IEE) of Japan.