Mobile QR Code

1. (Department of Electronic Systems Engineering, The University of Shiga Prefecture / 2500 Hassaka, Hikone, Shiga 522-8533, Japan oi23uyoshimura@ec.usp.ac.jp, {inoue.t, tsuchiya.a, kishine.k}@e.usp.ac.jp )

SoC design and applications, Neural chips, High-level synthesis, Bioinformatics, Radar and sonar signal processing

## 1. Introduction

Recently, systems that identify vital signs such as the heartbeat have been developed for early detection of diseases and biometric identification. Electrocardiograms (ECGs) are commonly used to obtain heart-rate information in medical institutions, but they are not suitable for application to embedded systems because they are expensive, and their use is restricted due to the electrodes and cables required. Heartrate can also be measured without touching the human body by using a microwave sensor [1-3]. By embedding a system using microwave sensors in walls, ceilings, or seats, stress-free health monitoring and biometric identification can be achieved.

Biometric systems require regression and classification, which are done using machine learning. Machine learning methods are effective for regression and classification, especially deep learning methods, which use multiple hidden layers in a neural network (NN) [4-15]. There are mainly two types of deep learning: convolutional neural networks (CNNs) [4-8] and recurrent neural networks (RNNs) [9-13]. CNNs extract local features of images or time-series data by filtering and processing neighbor pixels or electric signals. Therefore, it is difficult for CNNs to deal with a change of signals in the long term. In general, CNNs are used for image recognition and object detection [6].

RNNs have feedback loops in an NN, and the past data affect the present data. Therefore, RNNs are suitable for regression and classification of time-series data. The long short-term memory (LSTM) method [12,13] is a type of RNN that improves the performance for long-term data such as heartbeats, and classification of ECG signals using LSTM has been reported [13]. Recently, complementary NNs combining CNNs with RNNs have been developed to overcome the drawback of CNNs [14,15]. The complementary NNs require more hardware resources compared with a simple CNN or RNN. Therefore, a simple LSTM is suitable for the NN in embedded systems from the viewpoint of the scale of the circuits and systems.

Machine learning methods, including LSTM, consist of training and inference phases. In the training phase, weights and biases of an NN are determined using data for training. The weights and bias values are stored after the training phase. In the inference phase, inference results for unknown input data are output in accordance with the weights and biases. Since training takes a long time, the training phase is done with a large-scale computer before the inference phase. However, the inference process needs to be fast so that time-series data sampled at high frequency can be processed and the identification can be executed in real time, especially in biometric identification. In addition, in order to embed biometric identification systems in various places, systems with small size and low power consumption are desired.

Fig. 1. NN model with LSTM and dense layers ($\textit{M}$ = 1).

In general, CPUs and GPUs with multiple cores and high clock frequencies are used to reduce the computation time for machine learning. However, both CPUs and GPUs are large and typically consume high power (~100 W). A field-programmable gate array (FPGA)-based LSTM method has the potential to reduce the inference time and power consumption compared to CPU/GPU-based ones because a specialized circuit for LSTM that operates with low frequency is installed inside the FPGA [16,17]. Therefore, implementing only the inference phase in an FPGA after performing the training phase can be a better solution for embedded biometric systems.

While parallelizing a large number of calculations is effective for reducing the computation time, more hardware resources are required for the parallelization. Fixed-point arithmetic can reduce the computation time and hardware resource utilization in FPGA-based LSTM [17]. However, it is not suitable for biometric identification because it causes degradation of the computation accuracy compared with floating-point arithmetic.

In earlier work, we proposed algorithms for low-energy floating-point LSTM for the regression of microwave-sensor signals and implemented them in a small-scale FPGA [18]. In that report, the methodology and preliminary experimental results were described briefly. In this paper, we present our algorithms with simplifying activation functions and parallelized, pipelined, and unrolled processing in more detail. Experiments were done to compare the computation time, power consumption, and hardware-resource utilization with each algorithm. The results showed that the computation time of the proposed algorithm combining parallel, pipelined, and unrolled processing was reduced by 95% compared to that with the sequential algorithm. In addition, the amount of energy consumption with the proposed algorithm was reduced by 92% and 91% compared to that with a high-end GPU and CPU, respectively.

In Section 2, we explain why LSTM is suitable for time-series data, such as the signals of a microwave sensor, and present its mathematical expression. Next, we describe simplified activation functions and the improvement of the algorithms of LSTM to reduce the resource utilization and accelerate the calculation in Section 3. Then, we present the experimental results in Section 4. Finally, we discuss the results of the experiment in Section 5 before concluding the paper in Section 6.

## 2. Mathematical Expression for LSTM

### 2.1 LSTM and Dense Layers

A basic NN model consisting of LSTM and dense layers is shown in Fig. 1. The notation $x_{t}(t=1,2,\ldots ,N)$ refers to the input signal at $t$, while $h_{N}$ is the output of the $N$-th cell, and $N$ is the number of input signals. The matrix size of $h_{t}$ is ($L_{h}$, 1), where $L_{h}$ is the number of components in a cell (hidden layer). The cell output determined by the input data is transferred to the next cell. In other words, the output data $h_{N}$ is determined from both the past output data $h_{1}$ to $h_{N-1}$ and the input data. Therefore, LSTM is suitable for regression of time-series data.

The output layer is called a dense layer. The output data of the dense layer $y_{M}$ are determined by the output data from the LSTM layer $h_{N}$ and the weights $W_{M}^{d}$ biases and $b^{d}$ for the dense layer, where \textit{M} is the number of the final outputs. When $M=1$, this model works for regression, and $y_{1}$ can be expressed as:

##### (1)
$y_{1}=W_{1}^{d}h_{N}+b_{1}^{d},$

where the matrix size of $W_{1}^{d}$ is ($L_{h}$, 1), and that of $b_{1}^{d}$ is (1, 1). When $M$${\geq}$ 2, this model works as a classifier.

### 2.2 LSTM Cells

The cell structure of a standard LSTM is shown in Fig. 2. The LSTM cell has four gates: forget, input, cell, and output gates. Each gate consists of an activation function that is either a sigmoid ($\sigma$) or hyperbolic tangent (tanh) function, two weight matrices $W$ and $U$, and a bias$~ b.$ The gate outputs $i_{t}$, $f_{t}$, $\overset{˜}{C_{t}}$, and $o_{t}$ and the cell outputs $h_{t}$ and $C_{t}$ can be expressed as:

##### (2)
i t = σ x t W i + h t 1 U i + b i ,
##### (3)
f t = σ x t W f + h t 1 U f + b f ,
##### (4)
C t ˜ = tanh x t W g + h t 1 U g + b g ,

Fig. 2. Architecture of LSTM cell.

Fig. 3. Comparison of calculation results of sigmoid and hard-sigmoid functions.

##### (5)
o t = σ x t W o + h t 1 U o + b o ,
##### (6)
C t = f t     C t 1 + i t     C t ˜ ,
##### (7)
h t = tanh C t o t .

First, $i_{t}$, $f_{t}$, $\overset{˜}{C_{t}}$, and $o_{t}$ are calculated from Eq. (2) to (5). Next, the cell state $C_{t}$ is determined using $i_{t}$, $f_{t}$, $\overset{˜}{C_{t}}$, and $C_{t-1}$ from Eq. (6). Finally, output $h_{t}$ is determined using $C_{t}$ and $o_{t}$ from Eq. (7). The calculation of Eq. (2) to (7) is carried out by matrix calculation. The size of each matrix is determined by $L_{h}$ in each gate. Since $x_{t}$ is one-dimensional, the sizes of the weight and bias matrices are as follows: ($L_{h}$, 1) for $i_{t}$, $f_{t}$, $\overset{˜}{C_{t}}$,$o_{t}$, $C_{t}$, and $b^{i,f,g,o}$; (1, $L_{h}$) for $W^{i,f,g,o}$; and ($L_{h}$, $L_{h}$) for $U^{i,f,g,o}$. Each gate from Eq. (2) to (5) has two types of activation functions: a sigmoid function $\sigma$ and hyperbolic tangent (tanh).

## 3. Method of Implementation for Accelerating FPGA-based LSTM

Based on the equations presented in Section 2, we reviewed the extent to which the algorithms can decrease the computation time in an FPGA-based LSTM. We also examined the amount of resource utilization with these algorithms. First, we explain how to simplify the activation functions. Then, we present methods for accelerating the calculation for inference in an FPGA-based LSTM.

### 3.1 Simplifying Activation Functions

The LSTM cells have two types of activation functions: sigmoid and hyperbolic tangent functions. We simplified the activation functions to decrease the computation time and reduce the amount of resource utilization. The normal sigmoid function $\sigma$ can be expressed as:

##### (8)
σ ( x ) = 1 1 + e x .

There is an exponential calculation and division in the sigmoid function that require more complex computation than addition or multiplication. As an alternative, the Keras library for NNs in Python [19] provides a hard sigmoid function that expresses the sigmoid function as a piecewise linear approximation. The hard sigmoid function can be carried out using only multiplication and addition, which has the advantage of reducing the amount of resource utilization in an FPGA. The equation for the hard-sigmoid function $\sigma _{h}$ is given as follows:

##### (9)
σ h x = 0, x < 2 . 5 0 . 2 x + 0 . 5, 2 . 5 < x < 2 . 5 1 . 2 . 5 < x

The calculation results of the sigmoid and hard sigmoid functions are shown in Fig. 3. The hard sigmoid function has the smallest squared error with the sigmoid. Also, since the hard sigmoid function is available from the Keras library, it is possible to use it in the training phase. Therefore, the computation results are the same in both inference and training.

The hyperbolic tangent can be expressed as:

##### (10)
tanh x = e x e x e x + e x .

Since this function has four exponential calculations, the computation time and the amount of resource utilization are increased. Therefore, we implemented a simplified hyperbolic tangent as follows:

##### (11)
tanh x = a b a + b . a = e x    b = e x

Here, we set variables $a$ and $b$ to $e^{x}$ and $e^{-x}$, respectively. This reduces the amount of exponential arithmetic because the results of the exponential arithmetic are stored in the variables.

### 3.2 Sequential Algorithm

A straightforward algorithm is a sequential algorithm in which Eq. (2) to (7) are calculated sequentially, as shown in Fig. 4. The operations of Eq. (2) to (5) require calculation of a product including the two-dimensional matrix $h_{t-1}U$, so the number of for loops in each operation is $L_{h}^{2}$. The number of loops in calculating Eqs. (6) or (7) is composed of only a one-dimensional matrix $\mathrm{L}_{\mathrm{h}}.$ Hence, the total number of loops of the LSTM cell is $\left(4L_{h}^{2}+2L_{h}\right)$.

Fig. 4. Flow of sequential algorithm and number of loops.

Fig. 5. Flow of parallel algorithm and number of loops.

### 3.3 Parallel and Pipelined Algorithm

A parallel algorithm can take full advantage of the characteristics of an FPGA. As mentioned, the number of required loops is $L_{h}^{2}$ for Eq. (2) to (5). Since the calculation of these equations is carried out independently, the operation can be parallelized, as shown in Fig. 5. More specifically, after the product of $h_{t-1}U$ and a sum of $x_{t}W$ and $\textit{b}$ are parallelized and calculated, the calculation of each function ($\sigma$ or tanh) is carried out in parallel. However, the calculation of Eqs. (6) and (7) cannot be parallelized because the result of Eq. (6) is used in Eq. (7). Therefore, the total number of loops with the parallel algorithm is $\left(L_{h}^{2}+2L_{h}\right)$.

The ratio of the total number of loops with the parallel algorithm to that with the sequential algorithm can be expressed as:

##### (12)
Parallelized  Loop Sequential  Loop = L h 2 + 2 L h 4 L h 2 + 2 L h = 1 3 4 + 2 L h 1 4    .

As $L_{h}$ increases, the ratio approaches 1/4. Since there is a linear relationship between the number of loops and the computation time, reducing the number of loops results in shorter computation time. Assuming that the time required for one loop is the same in the operations of Eq. (2) to (7), the computation time with the parallel algorithm can become approximately 1/4 that obtained with the sequential algorithm. In the parallel algorithm, the order of operations is different from that in the sequential algorithm, but the operations and parameters are not changed. Therefore, the amounts of resource utilization are almost the same.

In addition to the parallel algorithm, pipelining is effective for reducing computation time. In pipelining, the next calculation starts before the present calculation is completed. As shown in Fig. 6, pipelining was applied to the calculation of Eqs. (6) and (7) in parallel-pipelined algorithm A (PP-A). In parallel-pipelined algorithm B (PP-B), pipelining the calculation of $h_{t-1}U$ in Eq. (2) to (5) is applied to PP-A. However, pipelining increases the amount of resource utilization because circuit blocks for both the present and next operations are required.

Fig. 6. Flow of PP-A, PP-B, and PPU and number of loops. The variables of for loops are $i(i=0,2,\ldots ,L_{h})$ and $j(j=0,1,\ldots ,L_{h})$.

Fig. 7. Diagram of experimental system for evaluation.

### 3.4 Proposed Parallel-pipelined Unrolled Algorithm

Decreasing the number of loops is also effective for reducing the computation time. Therefore, we propose a parallel-pipelined and unrolled algorithm (PPU), where unrolling is applied to PP-B. The result of the present loop is not used in the next loop, so the loops can be unrolled and executed simultaneously. Fig. 6 also shows the flow of PPU in the case of factor $\textit{F}$ = 2 (the number of unrolled loops). The computation time is halved, but the resource utilization is doubled.

## 4. Experiment and Results

We performed an experiment to evaluate the computation time, resource utilization, and power consumption with each algorithm in an FPGA. In this section, we present the experimental procedure and results of the proposed algorithm on an FPGA and a high-end GPU and CPU.

### 4.1 Implementation of Proposed System

An NN model using LSTM and dense layers was implemented to regress microwave-sensor signals. The number of inputs for the time-series data was set to $N=50$, the number of components in a cell was set to $L_{h}=128$, and the number of outputs for dense layers was set to $M=1$. First, the NN model was constructed using the Keras library in Python (backend: TensorFlow). After that, weights and biases were determined in the training phase on a PC with a GPU.

As a result of training, the mean absolute error (MAE) and root mean squared error (RMSE) for the training data (1949 sets) were 0.069 and 0.101, respectively. The MAE and RMSE for the test data (1949 sets) were 0.100 and 0.143, respectively. Next, we developed a C-based program of the LSTM model for the FPGA-based inference using the proposed algorithms. All variables were set to a 32-bit floating-point type, and the weights and biases were preserved in the BRAM. The program was then converted into an HDL-based program using Xilinx Vivado HLS 2019.2.

Xilinx Vivado 2019.2 was used to implement the experimental system, including the FPGA-based NN model shown in Fig. 7. The clock frequency of the FPGA was set to 100 MHz. The resource utilization and power consumption were confirmed by the implementation report in the software. A ZYBO Z7-20 was used in the experiments, which is an FPGA board based on ZYNQ-7020 and includes a CPU (microcontroller) and FPGA. The CPU sends the input data of the NN model and receives its output data with the FPGA via AXI4-Lite. We also implemented the NN model using a ZYBO Z7-20 (ZYBO CPU), a CPU (optimization option: -O3) in C language, an NVIDIA RTX 2080 Ti (GPU), and an Intel Core i7 9700 (Intel CPU) with the Keras library in Python to compare the computation time and power consumption with the FPGA-based model.

### 4.2 Effect of Simplifying Activation Functions

We compared the experimental results of the computation time and resource utilization with sequential algorithms using normal and simplified activation functions. The comparison of computation time and power consumption is shown in Table 1. The simplified type reduced the power consumption by 5% and the computation time by 3% compared to the normal one. The amount of energy consumption (power$~ \times ~$time) of the simplified type was also reduced by 8% compared to the normal one.

The results of the hardware-resource utilization are shown in Table 2. The resource utilization became about half as large as that of the normal type except for BRAM. These results demonstrate that simplifying the activation function is effective in reducing both resource consumption and computation time.

Table 1. Computation time and power consumption.

 Type Power [W] Time [ms] Energy [mJ] Normal 1.909 381.4 728.1 Simplified 1.801 371.5 669.1

Table 2. Comparison of resource utilization in sequential algorithms. Usage rates in ZYBO Z7-20 are shown.

 Type LUT LUTRAM FF BRAM DSP Normal 9108 258 8408 71.5 55 17.1% 1.48% 7.9% 51.0% 25% Simplified 4464 55 3853 68.5 38 8.4% 0.26% 3.6% 48.9% 17%

### 4.3 Comparison of Each Algorithm

Next, we compared the experimental results of the computation time and resource utilization of the sequential, parallel, parallel pipeline (PP-A, PP-B), and parallel-pipeline unrolled (PPU) algorithms. All the algorithms in this experiment used the simplified activation function. The comparison of computation time and power consumption is shown in Table 3.

The processing speed with the ZYBO CPU was 1.7 times as fast as that with the FPGA using the sequential algorithm because the clock frequency of the CPU (667~MHz) is higher than that of the FPGA (100 MHz), and the processing of the CPU is optimized by the compiler. The use of the parallel algorithm reduced the computation time by 74% compared to the FPGA-based sequential algorithm. This result is in good agreement with the theoretical value calculated from Eq. (8) in Section 3. The computation time of the parallel algorithm was reduced by 50% compared to the ZYBO CPU-based sequential algorithm.

The amount of energy consumption (power$~ \times ~$time) of the parallel algorithm was reduced by 61% and 48% compared to the GPU and Intel CPU, respectively. The computation time of PP-A was reduced by about 3% compared to the parallel algorithm because the ratio of pipeline operations to the total number of operations is small. In PP-B, the computation time decreased by 52% compared to PP-A. This happened because the computation of the product of $h_{t-1}U$ occupies a large part of the calculation for LSTM computation.

In PPU, the computation time was reduced by 50% compared to PP-B. This result is in good agreement with the theoretical value in Section 3. By using a higher clock frequency (118 MHz), the computation time was reduced by 16% compared to the PPU using a 100-MHz clock. As a result, the amount of energy consumption of the PPU was reduced by 95% through the sequential algorithm. In addition, the amount of energy consumption of the PPU was reduced by 92% and 91% compared to the GPU and Intel CPU, respectively.

The results of the hardware-resource utilization with each algorithm are shown in Table 4. The resource utilization of the sequential without optimizing activation function was twice that of the sequential function with the optimizing activation function. This indicates that the resource utilization greatly depends on how the activation function is implemented. The resource utilization of the parallel algorithm was slightly lower than that of the sequential algorithm. The reason for the lower resource utilization is that the synthesis is optimized in the software.

Table 3. Computation time and power consumption for each chip.

 Chip Algorithm Power [W] Time [ms] Energy [mJ] FPGA Sequential 1.801 371.5 669.1 Parallel 1.849 95.78 177.1 PP-A 1.83 92.92 170.0 PP-B 1.734 44.37 76.9 PPU 1.864 22.24 41.4 1.932* 18.76* 36.2* Chip Language Power [W] Time [ms] Energy [mJ] GPU Python 66.2 6.91 457.4 Intel CPU Python 63.4 5.36 339.8 ZYBO CPU C 1.4 213 298.2

* When clock frequency is 118 MHz

Table 4. Comparison of resource utilization for each algorithm. Usage rates in ZYBO Z7-20 are shown.

 Algorithm LUT LUTRAM FF BRAM DSP Sequential 4464 55 3853 68.50 38 8.39 % 0.32 % 3.6 % 48.9 % 17% Parallel 4229 45 3701 67.5 34 7.95% 0.26% 3.5% 48.2% 16% PP-A 4328 72 3735 68 34 8.14 % 0.41 % 3.5% 48.6 % 16% PP-B 4399 73 4033 68 34 8.27 % 0.42 % 3.8% 48.6% 15% PPU 8537 111 7218 74.5 68 16.1% 0.64 % 6.8% 53.2% 31% PPU* 8557 111 7218 74.5 68 16.1% 0.64 % 6.8% 53.2% 31%

* When clock frequency is 118 MHz

The utilization of PP-B increased slightly compared to the utilization of the parallel algorithm. The utilization of PPU was about twice as high as that of PP-B. This result is in agreement with the theoretical value in Section 3. This is equivalent to the resource utilization of the sequential algorithm. The high utilization rates of BRAM in all algorithms are due to containing weights ($\textit{W}$ and $\textit{U}$) and biases ($\textit{b}$) in a BRAM region. Only nine BRAM regions were needed to calculate the NN model.

## 5. Discussion

In this section, we compare the performance of our system with that of previous works [16, 17, 20]. First, we examined the number of floating-point operations per second (flop/s) in PPU (118 MHz) as a performance index of the computation speed. When a dataset of 50 points was input to the proposed NN consisting of the LSTM and dense layers, the total number of the floating-point operations was 6.758 Mflop. The computation time of PPU (118 MHz) was 18.76 ms, as shown in Table 3. Therefore, the performance index of the computation speed was 0.360~Gflop/s.

The comparison of implementations of the LSTM models on an FPGA is shown in Table 5, where [op] is a unit for the number of fixed-point operations. We compared the proposed system using PPU with a previous system adopting the same floating-point arithmetic [16]. We found that the energy efficiency of the previous system was about twice as high as that of the proposed method, but the resource consumption of the LUT was 22 times larger, that of the FF was 25 times larger, and that of the DSP was 17 times larger. The high resource utilization and power consumption of the previous system are due to the fact that the experiments were conducted with three layers of LSTM with 250 hidden layers. Therefore, even when we compared the resource utilization of the proposed system with three-layer LSTM and the previous system, we found that the previous system used more resources than ours.

The energy efficiency of our system with the PPU (118~MHz) was about 1.24 times higher than that of the previous system [17], even though the clock frequency was about 1.2 times higher (142 MHz) and the calculation was carried out with fixed-point arithmetic. Furthermore, the resource utilization of FF in our system with PPU was smaller than in the previous system, and the resource utilization of LUT and DSP was almost the same. The reason why the resource utilization was slightly larger than the previous system’s is that the PPU performs floating-point operations. Since the power consumption and the number of fixed-point bits are not specified in the previous system [20], we cannot compare its efficiency with that of our system. However, the resource utilization and operating frequency of the proposed system are lower than those of the previous system, so we can conclude that the power consumption is greatly reduced.

Table 5. Comparison of implementations for LSTM models on FPGA.

 Chip T. W. [17] [16] [20] FPGA chip Zynq XC7Z020 Virtex7 XCVX485t Zynq XC7Z020 Zynq XC7Z020 Frequency [MHz] 118 150 142 200 Precision [bit] Float 32 Float 32 Fixed 16 Fixed - Weight storage on-chip off-chip off-chip on-chip LUT utilization # 8557 189871 7201 39851 FF utilization # 7218 181634 12960 54488 DSP utilization # 68 1176 50 206 Performance index 0.360 Gflop/s 7.62 Gflop/s 0.284 Gop/s 20 Gop/s Power [W] 1.932 19.63 1.942 - Efficiency 0.186 Gflop/J 0.388 Gflop/J 0.146 Gop/J -

## 6. Conclusion

In this paper, we proposed algorithms to improve the energy consumption of LSTM by modifying the inference algorithm in an FPGA. Experimental results showed that using our parallel algorithms could reduce the computation time by 74% compared to the FPGA-based sequential algorithm without increasing resource utilization. In addition, the amount of energy consumption with the proposed algorithm combining parallel, pipelined, and unrolled processing was reduced by 95% compared to the sequential algorithm. Furthermore, the amount of energy consumption with PPU was reduced by 92% and 91% compared to that with a GPU and Intel CPU, respectively.

### REFERENCES

1
Sakamoto T., Imasaka R., Taki H., Sato T., Yoshioka M., Inoue K., Fukuda T., Sakai H., 2015, Accurate heartbeat monitoring using ultra-wideband radar, IEICE Electronics Express, Vol. 12, No. 3, pp. 20141197
2
Droitcour A. D., Boric-Lubecke O., Lubecke V. M., Lin J., Kovacs G. T. A., 2004, Range correlation and I/Q performance benefits in single-chip silicon doppler radars for noncontact cardiopulmonary monitoring, IEEE Transactions on Microwave Theory and Techniques, Vol. 52, No. 3, pp. 838-848
3
Droitcour A. D., Jan. 2006., Non-contact measurement of heart and respiration rates with single chip microwave doppler radar
4
Jafferson A.-J., Ponnusamy V., Jovic J., Trajanovic M., June 2021, An IoT based cloud EEG signal analytic framework for thought to text mapping, IEIE Transactions on Smart Processing & Computing, Vol. 10, No. 3, pp. 183-188
5
Lee K., Kim Y., Oct. 2019, Hardware implementation of the simplified digital spiking neural network on FPGA, IEIE Transactions on Smart Processing & Computing, Vol. 8, No. 5, pp. 405-414
6
Nguyen X.-T., Nguyen T.-N., Lee H.-J., Kim H., Dec. 2020, An accurate weight binarization scheme for CNN object detectors with two scaling factors, IEIE Transactions on Smart Processing & Computing, Vol. 9, No. 6, pp. 497-503
7
Lee H.-B., Kim G., Kim J.-H., Kang Y.-S., Park C., June 2021, Optimal design of convolutional neural network for EEG-based authentication, IEIE Transactions on Smart Processing & Computing, Vol. 10, No. 3, pp. 199-203
8
Jhong S.-Y., Tseng P.-Y., Siriphockpirom N., Hsia C.-H., Huang M.-S., Hua K.-L., Chen Y.-Y., 2020, An automated biometric identification system using CNN-based palm vein recognition, in 2020 International Conference on Advanced Robotics and Intelligent Systems (ARIS), pp. 1-6
9
Lynn H. M., Pan S. B., Kim P., Oct. 2019, A deep bidirectional GRU network model for biometric electrocardiogram classification based on recurrent neural networks, IEEE Access, Vol. 7, pp. 145395-145405
10
Singh S., Pandey S.K., Pawar U., Janghel R.R., 2018, Classification of ECG arrhythmia using recurrent neural networks, Procedia Computer Science, Vol. 132, pp. 1290-1297
11
Salloum R., Kuo C.-C. J., 2017, ECG-based biometrics using recurrent neural networks, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2062-2066
12
Hochreiter S., Schmidhuber J., Dec. 1997, Long short-term memory, Neural Computation, Vol. 9, pp. 1735-1780
13
Saadatnejad S., Oveisi M., Hashemi M., 2020, LSTM-based ECG classification for continuous monitoring on personal wearable devices, IEEE Journal of Biomedical and Health Informatics, Vol. 24, No. 2, pp. 515-523
14
Han S., Lee W., Eom H., Kim J., Park C., Aug. 2020, Detection of arrhythmia using 1D convolution neural network with LSTM model, IEIE Transactions on Smart Processing & Computing, Vol. 9, No. 4, pp. 261-265
15
Yang C., Jiang W., Guo Z., Nov. 2019, Time series data classification based on dual path CNN-RNN cascade network, IEEE Access, Vol. 7, pp. 155304-155312
16
Guan Y., Yuan Z., Sun G., Cong J., 2017, FPGA-based accelerator for long short-term memory recurrent neural networks, in 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC), Vol. , No. , pp. 629-634
17
Chang A. X. M., Martini B., Culurciello E., Nov. 2015, Recurrent neural networks hardware implementation on FPGA, arXiv e-prints, arXiv:1511.05552
18
Yoshimura U., Inoue T., Tsuchiya A., Kishine K., Jan. 2021, Implementation of low-energy LSTM with parallel and pipelined algorithm in small-scale FPGA, in 2021 International Conference on Electronics, Information, and Communication (ICEIC 2021), Jeju, South Korea, pp. 114-117.
19
Chollet F., 2015, Keras
20
Hao Y., Quigley S., Oct. 2017, The implementation of a deep recurrent neural network language model on a Xilinx FPGA, arXiv e-prints, arXiv:1710.10296

## Author

##### Ukyo Yoshimura

Ukyo Yoshimura received the B.E. degree in Electronic Systems Engi-neering from the University of Shiga Prefecture, Shiga, Japan in 2019. Since the same year, he has enrolled a master's course Graduate school of Engineering in the University of Shiga Prefecture. His research interests include FPGA based systems, embedded systems, and machine learning.

##### Toshiyuki Inoue

Toshiyuki Inoue received the B.S., M.S. and Ph.D. degrees in Electrical Electronic and Information Engi-neering from Osaka University, Osaka, Japan in 2010, 2012 and 2015, respectively. He joined the Department of Electronic Systems Engineering, the University of Shiga Prefecture, in 2017, and has been an Assistant Professor since 2017. His research interests include RF circuits for wireless communication, wireless sensor networks, radio-over-fiber technique and optoelectronics. Dr. Inoue is a member of the IEEE, the Institute of Electronics, Information and Communication Engineers (IEICE) of Japan, and the Japan Society of Applied Physics (JSAP). He received the Paper Award in 2013 from IEICE.

##### Akira Tsuchiya

Akira Tsuchiya received the B.E., M.E. and Ph.D. degrees in Communi-cations and Computer Engineering from Kyoto University, Kyoto, Japan, in 2001, 2003, and 2005, respectively. Since 2005, he has been an Assistant Professor in the Department of Communications and Computer Engineering, Graduate School of Informatics, Kyoto university. Since 2017, he has been an Associate Professor in the Department of Electronic Systems Engineering, the University of Shiga Prefecture, Shiga, Japan. His research interest includes modeling and design of on-chip passive components of high-frequency CMOS, and high-speed analog circuit design. He is a member of the IEEE, IEICE and IPSJ.

##### Keiji Kishine

Keiji Kishine received the B.S., and M.S. degrees in engineering science from Kyoto University, Kyoto, Japan, and Ph.D. degree in informatics from Kyoto University, Kyoto, Japan, in 1990, 1992, and 2006, respectively. In 1992, he joined the Electrical Communication Laboratories, Nippon Telegraph and Telephone Corporation (NTT), Tokyo, Japan. He has been engaged in research and design of high-speed, low-power circuits for Gb/s LSIs using Si-bipolar transistors, with application to optical communication systems in NTT System Electronics Laboratories, Kanagawa, Japan. From 1997, he has been worked on research and development of over Gb/s Clock and Data Recovery IC at Network Service Innovation Laboratory in NTT Network Innovation Laboratories, Kanagawa, Japan. He worked at Ubiquitous Interface Laboratory in NTT Microsystems Integration Laboratories, Kanagawa, Japan. Since 2008, he had been an Associate Professor, and now he is a Professor with the school of engineering, the University of Shiga prefecture, Shiga, Japan. Dr. Kishine is a member of the IEEE Solid-State Circuits Society (SSCS) and Circuits and Systems (CAS), the Institute of Electronics, Information and Communication Engineers (IEICE) of Japan, the Institute of Electrical Engineers (IEE) of Japan.