Biometric systems require the regression and classification of biological sensing data, which are both carried out using machine learning. Long short-term memory (LSTM) is one of the most common methods used for regression and classification. We have developed and implemented a low-energy LSTM algorithm for the regression of microwave sensor signals in a small-scale FPGA. Experimental results show that the FPGA-based parallel-pipelined unrolled algorithm can reduce the computation time by 95% compared to an FPGA-based sequential algorithm. In addition, we found that the power consumption can be reduced by 92% and 91% compared to that obtained with a high-end GPU and CPU, respectively.

※ The user interface design of www.jsts.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

### Journal Search

## 1. Introduction

Recently, systems that identify vital signs such as the heartbeat have been developed
for early detection of diseases and biometric identification. Electrocardiograms (ECGs)
are commonly used to obtain heart-rate information in medical institutions, but they
are not suitable for application to embedded systems because they are expensive, and
their use is restricted due to the electrodes and cables required. Heartrate can also
be measured without touching the human body by using a microwave sensor ^{[1}-^{3]}. By embedding a system using microwave sensors in walls, ceilings, or seats, stress-free
health monitoring and biometric identification can be achieved.

Biometric systems require regression and classification, which are done using
machine learning. Machine learning methods are effective for regression and classification,
especially deep learning methods, which use multiple hidden layers in a neural network
(NN) ^{[4}-^{15]}. There are mainly two types of deep learning: convolutional neural networks (CNNs)
^{[4}-^{8]} and recurrent neural networks (RNNs) ^{[9}-^{13]}. CNNs extract local features of images or time-series data by filtering and processing
neighbor pixels or electric signals. Therefore, it is difficult for CNNs to deal with
a change of signals in the long term. In general, CNNs are used for image recognition
and object detection ^{[6]}.

RNNs have feedback loops in an NN, and the past data affect the present data.
Therefore, RNNs are suitable for regression and classification of time-series data.
The long short-term memory (LSTM) method ^{[12,}^{13]} is a type of RNN that improves the performance for long-term data such as heartbeats,
and classification of ECG signals using LSTM has been reported ^{[13]}. Recently, complementary NNs combining CNNs with RNNs have been developed to overcome
the drawback of CNNs ^{[14,}^{15]}. The complementary NNs require more hardware resources compared with a simple CNN
or RNN. Therefore, a simple LSTM is suitable for the NN in embedded systems from the
viewpoint of the scale of the circuits and systems.

Machine learning methods, including LSTM, consist of training and inference phases. In the training phase, weights and biases of an NN are determined using data for training. The weights and bias values are stored after the training phase. In the inference phase, inference results for unknown input data are output in accordance with the weights and biases. Since training takes a long time, the training phase is done with a large-scale computer before the inference phase. However, the inference process needs to be fast so that time-series data sampled at high frequency can be processed and the identification can be executed in real time, especially in biometric identification. In addition, in order to embed biometric identification systems in various places, systems with small size and low power consumption are desired.

In general, CPUs and GPUs with multiple cores and high clock frequencies are used
to reduce the computation time for machine learning. However, both CPUs and GPUs are
large and typically consume high power (~100 W). A field-programmable gate array (FPGA)-based
LSTM method has the potential to reduce the inference time and power consumption compared
to CPU/GPU-based ones because a specialized circuit for LSTM that operates with low
frequency is installed inside the FPGA ^{[16,}^{17]}. Therefore, implementing only the inference phase in an FPGA after performing the
training phase can be a better solution for embedded biometric systems.

While parallelizing a large number of calculations is effective for reducing the
computation time, more hardware resources are required for the parallelization. Fixed-point
arithmetic can reduce the computation time and hardware resource utilization in FPGA-based
LSTM ^{[17]}. However, it is not suitable for biometric identification because it causes degradation
of the computation accuracy compared with floating-point arithmetic.

In earlier work, we proposed algorithms for low-energy floating-point LSTM for
the regression of microwave-sensor signals and implemented them in a small-scale FPGA
^{[18]}. In that report, the methodology and preliminary experimental results were described
briefly. In this paper, we present our algorithms with simplifying activation functions
and parallelized, pipelined, and unrolled processing in more detail. Experiments were
done to compare the computation time, power consumption, and hardware-resource utilization
with each algorithm. The results showed that the computation time of the proposed
algorithm combining parallel, pipelined, and unrolled processing was reduced by 95%
compared to that with the sequential algorithm. In addition, the amount of energy
consumption with the proposed algorithm was reduced by 92% and 91% compared to that
with a high-end GPU and CPU, respectively.

In Section 2, we explain why LSTM is suitable for time-series data, such as the signals of a microwave sensor, and present its mathematical expression. Next, we describe simplified activation functions and the improvement of the algorithms of LSTM to reduce the resource utilization and accelerate the calculation in Section 3. Then, we present the experimental results in Section 4. Finally, we discuss the results of the experiment in Section 5 before concluding the paper in Section 6.

## 2. Mathematical Expression for LSTM

### 2.1 LSTM and Dense Layers

A basic NN model consisting of LSTM and dense layers is shown in Fig. 1. The notation $x_{t}(t=1,2,\ldots ,N)$ refers to the input signal at $t$, while $h_{N}$ is the output of the $N$-th cell, and $N$ is the number of input signals. The matrix size of $h_{t}$ is ($L_{h}$, 1), where $L_{h}$ is the number of components in a cell (hidden layer). The cell output determined by the input data is transferred to the next cell. In other words, the output data $h_{N}$ is determined from both the past output data $h_{1}$ to $h_{N-1}$ and the input data. Therefore, LSTM is suitable for regression of time-series data.

The output layer is called a dense layer. The output data of the dense layer $y_{M}$ are determined by the output data from the LSTM layer $h_{N}$ and the weights $W_{M}^{d}$ biases and $b^{d}$ for the dense layer, where \textit{M} is the number of the final outputs. When $M=1$, this model works for regression, and $y_{1}$ can be expressed as:

where the matrix size of $W_{1}^{d}$ is ($L_{h}$, 1), and that of $b_{1}^{d}$ is (1, 1). When $M$${\geq}$ 2, this model works as a classifier.

### 2.2 LSTM Cells

The cell structure of a standard LSTM is shown in Fig. 2. The LSTM cell has four gates: forget, input, cell, and output gates. Each gate consists of an activation function that is either a sigmoid ($\sigma $) or hyperbolic tangent (tanh) function, two weight matrices $W$ and $U$, and a bias$~ b.$ The gate outputs $i_{t}$, $f_{t}$, $\overset{˜}{C_{t}}$, and $o_{t}$ and the cell outputs $h_{t}$ and $C_{t}$ can be expressed as:

First, $i_{t}$, $f_{t}$, $\overset{˜}{C_{t}}$, and $o_{t}$ are calculated from Eq. (2) to (5). Next, the cell state $C_{t}$ is determined using $i_{t}$, $f_{t}$, $\overset{˜}{C_{t}}$, and $C_{t-1}$ from Eq. (6). Finally, output $h_{t}$ is determined using $C_{t}$ and $o_{t}$ from Eq. (7). The calculation of Eq. (2) to (7) is carried out by matrix calculation. The size of each matrix is determined by $L_{h}$ in each gate. Since $x_{t}$ is one-dimensional, the sizes of the weight and bias matrices are as follows: ($L_{h}$, 1) for $i_{t}$, $f_{t}$, $\overset{˜}{C_{t}}$,$o_{t}$, $C_{t}$, and $b^{i,f,g,o}$; (1, $L_{h}$) for $W^{i,f,g,o}$; and ($L_{h}$, $L_{h}$) for $U^{i,f,g,o}$. Each gate from Eq. (2) to (5) has two types of activation functions: a sigmoid function $\sigma $ and hyperbolic tangent (tanh).

## 3. Method of Implementation for Accelerating FPGA-based LSTM

Based on the equations presented in Section 2, we reviewed the extent to which the algorithms can decrease the computation time in an FPGA-based LSTM. We also examined the amount of resource utilization with these algorithms. First, we explain how to simplify the activation functions. Then, we present methods for accelerating the calculation for inference in an FPGA-based LSTM.

### 3.1 Simplifying Activation Functions

The LSTM cells have two types of activation functions: sigmoid and hyperbolic tangent functions. We simplified the activation functions to decrease the computation time and reduce the amount of resource utilization. The normal sigmoid function $\sigma $ can be expressed as:

There is an exponential calculation and division in the sigmoid function that
require more complex computation than addition or multiplication. As an alternative,
the Keras library for NNs in Python ^{[19]} provides a hard sigmoid function that expresses the sigmoid function as a piecewise
linear approximation. The hard sigmoid function can be carried out using only multiplication
and addition, which has the advantage of reducing the amount of resource utilization
in an FPGA. The equation for the hard-sigmoid function $\sigma _{h}$ is given as follows:

The calculation results of the sigmoid and hard sigmoid functions are shown in Fig. 3. The hard sigmoid function has the smallest squared error with the sigmoid. Also, since the hard sigmoid function is available from the Keras library, it is possible to use it in the training phase. Therefore, the computation results are the same in both inference and training.

The hyperbolic tangent can be expressed as:

Since this function has four exponential calculations, the computation time and the amount of resource utilization are increased. Therefore, we implemented a simplified hyperbolic tangent as follows:

Here, we set variables $a$ and $b$ to $e^{x}$ and $e^{-x}$, respectively. This reduces the amount of exponential arithmetic because the results of the exponential arithmetic are stored in the variables.

### 3.2 Sequential Algorithm

A straightforward algorithm is a sequential algorithm in which Eq. (2) to (7) are calculated sequentially, as shown in Fig. 4. The operations of Eq. (2) to (5) require calculation of a product including the two-dimensional matrix $h_{t-1}U$, so the number of for loops in each operation is $L_{h}^{2}$. The number of loops in calculating Eqs. (6) or (7) is composed of only a one-dimensional matrix $\mathrm{L}_{\mathrm{h}}.$ Hence, the total number of loops of the LSTM cell is $\left(4L_{h}^{2}+2L_{h}\right)$.

### 3.3 Parallel and Pipelined Algorithm

A parallel algorithm can take full advantage of the characteristics of an FPGA. As mentioned, the number of required loops is $L_{h}^{2}$ for Eq. (2) to (5). Since the calculation of these equations is carried out independently, the operation can be parallelized, as shown in Fig. 5. More specifically, after the product of $h_{t-1}U$ and a sum of $x_{t}W$ and $\textit{b}$ are parallelized and calculated, the calculation of each function ($\sigma $ or tanh) is carried out in parallel. However, the calculation of Eqs. (6) and (7) cannot be parallelized because the result of Eq. (6) is used in Eq. (7). Therefore, the total number of loops with the parallel algorithm is $\left(L_{h}^{2}+2L_{h}\right)$.

The ratio of the total number of loops with the parallel algorithm to that with the sequential algorithm can be expressed as:

As $L_{h}$ increases, the ratio approaches 1/4. Since there is a linear relationship between the number of loops and the computation time, reducing the number of loops results in shorter computation time. Assuming that the time required for one loop is the same in the operations of Eq. (2) to (7), the computation time with the parallel algorithm can become approximately 1/4 that obtained with the sequential algorithm. In the parallel algorithm, the order of operations is different from that in the sequential algorithm, but the operations and parameters are not changed. Therefore, the amounts of resource utilization are almost the same.

In addition to the parallel algorithm, pipelining is effective for reducing computation time. In pipelining, the next calculation starts before the present calculation is completed. As shown in Fig. 6, pipelining was applied to the calculation of Eqs. (6) and (7) in parallel-pipelined algorithm A (PP-A). In parallel-pipelined algorithm B (PP-B), pipelining the calculation of $h_{t-1}U$ in Eq. (2) to (5) is applied to PP-A. However, pipelining increases the amount of resource utilization because circuit blocks for both the present and next operations are required.

### 3.4 Proposed Parallel-pipelined Unrolled Algorithm

Decreasing the number of loops is also effective for reducing the computation time. Therefore, we propose a parallel-pipelined and unrolled algorithm (PPU), where unrolling is applied to PP-B. The result of the present loop is not used in the next loop, so the loops can be unrolled and executed simultaneously. Fig. 6 also shows the flow of PPU in the case of factor $\textit{F}$ = 2 (the number of unrolled loops). The computation time is halved, but the resource utilization is doubled.

## 4. Experiment and Results

We performed an experiment to evaluate the computation time, resource utilization, and power consumption with each algorithm in an FPGA. In this section, we present the experimental procedure and results of the proposed algorithm on an FPGA and a high-end GPU and CPU.

### 4.1 Implementation of Proposed System

An NN model using LSTM and dense layers was implemented to regress microwave-sensor signals. The number of inputs for the time-series data was set to $N=50$, the number of components in a cell was set to $L_{h}=128$, and the number of outputs for dense layers was set to $M=1$. First, the NN model was constructed using the Keras library in Python (backend: TensorFlow). After that, weights and biases were determined in the training phase on a PC with a GPU.

As a result of training, the mean absolute error (MAE) and root mean squared error (RMSE) for the training data (1949 sets) were 0.069 and 0.101, respectively. The MAE and RMSE for the test data (1949 sets) were 0.100 and 0.143, respectively. Next, we developed a C-based program of the LSTM model for the FPGA-based inference using the proposed algorithms. All variables were set to a 32-bit floating-point type, and the weights and biases were preserved in the BRAM. The program was then converted into an HDL-based program using Xilinx Vivado HLS 2019.2.

Xilinx Vivado 2019.2 was used to implement the experimental system, including the FPGA-based NN model shown in Fig. 7. The clock frequency of the FPGA was set to 100 MHz. The resource utilization and power consumption were confirmed by the implementation report in the software. A ZYBO Z7-20 was used in the experiments, which is an FPGA board based on ZYNQ-7020 and includes a CPU (microcontroller) and FPGA. The CPU sends the input data of the NN model and receives its output data with the FPGA via AXI4-Lite. We also implemented the NN model using a ZYBO Z7-20 (ZYBO CPU), a CPU (optimization option: -O3) in C language, an NVIDIA RTX 2080 Ti (GPU), and an Intel Core i7 9700 (Intel CPU) with the Keras library in Python to compare the computation time and power consumption with the FPGA-based model.

### 4.2 Effect of Simplifying Activation Functions

We compared the experimental results of the computation time and resource utilization with sequential algorithms using normal and simplified activation functions. The comparison of computation time and power consumption is shown in Table 1. The simplified type reduced the power consumption by 5% and the computation time by 3% compared to the normal one. The amount of energy consumption (power$~ \times ~ $time) of the simplified type was also reduced by 8% compared to the normal one.

The results of the hardware-resource utilization are shown in Table 2. The resource utilization became about half as large as that of the normal type except for BRAM. These results demonstrate that simplifying the activation function is effective in reducing both resource consumption and computation time.

### 4.3 Comparison of Each Algorithm

Next, we compared the experimental results of the computation time and resource utilization of the sequential, parallel, parallel pipeline (PP-A, PP-B), and parallel-pipeline unrolled (PPU) algorithms. All the algorithms in this experiment used the simplified activation function. The comparison of computation time and power consumption is shown in Table 3.

The processing speed with the ZYBO CPU was 1.7 times as fast as that with the FPGA using the sequential algorithm because the clock frequency of the CPU (667~MHz) is higher than that of the FPGA (100 MHz), and the processing of the CPU is optimized by the compiler. The use of the parallel algorithm reduced the computation time by 74% compared to the FPGA-based sequential algorithm. This result is in good agreement with the theoretical value calculated from Eq. (8) in Section 3. The computation time of the parallel algorithm was reduced by 50% compared to the ZYBO CPU-based sequential algorithm.

The amount of energy consumption (power$~ \times ~ $time) of the parallel algorithm was reduced by 61% and 48% compared to the GPU and Intel CPU, respectively. The computation time of PP-A was reduced by about 3% compared to the parallel algorithm because the ratio of pipeline operations to the total number of operations is small. In PP-B, the computation time decreased by 52% compared to PP-A. This happened because the computation of the product of $h_{t-1}U$ occupies a large part of the calculation for LSTM computation.

In PPU, the computation time was reduced by 50% compared to PP-B. This result is in good agreement with the theoretical value in Section 3. By using a higher clock frequency (118 MHz), the computation time was reduced by 16% compared to the PPU using a 100-MHz clock. As a result, the amount of energy consumption of the PPU was reduced by 95% through the sequential algorithm. In addition, the amount of energy consumption of the PPU was reduced by 92% and 91% compared to the GPU and Intel CPU, respectively.

The results of the hardware-resource utilization with each algorithm are shown in Table 4. The resource utilization of the sequential without optimizing activation function was twice that of the sequential function with the optimizing activation function. This indicates that the resource utilization greatly depends on how the activation function is implemented. The resource utilization of the parallel algorithm was slightly lower than that of the sequential algorithm. The reason for the lower resource utilization is that the synthesis is optimized in the software.

Table 3. Computation time and power consumption for each chip.

Table 4. Comparison of resource utilization for each algorithm. Usage rates in ZYBO Z7-20 are shown.

The utilization of PP-B increased slightly compared to the utilization of the parallel algorithm. The utilization of PPU was about twice as high as that of PP-B. This result is in agreement with the theoretical value in Section 3. This is equivalent to the resource utilization of the sequential algorithm. The high utilization rates of BRAM in all algorithms are due to containing weights ($\textit{W}$ and $\textit{U}$) and biases ($\textit{b}$) in a BRAM region. Only nine BRAM regions were needed to calculate the NN model.

## 5. Discussion

In this section, we compare the performance of our system with that of previous works [16, 17, 20]. First, we examined the number of floating-point operations per second (flop/s) in PPU (118 MHz) as a performance index of the computation speed. When a dataset of 50 points was input to the proposed NN consisting of the LSTM and dense layers, the total number of the floating-point operations was 6.758 Mflop. The computation time of PPU (118 MHz) was 18.76 ms, as shown in Table 3. Therefore, the performance index of the computation speed was 0.360~Gflop/s.

The comparison of implementations of the LSTM models on an FPGA is shown in Table 5, where [op] is a unit for the number of fixed-point operations. We compared the proposed
system using PPU with a previous system adopting the same floating-point arithmetic
^{[16]}. We found that the energy efficiency of the previous system was about twice as high
as that of the proposed method, but the resource consumption of the LUT was 22 times
larger, that of the FF was 25 times larger, and that of the DSP was 17 times larger.
The high resource utilization and power consumption of the previous system are due
to the fact that the experiments were conducted with three layers of LSTM with 250
hidden layers. Therefore, even when we compared the resource utilization of the proposed
system with three-layer LSTM and the previous system, we found that the previous system
used more resources than ours.

The energy efficiency of our system with the PPU (118~MHz) was about 1.24 times
higher than that of the previous system ^{[17]}, even though the clock frequency was about 1.2 times higher (142 MHz) and the calculation
was carried out with fixed-point arithmetic. Furthermore, the resource utilization
of FF in our system with PPU was smaller than in the previous system, and the resource
utilization of LUT and DSP was almost the same. The reason why the resource utilization
was slightly larger than the previous system’s is that the PPU performs floating-point
operations. Since the power consumption and the number of fixed-point bits are not
specified in the previous system ^{[20]}, we cannot compare its efficiency with that of our system. However, the resource
utilization and operating frequency of the proposed system are lower than those of
the previous system, so we can conclude that the power consumption is greatly reduced.

Table 5. Comparison of implementations for LSTM models on FPGA.

## 6. Conclusion

In this paper, we proposed algorithms to improve the energy consumption of LSTM by modifying the inference algorithm in an FPGA. Experimental results showed that using our parallel algorithms could reduce the computation time by 74% compared to the FPGA-based sequential algorithm without increasing resource utilization. In addition, the amount of energy consumption with the proposed algorithm combining parallel, pipelined, and unrolled processing was reduced by 95% compared to the sequential algorithm. Furthermore, the amount of energy consumption with PPU was reduced by 92% and 91% compared to that with a GPU and Intel CPU, respectively.

### REFERENCES

## Author

Ukyo Yoshimura received the B.E. degree in Electronic Systems Engi-neering from the University of Shiga Prefecture, Shiga, Japan in 2019. Since the same year, he has enrolled a master's course Graduate school of Engineering in the University of Shiga Prefecture. His research interests include FPGA based systems, embedded systems, and machine learning.

Toshiyuki Inoue received the B.S., M.S. and Ph.D. degrees in Electrical Electronic and Information Engi-neering from Osaka University, Osaka, Japan in 2010, 2012 and 2015, respectively. He joined the Department of Electronic Systems Engineering, the University of Shiga Prefecture, in 2017, and has been an Assistant Professor since 2017. His research interests include RF circuits for wireless communication, wireless sensor networks, radio-over-fiber technique and optoelectronics. Dr. Inoue is a member of the IEEE, the Institute of Electronics, Information and Communication Engineers (IEICE) of Japan, and the Japan Society of Applied Physics (JSAP). He received the Paper Award in 2013 from IEICE.

Akira Tsuchiya received the B.E., M.E. and Ph.D. degrees in Communi-cations and Computer Engineering from Kyoto University, Kyoto, Japan, in 2001, 2003, and 2005, respectively. Since 2005, he has been an Assistant Professor in the Department of Communications and Computer Engineering, Graduate School of Informatics, Kyoto university. Since 2017, he has been an Associate Professor in the Department of Electronic Systems Engineering, the University of Shiga Prefecture, Shiga, Japan. His research interest includes modeling and design of on-chip passive components of high-frequency CMOS, and high-speed analog circuit design. He is a member of the IEEE, IEICE and IPSJ.

Keiji Kishine received the B.S., and M.S. degrees in engineering science from Kyoto University, Kyoto, Japan, and Ph.D. degree in informatics from Kyoto University, Kyoto, Japan, in 1990, 1992, and 2006, respectively. In 1992, he joined the Electrical Communication Laboratories, Nippon Telegraph and Telephone Corporation (NTT), Tokyo, Japan. He has been engaged in research and design of high-speed, low-power circuits for Gb/s LSIs using Si-bipolar transistors, with application to optical communication systems in NTT System Electronics Laboratories, Kanagawa, Japan. From 1997, he has been worked on research and development of over Gb/s Clock and Data Recovery IC at Network Service Innovation Laboratory in NTT Network Innovation Laboratories, Kanagawa, Japan. He worked at Ubiquitous Interface Laboratory in NTT Microsystems Integration Laboratories, Kanagawa, Japan. Since 2008, he had been an Associate Professor, and now he is a Professor with the school of engineering, the University of Shiga prefecture, Shiga, Japan. Dr. Kishine is a member of the IEEE Solid-State Circuits Society (SSCS) and Circuits and Systems (CAS), the Institute of Electronics, Information and Communication Engineers (IEICE) of Japan, the Institute of Electrical Engineers (IEE) of Japan.