KimMoonsoo1
LeeJuhan1
KimHyun2*
LeeHyuk-Jae1
-
(Department of Electrical and Computer Engineering, Seoul National University, Seoul,
Korea
{kimms213, jhyi, hyuk_jae_lee}@capp.snu.ac.kr
)
-
(Department of Electrical and Information Engineering and Research Center for Electrical
and Information Technology, Seoul National University of Science and Technology, Seoul,
Korea hyunkim@seoultech.ac.kr )
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Non-volatile memory, Phase-change memory, Read disturbance errors, On-demand scrubbing
1. Introduction
A modern computer system requires large amounts of main memory owing to its multi-core
structure and complex applications. In particular, data-intensive applications such
as big data and deep learning require a large amount of main memory to support large
amounts of data [1-3]. As a result, the need for large-capacity main memory with low power consumption
and high reliability has become important [4,5], and studies on the use of phase-change memory (PCM) as main memory have been actively
conducted [6-8]. The cell size in PCM is smaller than DRAM, so the module can be denser, enabling
a large memory capacity [9]. Furthermore, owing to the non-volatile characteristics of PCM, it is more advantageous
than DRAM in terms of power efficiency and data retention time.
Despite these advantages, PCM suffers from low reliability, which needs to be
addressed in order to use PCM as main memory. One of the main causes of reliability
issues in PCM is read disturbance errors (RDEs) [10,11]. An RDE is a phenomenon whereby cells that are repeatedly read are damaged by thermal
energy. RDEs occur when the number of reads exceeds a certain threshold. A conventional
solution for RDEs is to scrub the cells in a word before the number of reads reaches
the threshold. Memory scrubbing first reads a word, corrects bit errors with error-correcting
code (ECC), and writes the corrected word back to the same location. Periodically
scrubbing a word therefore prevents RDEs in that word.
However, periodical scrubbing requires read counters, which results in significant
resource overhead because the number of reads must be counted in order to trigger
scrubbing. In this paper, on-demand memory scrubbing that does not require read counters
is proposed. Under the given RDE model with ECC, the probability distribution of the
number of errors that occur with an additional read is derived. By using the derived
probability distribution, the proposed solution suggests whether to scrub or not based
on the current number of errors. Because the proposed solution only requires the number
of errors, it does not need read counters, thereby eliminating nearly 1GB of resource
overhead. The contributions from this paper are summarized as follows.
· A probabilistic model for RDEs is mathematically derived, and the optimal on-demand
scrubbing policy is derived from the proposed model.
· Monte-Carlo (MC) simulation is conducted to verify the probabilistic model.
· The proposed on-demand scrubbing eliminates more than 1GB needed for read counters,
while fixing more than 99.99% of RDEs.
The remainder of this paper is organized as follows. Section 2 introduces the
background, and Section 3 presents the proposed on-demand scrubbing method. In Section
4, experimental results are given. Finally, Section 5 concludes the paper.
2. Background
In this section, the background for RDE error models and RDE mitigating schemes
is presented.
2.1 Error Models for RDEs
Typically, a counter-based error model is used for RDEs [9]. Under this model, each cell has an RDE threshold, and when the number of reads reaches
the threshold, an RDE occurs. The RDE threshold values follow a Gaussian distribution,
$N\left(m,\sigma ^{2}\right)$. For later discussions, $\textit{m}$ at 3,000 and various
$\sigma $ values are assumed.
The word size for PCM typically ranges from 64B to 256B [7]. In this paper, we assume 128B words. For ECC, 176-21 Reed-Solomon code is assumed
so up to 21 symbols can be corrected out of 176 symbols in total. These 1,408 cells
in a word are assumed to have independent RDE threshold values modeled as a Gaussian
distribution.
2.2 RDE Mitigating Schemes
To mitigate RDE occurrences, a memory scrubbing method is used [12]. In conventional methods, each word utilizes a read counter. If the counter value
reaches a certain threshold, the method reads a whole word and checks for errors via
ECC. If errors are found, they are corrected and the word is rewritten. This read-and-fix
process is called memory scrubbing. Conventional memory scrubbing, which uses a read
counter, can remove RDEs effectively as long as the scrubbing threshold value is well
chosen. However, as shown in Fig. 1, it requires a read counter per word, which means an extra 2B of storage must be
allocated per word. Taking into account that the word size in PCM is typically between
64B and 256B, the extra storage takes about 1/32 to 1/128 of the total PCM capacity.
Moreover, these read counters are updated frequently, and thus, DRAM should be used
for them. Assuming 512GB of PCM capacity and a 128B word size, nearly 8GB of DRAM
is used only for read counters, which is a significantly large overhead.
Fig. 1. Diagram of PCM and its read counters.
3. On-demand Memory Scrubbing
In this section, an on-demand memory scrubbing method that effectively eliminates
read counters is described. First, the probability distribution under the Gaussian
counter-based error model is derived, and then, an efficient on-demand scrubbing policy
under the given probability distribution is suggested.
3.1 Probability Distribution for the Number of Errors
Denote as $\textit{L}$ the number of errors, with $\textit{K}$ as the number
of reads, and $\textit{T}$ as the RDE threshold. The first probability to derive is
$e_{k}$, which is the probability that an error occurs when the number of reads is
$\textit{k}$ $\left(K=k\right)$. $e_{k}$ is derived as follows:
Because errors occur when the number of reads exceeds the RDE threshold, the
first equality in (1) stands. The second equality is from the Gaussian modeling of $\textit{T}$. It should
be noted that $P\left(k\geq N\left(m,\sigma ^{2}\right)\right)$ can be easily calculated
from the normal distribution table.
When $K=k$, the probability of $L$ being $l$ is a binomial distribution $B\left(l,176,e_{k}\right)$.
More specifically, for all 176 symbols, the probability of an error is $e_{k}$, so
they have a binomial distribution as follows:
However, we are interested in probability $P(k|l)$, not $P(l|k)$, because the
value that can be observed is $\textit{l}$, not $\textit{k}$. By using Bayes’s rule,
$P(k|l)$ can be derived as follows:
This means that when the observed number of errors is $\textit{l}$, the hidden
number of reads, $\textit{k,}$ has the probability distribution derived from (3).
Lastly, probability $P(L'=l'|L=l)$ is derived. $L'$ represents the number of
errors when an additional read operation is performed. Therefore, probability $P(l'|l)$
means the number of errors \textit{l$^{\prime}$} from $\textit{l}$ due to an additional
read. It is derived as follows:
$B\left(\left(l'-l\right);176-l,e_{k+1}\right)$ calculates the probability that
the number of additional errors that occur from one more reads is $\left(l'-l\right)$.
Because the number of reads is now \textit{k+}1, $e_{k+1}$ is used as the error probability.
By summing the term over $\textit{k,}$ the desired $P\left(l'|l\right)$ is calculated.
3.2 On-demand Scrubbing Policy
So far, probability distribution $P\left(l'|l\right)$ is calculated with (4). This distribution is used to derive the on-demand scrubbing policy. Assuming that
ECC can correct up to 21 symbols, scrubbing must be done before the number of errors
exceeds 21. This means that probability $P\left(L'>21|L=l\right)$ must be small enough
when the scrubbing is done. It can be calculated as follows:
Table 1 shows the value of $P\left(L'>21|L=l\right)$ for various values of $\sigma $. When
$\sigma =10$ and $L=6$, the probability is 6.5E-7%. This means that when the current
number of errors is 6, the probability that the number of errors becomes larger than
21 due to one more read operation is 6.5E-7%. It is obvious that the probability must
be very small to prevent an uncorrectable error (UE). Therefore, performing scrubbing
at a small value for $\textit{L}$ is always better in terms of removing UEs. However,
a small L value causes too-frequent scrubbing. Table 2 shows the expected value for the number of reads ($\textit{K}$) with respect to $\textit{L}$.
We can see that $\textit{K}$ increases as $\textit{L}$ increases. Moreover, the difference
in the value is relatively large when $\textit{L}$ is small (from 0 to 6). On the
other hand, when $\textit{L}$ is relatively large, the difference becomes smaller.
To sum up, it is best to delay scrubbing while the UE probability is small enough,
but when the $\textit{L}$ value becomes relatively large, to delay scrubbing does
not give much benefit, because the difference in the $\textit{K}$ value is minor.
For example, when the probability goal is 99.999%, which means the UE probability
should be less than 0.001%, the selected $\textit{L}$ values that trigger scrubbing
are 7, 10, and 13 when $\sigma $ is 10, 20, and 50, respectively. The average reads
at the selected points are 2984, 2969, and 2920, respectively.
The proposed on-demand scrubbing provides a significant hardware overhead reduction
because it does not require read counters. As mentioned above, a read counter for
a single word takes up 2B. Assuming a word size of 128B, 1/64 of the total PCM capacity
must be allocated to read counters. For example, if the PCM capacity is 64GB, total
of 1GB of storage must be allocated to read counters. This means that using on-demand
scrubbing significantly reduces storage overhead by about 1/64 of the total capacity.
Table 1. The probability of violation with respect to L.
L
|
σ=10
|
σ=20
|
σ=50
|
0-6
|
<6.5E-7%
|
<4.8E-7%
|
<2.4E-8%
|
7
|
2.9E-4%
|
6.1E-6%
|
1.6E-7%
|
8
|
5.0E-3%
|
2.0E-5%
|
1.1E-6%
|
9
|
0.03%
|
8.4E-5%
|
8.7E-6%
|
10
|
0.08%
|
2.8E-4%
|
2.2E-5%
|
11
|
0.31%
|
3.4E-3%
|
9.1E-5%
|
12
|
0.82%
|
0.02%
|
3.3E-4%
|
Table 2. The expected value of K with respect to L.
L
|
σ=10
|
σ=20
|
σ=50
|
0-6
|
2956-2982
|
2913-2963
|
2783-2908
|
7
|
2983
|
2965
|
2912
|
8
|
2983
|
2967
|
2915
|
9
|
2984
|
2968
|
2917
|
10
|
2984
|
2969
|
2918
|
11
|
2984
|
2970
|
2919
|
12
|
2985
|
2970
|
2920
|
4. Simulation Results
In this section, MC simulation results regarding the proposed on-demand memory
scrubbing are presented. For MC simulations, RDE threshold value $\textit{T}$ for
each bit in a word was randomly sampled from the Gaussian distribution, $N\left(3000,~
\sigma ^{2}\right)$. Under the generated $\textit{T}$ values in a word, the number
of errors ($\textit{L}$) was followed as the read count ($\textit{K}$) increased.
Table 3 shows the MC simulation results with respect to various standard deviation values.
The number of MC trials was 1,000,000. The first column of Table 3 shows the number of errors in the current state. From the current state, if one more
read causes a violation ($\textit{i.e.}$, the number of errors exceeds 21), the case
is counted as a $\textit{violated read}$. The second, fifth, and eighth columns in
Table 3 show the number of violated reads for standard deviations of 10, 20, and 50, respectively.
For example, when the standard deviation was 10, the violated read value was 4 for
an $\textit{L}$ of 7. This means that in these four cases, the error number changed
directly from 7 to a value larger than 21. The columns labeled Probability of VR show
the ratio of violated reads with respect to the total number of trials. The column
Average Read Threshold represents the average read counts among the violations. For
example, when the standard deviation was 10, the average read threshold was 2,973
for an $\textit{L}$ of 7. This means that the average read count over four violations
was 2,973.
From the MC simulation results in Table 3, two observations can be drawn. First, the distribution of violated reads changed
with respect to the standard deviation. As the standard deviation of the underlying
Gaussian model increased, the violations occurred late. In other words, when the standard
deviation was relatively large, more errors could be endured until scrubbing. The
other observation is that the average read threshold value remained almost the same
while $\textit{L}$ increased. This means that even if we accumulate more errors before
scrubbing, in order to reduce scrubbing frequency, the actual reduction in the scrubbing
frequency is negligible. From this observation, the optimal on-demand scrubbing point
for $\textit{L}$ is the point where a violation first occurs. Therefore, the optimal
on-demand scrubbing points for each standard deviation value for $\textit{L}$ are
7, 10, and 13, respectively.
It should be noted that the MC simulation results presented in Table 3 verify the probability distribution described in Section 3.1. The probabilities of
violations derived from the probability distributions shown in Table 1 give values similar to those in Table 3. For the average read threshold value drawn from the probability distribution in
Table 2, the results verify that the value does not increase significantly when $\textit{L}$
is greater than 6.
In addition to RDEs, there are other reliability issues regarding PRAM, such as
write disturbance error (WDE). That is, for example, if the current number of errors
is 8, then it is possible that four RDEs and four WDEs both occur. Under this circumstance,
consider a case where the proposed on-demand scrubbing policy initiates scrubbing.
The policy thinks that all eight errors were caused by RDEs, but in this case, only
four errors were caused by RDEs. Therefore, the probability that an additional read
will cause an ECC violation is even less when other errors are considered. In summary,
the proposed approach, which only considers RDEs, is a conservative approach, and
other errors do not compromise the reliability of the proposed on-demand scrubbing.
Table 3. MC simulation results showing violated read and average read threshold values.
L
|
N3000,$10^{2}$
|
N3000,$20^{2}$
|
N3000,$50^{2}$
|
Violated Read
|
Probability of VR (%)
|
Average Read Threshold
|
Violated Read
|
Probability of VR (%)
|
Average Read Threshold
|
Violated Read
|
Probability of VR (%)
|
Average Read Threshold
|
0-6
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
7
|
4
|
0.0004
|
2973
|
0
|
0
|
0
|
0
|
0
|
0
|
8
|
37
|
0.0037
|
2973
|
0
|
0
|
0
|
0
|
0
|
0
|
9
|
197
|
0.0197
|
2973
|
0
|
0
|
0
|
0
|
0
|
0
|
10
|
706
|
0.0706
|
2973
|
3
|
0.0003
|
2952
|
0
|
0
|
0
|
11
|
2462
|
0.2462
|
2973
|
25
|
0.0025
|
2949
|
0
|
0
|
0
|
12
|
7604
|
0.7604
|
2973
|
160
|
0.016
|
2948
|
0
|
0
|
0
|
13
|
18692
|
1.8692
|
2973
|
866
|
0.0866
|
2948
|
3
|
0.0003
|
2881
|
14
|
40413
|
4.0413
|
2973
|
3669
|
0.3669
|
2949
|
47
|
0.0047
|
2876
|
15
|
73756
|
7.3756
|
2973
|
12916
|
1.2916
|
2949
|
432
|
0.0432
|
2875
|
16
|
115931
|
11.5931
|
2973
|
38812
|
3.8812
|
2949
|
2880
|
0.288
|
2876
|
17
|
157267
|
15.7267
|
2973
|
95337
|
9.5337
|
2949
|
17173
|
1.7173
|
2876
|
18
|
186598
|
18.6598
|
2974
|
187397
|
18.7397
|
2949
|
78581
|
7.8581
|
2876
|
19
|
198617
|
19.8617
|
2974
|
293767
|
29.3767
|
2949
|
267358
|
26.7358
|
2877
|
20
|
197716
|
19.7716
|
2974
|
367048
|
36.7048
|
2950
|
633526
|
63.3526
|
2877
|
5. Conclusion
To efficiently address RDEs, this paper proposes an on-demand scrubbing policy
that does not require read counters. Assuming the memory capacity of PCM is 64GB,
a read counter per word takes nearly 1GB of storage, which creates a significant resource
overhead. In the proposed method, without this resource overhead, more than 99.99%
of read disturbance errors can be fixed. From a mathematically derived probability
distribution model of RDE occurrence, the optimal on-demand scrubbing policy is drawn
up with respect to a certain standard deviation in RDE threshold values. MC simulation
results also verified the derived probability distribution model. It should be noted
that the probability of preventing violations can be increased by scrubbing more often.
This means that there is a trade-off between reliability and performance, and the
user can adaptively select the configuration according to the application.
ACKNOWLEDGMENTS
This paper was supported in part by the Technology Innovation Program (10080613,
DRAM/PRAM hetero-geneous memory architecture and controller IC design technology research
and development) funded by the Ministry of Trade, Industry & Energy (MOTIE, Korea)
and in part by the Basic Science Research Program through the National Research Foundation
of Korea (NRF) funded by the Ministry of Education under Grant NRF-2019R1A6A1A03032119.
REFERENCES
Kim B., et al. , Apr. 2020, PCM: Precision-Controlled Memory System for Energy Efficient
Deep Neural Network Training, 2020 Design, Automation & Test in Europe Conference
& Exhibition (DATE)
Nguyen D. T., Hung N. H., Kim H., Lee H.-J., May 2020, An Approximate Memory Architecture
for Energy Saving in Deep Learning Applications, in IEEE Transactions on Circuits
and Systems for Video Technology, Vol. 67, No. 5, pp. 1588-1601
Lee C., Lee H., Feb. 2019, Effective Parallelization of a High-Order Graph Matching
Algorithm for GPU Execution, in IEEE Transactions on Circuits and Systems for Video
Technology, Vol. 29, No. 2, pp. 560-571
Kim M., Choi J., Kim H., Lee H., 1 Oct. 2019, An Effective DRAM Address Remapping
for Mitigating Rowhammer Errors, in IEEE Transactions on Computers, Vol. 68, No. 10,
pp. 1428-1441
Kim M., Chang I., Lee H., 2019, Segmented Tag Cache: A Novel Cache Organization for
Reducing Dynamic Read Energy, in IEEE Transactions on Computers, Vol. 68, No. 10,
pp. 1546-1552
Lee H., Kim M., Kim H., Kim H., Lee H., 2019, Integration and boost of a read-modify-write
module in phase change memory system, IEEE Transactions on Computers, Vol. 68, No.
12, pp. 1772-1784
Lee B. C., Ipek E., Mutlu O., Burger D., Architecting phase change memory as a scalable
dram alternative, in Proceedings of the 36th Annual International Symposium on Computer
Architecture, ser. ISCA ’09.
Qureshi M. K., Srinivasan V., Rivers J. A., 2012, Scalable high performance main memory
system using phase-change memory technology, in Proceedings of the 36th Annual International
Symposium on High-Performance Comp Architecture (HPCA)
Wong H.-S. P., Raoux S., Kim S., Liang J., Reifenberg J. P., Rajendran B., Asheghi
M., Goodson K. E., 2010, Phase change memory, Proceedings of the IEEE, Vol. 98, No.
12, pp. 2201-2227
Nair P. J., Chou C., Rajendran B., Qureshi M. K., 2015, Reducing read latency of phase
change memory via early read and Turbo Read, 2015 IEEE 21st International Symposium
on High Performance Computer Architecture (HPCA), Burlingame, CA
Rashidi S., Jalili M., Sarbazi-Azad. H., Improving MLC PCM Performance through Relaxed
Write and Read for Intermediate Resistance Levels, ACM Trans. Archit. Code Optim.
15, 1, Article 12 (April 2018), 31 pages.
Awasthi M., Shevgoor M., Sudan K., Rajendran B., Balasubramonian R., Srinivasan V.,
2012, Efficient scrub mechanisms for error-prone emerging memories, IEEE International
Symposium on High-Performance Comp Architecture (HPCA), New Orleans, LA
Author
Moonsoo Kim received a B.S. and Ph.D. degrees in electrical and computer engineering
from Seoul Na- tional University, Seoul, Korea, in 2014 and 2020, respectively. In
2020, he joined Inter-University Semicon-ductor Research Center from Seoul National
University, Seoul, Korea as a post-doctoral researcher. His research interests include
SoC design of video/image applications, and low-power, reliable design of memory hierarchy.
Joohan Yi received the B.S. degree in electric and electrical engineering from
Korea University, Seoul, Korea, in 2018. He is currently working toward the integrated
M.S and Ph.D degree in electrical and computer engineering at Seoul National University,
Seoul, Korea. His research interests include memory hierarchy, deep neural network
processor, and image processing for Robot Systems.
Hyun Kim received the B.S., M.S. and Ph.D. degrees in Electrical Engineering and
Computer Science from Seoul National University, Seoul, Korea, in 2009, 2011 and 2015,
respectively. From 2015 to 2018, he was with the BK21 Creative Research Engineer Development
for IT, Seoul National University, Seoul, Korea, as a Research Professor. In 2018,
he joined the Department of Electrical and Information Engineering, Seoul National
University of Science and Technology, Seoul, Korea, where he is currently working
as an Assistant Professor. His research interests are the areas of algorithm, computer
architecture, and SoC design for low-complexity multimedia applications.
Hyuk-Jae Lee received B.S. and M.S. degrees in Electronics Engineering from Seoul
National University, Korea, in 1987 and 1989, respectively, and the Ph.D. degree in
Electrical and Computer Engineering from Purdue University at West Lafayette, Indiana,
in 1996. From 1998 to 2001, he worked at the Server and Workstation Chipset Division
of Intel Corporation in Hillsboro, Oregon as a senior component design engineer. From
1996 to 1998, he was on the faculty of the Department of Computer Science of Louisiana
Tech University at Ruston, Louisiana. In 2001, he joined the School of Electrical
Engineering and Computer Science at Seoul National University, Korea, where he is
currently working as a Professor. He is a founder of Mamurian Design, Inc., a fabless
SoC design house for multimedia applications. His research interests are in the areas
of computer architecture and SoC design for multimedia applications.