In this paper, we propose a deep learning-based die-to-die wafer inspection system, which is composed of an encoder-decoder-based twin network (Siamese network). In contrast to other deep learning-based wafer inspection methods, the proposed method takes golden and test die images as input and compares them to detect different areas as defects. In addition, we apply Bayesian learning to improve the performance of the proposed twin network. We verified the performance of the proposed method through experiments using patterned wafer images, which confirmed that the performance could be improved by applying Bayesian learning.

※ The user interface design of www.jsts.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

### Journal Search

## 1. Introduction

The semiconductor industry has been rapidly growing, and semiconductors are widely
used in various products, such as mobile phones, computers, tablets, and automobiles
^{[1,}^{2]}. In the semiconductor fabrication process, wafers are the core material, and integrated
circuit (IC) dies are arranged on the wafer in a grid pattern ^{[1]}. To ensure the quality of semiconductor products, it is essential to inspect the
dies of the wafer during the manufacturing process ^{[2,}^{3]}.

Typical wafer die inspection methods are based on machine vision-based golden
template method that utilizes the characteristics of repetitive patterns of the dies
on the wafer ^{[2-}^{5]}. These methods first generate a golden die image by combining multiple die images
in various ways ^{[2-}^{5]}. Then, they calculate the pixel-wise difference between the golden die image and
the test die image. Finally, they detect defects by applying various post-processing
techniques to the difference image, such as image thresholding or morphology operation
^{[2-}^{4]}. These inspection methods are called die-to-die inspection.

However, machine vision-based die-to-die inspection methods are vulnerable to
registration error and intensity-level variation between die images. In addition,
they rely heavily on an expert to extract hand-crafted features ^{[6]} and tune various parameters for an inspection system. Several deep learning-based
wafer inspection methods have been investigated to automatically extract optimal features
for wafer inspection and provide better performance ^{[6-}^{10]}. Most deep learning-based methods use convolutional neural networks (CNN) to classify
defect types ^{[8-}^{10]} or detect defect locations ^{[6,}^{7]}. However, these are not based on die-to-die inspection because they inspect the wafers
using only the test image without information from the golden image.

We believe that the die-to-die inspection using the characteristics of repetitive
die patterns can improve the performance of deep learning-based wafer inspection methods.
To the best of our knowledge, there are no studies on deep learning-based die-to-die
inspection. Therefore, we propose a deep learning-based comparison system for wafer
die-to-die inspection. The proposed method is based on a twin network (Siamese network)
composed of a convolutional encoder-decoder ^{[11]}. It takes golden and test die images as input and compares them to detect different
areas between two input images as defects. To alleviate the performance degradation
problem caused by registration error, we use only one die image as the golden die
image instead of combining multiple die images. In addition, the proposed method detects
defects using optimal features automatically extracted for die-to-die inspection in
the training stage instead of directly calculating the difference between the golden
die and test die images.

Furthermore, we improve the performance of the twin network by applying a Bayesian
learning technique. Recently, many studies on Bayesian learning have been conducted
in segmentation and classification fields ^{[12-}^{14]}. These studies showed that the performance of the networks with Bayesian learning
is better than that of conventional networks.

The idea of Bayesian learning is to interpret network weights as random variables
and compute the posterior distribution of network weights given training data ^{[15]}. The posterior distribution allows us to calculate the distribution of prediction
by marginalizing over network weights ^{[15]}. Through marginalization, Bayesian learning may prevent overfitting and improve the
performance of the network ^{[15]}. However, it is impractical to compute the posterior distribution of network weights
exactly, so techniques to approximate the posterior distribution are often used, such
as variational inference ^{[16]}.

Recently, a study developed a theoretical framework that casts dropout in machine
learning as approximate Bayesian learning ^{[15]}. The study showed that a neural network trained with dropout is an approximate Bayesian
neural network ^{[15,}^{17]}. Moreover, some studies showed that a Bayesian neural network may measure the uncertainty
of a trained model and reduce the overconfidence problem in classification tasks ^{[15,}^{18,}^{19]}.

To verify the usefulness of the proposed method for die-to-die inspection, we
compare its performance with that of a conventional wafer inspection method ^{[7]}. The conventional method is based on an encoder-decoder network with a single input
image ^{[7]}, so for a fair comparison, we modified it so that the encoder-decoder network uses
golden and test images as input, as shown in Fig. 1. We call this modified method the conventional method in this paper. To the best
of our knowledge, this is the first attempt to apply a twin network and Bayesian learning
for wafer die-to-die inspection.

The remainder of this paper is organized as follows. In Section 2, we explain two important aspects of the proposed method in detail: the twin network and the Bayesian twin network. We present a detailed description of our dataset in Section 3. Experimental results and conclusions are presented in Sections 4 and 5.

## 2. Proposed Method

### 2.1 Twin Network

A twin network is a neural network that contains two or more sub-networks ^{[11]}. The network inputs multiple images and extracts feature vectors by processing images
separately through each sub-network. Then, it computes the distance of extracted feature
vectors to measure similarity between the input images ^{[11]}. The main advantage of a twin network is sharing weights between sub-networks ^{[11]}. By sharing the network weights, a twin network can identify whether similar features
exist, so it can measure similarity between multiple images. Recently, a twin network
composed of a convolutional encoder-decoder structure has been applied for various
applications, such as video object segmentation ^{[20]} and change detection ^{[21-}^{23]}. These methods also share the weights between encoders ^{[20-}^{23]}.

As shown in Fig. 2, the proposed network consists of two encoders and one decoder, and the weights in
the two encoders are shared. This network takes the golden and test die images, extracts
the feature maps from sub-encoders, and concatenates extracted feature maps. The merged
feature maps are used as the decoder input, and the network finally outputs the different
areas between the golden and test die images as defects. To detect small defects accurately,
we also apply a difference skip-connection ^{[24]}, which computes the absolute difference of the feature maps between encoder-convolution
layers and then transfers the difference values to the decoder, as shown in Fig. 2.

### 2.2 Bayesian Twin Network

A Bayesian neural network models network weights as random variables instead
of fixed values and computes the posterior distribution of network weights given training
data. Using a posterior distribution, the network may predict a target value of test
data by marginalizing over network weights $\mathbf{W}$ given training data $\mathbf{X}$
and its corresponding label set $\mathbf{Y}$, which is defined as follows ^{[15]}:

where $\boldsymbol{x}^{\boldsymbol{*}}$ is test data, $\boldsymbol{y}^{\boldsymbol{*}}$ is the prediction for the test data, and $p\left(\mathbf{W}|\mathbf{X},\mathbf{Y}\right)$ represents the posterior distribution of network weights. The network weights $\mathbf{W}$ are composed of $L$ layers $\left\{\mathbf{W}_{i}\right\}_{i=1}^{L}$, where $\mathbf{W}_{i}$ is a matrix of dimensions $K_{i}\times K_{i-1}$, and $K_{i}$ is the number of units for each layer $i$.

In the prediction stage, the integral above is not tractable, so it is usually
approximated by Monte Carlo integration using samples drawn from the posterior distribution.
Also, it is impractical to compute the posterior distribution exactly, so one needs
to approximate the posterior distribution by applying variational inference techniques
^{[16]}. These techniques approximate the posterior distribution to some variational distribution
$q\left(\mathbf{W}\right)$, from which samples can be easily drawn. This is done by
minimizing the Kullback-Leibler (KL) divergence between $q\left(\mathbf{W}\right)$
and the posterior distribution. The KL divergence is defined as follows ^{[15]}:

A previous study showed that a model trained with dropout is an approximate Bayesian
network by defining the variational distribution $q\left(\mathbf{W}_{i}\right)$ for
every layer $i$ with units $j$ as follows ^{[15]}:

##### (3)

where $\mathrm{b}_{i,j}$ is a Bernoulli-distributed random variable with probability
$p_{i}$, and $\mathbf{M}_{i}$ is a variational parameter to be optimized. The integral
in Eq. (2) can be approximated by summing over Monte-Carlo samples drawn from the
variational distribution $q\left(\mathbf{W}\right)$ as follows ^{[15]}:

##### (4)

$-\int q\left(\mathbf{W}\right)\log p\left(\mathbf{Y}|\mathbf{X},\mathbf{W}\right)d\mathbf{W}\approx \frac{1}{T}\sum _{t=1}^{T}\sum _{n=1}^{N}\log p\left(\boldsymbol{y}_{n}|\boldsymbol{x}_{n},\hat{\mathbf{W}}_{n,t}\right)$,where $N$ is the number of data, $T$ is the number of Monte Carlo samples, and $~ \hat{\mathbf{W}}_{n,t}$ is a Monte Carlo sample from the variational distribution $q\left(\mathbf{W}\right).$

The first term approximated with a single Monte Carlo sample is identical to
the cross-entropy loss of a model trained with dropout ^{[15]}. In addition, the second term can be approximated as $L2$ regularization ^{[15]}. Therefore, minimizing a loss function composed of the cross-entropy loss and $L2$
regularization is approximately equivalent to minimizing KL divergence ^{[15]}.

We trained the twin network with dropout for the Bayesian twin network. After training, we sampled the network weights 50 times from the approximate posterior distribution using dropout. We used the mean of these samples as our prediction and used variance to measure model uncertainty.

## 3. The Dataset

We used patterned wafer data acquired from the Vega facility of the ATI company for this study. We added synthetic defects to the patterned wafer data. The generated synthetic defect images contain defects of various shapes, sizes, and intensity values. Fig. 3(a) shows the patterned wafer data, Fig. 3(b) shows one of the die images in the patterned wafer data, and Fig. 3(c) shows example images of synthetic defects with red boxes indicating the synthetic defects.

Since the size of the die image in Fig. 3(b) is 10,000$\times 10,000$, we cropped it into 200 $\times $ 200 sub-die images. We extracted the golden and test sub-die images as follows. First, we selected two different die images (10,000 $\times $ 10,000) for the golden and test die images and obtained test sub-die images (200 $\times $ 200) by tiling the test die image. Then, we extracted the golden sub-die images through a template-matching technique using test sub-die images for templates.

We set the search area for matching using the location information used for cropping the test die image. This prior information can be used due to the characteristics of the repetitive die patterns on the wafer image, as shown in Fig. 3(a). We measured the similarity between the golden and test sub-die images using the mean absolute error (MAE) and performed template matching in sub-pixel units. We refer to each golden and test sub-die set as a pair-set.

Fig. 4 shows two pair-set examples with red boxes indicating the defects. Fig. 4(a) shows golden sub-die images, Fig. 4(b) shows test sub-die images, and Fig. 4(c) shows the ground truth images indicating the different areas between the two sub-die images. We collected 26,379 pair-set images (15,376 defective images and 11,003 defect-free images) for training and 21,168 pair-set images (12,283 defective images and 8,885 defect-free images) for testing. The total number of defects in the test set was 14,554. To verify the comparison performance of the twin network, we also collected 21,168 identical pair-set images composed of the same sub-die images, as shown in Fig. 5. The golden and test sub-die images are identical, so networks that compare the input images should detect no defects for this case, as shown in Fig. 5(c).

## 4. Experiments

We investigated two methods: twin network and Bayesian twin network methods. We verified the usefulness of the twin network for die-to-die inspection by implementing the conventional wafer inspection method, as shown in Fig. 1. We compared the performance of the twin network with that of the conventional method. In addition, we compared the performance of the Bayesian twin network and the twin network to verify that Bayesian learning could improve the network performance. We conducted experiments using the patterned wafer data and attempted to make the number of parameters of each method similar.

### 4.1 Training and Inference

The twin network had two encoders with shared weights and one decoder, as shown in Fig. 2, while the conventional method had one encoder and a corresponding decoder. The encoder networks of all methods had four convolutional layers and max-pooling layers. The decoder networks of all methods had four transposed convolutional layers and five convolutional layers.

We used batch normalization and activation layers after every convolutional layer and transposed convolutional layer. All models used the same number of filters and filter sizes. The numbers of filters in each encoder-convolutional layer were 16, 32, 64, and 128, and those in each decoder-convolutional layer were 128, 64, 32, 16, and 2. The filter size was $3\times 3$.

We implemented all methods using the PyTorch library ^{[25]} and an NVIDIA GeForce GTX 1070 GPU (Nvidia Corporation, USA). In addition, all models
were trained using the Adam optimizer with an initial learning rate of $1\times 10^{-3}$,
a batch size of 64, and a weight decay of $1\times 10^{-5}.$ We also randomly selected
20$\% $ of the training data as validation data in every epoch to monitor the performance
of the models and used the validation data for early stopping. For the Bayesian twin
network, we inserted dropout after every max-pooling layer of the encoders except
for the first layer and after every convolutional layer of the decoder except for
the last two layers. We applied dropout with a rate of 0.1 and obtained 50 Monte
Carlo samples at the test stage.

### 4.2 Performance Measurement

We used precision and recall to measure the performance of the models and computed them in terms of the number of connected objects in the prediction images obtained by thresholding softmax scores. We define precision as follows:

where true positive ($TP$) indicates the number of connected objects correctly inspected as defects by comparing the input sub-die images. $FP$ is the number of connected objects incorrectly inspected as defects (false positives). We define recall as follows:

where $FN$ denotes the number of actual defects not detected (false negatives). Higher precision and recall indicate better performance.

### 4.3 Experimental Results

We evaluated the performance of the conventional method and the proposed method using the patterned wafer data. To ensure a fair comparison, we set the threshold individually for each method. The recall and precision of each network are reported in Tables 1 and 2.

Table 1 summarizes the precision of each method when the recall of each method is the same. The twin network shows better performance than the conventional method in the sense that it exhibited higher precision when the recall is the same. The precision of the twin network was about 0.39 percentage points higher than that of the conventional method.

Table 1. Performance of all methods (the recall of each method is the same).

We also compared the performance of the Bayesian twin network and the twin network to confirm whether Bayesian learning could improve the performance of the twin network. The Bayesian twin network achieved the best performance, and its precision was about 1.85 percentage points higher than that of the twin network.

Table 2 summarizes the recall of each method when the precision of each method is very similar. As shown in the table, the recall of the twin network was about 0.47 percentage points higher than that of the conventional method. In addition, the recall of the Bayesian twin network was about 0.33 percentage points higher than that of the twin network.

Table 2. Performance of all methods (the precision of each method is very similar).

Tables 1 and 2 confirm that the twin network shows better performance than the conventional method. We think that it is because the twin network focuses more on image comparison than the conventional method. The conventional method merges the two sub-die images and treats them as different color channels. On the other hand, the twin network processes the two sub-die images separately through each encoder with shared weights and then merges the two branches. We think that the shared weights allow the twin network to more effectively identify whether similar features exist in the two input images.

We also confirmed that Bayesian learning could improve the performance of the twin network. We suspect that the reason is that Bayesian learning may prevent overfitting through marginalization over the network weights. Moreover, Bayesian learning may measure uncertainty of a trained model, as shown in Fig. 6. We think that providing the uncertainty information of the model predictions can be helpful to improve the trust of the twin network-based wafer die-to-die inspection system.

We also evaluated the performance of all methods using identical pair-set images to further verify that the structure of the twin network focuses more on image comparison. Table 3 shows the test results of each network using identical pair-set images. Identical pair-set images are composed of the same sub-die images, so comparison networks should detect no defects.

Table 3. Performance of three methods using identical pair-set images.

Methods |
The number of false positive |

The conventional method |
1,722 |

Twin network |
0 |

Bayesian twin network |
0 |

As shown in Table 3, the twin network and the Bayesian twin network correctly inspected all identical pair-set images. On the other hand, the number of false positives of the conventional method was 1,722. Fig. 7 shows the prediction results of each method using identical pair-set images. Table 3 and Fig. 7 confirm that the structure of the twin network is more suitable for die-to-die inspection.

Fig. 7. Prediction results of the conventional method, twin network, and Bayesian twin network using identical pair-set images.

Table 4 shows the inference time of each method. This was computed using the same hardware described in Section 4.1. The inference time of the conventional method was about 8.04 milliseconds shorter than that of the twin network. We think it is because the structure of the twin network is composed of two encoders, whereas the network structure of the conventional method is composed of one encoder. As expected, the Bayesian twin network was the slowest because it requires multiple inferences at test time. Although the Bayesian twin network requires more computation than other methods, we think it is useful because it could improve the performance of the twin network and measure model uncertainty. One may conceive of a method to reduce the amount of computation while maintaining the advantages of Bayesian learning, which is a subject for future study.

## 5. Conclusion

We have proposed an encoder-decoder-based twin network for die-to-die wafer inspection. The proposed twin network is composed of two sub-encoder networks that share weights and one decoder network. In contrast to conventional deep learning-based wafer inspection methods, the proposed method takes the golden sub-die image and the test sub-die image as input and compares them to detect different areas between two input images as defects. Furthermore, we applied Bayesian learning to improve the performance of the twin network. We verified the usefulness of the proposed method in experiments using patterned wafer data with synthetic defects.

### ACKNOWLEDGMENTS

The authors are grateful to ATI Co., Ltd in Incheon, Korea for providing us the patterned wafer data. This work was supported by the grant from the ATI company and by the Ewha Womans University scholarship of 2019.

### REFERENCES

## Author

Eunjeong Choi received her B.S. and M.S. degrees in Electronic and Electrical Engineering at the Ewha Womans University in Seoul, Korea, in 2016 and 2018, respectively. She is currently pursuing Ph.D. in electronics engineering at the Ewha Womans University in Seoul, Korea. She is interested in digital signal processing and machine learning for machine vision, etc.

Jeongtae Kim received his B.S. and M.S. degrees in Control and Instru-mentation Engineering from Seoul National University, Seoul, Korea, in 1989 and 1991, respectively. From 1991 to 1998, he had worked for Samsung Electronics in Korea where he had been engaged in the develop-ment of digital camcorder and digital TV. He received his Ph.D. degree in Electrical Engineering and Computer Science from the University of Michigan, Ann Arbor in 2004. Since 2004, he has been with the department of Electronic and Electrical Engineering in the Ewha Womans University in Seoul, Korea, currently as a professor. His research interests include statistical signal processing, image restoration, image reconstruction, machine learning, etc.