Mobile QR Code

1. (Department of Electronic and Electrical Engineering, Ewha Womans University, Seoul, Korea {eunjeong_choi, jtkim}@ewha.ac.kr )

Bootstrap, CMOS, fully differential, mirrored-cascode, TIA

## 1. Introduction

The semiconductor industry has been rapidly growing, and semiconductors are widely used in various products, such as mobile phones, computers, tablets, and automobiles [1,2]. In the semiconductor fabrication process, wafers are the core material, and integrated circuit (IC) dies are arranged on the wafer in a grid pattern [1]. To ensure the quality of semiconductor products, it is essential to inspect the dies of the wafer during the manufacturing process [2,3].

Typical wafer die inspection methods are based on machine vision-based golden template method that utilizes the characteristics of repetitive patterns of the dies on the wafer [2-5]. These methods first generate a golden die image by combining multiple die images in various ways [2-5]. Then, they calculate the pixel-wise difference between the golden die image and the test die image. Finally, they detect defects by applying various post-processing techniques to the difference image, such as image thresholding or morphology operation [2-4]. These inspection methods are called die-to-die inspection.

However, machine vision-based die-to-die inspection methods are vulnerable to registration error and intensity-level variation between die images. In addition, they rely heavily on an expert to extract hand-crafted features [6] and tune various parameters for an inspection system. Several deep learning-based wafer inspection methods have been investigated to automatically extract optimal features for wafer inspection and provide better performance [6-10]. Most deep learning-based methods use convolutional neural networks (CNN) to classify defect types [8-10] or detect defect locations [6,7]. However, these are not based on die-to-die inspection because they inspect the wafers using only the test image without information from the golden image.

We believe that the die-to-die inspection using the characteristics of repetitive die patterns can improve the performance of deep learning-based wafer inspection methods. To the best of our knowledge, there are no studies on deep learning-based die-to-die inspection. Therefore, we propose a deep learning-based comparison system for wafer die-to-die inspection. The proposed method is based on a twin network (Siamese network) composed of a convolutional encoder-decoder [11]. It takes golden and test die images as input and compares them to detect different areas between two input images as defects. To alleviate the performance degradation problem caused by registration error, we use only one die image as the golden die image instead of combining multiple die images. In addition, the proposed method detects defects using optimal features automatically extracted for die-to-die inspection in the training stage instead of directly calculating the difference between the golden die and test die images.

Fig. 1. A conventional wafer inspection system.

Furthermore, we improve the performance of the twin network by applying a Bayesian learning technique. Recently, many studies on Bayesian learning have been conducted in segmentation and classification fields [12-14]. These studies showed that the performance of the networks with Bayesian learning is better than that of conventional networks.

The idea of Bayesian learning is to interpret network weights as random variables and compute the posterior distribution of network weights given training data [15]. The posterior distribution allows us to calculate the distribution of prediction by marginalizing over network weights [15]. Through marginalization, Bayesian learning may prevent overfitting and improve the performance of the network [15]. However, it is impractical to compute the posterior distribution of network weights exactly, so techniques to approximate the posterior distribution are often used, such as variational inference [16].

Recently, a study developed a theoretical framework that casts dropout in machine learning as approximate Bayesian learning [15]. The study showed that a neural network trained with dropout is an approximate Bayesian neural network [15,17]. Moreover, some studies showed that a Bayesian neural network may measure the uncertainty of a trained model and reduce the overconfidence problem in classification tasks [15,18,19].

To verify the usefulness of the proposed method for die-to-die inspection, we compare its performance with that of a conventional wafer inspection method [7]. The conventional method is based on an encoder-decoder network with a single input image [7], so for a fair comparison, we modified it so that the encoder-decoder network uses golden and test images as input, as shown in Fig. 1. We call this modified method the conventional method in this paper. To the best of our knowledge, this is the first attempt to apply a twin network and Bayesian learning for wafer die-to-die inspection.

The remainder of this paper is organized as follows. In Section 2, we explain two important aspects of the proposed method in detail: the twin network and the Bayesian twin network. We present a detailed description of our dataset in Section 3. Experimental results and conclusions are presented in Sections 4 and 5.

## 2. Proposed Method

### 2.1 Twin Network

A twin network is a neural network that contains two or more sub-networks [11]. The network inputs multiple images and extracts feature vectors by processing images separately through each sub-network. Then, it computes the distance of extracted feature vectors to measure similarity between the input images [11]. The main advantage of a twin network is sharing weights between sub-networks [11]. By sharing the network weights, a twin network can identify whether similar features exist, so it can measure similarity between multiple images. Recently, a twin network composed of a convolutional encoder-decoder structure has been applied for various applications, such as video object segmentation [20] and change detection [21-23]. These methods also share the weights between encoders [20-23].

As shown in Fig. 2, the proposed network consists of two encoders and one decoder, and the weights in the two encoders are shared. This network takes the golden and test die images, extracts the feature maps from sub-encoders, and concatenates extracted feature maps. The merged feature maps are used as the decoder input, and the network finally outputs the different areas between the golden and test die images as defects. To detect small defects accurately, we also apply a difference skip-connection [24], which computes the absolute difference of the feature maps between encoder-convolution layers and then transfers the difference values to the decoder, as shown in Fig. 2.

### 2.2 Bayesian Twin Network

A Bayesian neural network models network weights as random variables instead of fixed values and computes the posterior distribution of network weights given training data. Using a posterior distribution, the network may predict a target value of test data by marginalizing over network weights $\mathbf{W}$ given training data $\mathbf{X}$ and its corresponding label set $\mathbf{Y}$, which is defined as follows [15]:

##### (1)
D K L ( q W | | p W | X , Y )

Fig. 2. Twin network-based wafer die-to-die inspection system.

where $\boldsymbol{x}^{\boldsymbol{*}}$ is test data, $\boldsymbol{y}^{\boldsymbol{*}}$ is the prediction for the test data, and $p\left(\mathbf{W}|\mathbf{X},\mathbf{Y}\right)$ represents the posterior distribution of network weights. The network weights $\mathbf{W}$ are composed of $L$ layers $\left\{\mathbf{W}_{i}\right\}_{i=1}^{L}$, where $\mathbf{W}_{i}$ is a matrix of dimensions $K_{i}\times K_{i-1}$, and $K_{i}$ is the number of units for each layer $i$.

In the prediction stage, the integral above is not tractable, so it is usually approximated by Monte Carlo integration using samples drawn from the posterior distribution. Also, it is impractical to compute the posterior distribution exactly, so one needs to approximate the posterior distribution by applying variational inference techniques [16]. These techniques approximate the posterior distribution to some variational distribution $q\left(\mathbf{W}\right)$, from which samples can be easily drawn. This is done by minimizing the Kullback-Leibler (KL) divergence between $q\left(\mathbf{W}\right)$ and the posterior distribution. The KL divergence is defined as follows [15]:

##### (2)
D K L ( q W | | p W | X , Y )             =   q W log p Y | X , W d W + D K L ( q W | | p W )

A previous study showed that a model trained with dropout is an approximate Bayesian network by defining the variational distribution $q\left(\mathbf{W}_{i}\right)$ for every layer $i$ with units $j$ as follows [15]:

##### (3)
W i = M i d i a g b i , j j = 1 K i 1 b i , j ~ Bernoulli p i for i = 1, , L , j = 1, , K i 1

where $\mathrm{b}_{i,j}$ is a Bernoulli-distributed random variable with probability $p_{i}$, and $\mathbf{M}_{i}$ is a variational parameter to be optimized. The integral in Eq. (2) can be approximated by summing over Monte-Carlo samples drawn from the variational distribution $q\left(\mathbf{W}\right)$ as follows [15]:

##### (4)
$-\int q\left(\mathbf{W}\right)\log p\left(\mathbf{Y}|\mathbf{X},\mathbf{W}\right)d\mathbf{W}\approx \frac{1}{T}\sum _{t=1}^{T}\sum _{n=1}^{N}\log p\left(\boldsymbol{y}_{n}|\boldsymbol{x}_{n},\hat{\mathbf{W}}_{n,t}\right)$,

where $N$ is the number of data, $T$ is the number of Monte Carlo samples, and $~ \hat{\mathbf{W}}_{n,t}$ is a Monte Carlo sample from the variational distribution $q\left(\mathbf{W}\right).$

The first term approximated with a single Monte Carlo sample is identical to the cross-entropy loss of a model trained with dropout [15]. In addition, the second term can be approximated as $L2$ regularization [15]. Therefore, minimizing a loss function composed of the cross-entropy loss and $L2$ regularization is approximately equivalent to minimizing KL divergence [15].

We trained the twin network with dropout for the Bayesian twin network. After training, we sampled the network weights 50 times from the approximate posterior distribution using dropout. We used the mean of these samples as our prediction and used variance to measure model uncertainty.

## 3. The Dataset

We used patterned wafer data acquired from the Vega facility of the ATI company for this study. We added synthetic defects to the patterned wafer data. The generated synthetic defect images contain defects of various shapes, sizes, and intensity values. Fig. 3(a) shows the patterned wafer data, Fig. 3(b) shows one of the die images in the patterned wafer data, and Fig. 3(c) shows example images of synthetic defects with red boxes indicating the synthetic defects.

Fig. 3. Experimental dataset.

Fig. 4. Pair-set examples.

Since the size of the die image in Fig. 3(b) is 10,000$\times 10,000$, we cropped it into 200 $\times$ 200 sub-die images. We extracted the golden and test sub-die images as follows. First, we selected two different die images (10,000 $\times$ 10,000) for the golden and test die images and obtained test sub-die images (200 $\times$ 200) by tiling the test die image. Then, we extracted the golden sub-die images through a template-matching technique using test sub-die images for templates.

We set the search area for matching using the location information used for cropping the test die image. This prior information can be used due to the characteristics of the repetitive die patterns on the wafer image, as shown in Fig. 3(a). We measured the similarity between the golden and test sub-die images using the mean absolute error (MAE) and performed template matching in sub-pixel units. We refer to each golden and test sub-die set as a pair-set.

Fig. 4 shows two pair-set examples with red boxes indicating the defects. Fig. 4(a) shows golden sub-die images, Fig. 4(b) shows test sub-die images, and Fig. 4(c) shows the ground truth images indicating the different areas between the two sub-die images. We collected 26,379 pair-set images (15,376 defective images and 11,003 defect-free images) for training and 21,168 pair-set images (12,283 defective images and 8,885 defect-free images) for testing. The total number of defects in the test set was 14,554. To verify the comparison performance of the twin network, we also collected 21,168 identical pair-set images composed of the same sub-die images, as shown in Fig. 5. The golden and test sub-die images are identical, so networks that compare the input images should detect no defects for this case, as shown in Fig. 5(c).

Fig. 5. Identical pair-set examples.

## 4. Experiments

We investigated two methods: twin network and Bayesian twin network methods. We verified the usefulness of the twin network for die-to-die inspection by implementing the conventional wafer inspection method, as shown in Fig. 1. We compared the performance of the twin network with that of the conventional method. In addition, we compared the performance of the Bayesian twin network and the twin network to verify that Bayesian learning could improve the network performance. We conducted experiments using the patterned wafer data and attempted to make the number of parameters of each method similar.

### 4.1 Training and Inference

The twin network had two encoders with shared weights and one decoder, as shown in Fig. 2, while the conventional method had one encoder and a corresponding decoder. The encoder networks of all methods had four convolutional layers and max-pooling layers. The decoder networks of all methods had four transposed convolutional layers and five convolutional layers.

We used batch normalization and activation layers after every convolutional layer and transposed convolutional layer. All models used the same number of filters and filter sizes. The numbers of filters in each encoder-convolutional layer were 16, 32, 64, and 128, and those in each decoder-convolutional layer were 128, 64, 32, 16, and 2. The filter size was $3\times 3$.

We implemented all methods using the PyTorch library [25] and an NVIDIA GeForce GTX 1070 GPU (Nvidia Corporation, USA). In addition, all models were trained using the Adam optimizer with an initial learning rate of $1\times 10^{-3}$, a batch size of 64, and a weight decay of $1\times 10^{-5}.$ We also randomly selected 20$\%$ of the training data as validation data in every epoch to monitor the performance of the models and used the validation data for early stopping. For the Bayesian twin network, we inserted dropout after every max-pooling layer of the encoders except for the first layer and after every convolutional layer of the decoder except for the last two layers. We applied dropout with a rate of 0.1 and obtained 50 Monte Carlo samples at the test stage.

### 4.2 Performance Measurement

We used precision and recall to measure the performance of the models and computed them in terms of the number of connected objects in the prediction images obtained by thresholding softmax scores. We define precision as follows:

##### (4)
$\textit{Precision}$=$\frac{TP}{TP+FP}$

where true positive ($TP$) indicates the number of connected objects correctly inspected as defects by comparing the input sub-die images. $FP$ is the number of connected objects incorrectly inspected as defects (false positives). We define recall as follows:

##### (5)
$\textit{Recall}$=$\frac{TP}{TP+FN}$

where $FN$ denotes the number of actual defects not detected (false negatives). Higher precision and recall indicate better performance.

### 4.3 Experimental Results

We evaluated the performance of the conventional method and the proposed method using the patterned wafer data. To ensure a fair comparison, we set the threshold individually for each method. The recall and precision of each network are reported in Tables 1 and 2.

Table 1 summarizes the precision of each method when the recall of each method is the same. The twin network shows better performance than the conventional method in the sense that it exhibited higher precision when the recall is the same. The precision of the twin network was about 0.39 percentage points higher than that of the conventional method.

Table 1. Performance of all methods (the recall of each method is the same).

 Methods Recall Precision The conventional method 98.87 % (14,390/14,554) 96.33 % (14,390/14,938) Twin network 98.87 % (14,390/14,554) 96.72 % (14,390/14,878) Bayesian twin network 98.87 % (14,390/14,554) 98.57 % (14,390/14,599)

We also compared the performance of the Bayesian twin network and the twin network to confirm whether Bayesian learning could improve the performance of the twin network. The Bayesian twin network achieved the best performance, and its precision was about 1.85 percentage points higher than that of the twin network.

Table 2 summarizes the recall of each method when the precision of each method is very similar. As shown in the table, the recall of the twin network was about 0.47 percentage points higher than that of the conventional method. In addition, the recall of the Bayesian twin network was about 0.33 percentage points higher than that of the twin network.

Table 2. Performance of all methods (the precision of each method is very similar).

 Methods Recall Precision The conventional method 98.07 % (14,273/14,554) 98.56 % (14,273/14,481) Twin network 98.54 % (14,341/14,554) 98.57 % (14,341/14,549) Bayesian twin network 98.87 % (14,390/14,554) 98.57 % (14,390/14,599)

Tables 1 and 2 confirm that the twin network shows better performance than the conventional method. We think that it is because the twin network focuses more on image comparison than the conventional method. The conventional method merges the two sub-die images and treats them as different color channels. On the other hand, the twin network processes the two sub-die images separately through each encoder with shared weights and then merges the two branches. We think that the shared weights allow the twin network to more effectively identify whether similar features exist in the two input images.

We also confirmed that Bayesian learning could improve the performance of the twin network. We suspect that the reason is that Bayesian learning may prevent overfitting through marginalization over the network weights. Moreover, Bayesian learning may measure uncertainty of a trained model, as shown in Fig. 6. We think that providing the uncertainty information of the model predictions can be helpful to improve the trust of the twin network-based wafer die-to-die inspection system.

Fig. 6. Bayesian twin network prediction results

We also evaluated the performance of all methods using identical pair-set images to further verify that the structure of the twin network focuses more on image comparison. Table 3 shows the test results of each network using identical pair-set images. Identical pair-set images are composed of the same sub-die images, so comparison networks should detect no defects.

Table 3. Performance of three methods using identical pair-set images.

 Methods The number of false positive The conventional method 1,722 Twin network 0 Bayesian twin network 0

As shown in Table 3, the twin network and the Bayesian twin network correctly inspected all identical pair-set images. On the other hand, the number of false positives of the conventional method was 1,722. Fig. 7 shows the prediction results of each method using identical pair-set images. Table 3 and Fig. 7 confirm that the structure of the twin network is more suitable for die-to-die inspection.

Fig. 7. Prediction results of the conventional method, twin network, and Bayesian twin network using identical pair-set images.

Table 4 shows the inference time of each method. This was computed using the same hardware described in Section 4.1. The inference time of the conventional method was about 8.04 milliseconds shorter than that of the twin network. We think it is because the structure of the twin network is composed of two encoders, whereas the network structure of the conventional method is composed of one encoder. As expected, the Bayesian twin network was the slowest because it requires multiple inferences at test time. Although the Bayesian twin network requires more computation than other methods, we think it is useful because it could improve the performance of the twin network and measure model uncertainty. One may conceive of a method to reduce the amount of computation while maintaining the advantages of Bayesian learning, which is a subject for future study.

Table 4. Inference time (msec).

 Methods Inference Time (msec) The conventional method 19.22 Twin network 27.26 Bayesian twin network 389.44

## 5. Conclusion

We have proposed an encoder-decoder-based twin network for die-to-die wafer inspection. The proposed twin network is composed of two sub-encoder networks that share weights and one decoder network. In contrast to conventional deep learning-based wafer inspection methods, the proposed method takes the golden sub-die image and the test sub-die image as input and compares them to detect different areas between two input images as defects. Furthermore, we applied Bayesian learning to improve the performance of the twin network. We verified the usefulness of the proposed method in experiments using patterned wafer data with synthetic defects.

### ACKNOWLEDGMENTS

The authors are grateful to ATI Co., Ltd in Incheon, Korea for providing us the patterned wafer data. This work was supported by the grant from the ATI company and by the Ewha Womans University scholarship of 2019.

### REFERENCES

1
Stan Stokowski, Vaez-Iravani Mehdi, 1998, Wafer inspection technology challenges for ULSI manufacturing., American Institute of Physics, Vol. 449, No. 1
2
Zhang Jiun-Ming, Lin Ruey-Ming, Wang Mao-Jiun J, 1999, The development of an automatic post-sawing inspection system using computer vision techniques., Computers in Industry, Vol. 40, No. 1, pp. 51-60
3
Chou Paul B. , et al. , 1997, Automatic defect classification for semiconductor manufacturing., Machine Vision and Applications, Vol. 9, No. 4, pp. 201-214
4
Sheng-Uei Guan, Pin Xie, Hong Li., 2003, A golden-block-based self-refining scheme for repetitive patterned wafer inspections., Machine Vision and Applications, Vol. 13, No. 5, pp. 314-321
5
Liu Hongxia, et al. , 2010, Defect detection of IC wafer based on two-dimension wavelet transform., Microelectronics journal, Vol. 41, No. 2-3, pp. 171-177
6
Chen Ssu-Han, Kang Chih-Hsiang, Perng Der-Baau, 2020, Detecting and Measuring Defects in Wafer Die Using GAN and YOLOv3., Applied Sciences, Vol. 10, No. 23
7
Nakazawa Takeshi, Kulkarni Deepak V., 2019, Anomaly detection and segmentation for wafer defect patterns using deep convolutional encoder-decoder neural network architectures in semiconductor manufacturing., IEEE Transactions on Semiconductor Manufacturing, Vol. 32, No. 2, pp. 250-256
8
Cheon Sejune, et al. , 2019, Convolutional neural network for wafer surface defect classification and the detection of unknown defect class., IEEE Transactions on Semiconductor Manufacturing, Vol. 32, No. 2, pp. 163-170
9
Lin Hui, et al. , 2019, Automated defect inspection of LED chip using deep convolutional neural network., Journal of Intelligent Manufacturing, Vol. 30, No. 6, pp. 2525-2534
10
Chen Xiaoyan, et al. , 2020, A Light-Weighted CNN Model for Wafer Structural Defect Detection., IEEE Access, Vol. 8, pp. 24006-24018
11
Koch Gregory, Richard Zemel, Ruslan Salakhutdinov, 2015, Siamese neural networks for one-shot image recognition., ICML deep learning workshop, Vol. 2
12
Alex Kendall, Vijay Badrinarayanan, Roberto Cipolla, 2015, Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding., arXiv preprint arXiv:1511.02680
13
Isobe Shuya, Arai Shuichi, 2017, Deep convolutional encoder-decoder network with model uncertainty for semantic segmentation., 2017 IEEE International Conference on INnovations in Intelligent SysTems and Applications (INISTA) IEEE
14
Nair Tanya, et al. , 2020, Exploring uncertainty measures in deep networks for multiple sclerosis lesion detection and segmentation., Medical image analysis, Vol. 59
15
Gal Yarin, Ghahramani Zoubin, 2016, Dropout as a bayesian approximation: Representing model uncertainty in deep learning., international conference on machine learning PMLR
16
Graves Alex, 2011, Practical variational inference for neural networks., Advances in neural information processing systems
17
Srivastava Nitish, et al. , 2014, Dropout: a simple way to prevent neural networks from overfitting., The journal of machine learning research, Vol. 15, No. 1, pp. 1929-1958
18
Kristiadi Agustinus, Matthias Hein, Philipp Hennig, 2020, Being bayesian, even just a bit, fixes overconfidence in relu networks., International Conference on Machine Learning. PMLR
19
Shen Yichen, et al. , 2021, Real-Time Uncertainty Estimation in Computer Vision via Uncertainty-Aware Distribution Distillation., Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.
20
Lu Xiankai, et al. , 2019, See more, know more: Unsupervised video object segmentation with co-attention siamese networks., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
21
Varghese Ashley, et al. , 2018, ChangeNet: A deep learning architecture for visual change detection., Proceedings of the European Conference on Computer Vision (ECCV) Workshops
22
Dong Hongwen, et al. , 2019, PGA-Net: Pyramid feature fusion and global context attention network for automated surface defect detection., IEEE Transactions on Industrial Informatics, Vol. 16, No. 12, pp. 7448-7458
23
Chen Jie, et al. , 2020, DASNet: Dual attentive fully convolutional siamese networks for change detection of high resolution satellite images., IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
24
Daudt Rodrigo Caye, Saux Bertr Le, Boulch Alexandre, 2018, Fully convolutional siamese networks for change detection., 2018 25th IEEE International Conference on Image Processing (ICIP), IEEE
25
Paszke Adam, et al. , 2019, Pytorch: An imperative style, high-performance deep learning library., arXiv preprint arXiv:1912.01703

## Author

##### Eunjeong Choi

Eunjeong Choi received her B.S. and M.S. degrees in Electronic and Electrical Engineering at the Ewha Womans University in Seoul, Korea, in 2016 and 2018, respectively. She is currently pursuing Ph.D. in electronics engineering at the Ewha Womans University in Seoul, Korea. She is interested in digital signal processing and machine learning for machine vision, etc.

##### Jeongtae Kim

Jeongtae Kim received his B.S. and M.S. degrees in Control and Instru-mentation Engineering from Seoul National University, Seoul, Korea, in 1989 and 1991, respectively. From 1991 to 1998, he had worked for Samsung Electronics in Korea where he had been engaged in the develop-ment of digital camcorder and digital TV. He received his Ph.D. degree in Electrical Engineering and Computer Science from the University of Michigan, Ann Arbor in 2004. Since 2004, he has been with the department of Electronic and Electrical Engineering in the Ewha Womans University in Seoul, Korea, currently as a professor. His research interests include statistical signal processing, image restoration, image reconstruction, machine learning, etc.