Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 14, No. 05, p.616-630

ISSN (online) :

2287-5255

Received : 5 August 2024Revised : 26 August 2024Accepted : 8 October 2024

DOI :

https://doi.org/10.5573/IEIESPC.2025.14.5.616

Regular Paper

Face Super-Resolution via Restormer Attention and Feedback-enhanced Facial Prior Integration

ChangHuimin¹ DingQihui²

(Hangzhou Jiefeng Software Co., Ltd, Hangzhou 315100, China)
(Zhejiang Dahao MIND Intelligent Control Equipment Co., Ltd, Hangzhou 315100, China)

^* Corresponding Author:Huimin Chang, changmengqi-2008@163.com

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

Face Super-Resolution (FSR) methods based on deep learning have made significant progress in recovering severely degraded facial images. However, existing approaches still face challenges when dealing with extremely low-resolution and noisy inputs, particularly in preserving facial structures and fine details. This paper introduces a Restormer-based Face Super-Resolution (RFSR) method that integrates the robust feature extraction capabilities of Transformers with the temporal processing advantages of recurrent neural networks. The RFSR architecture comprises four key components: an initial feature extraction module (G1), a Restormer module, a Recurrent SuperResolution module (RecurrentSRModule), and a final reconstruction module (G2). The Restormer module employs multi-head transposed self-attention mechanisms to capture long-range dependencies, effectively extracting global facial features. The RecurrentSRModule refines and enhances image details through multiple iterations. This iterative collaboration mechanism enables the network to improve reconstruction quality progressively, particularly when processing challenging low-quality inputs. Additionally, the network incorporates a residual connection that adds the upsampled original input to the network output. This design allows the main network to focus on learning highfrequency details and image enhancement while preserving low-frequency information from the original input. This approach improves reconstruction stability. It also enhances detail fidelity in the super-resolved images. Extensive quantitative and qualitative experimental results demonstrate that the proposed RFSR method outperforms existing state-of-the-art FSR approaches in recovering high-quality facial images, especially when processing extremely low-resolution and heavily noisy inputs. Our method effectively restores facial structures and texture details while maintaining identity consistency and subtle expression variations.ㅍ

Keywords

Face super-resolution, Restormer, Recurrent neural networks, Self-attention mechanisms, , Image reconstructionㅍ

1. Introduction

In recent years, face super-resolution (FSR) has emerged as a critical area of research in computer vision, attracting significant attention from the scientific community ^[1]. FSR, alternatively known as face hallucination, aims to reconstruct high-resolution (HR) facial images from their low-resolution (LR) counterparts. This technology plays a crucial role in various applications, including video surveillance, facial recognition systems, and digital image forensics ^[2]. The advancements in FSR have not only enhanced the visual quality of facial images but also significantly improved the performance of downstream tasks such as face recognition, emotion analysis, and facial landmark detection ^[3]. As the demand for high-quality facial imagery continues to grow across multiple domains, the development of robust and efficient FSR methods has become increasingly crucial.

FSR represents a specialized subset of the broader task of single image super-resolution (SISR) ^[4-^6], which is inherently challenging due to the ill-posed nature of reconstructing HR details from LR inputs. Unlike SISR, which addresses images from arbitrary scenes, FSR focuses exclusively on facial images, leveraging the unique structural characteristics and statistical regularities of faces. This allows FSR methods to exploit strong prior knowledge about facial configurations, facilitating the recovery of both global structures and local details. Consequently, FSR approaches have demonstrated superior performance over general SISR techniques, particularly at high upscaling factors (e.g., $8\times$) ^[7]. Recent advancements in FSR have been largely driven by deep learning, with deep convolutional neural networks (DCNNs) providing robust generative capabilities that have substantially improved the quality of super-resolved facial images ^[8]. Several innovative FSR methods have emerged in recent years, further pushing the boundaries of what is achievable in this domain ^[9].

Recent FSR approaches have increasingly incorporated advanced deep learning architectures to enhance their performance. Notably, transformer-based models, which have demonstrated remarkable success in various computer vision tasks, have been adapted for FSR ^[10,^11]. These models excel at capturing long-range dependencies within images, a crucial aspect for reconstructing facial features coherently. Concurrently, recurrent neural networks (RNNs) have shown promise in iterative refinement processes, allowing for progressive improvement of image quality ^[12,^13]. However, existing methods often struggle to fully leverage the strengths of these architectures in the context of FSR. Many approaches apply transformers or RNNs in isolation, potentially missing out on the synergistic benefits of combining these techniques ^[1]. Furthermore, the integration of global and local feature learning, critical for preserving both overall facial structure and fine details, remains a challenge in current FSR frameworks ^[9]. Additionally, while some methods have attempted to incorporate facial priors or attention mechanisms, they often do so in a manner that does not fully exploit the hierarchical nature of facial features ^[5,^6]. Addressing these limitations requires a more sophisticated approach that can seamlessly blend the global context-awareness of transformers with the iterative refinement capabilities of RNNs, while also incorporating effective mechanisms for feature fusion and attention.

In this paper, we propose a novel Restormer-based Face Super-Resolution (RFSR) method that addresses the aforementioned challenges by integrating the strengths of transformer architectures and RNNs. Our approach comprises four key components: an initial feature extraction module (G1), a Restormer module, a Recurrent Super-Resolution module (RecurrentSRModule), and a final reconstruction module (G2). The Restormer module leverages multi-head transposed self-attention mechanisms to capture long-range dependencies and extract global facial features effectively ^[14]. This allows the network to maintain coherence across facial structures even at high upscaling factors. The RecurrentSRModule, inspired by the iterative refinement capabilities of RNNs, progressively enhances image details through multiple iterations ^[15]. This iterative process enables the network to adapt dynamically to varying levels of input degradation, a crucial feature when dealing with extremely LR or noisy facial images. To further improve reconstruction quality, we implement a residual connection that adds the upsampled original input to the network output. This design allows the main network to focus on learning high-frequency details and image enhancement while preserving low-frequency information from the original input. Our approach differs from previous methods by seamlessly integrating global context-aware feature extraction with iterative local refinement, addressing the limitations of using these techniques in isolation. Extensive experiments on benchmark datasets demonstrate that RFSR significantly outperforms existing state-of-the-art FSR methods, particularly in challenging scenarios involving severe degradation or extreme upscaling factors.

2. Related Work

2.1. Face Super-resolution

Recent advancements in deep learning have significantly propelled the field of FSR, with various innovative approaches emerging to address the challenges of reconstructing high-quality facial images from LR inputs. Transformer-based architectures, which have shown remarkable success in natural language processing, have been adapted for image restoration tasks, including FSR ^[16]. These models excel at capturing long-range dependencies, crucial for maintaining global facial structure coherence. For instance, Chen et al. ^[17] introduced a transformer-based model that leverages multi-scale feature extraction to enhance both global and local facial details. Concurrently, RNNs have been explored for their potential in iterative refinement processes. Zhang et al. ^[18] proposed an RNN-based approach that progressively enhances facial features through multiple recurrent steps, demonstrating improved performance on severely degraded inputs. Some researchers have also focused on incorporating facial priors to guide the super-resolution process. Wang et al. ^[19] introduced a method that utilizes facial landmark information to generate attention maps, effectively guiding the network to focus on critical facial regions. Hybrid approaches that combine different architectural elements have also gained traction. Li et al. ^[20] proposed a dual-branch network that simultaneously addresses global facial structure and local detail enhancement. Similarly, Niu et al. ^[21] introduced a cascaded refinement network that iteratively improves both the overall face structure and fine-grained textures. Despite these advancements, challenges remain in effectively integrating global context awareness with local detail refinement, particularly for extreme upscaling factors. Moreover, the computational efficiency of these models, especially in real-time applications, continues to be an area of active research. Our work builds upon these foundations, addressing the limitations of existing approaches by synergistically combining transformer architectures with recurrent refinement techniques.

2.2. Single Image Super-resolution

The field of SISR has undergone rapid evolution, providing crucial insights and techniques that form the foundation for FSR research. Early deep learning approaches, such as those proposed by Dong et al. ^[22], introduced CNNs to learn the mapping between LR and HR images, marking a significant leap from traditional interpolation methods. This was quickly followed by Kim et al. ^[23], who leveraged very deep networks with residual learning, demonstrating that network depth is critical for capturing complex image transformations. The pursuit of perceptual quality led to groundbreaking work by Ledig et al. ^[5], which pioneered the use of generative adversarial networks (GANs) and perceptual losses. This approach produced visually pleasing results that surpassed the limitations of pixel-wise loss functions, albeit at the cost of lower PSNR/SSIM scores. Enhanced GAN-based methods, such as those introduced by Wang et al. ^[24], incorporated a Relativistic GAN and a multi-scale discriminator architecture, effectively balancing global structure and local detail preservation. Attention mechanisms have proven to be powerful tools in SISR. Zhang et al. ^[25] proposed a channel attention module to adaptively recalibrate channel-wise features, demonstrating improved feature representation and reconstruction quality. Dai et al. ^[26] took this further by introducing a second-order attention network, leveraging both spatial and channel attention to focus on the most informative image regions and feature channels. The efficiency and effectiveness of recursive and progressive approaches have been highlighted in several works. Kim et al. ^[27] designed a deep recursive residual network that enables parameter sharing while maintaining high performance. Li et al. ^[28] introduced a progressive reconstruction framework, gradually upscaling images through multiple stages, which proved particularly effective for large upscaling factors.

Recent research has also explored the potential of transformer architectures in SISR. Chen et al. ^[29] adapted the transformer model for low-level vision tasks, including super-resolution, showcasing the power of self-attention mechanisms in capturing long-range dependencies. Liang et al. ^[30] further refined this approach by introducing a hierarchical transformer with shifted windows, effectively balancing local and global feature modeling. The integration of prior knowledge and learning-based methods has also shown promise. Zhang et al. ^[31] incorporated a degradation model into the network design, allowing for more robust performance across various downsampling kernels. Similarly, Mei et al. ^[32] leveraged non-local spatial attention to capture correlations between distant pixels, enhancing the network's ability to reconstruct complex textures. Efforts to improve computational efficiency while maintaining high performance have led to innovations like those proposed by Hui et al. ^[33], which introduced a lightweight network architecture with information multi-distillation blocks. This approach achieved a good balance between model size and reconstruction quality, making it suitable for resource-constrained applications. These advancements in SISR provide a rich foundation for FSR research. However, the domain-specific nature of facial images presents unique challenges and opportunities not fully addressed by general SISR techniques. These include the need to preserve facial identity, handle extreme pose variations, and reconstruct fine-grained facial features accurately. Our work builds upon these SISR innovations, adapting and extending them to the specific requirements of FSR, particularly in handling the intricate structures and subtle details characteristic of facial images.

3. Methods

In FSR, our primary objective is to reconstruct HR facial images ISR from their LR counterparts ILR, while preserving intricate facial details and maintaining identity consistency. To achieve this, we propose a novel RFSR framework that synergistically combines the strengths of transformer architectures with recurrent refinement techniques. The overall architecture of our proposed method is illustrated in Fig. 1.

Fig. 1. Overview of the proposed RFSR framework: (a) The main architecture, consisting of convolutional layers, pixel shuffle, Restormer, and recurrent SR modules, with an upsample path; (b) Detailed structure of the Restormer, composed of multiple transformer blocks with down-sampling and up-sampling paths; (c) Internal structure of a transformer Block; (d) Multi-Conv head transposed attention mechanism used within the transformer blocks.

As shown in Fig. 1, the RFSR framework consists of four key components: (1) an initial feature extraction module (G1) that captures low-level facial features from the input LR image, (2) a Restormer module leveraging multi-head transposed self-attention mechanisms to model long-range dependencies and global facial structure, (3) a Recurrent Super-Resolution module (RecurrentSRModule) that progressively refines facial details through multiple iterations, and (4) a final reconstruction module (G2) that synthesizes the HR output. To effectively integrate information across these modules and enhance the overall reconstruction quality, we introduce a novel Feature Integration and Enhancement (FIE) block, as depicted in the detailed view of Fig. 1. This block dynamically fuses features from different stages of the network, allowing for adaptive refinement of facial details based on both local and global context. Our method, as illustrated in Fig. 1, emphasizes the importance of preserving facial structure while enhancing fine details. By leveraging the global context modeling capabilities of the Restormer module and the iterative refinement process of the RecurrentSRModule, RFSR can effectively handle various facial poses, expressions, and lighting conditions. This approach enables the network to produce high-quality super-resolved facial images that maintain fidelity to the input while significantly enhancing resolution and detail.

The pipeline process can be described as follows:

(1)

$ x_1=G_1(x), $

where $G_1$ represents the initial feature extraction, which includes convolutional layers and pixel shuffle operations to provide enriched facial information for subsequent processing. The extracted features $x_1$ are then fed into the Restormer module:

(2)

$ x_2=R(x_1), $

where $R$ denotes the Restormer function, leveraging multi-head transposed self-attention mechanisms to model long-range dependencies and capture global facial structure. Following this, the RecurrentSRModule iteratively refines the facial details

(3)

$ x_3=S(x_2), $

where $S$ represents the RecurrentSRModule function. Finally, the HR face image is reconstructed using the G2 module:

(4)

$ y^{'}=G_2(x_3), $

where $y^{'}$ is the intermediate super-resolved output. To further enhance the results, we incorporate a residual connection with an upsampled version of the input:

(5)

$ y=y^{'}+U(x), $

where $U$ denotes the upsampling function and $y$ is the final HR output. Given a training set ${(x^{(i)},y^{(i)})}^N_{(i=1)}$, where $N$ is the number of training images, and $y^{(i)}$ is the ground-truth HR image corresponding to the LR image $x^{(i)}$, the loss function of the proposed RFSR is

(6)

$ L(\theta )=1/N{\Sigma}^N_{(i=1)}{\|y^{(i)}-y^{'(i)}\|}_1+\lambda *L_p(y^{(i)},y^{'(i)}), $

where $\theta $ denotes the network parameters, $\lambda $ is the trade-off between L1 loss and perceptual loss $L_p$. The L1 loss ensures pixel-wise fidelity, while the perceptual loss encourages the generation of visually pleasing results with fine facial details. The perceptual loss Lp is computed using the features extracted from a pre-trained VGG-19 network. Specifically, it measures the difference between the high-level features of the predicted ($y^i$) and ground truth ($y^{'(i)}$) images: $Lp(y^{(i)},y^{'(i)})=\Sigma\|{\phi }_j(y^{(i)})-{\phi }_j(y^{'(i)})\|^2$, where ${\phi }_j$ denotes the feature maps obtained from the $j$-th layer of the VGG-19 network.

3.1. Restormer Module

The Restormer module ^[14], denoted as $R$ in Eq. (2), represents a notable improvement in FSR. Motivated by the need to capture both local details and global facial structures, this module leverages the power of transformer architectures while introducing novel components tailored specifically for image processing tasks. The Restormer's design is driven by the goal of effectively modeling long-range dependencies in facial images while maintaining computational efficiency. At its core, the Restormer employs a U-shaped architecture, comprising an encoder, a latent space processor, and a decoder. The process begins with an overlapped patch embedding operation, E, which projects the input features into a high-dimensional space:

(7)

$ e=E(z_0), $

where $z_0$ is the input to the Restormer module. This initial embedding allows the network to capture local spatial relationships effectively.

The encoder consists of multiple stages, each operating at progressively lower resolutions. For each encoder stage $i$ ($i = 1$, $2$, $3$), the features are processed as

(8)

$ z^i_e=E_i\left(D_i\left(z^{i-1}_e\right)\right), $

where $E_i$ represents the $i$-th encoder stage composed of multiple Transformer blocks, and $D_i$ is a downsampling operation. The number of Transformer blocks in each encoder stage follows the pattern $[4$, $6$, $6]$, resulting in a total depth of 16 blocks in the encoder. This increasing depth as spatial dimensions decrease allows the network to capture increasingly abstract and global features.

The latent space processing, occurring at the bottleneck of the U-shaped architecture, is defined as

(9)

$ z_l=L\left(D_4\left(z^3_e\right)\right), $

where $L$ consists of 8 Transformer blocks. This deep processing at the lowest resolution enables the network to capture global context and long-range dependencies across the entire facial image, which is crucial for accurately reconstructing high-frequency details in the super-resolved output.

The decoder mirrors the encoder's structure, progressively upsampling and refining the features. For each decoder stage $i$ ($i=3$, $2$, $1$), the features are processed as

(10)

$ z^i_d=T_i\left(U_i\left(\left[z^{i+1}_d,z^i_e\right]\right)\right), $

where $T_i$ represents the $i$-th decoder stage, $U_i$ is an upsampling operation, and $[\cdot$, $\cdot] $ denotes channel-wise concatenation. This skip connection design facilitates effective multi-scale feature integration, allowing the network to combine low-level details with high-level semantic information.

A key innovation in the Restormer lies in its Transformer block design. Each block incorporates two novel components: the Multi-DConv Head Transposed Self-Attention (MDTA) and the Gated-DConv Feed-Forward Network (GDFN). The MDTA mechanism is expressed as

(11)

$ MDTA\left(Q,K,V\right)=softmax\left(\left(Q*K^T\right)/\sqrt{d_k}\right)*V, $

where $Q$, $K$, and $V$ are obtained through depthwise separable convolutions

(12)

$ Q,~K,~V=W_q\left(LN(Z)\right),~W_k\left(LN(Z)\right),~W_v\left(LN(Z)\right). $

This design allows for efficient spatial information modeling while maintaining the benefits of self-attention for capturing long-range dependencies.

The GDFN, following the MDTA in each Transformer block, is described as

(13)

$ GDFN(Z)=W_2\left(GELU\left(W_1(Z)\right){\odot }DWConv\left(W_1(Z)\right)\right), $

where ${\odot }$ represents element-wise multiplication, $DWConv$ is a depthwise convolution, and $W_1$, $W_2$ are linear transformations. This gating mechanism enables adaptive feature modulation based on both channel-wise and spatial information, enhancing the network's ability to focus on relevant facial features.

The Restormer also incorporates an adaptive Layer Normalization strategy, offering both bias-free ($LN_{BF}$) and with-bias ($LN_{WB}$) options

(14)

$ LN_{BF}(Z)=\left(Z-{\mu}(Z)\right)/\left({\sigma}(Z)+{\epsilon}\right)*{\gamma}andLN_{WB}(Z)\nonumber\\ =\left(\left(Z-{\mu}(Z)\right)/\left({\sigma}(Z)+{\epsilon}\right)\right)*{\gamma}, $

where $LN_{BF}$ denotes bias-free layer normalization, and $LN_{WB}$ denotes with-bias layer normalization.

This flexibility in normalization contributes to the network's ability to handle diverse input distributions and feature characteristics throughout its depth.

The final output of the Restormer module is obtained after a refinement stage: $z_{out}=H\left(z^1_d\right)$, where $H$ consists of 4 additional Transformer blocks. This brings the total depth of the Restormer to an impressive 38 Transformer blocks, allowing for extensive feature refinement and ensuring high-quality reconstruction of facial details.

The Restormer's innovative architecture, with its deep hierarchical processing, advanced attention mechanisms, and adaptive design choices, enables it to effectively capture and reconstruct intricate facial details across multiple scales. This makes it particularly well-suited for challenging FSR tasks, where preserving identity-specific features and generating realistic high-frequency details are paramount. The combination of local and global feature modeling, coupled with the network's significant depth, allows the Restormer to achieve superior performance in recovering fine facial structures and textures.

3.2. Recurrent SR module

The feedback block ^[15] serves as a crucial element of the model architecture, designed to iteratively refine feature representations for HR face image reconstruction. This block is inspired by the need to progressively enhance features through recurrent connections, which is essential for capturing intricate facial details and improving the final output quality. As illustrated in Fig. 2, the Feedback Block uses a recurrent structure that allows for continuous refinement and enhancement of features.

Fig. 2. The structure of the feedback block.

This block consists of alternating up-projection ($U_g$) and down-projection ($D_g$) operations for each group $g$. The input feature map $F_{\mathrm{in}}$ is initially compressed using a 1x1 convolutional layer $C_{\mathrm{in}}$. The up-projection steps transform LR feature maps $L_{g-1}$ into HR maps $H_g$, which are subsequently down-projected back to LR maps $L_g$. The final output $F_{\mathrm{out}}$ is obtained by concatenating all intermediate LR feature maps and compressing them with another 1x1 convolutional layer $C_{\mathrm{out}}$.

The Feedback Block is configured with 6 groups, 4 steps, and 48 feature channels. This setup balances computational efficiency with the ability to capture detailed features, which are vital for reconstructing high-quality images.

The block starts by compressing the input feature map $F_{\mathrm{in}}$ to reduce dimensionality, making it easier for the network to process the data efficiently. This initial compression is performed using a $1\times 1$ convolutional layer

(15)

$ F_0=C_{\mathrm{in}}\left(F_{\mathrm{in}}\right), $

where $C_{\mathrm{in}}$ represents the convolution operation.

Once the features are compressed, the Feedback Block employs a series of up-projection and down-projection operations. These operations progressively upscale and downscale the feature maps, refining the features at multiple scales. For each group $g$, where $g = 1$, $2$, $\dots$, $6$, the up-projection step is performed by applying the up-projection operation $U_g$ to the LR feature map from the previous step

(16)

$ H_g=U_g\left(L_{g-1}\right). $

After up-projection, the HR feature map undergoes down-projection through the down-projection operation $D_g$:

(17)

$ L_g=D_g\left(H_g\right). $

To ensure that the network effectively integrates multi-scale features, each up-projection and down-projection operation utilizes concatenated feature maps from all previous HR and LR stages. The concatenation process for down-projection can be described as

(18)

$ L_g=D_g\left(\mathrm{Cat}\left(H_1,H_2,{\dots },H_g\right)\right), $

where $\mathrm{Cat}$ denotes the concatenation operation along the channel dimension.

After processing through all groups, the final output of the Feedback Block is obtained by concatenating all intermediate LR feature maps and applying a final compression operation

(19)

$ F_{\mathrm{out}}=C_{\mathrm{out}}\left(\mathrm{Cat}\left(L_1,L_2,{\dots },L_6\right)\right), $

where $C_{\mathrm{out}}$ is another 1x1 convolutional layer that reduces the concatenated features to the desired output dimensionality.

The recurrent nature of the Feedback Block is key to its ability to iteratively refine feature maps. In each iteration, the block takes the output from the previous step and feeds it back as input for further refinement. This process of recurrent feedback allows the network to progressively enhance feature representations, making it particularly effective for tasks that require high-fidelity FSR. The design of the Feedback Block is driven by the need to iteratively improve the quality of feature representations. By alternating between up-projection and down-projection, the block captures and integrates features across different scales, making it adept at reconstructing fine details and complex structures in HR facial images. The concatenation of features from various scales, followed by a compression step, ensures that the final feature representation is both rich in detail and computationally efficient to process.

In conclusion, the Feedback Block plays a pivotal role in the model's ability to generate high-quality super-resolution images. Its iterative structure, combining up-projection, down-projection, and feature concatenation, allows for effective multi-scale feature integration and refinement. This iterative refinement process, as illustrated in Fig. 2, is essential for achieving the desired level of detail and accuracy in the reconstructed images.

4. Experiment

4.1. Dataset

We conduct our experiments using the CelebA dataset ^[34], a large-scale face attributes dataset widely used in face-related computer vision tasks. For our study, we utilize the first 36,000 images for training and the subsequent 1,000 images for testing. This split provides a comprehensive training set while maintaining an adequate test set for thorough evaluation. The training images undergo a preprocessing stage where we coarsely crop the face regions using a face detection algorithm. These cropped images are then resized to $128\times128$ pixels without any pre-alignment procedures. This resizing approach maintains the aspect ratio of the cropped face regions, ensuring that facial features are not distorted or stretched significantly during the resizing process. This approach preserves the natural variations in facial pose and expression, allowing our model to learn from and adapt to a diverse range of facial orientations and expressions. Fig. 3 presents a selection of sample images from our training dataset, showcasing the variety in facial features, expressions, and image quality present in the CelebA dataset.

Fig. 3. Sample images from the CelebA dataset used in our training process.

These images demonstrate the diversity in facial features, expressions, and image quality. For our RFSR method, we use color images in the training process. This decision enables our network to learn and reconstruct the full spectrum of facial details, including subtle color variations that are crucial for realistic face reconstruction.

4.2. Degradation Models

To comprehensively evaluate the effectiveness of our proposed RFSR method for various types of image degradation, we employ three degradation models to simulate LR images. The first model is bicubic downsampling (Bic), which we implement using the Matlab function imresize with the bicubic option. This model is applied with scaling factors of $4\times$ and $8\times$, simulating different levels of resolution loss. The second degradation model, bicubic downsampling with noise (BicN), builds upon the first model by adding Gaussian noise after the downsampling process. Specifically, after applying bicubic downsampling with scaling factors of $4\times$ and $8\times$, we add Gaussian noise with a noise level of 15. The noise level n indicates a standard deviation of n in a pixel intensity range of [0, 255]. This model simulates scenarios where the LR image is affected by both resolution loss and sensor noise. To create a more challenging scenario, we introduce a third degradation model: blur, bicubic downsampling, and noise (BBicN). In this model, we first blur the HR image using a Gaussian kernel of size $7 \times 7$ with a standard deviation of 1.6. We then apply bicubic downsampling with scaling factors of $4\times$ and $8\times$, followed by the addition of Gaussian noise with a noise level of 30. This model represents a complex degradation process involving blur, downsampling, and severe noise. These degradation models allow us to test our RFSR method under various conditions, from simple resolution reduction to complex scenarios involving multiple types of image degradation. By using these diverse degradation models, we aim to demonstrate the robustness and effectiveness of our proposed method in handling different types and levels of image quality deterioration commonly encountered in real-world FSR tasks.

4.3. Training Setting

We initialize the network parameters using the Xavier initialization method. All transformer blocks in our Restormer-based architecture use the Gaussian Error Linear Unit (GELU) as the activation function. Our model is implemented using the PyTorch framework and optimized using the AdamW optimizer with cosine learning rate scheduling. The initial learning rate is set to $3\times {10}^{-4}$ and gradually decreases to $1 \times 10^{-6}$ over the course of training. We use a batch size of 8 for our experiments. The momentum parameter $\beta1$ is set to 0.9, and $\beta2$ is set to 0.999. Weight decay is applied with a factor of $1 \times {10}^{-4}$ to prevent overfitting. We train our RFSR model for a total of 300 epochs. The training process takes approximately 24 hours on a single NVIDIA RTX 3090 GPU. To assess the quality of the super-resolved images, we employ two widely used objective image quality assessment metrics: Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM). All metrics are calculated on the Y-channel of the YCbCr color space of the super-resolved images. For the loss function, we use a combination of L1 loss and perceptual loss. The perceptual loss is computed using the features extracted from a pre-trained VGG-19 network. The total loss is a weighted sum of these two components, with the weights empirically set to balance the contribution of each loss term. To enhance the model's generalization capability, we applied several data augmentation techniques during training. These include random horizontal flipping with a probability of 0.5, random rotation with angles selected from the range of $\pm10$ degrees, and random brightness and contrast adjustments with factors selected from the range of $\pm15$% (corresponding to a factor range of 0.85 to 1.15). These augmentations were applied sequentially, with each transformation having an independent probability of being applied to a given image.

4.4. Comparisons with State-of-the-Art Methods

We compare our proposed RFSR method with several state-of-the-art super-resolution approaches, each representing different advancements in the field. For fair comparison, all models are trained on the same CelebA dataset using the same training split and under consistent settings. ESRGAN ^[24] enhances the traditional GAN-based super-resolution methods by introducing a more robust architecture and a perceptual loss function, which focuses on high-frequency details to generate visually pleasing images. SwinIR ^[30], based on Swin Transformers, leverages hierarchical feature representations and attention mechanisms to effectively handle image restoration tasks, achieving high-quality reconstructions. EDSR ^[6] employs an enhanced deep residual network that simplifies the network structure by removing unnecessary batch normalization layers, leading to significant improvements in image quality and computational efficiency. VQ-VAE-2 ^[35] utilizes vector quantization and autoencoders to generate high-fidelity images, providing a unique approach to handling image super-resolution tasks. SRFlow ^[36] introduces a novel approach using normalizing flows to model the distribution of HR images, enabling it to generate diverse and high-quality outputs. Finally, EfficientNet-SR ^[37] applies the EfficientNet architecture, known for its balance between accuracy and efficiency, to the task of image super-resolution, achieving impressive results with relatively low computational cost. To evaluate the performance of these methods, we use Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) as metrics, focusing on the Y-channel of the YCbCr color space.

Table 1 showcases the performance of various super-resolution methods, including our proposed RFSR, across different degradation models (Bic, BicN, and BBicN), using PSNR and SSIM metrics. For the Bic model, which involves simple bicubic downsampling, our RFSR method achieves the highest PSNR of 27.65 and SSIM of 0.7734. This indicates superior noise reduction and fidelity compared to other methods. In the more challenging BicN model, which adds Gaussian noise to the downsampled images, RFSR again leads with a PSNR of 25.67 and an SSIM of 0.7195, demonstrating robustness against noise. The BBicN model, combining blur, downsampling, and high-level Gaussian noise, presents the most complex degradation scenario. Here, RFSR attains a PSNR of 23.99 and an SSIM of 0.6892, outperforming other methods and showing its capability to handle severe degradation while maintaining structural integrity. Overall, our RFSR method consistently excels across all degradation models, highlighting its robustness and effectiveness. The advanced transformer-based architecture of RFSR, which effectively captures and reconstructs fine details, sets a new benchmark in FSR, significantly enhancing both PSNR and SSIM metrics.

Then, we compared our RFSR method with state-of-the-art approaches visually. As shown in Figs. 4 and 5, our method demonstrates superior performance in reconstructing high-quality face images under various degradation conditions.

Table 1. Benchmark results with different degradation models.

Methods	PSNR (Bic)	SSIM (Bic)	PSNR (BicN)	SSIM (BicN)	PSNR (BBicN)	SSIM (BBicN)
ESRGAN ^[24]	27.0503	0.7632	25.1312	0.7101	23.6529	0.6589
SwinIR ^[30]	27.2504	0.7671	25.3233	0.7128	23.7621	0.6612
EDSR ^[6]	27.3412	0.7689	25.4525	0.7146	23.8231	0.6628
VQ-VAE-2 ^[35]	27.1711	0.7643	25.2443	0.7110	23.7021	0.6597
SRFlow ^[36]	27.3103	0.7679	25.4041	0.7133	23.7802	0.6619
EfficientNet-SR ^[37]	27.4051	0.7701	25.5056	0.7157	23.8412	0.6635
RFSR (Ours)	27.6502	0.7734	25.6712	0.7195	23.9913	0.6673

Table 2. Face recognition evaluation on the BBicN degradation SR results from each method.

Methods	Performance	Methods	Performance
Bicubic	0.8058	EDSR	0.8530
ESRGAN	0.8412	VQ-VAE-2	0.8645
SwinIR	0.8620	SRFlow	0.8580
EfficientNet-SR	0.8710	RFSR (Ours)	0.8935

Fig. 4 illustrates the visual comparison of different methods under the bicubic downsampling (Bic) model. While all methods show improvements over the LR input, our RFSR method stands out in several aspects. The facial features reconstructed by RFSR are noticeably sharper and more defined, especially in critical areas such as the eyes, nose, and mouth. For instance, the eye region in our result shows clearer iris details and more natural eyelash rendering compared to other methods. The skin texture produced by RFSR also appears more realistic, avoiding the over-smoothing effect seen in some competing methods like ESRGAN or the slight blurriness in SwinIR results.

In Fig. 5, we present the visual results under the more challenging BicN model, which introduces noise to the downsampled images. The impact of noise is evident across all methods, but our RFSR demonstrates remarkable resilience. While methods like EDSR and VQ-VAE-2 struggle to maintain clear facial structures in the presence of noise, our approach preserves the overall facial integrity and fine details. The hair texture in our result, for example, retains more natural waviness and individual strand definition, whereas other methods tend to produce a more smudged or artificial appearance. Notably, our method excels in preserving the subtle contours and expressions of the face. The nasolabial folds and slight smile lines are more accurately reconstructed in our results, contributing to a more lifelike and expressive face image. This is particularly evident when compared to methods like SRFlow or EfficientNet-SR, which, while effective in general super-resolution tasks, seem to struggle with the nuanced details of facial features under noisy conditions. Furthermore, the color fidelity in our reconstructions is superior. The skin tone appears more natural and consistent across the face, avoiding the color distortions or uneven patches sometimes seen in the results of other methods. This is crucial for maintaining the overall realism and quality of the super-resolved face images. The effectiveness of our RFSR method is particularly evident in the more complex degradation scenarios. Even as the noise level increases, our method maintains a consistent quality in facial reconstruction. This is in contrast to some other approaches, where the quality degrades more noticeably with increased noise, resulting in loss of facial details or introduction of artifacts.

In summary, the visual comparisons in Figs. 4 and 5 clearly demonstrate the superiority of our RFSR method in FSR tasks. Our approach not only produces sharper and more detailed facial features but also shows remarkable robustness against various types of image degradation, particularly noise. This visual evidence, combined with the quantitative results from Table 1, strongly supports the effectiveness of our proposed method in generating high-quality, realistic face images from LR inputs.

Fig. 4. Visual comparison of different super-resolution methods under the bicubic downsampling (Bic) model. Images are from the CelebA test set, downsampled by a factor of $8 \times$. From left to right: LR input, ESRGAN, SwinIR, EDSR, VQ-VAE-2, SRFlow, EfficientNet-SR, and our RFSR (proposed).

Fig. 5. Visual comparison of different super-resolution methods under the bicubic downsampling with noise (BicN) model. Images are from the CelebA test set, downsampled by a factor of $8\times$ and added Gaussian noise with a noise level of 15. From left to right: LR noisy input, ESRGAN, SwinIR, EDSR, VQ-VAE-2, SRFlow, EfficientNet-SR, and our RFSR (proposed).

In order to corroborate the real benefit of our proposal, we further performed face verification experiments using Additive Angular Margin Loss for Deep Face Recognition (ArcFace) ^[38]. We constructed 1000 positive sample pairs and 9000 negative sample pairs based on the SR results (BBicN) from each method. The results are shown in Table~2. It can be seen that our reconstruction results have better identity retention properties.

The face verification results clearly demonstrate the superiority of our RFSR method in preserving identity-related features. Our method achieves the highest performance score of 0.8935, significantly outperforming all other compared methods. This indicates that the face images reconstructed by RFSR not only have better visual quality but also retain more accurate and distinctive identity information. The performance gap between RFSR and other methods is particularly notable. For instance, EfficientNet-SR, which shows the second-best performance, achieves a score of 0.8710, while our method surpasses it by a margin of 0.0225. This improvement is substantial in the context of face verification tasks, where even small increases in accuracy can have significant practical implications. It's worth noting that all learning-based methods show improvements over the basic bicubic interpolation (0.8058), highlighting the effectiveness of deep learning approaches in FSR. However, the varying degrees of improvement among different methods underscore the importance of architectural choices and training strategies in preserving identity features.

4.5. Effect of Transformer Blocks and Refinement Module

To thoroughly evaluate the effectiveness of our proposed RFSR method, we conduct an ablation study to analyze the impact of the transformer blocks and the recurrent SR module. We design three network variants to clearly demonstrate how different components contribute to the overall performance: RFSR (Ours) as the complete model with all components, BasicNet V1 as RFSR without the recurrent SR module, and BasicNet V2 as RFSR without transformer blocks. We train these variants on the FSR task using the BicN degradation model and evaluate their performance in terms of PSNR over 50 epochs. The results are presented in Fig. 6.

As shown in Fig. 6, the complete RFSR model consistently outperforms all other variants, achieving the highest PSNR values throughout the training process. This demonstrates the synergistic effect of combining transformer blocks with the recurrent SR module. The RFSR model reaches a PSNR of approximately 25.67 dB by the end of training, setting a new benchmark for FSR performance. BasicNet V1, which lacks the recurrent SR module, shows the second-best performance. Its PSNR curve closely follows that of the full RFSR model, but with a consistent gap of about 0.2 dB, settling around 25.47 dB. This observation highlights the significant role of the recurrent SR module in enhancing fine details and improving overall image quality. The recurrent module appears to be particularly effective in recovering subtle facial features that may be missed by the transformer blocks alone. BasicNet V2, which eliminates the transformer blocks entirely, shows the lowest performance among all variants. Its PSNR values plateau around 25.13 dB, significantly lower than the other models. The substantial gap between BasicNet V2 and the other variants, particularly the full RFSR model (about 0.54 dB difference), underscores the critical role of transformer blocks in achieving high-quality FSR. This result suggests that the transformer architecture is fundamental to capturing the long-range dependencies and complex feature interactions necessary for effective face image reconstruction. Interestingly, all models show rapid improvement in the initial epochs, followed by a more gradual increase in PSNR. However, the full RFSR model maintains a slightly steeper improvement curve even in later epochs, indicating its superior learning capacity and potential for further improvement with extended training. This observation suggests that the combination of transformer blocks and the recurrent SR module not only yields better results but also provides a more robust and flexible learning framework. To further validate the effectiveness of our RFSR model, we conduct additional experiments using different degradation models. Table 3 presents the PSNR and SSIM results for all variants across Bic, BicN, and BBicN degradation models.

The results in Table 3 consistently demonstrate the superiority of the full RFSR model across all degradation scenarios. The performance gap is particularly pronounced in more challenging degradation models like BBicN, where the full model's ability to handle complex degradations becomes evident.

Fig. 6. Ablation study on effects of transformer blocks and recurrent SR module with BicN degradation model.

Table 3. Quantitative results of ablation study under different degradation models.

Model	Bic (PSNR/SSIM)	BicN (PSNR/SSIM)	BBicN (PSNR/SSIM)
RFSR (Ours)	27.65 / 0.7734	25.67 / 0.7195	23.99 / 0.6892
BasicNet V1	27.45 / 0.7711	25.47 / 0.7173	23.84 / 0.6871
BasicNet V2	27.10 / 0.7662	25.13 / 0.7124	23.58 / 0.6829

5. Conclusion

In this paper, we propose a FSR network named RFSR. RFSR integrates transformer-based global feature extraction with recurrent local refinement, aiming to enhance the reconstruction of degraded facial images. The network architecture consists of four key components: an initial feature extraction module, a Restormer module, a Recurrent Super-Resolution module, and a final reconstruction module. To address challenges posed by noise, blur, and LR inputs, RFSR employs multi-head transposed self-attention mechanisms in the Restormer module to capture long-range dependencies. The Recurrent Super-Resolution module iteratively refines facial details through multiple feedback loops. Additionally, we incorporate a residual connection that adds the upsampled original input to the network output, allowing the main network to focus on learning high-frequency details while preserving low-frequency information. This design is intended to enhance reconstruction stability and detail fidelity, especially when processing challenging low-quality inputs.

Declaration of Competing Interes

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability

Data will be made available on request.

REFERENCES

C. Wang, J. Jiang, Z. Zhong, and X. Liu, ``Spatial-frequency mutual learning for face super-resolution,'' Proc. of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, pp. 22356-22366, 2023.

C. Ma, Z. Jiang, Y. Rao, J. Lu, and J. Zhou, ``Deep face super-resolution with iterative collaboration between attentive recovery and landmark estimation,'' Proc. of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 5568-5577, 2020.

C. Chen, D. Gong, H. Wang, Z. Li, and K.-Y. K. Wong, ``Learning spatial attention for face super-resolution,'' IEEE Transactions on Image Processing, vol. 30, pp. 1219-1231, 2021.

Y. Yoon, H.-G. Jeon, D. Yoo, J.-Y. Lee, and I. S. Kweon, ``Learning a deep convolutional network for light-field image super-resolution,'' Proc. of IEEE International Conference on Computer Vision Workshop (ICCVW), Santiago, Chile, pp. 57-65, 2015.

C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, ``Photo-realistic single image super-resolution using a generative adversarial network,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 105-114, 2017.

Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, ``Residual dense network for image super-resolution,'' Proc. of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 2472-2481, 2018.

D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. A. Efros, ``Context encoders: feature learning by inpainting,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 2536-2544, 2016.

Z. Wang, J. Chen, and S. C. H. Hoi, ``Deep learning for image super-resolution: A survey,'' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 10, pp. 3365-3387, 2021.

M. Kawulok, P. Benecki, S. Piechaczek, K. Hrynczenko, D. Kostrzewa, and J. Nalepa, ``Deep learning for multiple-image super-resolution,'' IEEE Geoscience and Remote Sensing Letters, vol. 17, no. 6, pp. 1062-1066, 2020.

M. Shin, M. Kim, and D.-S. Kwon, ``Baseline CNN structure analysis for facial expression recognition,'' Proc. of 25th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), New York, NY, USA, pp. 724-729, 2016.

X. Huang, P. Zhang, J. Song, Q. Huang, C. Huang, and D. Huang, ``DDFormer: Dual-domain and dual-aggregation transformer for multi-contrast MRI super-resolution,'' Proc. of IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Istanbul, Turkiye, pp. 3029-3036, 2023.

G. Gao, Z. Xu, J. Li, J. Yang, T. Zeng, and G.-J. Qi, ``CTCNet: A CNN-transformer cooperation network for face image super-resolution,'' IEEE Transactions on Image Processing, vol. 32, pp. 1978-1991, 2023.

Q. Bao, Y. Liu, B. Gang, W. Yang, and Q. Liao, ``SCTANet: A spatial attention-guided cnn-transformer aggregation network for deep face image super-resolution,'' IEEE Transactions on Multimedia, vol. 25, pp. 8554-8565, 2023.

J. Pan, H. Yu, Z. Gao, S. Wang, H. Zhang, and W. Wu, ``Iterative residual optimization network for limited-angle tomographic reconstruction,'' IEEE Transactions on Image Processing, vol. 33, pp. 910-925, 2024

Z. Li, J. Yang, Z. Liu, X. Yang, G. Jeon, and W. Wu, ``Feedback network for image super-resolution,'' Proc. of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, pp. 3862-3871, 2019.

X. Wang, L. Xie, C. Dong, and Y. Shan, ``Real-ESRGAN: training real-world blind super-resolution with pure synthetic data,'' Proc. of IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, pp. 1905-1914, 2021.

Q. Bao, B. Gang, W. Yang, J. Zhou, and Q. Liao, ``Attention-driven graph neural network for deep face super-resolution,'' IEEE Transactions on Image Processing, vol. 31, pp. 6455-6470, 2022.

L. Wang, Y. Wang, X. Dong, Q. Xu, J. Yang, W. An, and Y. Guo, ``Unsupervised degradation representation learning for blind super-resolution,'' Proc. of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, pp. 10576-10585, 2021.

X. Wang, Y. Li, H. Zhang, and Y. Shan, ``Towards real-world blind face restoration with generative facial prior,'' Proc. of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, pp. 9164-9174, 2021.

S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, ``Multi-stage progressive image restoration,'' Proc. of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, pp. 14816-14826, 2021.

P. Bian, Z. Zheng, D. Zhang, L. Chen, and M. Li, ``Single image super-resolution via global-context attention networks,'' Proc. of IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, pp. 1794-1798, 2021.

K. S. Charan, R. G. Rochan, T. N. Shashank, and C. Gururaj, ``Image super-resolution using convolutional neural network,'' Proc. of IEEE 2nd Mysore Sub Section International Conference (MysuruCon), Mysuru, India, pp. 1-7, 2022.

J. Kim, J. K. Lee, and K. M. Lee, ``Accurate image super-resolution using very deep convolutional networks,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 1646-1654, 2016.

N. C. Rakotonirina and A. Rasoanaivo, ``ESRGAN+: Further improving enhanced super-resolution generative adversarial network,'' Proc. of ICASSP 2020 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, pp. 3637-3641, 2020.

B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, ``Enhanced deep residual networks for single image super-resolution,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, pp. 1132-1140, 2017.

T. Dai, J. Cai, Y. Zhang, S.-T. Xia, and L. Zhang, ``Second-order attention network for single image super-resolution,'' Proc. of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, pp. 11057-11066, 2019.

J. Kim, J. K. Lee, and K. M. Lee, ``Deeply-recursive convolutional network for image super-resolution,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 1637-1645, 2016.

B. Zhao, R. Hu, X. Jia, and Y. Guo, ``Multi-scale residual fusion network for super-resolution reconstruction of single image,'' IEEE Access, vol. 8, pp. 155285-155295, 2020.

H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, and Z. Liu, ``Pre-trained image processing transformer,'' Proc. of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, pp. 12294-12305, 2021.

J. Liang, J. Cao, G. Sun, K. Zhang, L. van Gool, and R. Timofte, ``SwinIR: Image restoration using swin transformer,'' Proc. of IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, pp. 1833-1844, 2021.

Z. Hui, X. Wang, and X. Gao, ``Fast and accurate single image super-resolution via information distillation network,'' Proc. of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 723-731, 2018.

Y. Mei, Y. Fan, Y. Zhou, L. Huang, T. S. Huang, and H. Shi, ``Image super-resolution with cross-scale non-local attention and exhaustive self-exemplars mining,'' Proc. of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 5689-5698, 2020.

Z. Wang, G. Gao, J. Li, Y. Yu, and H. Lu, ``Lightweight image super-resolution with multi-scale feature interaction network,'' Proc. of IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, pp. 1-6, 2021.

CelebA Dataset - Machine Learning Datasets, CelebA Dataset Papers with CodeCelebA Dataset.

A. Razavi, A. van den Oord, and O. Vinyals, ``Generating diverse high-fidelity images with VQ-VAE-2,'' Proc. of the 33rd International Conference on Neural Information Processing Systems, pp. 14866-14876, 2019.

A. Lugmayr, M. Danelljan, L. V. Gool, and R. Timofte, ``SRFlow: Learning the super-resolution space with normalizing flow,'' Proc. of the European Conference on Computer Vision (ECCV). 2021.

Y. Zhang, et al., ``Rethinking super-resolution: efficient and effective image upsampling with CNNs,'' Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023.

J. Deng, J. Guo, N. Xue, and S. Zafeiriou, ``ArcFace: additive angular margin loss for deep face recognition,'' Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4690-4699, 2019.

Author

Huimin Chang

Huimin Chang graduated from Xidian University in 2015 with the master's degree in cryptography, worked at Hikvision Digital Technology Co., Ltd. from 2015 to 2020, worked at Zhejiang Dahua Technology Co., Ltd. from 2021 to 2024, and then joined Hangzhou Jiefeng Software Co., Ltd. in 2025 as a senior solution engineer with the qualification of Associate Professor of Engineering, mainly responsible for the design of solutions for Big Data, Cloud Computing.

Qihui Ding

Qihui Ding graduated from Xidian University in 2007 with the bachelor's degree in software engineering, worked at Hangzhou MIND Technology Co., Ltd. from 2009 to 2018, joined Zhejiang Dahao MIND Intelligent Control Equipment Co., Ltd. Hangzhou Branch in 2018 as a senior software development engineer, mainly responsible for board driver development, Embedded UI library development and pattern design system development.

IEIE SPC IEIE Transactions on Smart Processing & Computing

Journal Search

Journal XML

Journal Information

Face Super-Resolution via Restormer Attention and Feedback-enhanced Facial Prior Integration

Abstract

Keywords

1. Introduction

2. Related Work

2.1. Face Super-resolution

2.2. Single Image Super-resolution

3. Methods

(1)

(2)

(3)

(4)

(5)

(6)

3.1. Restormer Module

(7)

(8)

(9)

(10)

(11)

(12)

(13)

(14)

3.2. Recurrent SR module

(15)

(16)

(17)

(18)

(19)

4. Experiment

4.1. Dataset

4.2. Degradation Models

4.3. Training Setting

4.4. Comparisons with State-of-the-Art Methods

4.5. Effect of Transformer Blocks and Refinement Module

5. Conclusion

Declaration of Competing Interes

Data availability

REFERENCES

Author

Huimin Chang

Qihui Ding

Article Information (continued)

Keywords

IEIE SPC

IEIE Transactions on Smart Processing & Computing