1. Introduction
In recent years, face super-resolution (FSR) has emerged as a critical area of research
in computer vision, attracting significant attention from the scientific community
[1]. FSR, alternatively known as face hallucination, aims to reconstruct high-resolution
(HR) facial images from their low-resolution (LR) counterparts. This technology plays
a crucial role in various applications, including video surveillance, facial recognition
systems, and digital image forensics [2]. The advancements in FSR have not only enhanced the visual quality of facial images
but also significantly improved the performance of downstream tasks such as face recognition,
emotion analysis, and facial landmark detection [3]. As the demand for high-quality facial imagery continues to grow across multiple
domains, the development of robust and efficient FSR methods has become increasingly
crucial.
FSR represents a specialized subset of the broader task of single image super-resolution
(SISR) [4-6], which is inherently challenging due to the ill-posed nature of reconstructing HR
details from LR inputs. Unlike SISR, which addresses images from arbitrary scenes,
FSR focuses exclusively on facial images, leveraging the unique structural characteristics
and statistical regularities of faces. This allows FSR methods to exploit strong prior
knowledge about facial configurations, facilitating the recovery of both global structures
and local details. Consequently, FSR approaches have demonstrated superior performance
over general SISR techniques, particularly at high upscaling factors (e.g., $8\times$)
[7]. Recent advancements in FSR have been largely driven by deep learning, with deep
convolutional neural networks (DCNNs) providing robust generative capabilities that
have substantially improved the quality of super-resolved facial images [8]. Several innovative FSR methods have emerged in recent years, further pushing the
boundaries of what is achievable in this domain [9].
Recent FSR approaches have increasingly incorporated advanced deep learning architectures
to enhance their performance. Notably, transformer-based models, which have demonstrated
remarkable success in various computer vision tasks, have been adapted for FSR [10,11]. These models excel at capturing long-range dependencies within images, a crucial
aspect for reconstructing facial features coherently. Concurrently, recurrent neural
networks (RNNs) have shown promise in iterative refinement processes, allowing for
progressive improvement of image quality [12,13]. However, existing methods often struggle to fully leverage the strengths of these
architectures in the context of FSR. Many approaches apply transformers or RNNs in
isolation, potentially missing out on the synergistic benefits of combining these
techniques [1]. Furthermore, the integration of global and local feature learning, critical for
preserving both overall facial structure and fine details, remains a challenge in
current FSR frameworks [9]. Additionally, while some methods have attempted to incorporate facial priors or
attention mechanisms, they often do so in a manner that does not fully exploit the
hierarchical nature of facial features [5,6]. Addressing these limitations requires a more sophisticated approach that can seamlessly
blend the global context-awareness of transformers with the iterative refinement capabilities
of RNNs, while also incorporating effective mechanisms for feature fusion and attention.
In this paper, we propose a novel Restormer-based Face Super-Resolution (RFSR) method
that addresses the aforementioned challenges by integrating the strengths of transformer
architectures and RNNs. Our approach comprises four key components: an initial feature
extraction module (G1), a Restormer module, a Recurrent Super-Resolution module (RecurrentSRModule),
and a final reconstruction module (G2). The Restormer module leverages multi-head
transposed self-attention mechanisms to capture long-range dependencies and extract
global facial features effectively [14]. This allows the network to maintain coherence across facial structures even at high
upscaling factors. The RecurrentSRModule, inspired by the iterative refinement capabilities
of RNNs, progressively enhances image details through multiple iterations [15]. This iterative process enables the network to adapt dynamically to varying levels
of input degradation, a crucial feature when dealing with extremely LR or noisy facial
images. To further improve reconstruction quality, we implement a residual connection
that adds the upsampled original input to the network output. This design allows the
main network to focus on learning high-frequency details and image enhancement while
preserving low-frequency information from the original input. Our approach differs
from previous methods by seamlessly integrating global context-aware feature extraction
with iterative local refinement, addressing the limitations of using these techniques
in isolation. Extensive experiments on benchmark datasets demonstrate that RFSR significantly
outperforms existing state-of-the-art FSR methods, particularly in challenging scenarios
involving severe degradation or extreme upscaling factors.
3. Methods
In FSR, our primary objective is to reconstruct HR facial images ISR from their LR
counterparts ILR, while preserving intricate facial details and maintaining identity
consistency. To achieve this, we propose a novel RFSR framework that synergistically
combines the strengths of transformer architectures with recurrent refinement techniques.
The overall architecture of our proposed method is illustrated in Fig. 1.
Fig. 1. Overview of the proposed RFSR framework: (a) The main architecture, consisting
of convolutional layers, pixel shuffle, Restormer, and recurrent SR modules, with
an upsample path; (b) Detailed structure of the Restormer, composed of multiple transformer
blocks with down-sampling and up-sampling paths; (c) Internal structure of a transformer
Block; (d) Multi-Conv head transposed attention mechanism used within the transformer
blocks.
As shown in Fig. 1, the RFSR framework consists of four key components: (1) an initial feature extraction
module (G1) that captures low-level facial features from the input LR image, (2) a
Restormer module leveraging multi-head transposed self-attention mechanisms to model
long-range dependencies and global facial structure, (3) a Recurrent Super-Resolution
module (RecurrentSRModule) that progressively refines facial details through multiple
iterations, and (4) a final reconstruction module (G2) that synthesizes the HR output.
To effectively integrate information across these modules and enhance the overall
reconstruction quality, we introduce a novel Feature Integration and Enhancement (FIE)
block, as depicted in the detailed view of Fig. 1. This block dynamically fuses features from different stages of the network, allowing
for adaptive refinement of facial details based on both local and global context.
Our method, as illustrated in Fig. 1, emphasizes the importance of preserving facial structure while enhancing fine details.
By leveraging the global context modeling capabilities of the Restormer module and
the iterative refinement process of the RecurrentSRModule, RFSR can effectively handle
various facial poses, expressions, and lighting conditions. This approach enables
the network to produce high-quality super-resolved facial images that maintain fidelity
to the input while significantly enhancing resolution and detail.
The pipeline process can be described as follows:
where $G_1$ represents the initial feature extraction, which includes convolutional
layers and pixel shuffle operations to provide enriched facial information for subsequent
processing. The extracted features $x_1$ are then fed into the Restormer module:
where $R$ denotes the Restormer function, leveraging multi-head transposed self-attention
mechanisms to model long-range dependencies and capture global facial structure. Following
this, the RecurrentSRModule iteratively refines the facial details
where $S$ represents the RecurrentSRModule function. Finally, the HR face image is
reconstructed using the G2 module:
where $y^{'}$ is the intermediate super-resolved output. To further enhance the results,
we incorporate a residual connection with an upsampled version of the input:
where $U$ denotes the upsampling function and $y$ is the final HR output. Given a
training set ${(x^{(i)},y^{(i)})}^N_{(i=1)}$, where $N$ is the number of training
images, and $y^{(i)}$ is the ground-truth HR image corresponding to the LR image $x^{(i)}$,
the loss function of the proposed RFSR is
where $\theta $ denotes the network parameters, $\lambda $ is the trade-off between
L1 loss and perceptual loss $L_p$. The L1 loss ensures pixel-wise fidelity, while
the perceptual loss encourages the generation of visually pleasing results with fine
facial details. The perceptual loss Lp is computed using the features extracted from
a pre-trained VGG-19 network. Specifically, it measures the difference between the
high-level features of the predicted ($y^i$) and ground truth ($y^{'(i)}$) images:
$Lp(y^{(i)},y^{'(i)})=\Sigma\|{\phi }_j(y^{(i)})-{\phi }_j(y^{'(i)})\|^2$, where ${\phi
}_j$ denotes the feature maps obtained from the $j$-th layer of the VGG-19 network.
3.1. Restormer Module
The Restormer module [14], denoted as $R$ in Eq. (2), represents a notable improvement in FSR. Motivated by the need to capture both local
details and global facial structures, this module leverages the power of transformer
architectures while introducing novel components tailored specifically for image processing
tasks. The Restormer's design is driven by the goal of effectively modeling long-range
dependencies in facial images while maintaining computational efficiency. At its core,
the Restormer employs a U-shaped architecture, comprising an encoder, a latent space
processor, and a decoder. The process begins with an overlapped patch embedding operation,
E, which projects the input features into a high-dimensional space:
where $z_0$ is the input to the Restormer module. This initial embedding allows the
network to capture local spatial relationships effectively.
The encoder consists of multiple stages, each operating at progressively lower resolutions.
For each encoder stage $i$ ($i = 1$, $2$, $3$), the features are processed as
where $E_i$ represents the $i$-th encoder stage composed of multiple Transformer blocks,
and $D_i$ is a downsampling operation. The number of Transformer blocks in each encoder
stage follows the pattern $[4$, $6$, $6]$, resulting in a total depth of 16 blocks
in the encoder. This increasing depth as spatial dimensions decrease allows the network
to capture increasingly abstract and global features.
The latent space processing, occurring at the bottleneck of the U-shaped architecture,
is defined as
where $L$ consists of 8 Transformer blocks. This deep processing at the lowest resolution
enables the network to capture global context and long-range dependencies across the
entire facial image, which is crucial for accurately reconstructing high-frequency
details in the super-resolved output.
The decoder mirrors the encoder's structure, progressively upsampling and refining
the features. For each decoder stage $i$ ($i=3$, $2$, $1$), the features are processed
as
where $T_i$ represents the $i$-th decoder stage, $U_i$ is an upsampling operation,
and $[\cdot$, $\cdot] $ denotes channel-wise concatenation. This skip connection design
facilitates effective multi-scale feature integration, allowing the network to combine
low-level details with high-level semantic information.
A key innovation in the Restormer lies in its Transformer block design. Each block
incorporates two novel components: the Multi-DConv Head Transposed Self-Attention
(MDTA) and the Gated-DConv Feed-Forward Network (GDFN). The MDTA mechanism is expressed
as
where $Q$, $K$, and $V$ are obtained through depthwise separable convolutions
This design allows for efficient spatial information modeling while maintaining the
benefits of self-attention for capturing long-range dependencies.
The GDFN, following the MDTA in each Transformer block, is described as
where ${\odot }$ represents element-wise multiplication, $DWConv$ is a depthwise convolution,
and $W_1$, $W_2$ are linear transformations. This gating mechanism enables adaptive
feature modulation based on both channel-wise and spatial information, enhancing the
network's ability to focus on relevant facial features.
The Restormer also incorporates an adaptive Layer Normalization strategy, offering
both bias-free ($LN_{BF}$) and with-bias ($LN_{WB}$) options
where $LN_{BF}$ denotes bias-free layer normalization, and $LN_{WB}$ denotes with-bias
layer normalization.
This flexibility in normalization contributes to the network's ability to handle diverse
input distributions and feature characteristics throughout its depth.
The final output of the Restormer module is obtained after a refinement stage: $z_{out}=H\left(z^1_d\right)$,
where $H$ consists of 4 additional Transformer blocks. This brings the total depth
of the Restormer to an impressive 38 Transformer blocks, allowing for extensive feature
refinement and ensuring high-quality reconstruction of facial details.
The Restormer's innovative architecture, with its deep hierarchical processing, advanced
attention mechanisms, and adaptive design choices, enables it to effectively capture
and reconstruct intricate facial details across multiple scales. This makes it particularly
well-suited for challenging FSR tasks, where preserving identity-specific features
and generating realistic high-frequency details are paramount. The combination of
local and global feature modeling, coupled with the network's significant depth, allows
the Restormer to achieve superior performance in recovering fine facial structures
and textures.
3.2. Recurrent SR module
The feedback block [15] serves as a crucial element of the model architecture, designed to iteratively refine
feature representations for HR face image reconstruction. This block is inspired by
the need to progressively enhance features through recurrent connections, which is
essential for capturing intricate facial details and improving the final output quality.
As illustrated in Fig. 2, the Feedback Block uses a recurrent structure that allows for continuous refinement
and enhancement of features.
Fig. 2. The structure of the feedback block.
This block consists of alternating up-projection ($U_g$) and down-projection ($D_g$)
operations for each group $g$. The input feature map $F_{\mathrm{in}}$ is initially
compressed using a 1x1 convolutional layer $C_{\mathrm{in}}$. The up-projection steps
transform LR feature maps $L_{g-1}$ into HR maps $H_g$, which are subsequently down-projected
back to LR maps $L_g$. The final output $F_{\mathrm{out}}$ is obtained by concatenating
all intermediate LR feature maps and compressing them with another 1x1 convolutional
layer $C_{\mathrm{out}}$.
The Feedback Block is configured with 6 groups, 4 steps, and 48 feature channels.
This setup balances computational efficiency with the ability to capture detailed
features, which are vital for reconstructing high-quality images.
The block starts by compressing the input feature map $F_{\mathrm{in}}$ to reduce
dimensionality, making it easier for the network to process the data efficiently.
This initial compression is performed using a $1\times 1$ convolutional layer
where $C_{\mathrm{in}}$ represents the convolution operation.
Once the features are compressed, the Feedback Block employs a series of up-projection
and down-projection operations. These operations progressively upscale and downscale
the feature maps, refining the features at multiple scales. For each group $g$, where
$g = 1$, $2$, $\dots$, $6$, the up-projection step is performed by applying the up-projection
operation $U_g$ to the LR feature map from the previous step
After up-projection, the HR feature map undergoes down-projection through the down-projection
operation $D_g$:
To ensure that the network effectively integrates multi-scale features, each up-projection
and down-projection operation utilizes concatenated feature maps from all previous
HR and LR stages. The concatenation process for down-projection can be described as
where $\mathrm{Cat}$ denotes the concatenation operation along the channel dimension.
After processing through all groups, the final output of the Feedback Block is obtained
by concatenating all intermediate LR feature maps and applying a final compression
operation
where $C_{\mathrm{out}}$ is another 1x1 convolutional layer that reduces the concatenated
features to the desired output dimensionality.
The recurrent nature of the Feedback Block is key to its ability to iteratively refine
feature maps. In each iteration, the block takes the output from the previous step
and feeds it back as input for further refinement. This process of recurrent feedback
allows the network to progressively enhance feature representations, making it particularly
effective for tasks that require high-fidelity FSR. The design of the Feedback Block
is driven by the need to iteratively improve the quality of feature representations.
By alternating between up-projection and down-projection, the block captures and integrates
features across different scales, making it adept at reconstructing fine details and
complex structures in HR facial images. The concatenation of features from various
scales, followed by a compression step, ensures that the final feature representation
is both rich in detail and computationally efficient to process.
In conclusion, the Feedback Block plays a pivotal role in the model's ability to generate
high-quality super-resolution images. Its iterative structure, combining up-projection,
down-projection, and feature concatenation, allows for effective multi-scale feature
integration and refinement. This iterative refinement process, as illustrated in Fig. 2, is essential for achieving the desired level of detail and accuracy in the reconstructed
images.
4. Experiment
4.1. Dataset
We conduct our experiments using the CelebA dataset [34], a large-scale face attributes dataset widely used in face-related computer vision
tasks. For our study, we utilize the first 36,000 images for training and the subsequent
1,000 images for testing. This split provides a comprehensive training set while maintaining
an adequate test set for thorough evaluation. The training images undergo a preprocessing
stage where we coarsely crop the face regions using a face detection algorithm. These
cropped images are then resized to $128\times128$ pixels without any pre-alignment
procedures. This resizing approach maintains the aspect ratio of the cropped face
regions, ensuring that facial features are not distorted or stretched significantly
during the resizing process. This approach preserves the natural variations in facial
pose and expression, allowing our model to learn from and adapt to a diverse range
of facial orientations and expressions. Fig. 3 presents a selection of sample images from our training dataset, showcasing the variety
in facial features, expressions, and image quality present in the CelebA dataset.
Fig. 3. Sample images from the CelebA dataset used in our training process.
These images demonstrate the diversity in facial features, expressions, and image
quality. For our RFSR method, we use color images in the training process. This decision
enables our network to learn and reconstruct the full spectrum of facial details,
including subtle color variations that are crucial for realistic face reconstruction.
4.2. Degradation Models
To comprehensively evaluate the effectiveness of our proposed RFSR method for various
types of image degradation, we employ three degradation models to simulate LR images.
The first model is bicubic downsampling (Bic), which we implement using the Matlab
function imresize with the bicubic option. This model is applied with scaling factors
of $4\times$ and $8\times$, simulating different levels of resolution loss. The second
degradation model, bicubic downsampling with noise (BicN), builds upon the first model
by adding Gaussian noise after the downsampling process. Specifically, after applying
bicubic downsampling with scaling factors of $4\times$ and $8\times$, we add Gaussian
noise with a noise level of 15. The noise level n indicates a standard deviation of
n in a pixel intensity range of [0, 255]. This model simulates scenarios where the
LR image is affected by both resolution loss and sensor noise. To create a more challenging
scenario, we introduce a third degradation model: blur, bicubic downsampling, and
noise (BBicN). In this model, we first blur the HR image using a Gaussian kernel of
size $7 \times 7$ with a standard deviation of 1.6. We then apply bicubic downsampling
with scaling factors of $4\times$ and $8\times$, followed by the addition of Gaussian
noise with a noise level of 30. This model represents a complex degradation process
involving blur, downsampling, and severe noise. These degradation models allow us
to test our RFSR method under various conditions, from simple resolution reduction
to complex scenarios involving multiple types of image degradation. By using these
diverse degradation models, we aim to demonstrate the robustness and effectiveness
of our proposed method in handling different types and levels of image quality deterioration
commonly encountered in real-world FSR tasks.
4.3. Training Setting
We initialize the network parameters using the Xavier initialization method. All transformer
blocks in our Restormer-based architecture use the Gaussian Error Linear Unit (GELU)
as the activation function. Our model is implemented using the PyTorch framework and
optimized using the AdamW optimizer with cosine learning rate scheduling. The initial
learning rate is set to $3\times {10}^{-4}$ and gradually decreases to $1 \times 10^{-6}$
over the course of training. We use a batch size of 8 for our experiments. The momentum
parameter $\beta1$ is set to 0.9, and $\beta2$ is set to 0.999. Weight decay is applied
with a factor of $1 \times {10}^{-4}$ to prevent overfitting. We train our RFSR model
for a total of 300 epochs. The training process takes approximately 24 hours on a
single NVIDIA RTX 3090 GPU. To assess the quality of the super-resolved images, we
employ two widely used objective image quality assessment metrics: Peak Signal-to-Noise
Ratio (PSNR) and Structural Similarity Index (SSIM). All metrics are calculated on
the Y-channel of the YCbCr color space of the super-resolved images. For the loss
function, we use a combination of L1 loss and perceptual loss. The perceptual loss
is computed using the features extracted from a pre-trained VGG-19 network. The total
loss is a weighted sum of these two components, with the weights empirically set to
balance the contribution of each loss term. To enhance the model's generalization
capability, we applied several data augmentation techniques during training. These
include random horizontal flipping with a probability of 0.5, random rotation with
angles selected from the range of $\pm10$ degrees, and random brightness and contrast
adjustments with factors selected from the range of $\pm15$% (corresponding to a factor
range of 0.85 to 1.15). These augmentations were applied sequentially, with each transformation
having an independent probability of being applied to a given image.
4.4. Comparisons with State-of-the-Art Methods
We compare our proposed RFSR method with several state-of-the-art super-resolution
approaches, each representing different advancements in the field. For fair comparison,
all models are trained on the same CelebA dataset using the same training split and
under consistent settings. ESRGAN [24] enhances the traditional GAN-based super-resolution methods by introducing a more
robust architecture and a perceptual loss function, which focuses on high-frequency
details to generate visually pleasing images. SwinIR [30], based on Swin Transformers, leverages hierarchical feature representations and attention
mechanisms to effectively handle image restoration tasks, achieving high-quality reconstructions.
EDSR [6] employs an enhanced deep residual network that simplifies the network structure by
removing unnecessary batch normalization layers, leading to significant improvements
in image quality and computational efficiency. VQ-VAE-2 [35] utilizes vector quantization and autoencoders to generate high-fidelity images, providing
a unique approach to handling image super-resolution tasks. SRFlow [36] introduces a novel approach using normalizing flows to model the distribution of
HR images, enabling it to generate diverse and high-quality outputs. Finally, EfficientNet-SR
[37] applies the EfficientNet architecture, known for its balance between accuracy and
efficiency, to the task of image super-resolution, achieving impressive results with
relatively low computational cost. To evaluate the performance of these methods, we
use Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) as metrics,
focusing on the Y-channel of the YCbCr color space.
Table 1 showcases the performance of various super-resolution methods, including our proposed
RFSR, across different degradation models (Bic, BicN, and BBicN), using PSNR and SSIM
metrics. For the Bic model, which involves simple bicubic downsampling, our RFSR method
achieves the highest PSNR of 27.65 and SSIM of 0.7734. This indicates superior noise
reduction and fidelity compared to other methods. In the more challenging BicN model,
which adds Gaussian noise to the downsampled images, RFSR again leads with a PSNR
of 25.67 and an SSIM of 0.7195, demonstrating robustness against noise. The BBicN
model, combining blur, downsampling, and high-level Gaussian noise, presents the most
complex degradation scenario. Here, RFSR attains a PSNR of 23.99 and an SSIM of 0.6892,
outperforming other methods and showing its capability to handle severe degradation
while maintaining structural integrity. Overall, our RFSR method consistently excels
across all degradation models, highlighting its robustness and effectiveness. The
advanced transformer-based architecture of RFSR, which effectively captures and reconstructs
fine details, sets a new benchmark in FSR, significantly enhancing both PSNR and SSIM
metrics.
Then, we compared our RFSR method with state-of-the-art approaches visually. As shown
in Figs. 4 and 5, our method demonstrates superior performance in reconstructing high-quality face
images under various degradation conditions.
Table 1. Benchmark results with different degradation models.
|
Methods
|
PSNR (Bic)
|
SSIM (Bic)
|
PSNR (BicN)
|
SSIM (BicN)
|
PSNR (BBicN)
|
SSIM (BBicN)
|
|
ESRGAN [24]
|
27.0503
|
0.7632
|
25.1312
|
0.7101
|
23.6529
|
0.6589
|
|
SwinIR [30]
|
27.2504
|
0.7671
|
25.3233
|
0.7128
|
23.7621
|
0.6612
|
|
EDSR [6]
|
27.3412
|
0.7689
|
25.4525
|
0.7146
|
23.8231
|
0.6628
|
|
VQ-VAE-2 [35]
|
27.1711
|
0.7643
|
25.2443
|
0.7110
|
23.7021
|
0.6597
|
|
SRFlow [36]
|
27.3103
|
0.7679
|
25.4041
|
0.7133
|
23.7802
|
0.6619
|
|
EfficientNet-SR [37]
|
27.4051
|
0.7701
|
25.5056
|
0.7157
|
23.8412
|
0.6635
|
|
RFSR (Ours)
|
27.6502
|
0.7734
|
25.6712
|
0.7195
|
23.9913
|
0.6673
|
Table 2. Face recognition evaluation on the BBicN degradation SR results from each
method.
|
Methods
|
Performance
|
Methods
|
Performance
|
|
Bicubic
|
0.8058
|
EDSR
|
0.8530
|
|
ESRGAN
|
0.8412
|
VQ-VAE-2
|
0.8645
|
|
SwinIR
|
0.8620
|
SRFlow
|
0.8580
|
|
EfficientNet-SR
|
0.8710
|
RFSR (Ours)
|
0.8935
|
Fig. 4 illustrates the visual comparison of different methods under the bicubic downsampling
(Bic) model. While all methods show improvements over the LR input, our RFSR method
stands out in several aspects. The facial features reconstructed by RFSR are noticeably
sharper and more defined, especially in critical areas such as the eyes, nose, and
mouth. For instance, the eye region in our result shows clearer iris details and more
natural eyelash rendering compared to other methods. The skin texture produced by
RFSR also appears more realistic, avoiding the over-smoothing effect seen in some
competing methods like ESRGAN or the slight blurriness in SwinIR results.
In Fig. 5, we present the visual results under the more challenging BicN model, which introduces
noise to the downsampled images. The impact of noise is evident across all methods,
but our RFSR demonstrates remarkable resilience. While methods like EDSR and VQ-VAE-2
struggle to maintain clear facial structures in the presence of noise, our approach
preserves the overall facial integrity and fine details. The hair texture in our result,
for example, retains more natural waviness and individual strand definition, whereas
other methods tend to produce a more smudged or artificial appearance. Notably, our
method excels in preserving the subtle contours and expressions of the face. The nasolabial
folds and slight smile lines are more accurately reconstructed in our results, contributing
to a more lifelike and expressive face image. This is particularly evident when compared
to methods like SRFlow or EfficientNet-SR, which, while effective in general super-resolution
tasks, seem to struggle with the nuanced details of facial features under noisy conditions.
Furthermore, the color fidelity in our reconstructions is superior. The skin tone
appears more natural and consistent across the face, avoiding the color distortions
or uneven patches sometimes seen in the results of other methods. This is crucial
for maintaining the overall realism and quality of the super-resolved face images.
The effectiveness of our RFSR method is particularly evident in the more complex degradation
scenarios. Even as the noise level increases, our method maintains a consistent quality
in facial reconstruction. This is in contrast to some other approaches, where the
quality degrades more noticeably with increased noise, resulting in loss of facial
details or introduction of artifacts.
In summary, the visual comparisons in Figs. 4 and 5 clearly demonstrate the superiority of our RFSR method in FSR tasks. Our approach
not only produces sharper and more detailed facial features but also shows remarkable
robustness against various types of image degradation, particularly noise. This visual
evidence, combined with the quantitative results from Table 1, strongly supports the effectiveness of our proposed method in generating high-quality,
realistic face images from LR inputs.
Fig. 4. Visual comparison of different super-resolution methods under the bicubic
downsampling (Bic) model. Images are from the CelebA test set, downsampled by a factor
of $8 \times$. From left to right: LR input, ESRGAN, SwinIR, EDSR, VQ-VAE-2, SRFlow,
EfficientNet-SR, and our RFSR (proposed).
Fig. 5. Visual comparison of different super-resolution methods under the bicubic
downsampling with noise (BicN) model. Images are from the CelebA test set, downsampled
by a factor of $8\times$ and added Gaussian noise with a noise level of 15. From left
to right: LR noisy input, ESRGAN, SwinIR, EDSR, VQ-VAE-2, SRFlow, EfficientNet-SR,
and our RFSR (proposed).
In order to corroborate the real benefit of our proposal, we further performed face
verification experiments using Additive Angular Margin Loss for Deep Face Recognition
(ArcFace) [38]. We constructed 1000 positive sample pairs and 9000 negative sample pairs based on
the SR results (BBicN) from each method. The results are shown in Table~2. It can
be seen that our reconstruction results have better identity retention properties.
The face verification results clearly demonstrate the superiority of our RFSR method
in preserving identity-related features. Our method achieves the highest performance
score of 0.8935, significantly outperforming all other compared methods. This indicates
that the face images reconstructed by RFSR not only have better visual quality but
also retain more accurate and distinctive identity information. The performance gap
between RFSR and other methods is particularly notable. For instance, EfficientNet-SR,
which shows the second-best performance, achieves a score of 0.8710, while our method
surpasses it by a margin of 0.0225. This improvement is substantial in the context
of face verification tasks, where even small increases in accuracy can have significant
practical implications. It's worth noting that all learning-based methods show improvements
over the basic bicubic interpolation (0.8058), highlighting the effectiveness of deep
learning approaches in FSR. However, the varying degrees of improvement among different
methods underscore the importance of architectural choices and training strategies
in preserving identity features.
4.5. Effect of Transformer Blocks and Refinement Module
To thoroughly evaluate the effectiveness of our proposed RFSR method, we conduct an
ablation study to analyze the impact of the transformer blocks and the recurrent SR
module. We design three network variants to clearly demonstrate how different components
contribute to the overall performance: RFSR (Ours) as the complete model with all
components, BasicNet V1 as RFSR without the recurrent SR module, and BasicNet V2 as
RFSR without transformer blocks. We train these variants on the FSR task using the
BicN degradation model and evaluate their performance in terms of PSNR over 50 epochs.
The results are presented in Fig. 6.
As shown in Fig. 6, the complete RFSR model consistently outperforms all other variants, achieving the
highest PSNR values throughout the training process. This demonstrates the synergistic
effect of combining transformer blocks with the recurrent SR module. The RFSR model
reaches a PSNR of approximately 25.67 dB by the end of training, setting a new benchmark
for FSR performance. BasicNet V1, which lacks the recurrent SR module, shows the second-best
performance. Its PSNR curve closely follows that of the full RFSR model, but with
a consistent gap of about 0.2 dB, settling around 25.47 dB. This observation highlights
the significant role of the recurrent SR module in enhancing fine details and improving
overall image quality. The recurrent module appears to be particularly effective in
recovering subtle facial features that may be missed by the transformer blocks alone.
BasicNet V2, which eliminates the transformer blocks entirely, shows the lowest performance
among all variants. Its PSNR values plateau around 25.13 dB, significantly lower than
the other models. The substantial gap between BasicNet V2 and the other variants,
particularly the full RFSR model (about 0.54 dB difference), underscores the critical
role of transformer blocks in achieving high-quality FSR. This result suggests that
the transformer architecture is fundamental to capturing the long-range dependencies
and complex feature interactions necessary for effective face image reconstruction.
Interestingly, all models show rapid improvement in the initial epochs, followed by
a more gradual increase in PSNR. However, the full RFSR model maintains a slightly
steeper improvement curve even in later epochs, indicating its superior learning capacity
and potential for further improvement with extended training. This observation suggests
that the combination of transformer blocks and the recurrent SR module not only yields
better results but also provides a more robust and flexible learning framework. To
further validate the effectiveness of our RFSR model, we conduct additional experiments
using different degradation models. Table 3 presents the PSNR and SSIM results for all variants across Bic, BicN, and BBicN degradation
models.
The results in Table 3 consistently demonstrate the superiority of the full RFSR model across all degradation
scenarios. The performance gap is particularly pronounced in more challenging degradation
models like BBicN, where the full model's ability to handle complex degradations becomes
evident.
Fig. 6. Ablation study on effects of transformer blocks and recurrent SR module with
BicN degradation model.
Table 3. Quantitative results of ablation study under different degradation models.
|
Model
|
Bic (PSNR/SSIM)
|
BicN (PSNR/SSIM)
|
BBicN (PSNR/SSIM)
|
|
RFSR (Ours)
|
27.65 / 0.7734
|
25.67 / 0.7195
|
23.99 / 0.6892
|
|
BasicNet V1
|
27.45 / 0.7711
|
25.47 / 0.7173
|
23.84 / 0.6871
|
|
BasicNet V2
|
27.10 / 0.7662
|
25.13 / 0.7124
|
23.58 / 0.6829
|