Mobile QR Code QR CODE

2025

Reject Ratio

81.5%


  1. (Department of Imaging Science and Arts, GSAIM, Chung-Ang University, Seoul, South Korea. {looloo330, jihyongoh}@cau.ac.kr)
  2. (Department of Virtual Convergence, GSAIM, Chung-Ang University, Seoul, South Korea. cheom@cau.ac.kr)



Implicit neural representation, Neural field, Activation function, Positional encoding, Spectral bias

1. Introduction

Low-level vision (LLV) refers to fundamental tasks that operate directly on pixel- or signal-level data to restore, enhance, or reconstruct degraded visual inputs. Typical examples include image super-resolution [1, 2], denoising [3], deblurring [4], and inpainting [5]. These tasks form the foundation of many high-level vision (HLV) applications, being crucial for enhancing perceptual quality and frequently serving as essential preprocessing steps for downstream objectives such as object detection [6, 7], semantic segmentation [8, 9], and scene understanding [10].

Alongside these developments, the landscape of LLV tasks has expanded significantly with rapid advances in 3D vision methods, now encompassing areas such as 3D scene reconstruction [11- 13], novel view synthesis [14], and occupancy field estimation [26]. These emerging tasks are increasingly vital in fields like robotics [16], autonomous driving [17], augmented reality [18, 19], and digital twin systems [20], all of which demand high-fidelity, continuous representations of complex spatial structures, which are capabilities that traditional grid-based approaches [21] struggle to provide.

Against this backdrop, Implicit Neural Representations (INRs) have emerged as a unified and powerful paradigm for addressing these challenges. Traditional neural representation methods such as grid- [21], voxel- [22], or point-based models [23], which explicitly store sampled values in discrete structures, inevitably suffer from discretization artifacts, prohibitive memory consumption due to cubic scaling, and poor scalability across resolutions. In contrast, INRs model data as continuous signals parameterized by coordinate-based neural networks (typically multilayer perceptrons, MLPs), enabling resolution-independent and compact representations. Moreover, the universality of INR frameworks allows the same architecture to be seamlessly applied across diverse tasks and modalities, including audio [25], image restoration [24], and 3D scene reconstruction [26], all without architectural modification.

Despite their rapid advances in INRs and their remarkable versatility across domains, the field of INRs lacks a unified and systematic overview. Most existing studies tend to focus on specific tasks or isolated architectural choices, which can obscure the broader methodological landscape and impede cross-domain understanding. To provide an overview of the methodological landscape and the organization of this survey, we illustrate in Fig. 1 a taxonomy of INR-based approaches. This taxonomy outlines the key methodological categories, challenges, datasets, and applications that are discussed in the subsequent sections. In this work, we provide a comprehensive survey of the foundations, design principles, and practical applications for INRs. We classify INRs according to the core architectural paradigms, including activation functions, positional encodings, Fourier-based reparameterization, hybrid strategies and conditioning mechanisms. Additionally, we discuss key challenges such as spectral bias [27] and high-frequency reconstruction, introduce benchmark datasets and evaluation metrics, and summarize prominent real-world use cases.

Fig. 1. Overview of the taxonomy of INRs and survey structure.

../../Resources/ieie/IEIESPC.2026.15.3.396/fig1.png

2. METHODOLOGY

2.1. Activation Functions towards Spectral and Spatial Inductive Bias Control

In implicit neural representations (INRs), the design of activation functions plays a pivotal role in shaping the model’s inductive biases. While standard activations like ReLU [28] exhibit strong spectral bias toward low-frequency signals, recent methods propose periodic and localized activations to enable richer signal modeling. This subsection introduces representative activation functions (namely, SIREN [29], Sinc [30], HOSC [31], FINER [32], and WIRE [33]), each designed to control either the spectral domain, the spatial domain, or both.

(a) SIREN: Sinusoidal representation networks

SIREN [29] adopts a sinusoidal activation of the form

(1)
$ \sigma(x) = \sin(\omega x), $

where $x$ denotes the input coordinate and $\omega$ is a scaling factor that controls the frequency range. This periodic non-linearity enables the network to effectively represent high-frequency signals.

(b) FINER: Variable-periodic activation for flexible spectral bias tuning

FINER [32] addresses the frequency range limitation of fixed-periodic activations by employing a variable-periodic function:

(2)
$ \sigma(x) = \sin((|x|+1)x), $

where $x$ denotes the input coordinate. Unlike SIREN’s fixed-periodic activation $\sin(\omega x)$, this formulation introduces frequency variation that depends on the input magnitude, thereby enabling coverage of higher-frequency components. As a result, FINER provides a simple and architecture-agnostic means to mitigate spectral bias [34].

(c) HOSC: Preserving sharp features with tunable periodic activations

HOSC [31] modifies the sine activation by introducing a sharpness parameter:

(3)
$ \sigma(x;\beta) = \tanh(\beta \sin x), $

where $x$ denotes the input coordinate and $\beta$ is a tunable parameter controlling the sharpness of oscillations. As $\beta$ increases, $\tanh(\beta \sin x)$ approaches a square wave, i.e., $\text{sign}(\sin x)$. Smaller values of $\beta$ result in smooth oscillations, while larger values produce sharper transitions. This tunability enables the model to flexibly adapt between smooth and piecewise-constant representations, making HOSC particularly effective for tasks that require sharp boundary preservation.

(d) Sinc: Bandlimited activation with ideal frequency selectivity

The sinc activation is defined as

(4)
$ \sigma(x) = \frac{\sin(\pi x)}{\pi x}, $

where $x$ denotes the input coordinate and $\pi$ determines the normalized cutoff frequency of the ideal low-pass filter. This formulation corresponds to the impulse response of an ideal low-pass filter. In the frequency domain, the sinc function transforms into a rectangular pulse, providing ideal frequency selectivity by uniformly passing components within a specific bandwidth while rejecting those outside. This property is advantageous for reconstructing band-limited signals and suppressing aliasing artifacts in INRs. However, sinc has infinite support in the spatial domain, which makes it less suitable for modeling spatially localized structures and can introduce ringing artifacts when truncated.

(e) WIRE: Wavelet-based activation for space-frequency localization

WIRE [33] employs a Gabor wavelet as its activation function:

(5)
$ \sigma(x) = e^{-s_0 x^2} \cos(\omega_0 x), $

where $x$ denotes the input coordinate, $\omega_0$ is the frequency parameter of the cosine carrier, and $s_0$ controls the Gaussian envelope that provides spatial localization. The cosine term enables the modeling of high-frequency details, while the Gaussian attenuation confines the response spatially and mitigates abrupt truncation. This combination reduces ringing artifacts, enhances robustness to weight initialization, and improves performance in real-world settings by effectively capturing both localized and high-frequency patterns.

Recently, beyond categories (a)–(e), there have been attempts to leverage concepts from classical signal processing as well as to apply diverse activation functions from deep learning. For example, FLAIR [35] is designed under the theoretical constraint of the time–frequency uncertainty principle, enabling the model to learn both temporal localization and frequency selectivity in a learnable manner. Unlike the periodic function–based approaches in (a) and (b), this method focuses on modeling only the essential hidden features required for representation, similar to (d) and (e). As a result, it achieves more efficient and sparse representations.

2.2. Positional Encodings for Frequency and Spatial Localization

While activation functions determine the network’s ability to model non-linear signals, the way input coordinates are encoded before entering the network plays an equally crucial role in shaping spectral and spatial inductive biases. Positional encodings aim to embed the input coordinates $\mathbf{x}$ into a higher-dimensional space $\gamma(\mathbf{x})$ to enrich the network’s capacity to capture high-frequency content and spatial structures.

(a) Basic Fourier encoding: The simplest form of positional encoding applies a single frequency sinusoid:

(6)
$ \gamma(x) = [\cos(2\pi x), \sin(2\pi x)]^T. $

This encoding projects the input onto two oscillatory bases, enabling the model to incorporate periodicity. However, it lacks the ability to capture a broad range of frequencies, which limits its expressiveness in complex signal reconstruction.

(b) Fixed-frequency positional encoding: To address the limited frequency coverage of the Basic Fourier Encoding, a multiscale extension is often adopted:

(7)
$ \gamma(x) = \left[ x, \cos\left(\frac{2\pi\omega_j x}{m}\right), \sin\left(\frac{2\pi\omega_j x}{m}\right) \right]_{j=1}^m{}^\top, $

where $x$ denotes the input coordinate, $\omega_j$ is the $j$-th element of a predefined frequency set, $m$ is the embedding dimension, and $^\top$ indicates that the resulting vector is represented as a column vector. This encoding introduces a spectrum of fixed frequencies, thereby enhancing the model’s ability to represent both fine- and coarse-grained spatial details compared to the Basic Fourier Encoding. Such encodings have been widely adopted in tasks involving spatially dense predictions, such as NeRF [14] and image restoration [36].

(c) Random Fourier features (RFF): Random Fourier Features [37] further extend this idea by injecting stochasticity into the frequency components:

(8)
$ \gamma(x) = [\cos(2\pi Bx), \sin(2\pi Bx)]^T, \quad B \sim \mathcal{N}(0, \omega^2), $

where $x$ denotes the input coordinate, $B$ is a Gaussian random matrix with entries sampled from a normal distribution $\mathcal{N}(0, \omega^2)$, $\omega$ controls the variance of frequency sampling, and $^\top$ indicates a column vector representation. This stochastic basis sampling provides a Monte Carlo approximation of shift-invariant kernels, thereby improving generalization to unseen signals. RFFs are particularly effective in applications requiring robustness and uncertainty modeling [37].

(d) Wavelet positional encoding (WPE): While the previous encodings are global in nature, Wavelet Positional Encoding [38] introduces spatial locality by combining sinusoidal basis functions with Gaussian envelopes:

(9)
$ \gamma_L(p) = \left[ e^{-\frac{(p-w_c^i)^2}{2{w_s^i}^2}} \cos\left(\frac{p-w_c^i}{w_s^i}\right), e^{-\frac{(p-w_c^i)^2}{2{w_s^i}^2}} \sin\left(\frac{p-w_c^i}{w_s^i}\right) \right]_{i=1}^M{}^\top, $

where $p$ denotes the input coordinate, $w_c^i$ is the center of the $i$-th Gaussian window, $w_s^i$ is its scale parameter, and $M$ is the number of frequency pairs. Each pair of sine and cosine terms is modulated by a Gaussian envelope, producing localized frequency bases. This localization enables the model to capture spatially compact and high-frequency structures. The Gaussian attenuation further suppresses ringing artifacts and sharp transitions, making WPE especially suitable for real-world signals with non-uniform frequency distributions.

While different positional encoding schemes enrich the representation capacity of INRs, their computational and memory trade-offs also need to be considered. (a) Basic Fourier Encoding requires minimal computation and memory as it applies a single sinusoidal mapping, but it remains limited in expressiveness. (b) Fixed-Frequency Positional Encoding extends this by projecting inputs onto multiple predefined frequencies, which improves representation power but increases embedding dimensionality, leading to higher computational cost and memory usage. (c) Random Fourier Features (RFF) introduce stochastic sampling of frequency bases, enhancing generalization and robustness at the expense of additional overhead from random matrix multiplications. (d) Wavelet Positional Encoding (WPE) further incorporates Gaussian envelopes to provide spatial locality and compact high-frequency modeling, yet this requires more complex operations and additional storage. Overall, these encoding strategies illustrate a fundamental trade-off between representation expressiveness, computation time, and memory efficiency, which should be carefully balanced depending on the target application.

2.3. Fourier-based Reparameterization

Existing INR methods commonly suffer from the spectral bias problem [27], where networks tend to learn low-frequency components first while struggling to capture high-frequency details. A multilayer perceptron (MLP), which underlies most INR formulations, can be expressed as

(10)
$ \mathbf{y}^{(n)} = \sigma\left(\mathbf{W}^{(n)}\mathbf{y}^{(n-1)} + \mathbf{b}^{(n)}\right), $

where $\mathbf{y}^{(n-1)}$ denotes the output from the previous layer, $\mathbf{W}^{(n)}$ and $\mathbf{b}^{(n)}$ represent the learnable weight matrix and bias vector of the $n$-th layer, respectively, and $\sigma(\cdot)$ is the nonlinear activation function.

To mitigate spectral bias, two representative strategies have been studied. In the activation-based approach, the design of $\sigma(\cdot)$ is modified so that the network can better capture high-frequency signals. In the PE-based approach, the input $\mathbf{y}^{(n-1)}$ is transformed into a higher-dimensional representation, thereby reducing the low-frequency preference.

Beyond these strategies, Shi et al. [46] proposed a Fourier-based reparameterization of the weight matrix. Instead of learning $\mathbf{W}^{(n)}$ directly, it is expressed as the product of two components: a trainable coefficient matrix $\mathbf{A}^{(n)} \in \mathbb{R}^{d_n \times M}$ and a fixed set of $M$ Fourier bases $\mathbf{B}^{(n)} \in \mathbb{R}^{M \times d_{n-1}}$. Each Fourier basis is defined by varying frequency $\omega$ and phase $\phi$ of a cosine function, such that the $(i, j)$-th element of $\mathbf{B}^{(n)}$ is

(11)
$ b_{ij} = \cos(\omega_i z_j + \phi_i), \quad i = 1,\dots,M; \ j = 1,\dots,d_{n-1}, $

where $\mathbf{z} = \{z_j\}_{j=1}^{d_{n-1}}$ denotes the sampling positions. The layer output then becomes

(12)
$ \mathbf{y}^{(n)} = \sigma\left(\mathbf{A}^{(n)}\mathbf{B}^{(n)}\mathbf{y}^{(n-1)} + \mathbf{b}^{(n)}\right). $

By constraining $\mathbf{W}^{(n)}$ to lie in the span of Fourier components, this formulation embeds frequency priors directly into the parameterization and offers an alternative way of alleviating spectral bias, complementing activation- and PE-based designs.

2.4. A Combined Strategy for Non-linear Compactness

Beyond the separate use of activation functions and positional encodings, recent advancements have led to combined strategies that unify both spectral and spatial inductive biases within a single functional design. TRIDENT [39] exemplifies this integrated approach by merging frequency-aware encoding, spatial localization, and non-linear transformation into a cohesive representation. Rather than treating sinusoidal encoding and activation as distinct components, TRIDENT incorporates both through a radial basis function (RBF)-like formulation with exponential non-linearity, enabling compact and expressive modeling within implicit neural representations (INRs).

At its core, TRIDENT encodes the input coordinate $x$ using a combination of sinusoidal basis functions modulated by a Gaussian envelope. The function is defined as:

(13)
$ \phi(x) = \exp\left( -s_0 \cdot \left[ x, \cos(2\pi\sigma^0 x), \sin(2\pi\sigma^0 x), \dots, \cos(2\pi\sigma^{(m-1)/m} x), \sin(2\pi\sigma^{(m-1)/m} x) \right]^2 \right), $

where $s_0$ is a scaling parameter controlling spatial concentration, and $\sigma$ determines the geometric progression of frequency components. This formulation jointly models local and global structures while maintaining numerical stability through soft spatial weighting.

The design of TRIDENT induces three core properties:

(a) Order compactness: By embedding sinusoidal terms within an exponential function, TRIDENT implicitly encodes high-order polynomial behaviors via power-series expansion. This allows rich structural details to be modeled without explicitly deep or wide networks.

(b) Frequency compactness The inclusion of multiple harmonics, spaced in a log-linear manner, allows the model to efficiently capture low- and high-frequency components. The dual use of $\cos$ and $\sin$ further ensures balanced representation of even and odd frequency modes.

(c) Spatial compactness The Gaussian envelope localizes the response of the function, concentrating representational energy within a compact region of the input space. This spatial attenuation mitigates ringing artifacts and improves generalization in local-detail-sensitive tasks.

Together, these characteristics form the non-linear trilogy of TRIDENT. As a combined strategy, it serves as an effective drop-in replacement for conventional PE + activation stacks, enhancing the expressiveness, compactness, and task generalization of architectures for INRs.

2.5. Implicit Neural Conditioning with Prior Knowledge Embeddings

Implicit Neural Conditioning (INCODE) proposes a conditional architecture for INRs that stabilizes training and enhances expressiveness by embedding prior knowledge into the network’s activation modulation. At the core of INCODE [40] lies a composer network, which replaces fixed sinusoidal activations with a generalized adaptive form:

(14)
$ \sigma(x) = a \sin(b\omega_0 x + c) + d, $

where the parameters (a, b, c, d) are dynamically predicted by a harmonizer network, conditioned on a latent embedding extracted from a pre-trained feature encoder. This design draws motivation from the observation that sinusoidal activations, while effective for capturing fine details, are highly sensitive to initialization and task-dependent frequency priors. By leveraging pre-trained representations, such as a ResNet encoder, as a source of learned signal statistics, the harmonizer provides an informed initialization strategy that guides the activation’s shape and phase. This approach effectively decouples the network from manual hyperparameter tuning and stabilizes convergence, particularly in early training. INCODE has shown robust performance across diverse modalities, including image, audio, and 3D scene domains, and demonstrates strong generalization in tasks such as super-resolution, inpainting, and denoising. It outperforms conventional INRs with faster and more stable training dynamics.

2.6. Function Decomposition with Learnable Operators

The Kolmogorov–Arnold Network (KAN) draws inspiration from the Kolmogorov–Arnold representation theorem, which states that any multivariate continuous function can be decomposed into a finite sum of univariate functions. KAN [41] operationalizes this idea by introducing learnable univariate functions along the network edges, replacing conventional scalar weights. Specifically, each edge implements a spline-based transformation $f_w(x)$, where $f_w$ is a parameterized function rather than a fixed weight. Neurons in a KAN simply perform summation without additional non-linearities, as the non-linearity is embedded in the edge functions themselves. This architecture grants precise control over inductive biases, enabling efficient approximation of complex, high-dimensional mappings with fewer parameters. KANs [42] have demonstrated superior performance and representation efficiency on tasks such as PDE solving, where they outperform traditional MLPs of similar or even larger size. Their compositional design not only alleviates the curse of dimensionality but also enhances interpretability, making them particularly well-suited for applications requiring structured generalization and robust approximation behavior. [43]

3. CHALLENGES

Implicit Neural Representations (INRs) have demonstrated remarkable advantages, offering full differentiability, smooth- ness, compactness, and adaptability to arbitrary resolution in representing data. These properties mean that an INR models a signal as a continuous, differentiable function (enabling gradient-based optimization and integration into physics-inspired tasks), stores information efficiently in network weights rather than dense grids, and can be queried at any coordinate resolution. In principle, such traits allow INRs to capture fine-grained details without huge memory requirements and to generalize beyond fixed grids. However, realizing these ideals in practice presents significant open challenges. Key desired properties often conflict with one another, and current models of INRs face limitations that impact their performance and generalization capability on complex real-world signals.

3.1. Spectral Bias

One fundamental challenge is the spectral bias inherent in standard MLP-based INRs, which causes a predisposition toward learning low-frequency (smooth) components of signals at the expense of high-frequency details. Networks with conventional activations like ReLU or $\tanh$ struggle to faithfully represent signals with rich high-frequency content and fine details, instead favoring coarse approximations. This bias is problematic because many signals (textures in images, sharp edges, high-pitch variations in audio, etc.) contain critical high-frequency information. As a result, INRs with spectral bias tend to produce overly smooth reconstructions that miss small structures or rapid variations. The loss of detail directly degrades performance in tasks such as image super-resolution [1], detailed 3D shape modeling, or audio synthesis [44]. Moreover, this bias can hinder generalization: a model that only learns low-frequency structure may not adapt well when finer-scale patterns are required. Overcoming spectral bias is thus crucial for improving INRs’ accuracy and their ability to generalize to signals with diverse frequency content.

3.2. Mitigating Spectral Bias: Limitations and Trade-offs

Addressing this low-frequency bias and achieving faithful high-frequency representation remain largely unsolved problems. Recent research has made progress by introducing specialized activation functions and encoding schemes to expand the frequency response of INRs. For instance, sinusoidal activations (as in SIREN) and positional encoding with high- frequency Fourier features have been used to enable the network to learn more high-frequency content than a vanilla MLP. These approaches indeed mitigate the bias, allowing INRs to capture finer details than before. However, significant limitations persist. Even SIREN, which leverages a periodic activation, can struggle with very complex or higher-frequency details when those exceed the single scale periodic basis it provides. Many enhanced models for INRs require carefully chosen hyperparameters or initialization. schemes to balance frequency components, and they may still exhibit trade-offs between smoothness and detail. Increasing a network’s capacity for high frequencies (through deeper networks, Gabor wavelet activations, or extreme positional encodings) can introduce challenges like higher computational cost and risk of overfitting to noise. In practice, models designed to capture very fine details sometimes become overly sensitive to minor signal variations, which harms their generalization to new or noisier inputs. Thus, current solutions only partially address the high-frequency challenge, and designing INR that robustly represent both low- and high-frequency content without adverse side-effects remains an active area of research.

3.3. Balancing Compactness and Expressiveness

Another open challenge lies in balancing the compactness and expressiveness of INRs. A hallmark of implicit representations is their memory efficiency: complex signals are encoded in relatively few neural parameters, with memory scaling primarily according to model size rather than output resolution. This compactness is crucial for scalability to high-dimensional data [74]. However, there is an inherent tension between maintaining compactness and ensuring sufficient representational capacity to capture highly detailed or large-scale structures.

In theory, the memory usage of INRs grows with the complexity of the represented function rather than directly with output resolution, which suggests excellent scalability. In practice, however, representing extremely high-resolution or highly complex signals often requires larger networks or dense sampling during training, thereby diminishing the efficiency advantage. For example, to capture fine textures or geometric details, one may increase the network width or depth, or employ extensive multi-frequency encodings, which increases both the parameter count and the training cost. As a result, scalability remains a bottleneck: many INR methods struggle or become impractically slow when extended to ultra-high-resolution images or fine-grained 3D geometries.

The key challenge, therefore, is to retain a compact representation while scaling to these complexities. Recent approaches have explored spatially adaptive designs, such as local subnetworks and multi-grid structures [48], which allocate higher representational capacity only where needed. However, integrating such spatial adaptivity while maintaining full differentiability and architectural simplicity remains non-trivial.

3.4. Resolution Adaptability

The adaptability to resolution (or resolution-independence) of INRs is a double-edged sword. On one hand, because INRs are defined as continuous functions, a single model can be queried at any resolution, offering built-in super-resolution and continuous zoom capabilities. This is a major advantage over grid-based representations and contributes to the generalization of INRs beyond fixed discretizations.

On the other hand, guaranteeing consistent fidelity across different query resolutions is challenging. If INRs are trained on data at a certain scale or sampling density, querying them at a much finer resolution might reveal interpolation artifacts or missing high-frequency detail that the models never learned. The models might either over-smooth those upsampled regions (due to spectral bias) or, if forced to fit every training sample exactly, they could reproduce high-frequency noise or aliasing when queried off-grid. Current training regimes for INRs often implicitly assume a certain target resolution or distribution of sample points; using these models far outside this range can lead to degraded quality. Techniques like multi-scale supervision and anti-aliasing filters are being explored to ensure that the continuous representations remain faithful when extrapolating to new scales.

Yet, achieving truly resolution-robust representations with INRs is still an open issue. Recent advanced models (e.g., FINER networks with dynamic frequency scaling) explicitly attempt to adapt to varying levels of detail, yielding more stable results across scales. Nonetheless, a general solution for making INRs reliably handle arbitrary resolution queries without retraining or quality loss has not been fully established.

4. DATASETS AND EVALUATION: BASIC GUIDELINE

Implicit Neural Representations (INRs) have been widely adopted across various domains such as 2D image processing [24], 3D scene representation [14], and audio signal modeling [50]. Due to their flexible and continuous nature, the datasets and evaluation protocols for INRs are constructed to reflect the diverse task settings they support. This section summarizes representative datasets, evaluation metrics, and common loss functions used in these tasks.

4.1. Tasks and Representative Datasets

INRs are evaluated in several domains, categorized as follows:

  • 2D image tasks: Includes image reconstruction [24], single image super-resolution (SISR) [1], denoising [3], inpainting [5], and CT reconstruction [47].

  • 3D scene representation: Covers novel view synthesis with Neural Radiance Fields (NeRF) [14], 3D occupancy reconstruction [26], and signed distance function (SDF) [49] modeling for continuous implicit surface representation.

  • Audio signal representation: Includes tasks such as speech synthesis [50], audio super-resolution [51], and music representation.

Representative datasets for these domains are summarized in Table 1, highlighting widely used benchmarks across 2D images, 3D scenes, CT reconstruction, and audio signals.

Table 1. Comparison of representative datasets by domain.

Domain Task examples Representative datasets
2D image Super-resolution [1], inpainting [5], denoising [3], fitting [24] DIV2K [53], [54] Kodak [55], Set5 [56]
CT reconstruction Tomographic reconstruction [57] Mayo Clinic TCIA [58]
3D scene Novel view synthesis [14], occupancy [26], signed distance function (SDF) [49] NeRF [14], Stanford 3D Scanning Repository [59], LLFF [60]
Audio Audio synthesis [50], super-resolution [51] VCTK [61], LibriSpeech [62], NSynth [63]

4.2. Evaluation Metrics

The evaluation metrics commonly employed in these domains are summarized in Table 2, covering perceptual measures for 2D images, IoU-based criteria for 3D occupancy, rendering quality metrics for NeRF-based view synthesis, and intelligibility scores for audio signals.

Table 2. Evaluation metrics by task type.

Domain Metrics Notes
2D image PSNR↑, SSIM↑ [64], LPIPS↓ [65] Traditional and perceptual similarity
3D occupancy IoU↑, Chamfer Distance↓ [66] Spatial reconstruction accuracy
View synthesis (NeRF) PSNR↑, SSIM↑, LPIPS↓ Rendered view quality
Audio SI-SNR↑ [67], PESQ↑ [68], STOI↑ [69] Perceptual and intelligibility metrics

Fig. 2. Signal representation results brought by WIRE [33].

../../Resources/ieie/IEIESPC.2026.15.3.396/fig2.png

5. Applications: Basic Guideline

5.1. Overview

INRs provide a powerful alternative to traditional representations across a wide range of tasks. Their ability to represent signals as continuous functions enables applications that require flexible resolution, memory efficiency, and smooth signal interpolation.

5.2. Detailed Examples

Signal representation: Implicit neural representations can serve as a flexible paradigm for modeling continuous signals across different domains. As illustrated in Fig. 2, one canonical task is 2D image fitting [24], where INRs directly regress pixel intensities of natural images, providing a fundamental benchmark to assess their ability to approximate continuous functions in low-level vision. Another important example is 3D occupancy prediction [26], where INRs capture volumetric occupancy fields that encode the geometry of objects, highlighting their capacity to model complex spatial structures in a compact and continuous form.

Image restoration: INRs can be applied to a wide range of low-level downstream tasks, as shown in Fig. 3. The top row illustrates image denoising [3] , where INRs remove noise from natural images and reconstruct clean signals without relying on handcrafted filtering priors. The second row shows CT reconstruction [47] from limited projections, in which INRs model the underlying continuous attenuation field, enabling faithful reconstructions even from sparse-view measurements. The third row depicts $4\times$ super-resolution (SR), where INRs upscale low-resolution inputs to recover fine-grained details. Thanks to their capability for continuous representation, INRs naturally extend beyond fixed integer ratios and enable arbitrary-scale SR (e.g., $\times 4.7$, $\times 8.6$, or $\times 11.4$) [1], underscoring their flexibility compared to discrete upsampling approaches.

Fig. 3. Image restoration tasks using implicit neural representations (results from WIRE [33]).

../../Resources/ieie/IEIESPC.2026.15.3.396/fig3.png

Neural rendering: Recent advances in Neural Radiance Fields for View Synthesis (NeRF) [14] have sparked tremendous research activity in neural rendering. Importantly, Implicit Neural Representations (INRs) have been adopted within NeRF’s MLP backbone, enabling richer function approximation compared to conventional ReLU activations. Notably, INR models with carefully designed non-linear activations can surpass traditional ReLU + positional encoding (P.E.) [36] baselines, as demonstrated in Fig. 4. Furthermore, Fig. 5 shows that such activation-based INRs exhibit stronger robustness when training views are reduced from 100 to as few as 25, significantly outperforming P.E.-based NeRF formulations under sparse supervision.

Fig. 4. Qualitative and quantitative results on the Lego dataset. INR-based activations (e.g., WIRE, SIREN) outperform ReLU + positional encoding both visually and in PSNR (reported in WIRE [33]).

../../Resources/ieie/IEIESPC.2026.15.3.396/fig4.png

Fig. 5. Qualitative results with varying numbers of training views on the Drums dataset (results from WIRE [33]).

../../Resources/ieie/IEIESPC.2026.15.3.396/fig5.png

5.3. Discussion and Future Work

By leveraging the inherent compactness of implicit neural representations (INRs), these models can serve as competitive alternatives to transformer [70]- or CNN-based architectures [71] in low-level vision tasks, while being directly trainable and deployable in real-world domains. Generalised Implicit Neural Representations [72] have already shown potential in handling non-Euclidean domains, such as generating texture signals via Gray-Scott reaction-diffusion simulations on the Stanford bunny mesh, modeling protein solvent-excluded surface graphs, and capturing social dynamics in the US Election through county-level Facebook Social Connectedness networks. We expect future research to further exploit such generalization, paving the way for practical and domain-adaptive INR applications.

Furthermore, INR research should not remain limited to per-scene optimization but expand towards multi-scene generalizable INRs, which would ensure broader applicability beyond synthetic benchmarks. In parallel, efficiency and compression perspectives open another promising direction. For instance, SINR [73] introduces a sparsity-driven compression framework that leverages high-dimensional dictionary-based sparse codes, achieving substantial reductions in storage requirements while preserving high-quality decoding across diverse modalities. Complementarily, the bit-plane decomposition approach [74] targets the digital precision bottleneck by predicting bit-planes, enabling lossless representation even for high bit-depth signals and facilitating faster convergence with constant model size. Together, these lines of research highlight the importance of advancing INR efficiency and compression to enable practical deployment across large-scale, real-world scenarios.

Acknowledgements

This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (RS-2025-23524035). This research was supported by the MSIT(Ministry of Science and ICT), Korea, under the Graduate School of Virtual Convergence support program(IITP-2024-RS-2024-00418847) supervised by the IITP(Institute for Information & Communications Technology Planning & Evaluation).

References

1 
Y. Chen , S. Liu , X. Wang , Learning continuous image representation with local implicit image function, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021URL
2 
H. Chang , Q. Ding , Face super-resolution via Restormer attention and feedback-enhanced facial prior integration, IEIE Transactions on Smart Processing & Computing, Vol. 14, No. 5, pp. 616-630, 2025DOI
3 
Z. Yan , Z. Liu , J. Li , Boosting of implicit neural representation-based image denoiser, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024URL
4 
Y. Yan , W. Ren , Y. Guo , R. Wang , X. Cao , Image deblurring via extreme channels prior, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4003-4011, 2017URL
5 
M. Bertalmio , G. Sapiro , V. Caselles , C. Ballester , Image inpainting, Proceedings of ACM SIGGRAPH, pp. 417-424, 2000URL
6 
S. Ren , K. He , R. Girshick , J. Sun , Faster R-CNN: towards real-time object detection with region proposal networks, Advances in Neural Information Processing Systems (NeurIPS), 2015URL
7 
M. Allmamun , F. Akter , M. B. U. Talukdar , S. Chakraborty , J. Uddin , Drone detection and tracking using deep convolutional neural networks from real-time CCTV footage, IEIE Transactions on Smart Processing & Computing, Vol. 13, No. 4, pp. 313-321, 2024DOI
8 
J. Long , E. Shelhamer , T. Darrell , Fully convolutional networks for semantic segmentation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015URL
9 
B. A. Lodhi , R. Ullah , S. Imran , M. Imran , B.-S. Kim , Sensenet: densely connected, fully convolutional network with bottleneck skip connection for image segmentation, IEIE Transactions on Smart Processing & Computing, Vol. 13, No. 4, pp. 328-336, 2024DOI
10 
Y. Zhou , Q. Ye , J. Qiu , Z. Li , J. Han , SceneGraphNet: neural message passing for 3D indoor scene augmentation, Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019URL
11 
J. L. Schönberger , J.-M. Frahm , Structure-from-motion revisited, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016URL
12 
C. Kwag , S. S. Hwang , Neural rendering survey targeted on speed, quality, 3D reconstruction, and editing, IEIE Transactions on Smart Processing & Computing, Vol. 14, No. 2, pp. 191-204, 2025DOI
13 
Q. Yu , Application of 3D scene reconstruction in sports public service based on pyramid lk optical flow method and ransac algorithm, IEIE Transactions on Smart Processing & Computing, Vol. 14, No. 4, pp. 457-470, 2025DOI
14 
B. Mildenhall , P. P. Srinivasan , M. Tancik , J. T. Barron , R. Ramamoorthi , R. Ng , NeRF: representing scenes as neural radiance fields for view synthesis, Proceedings of the European Conference on Computer Vision (ECCV), 2020URL
15 
L. Mescheder , M. Oechsle , M. Niemeyer , S. Nowozin , A. Geiger , Occupancy networks: learning 3D reconstruction in function space, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019URL
16 
S. Levine , C. Finn , T. Darrell , P. Abbeel , End-to-end training of deep visuomotor policies, Journal of Machine Learning Research, 2016URL
17 
C. Chen , A. Seff , A. Kornhauser , J. Xiao , Deep-driving: learning affordance for direct perception in autonomous driving, Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2722-2730, 2015URL
18 
J. Carmigniani , B. Furht , M. Anisetti , P. Ceravolo , E. Damiani , M. Ivkovic , Augmented reality technologies, systems and applications, Multimedia Tools and Applications, Vol. 51, No. 1, pp. 341-377, 2011DOI
19 
D. Lin , Image recognition processing technology based on virtual reality technology and adaptive feature fusion, IEIE Transactions on Smart Processing & Computing, Vol. 14, No. 6, pp. 715-727, 2025DOI
20 
C. Boje , A. Guerriero , S. Kubicki , Y. Rezgui , Towards a semantic construction digital twin: directions for future research, Automation in Construction, Vol. 114, pp. 103179, 2020DOI
21 
C. B. Choy , D. Xu , J. Gwak , K. Chen , S. Savarese , 3D-R2N2: a unified approach for single and multi-view 3D object reconstruction, Proceedings of the European Conference on Computer Vision (ECCV), pp. 628-644, 2016URL
22 
L. Liu , J. Gu , K. Z. Lin , T.-S. Chua , C. Theobalt , Neural sparse voxel fields, Advances in Neural Information Processing Systems (NeurIPS), Vol. 33, pp. 15651-15663, 2020URL
23 
S. A. Bello , S. Yu , C. Wang , J. M. Adam , J. Li , Deep learning on 3D point clouds, Remote Sensing, Vol. 12, No. 11, pp. 1729, 2020DOI
24 
D. Ulyanov , A. Vedaldi , V. Lempitsky , Deep image prior, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9446-9454, 2018URL
25 
K. Su , M. Chen , E. Shlizerman , Inras: implicit neural representation for audio scenes, Advances in Neural Information Processing Systems (NeurIPS), Vol. 35, pp. 8144-8158, 2022URL
26 
L. Mescheder , M. Oechsle , M. Niemeyer , S. Nowozin , A. Geiger , Occupancy networks: learning 3D reconstruction in function space, Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 4460-4470, 2019URL
27 
N. Rahaman , A. Baratin , D. Arpit , F. Draxler , M. Lin , F. Hamprecht , Y. Bengio , A. Courville , On the spectral bias of neural networks, Proceedings of the International Conference on Machine Learning (ICML), pp. 5301-5310, 2019URL
28 
A. F. Agarap , Deep learning using rectified linear units (ReLU), arXiv preprint arXiv:1803.08375, 2018URL
29 
V. Sitzmann , J. N. P. Martel , A. W. Bergman , D. B. Lindell , G. Wetzstein , Implicit neural representations with periodic activation functions, Advances in Neural Information Processing Systems (NeurIPS), 2020URL
30 
H. Saratchandran , S. Ramasinghe , V. Shevchenko , A. Long , S. Lucey , A sampling theory perspective on activations for implicit neural representations, arXiv preprint arXiv:2402.05427, 2024URL
31 
D. Serrano , J. Szymkowiak , P. Musialski , HOSC: a periodic activation function for preserving sharp features in implicit neural representations, arXiv preprint arXiv:2401.10967, 2024URL
32 
Z. Liu , H. Zhu , Q. Zhang , J. Fu , W. Deng , Z. Ma , Y. Guo , X. Cao , FINER: flexible spectral-bias tuning in implicit neural representation by variable-periodic activation functions, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024URL
33 
V. Saragadam , D. LeJeune , J. Tan , G. Balakrishnan , A. Veeraraghavan , R. G. Baraniuk , WIRE: wavelet implicit neural representations, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18507-18516, 2023URL
34 
H. Zhu , Z. Liu , FINER++: building a family of variable-periodic functions for activating implicit neural representation, arXiv preprint arXiv:2407.19434, 2024URL
35 
S. Ko , D. Kye , K. Min , C. Eom , J. Oh , FLAIR: frequency- and locality-aware implicit neural representations, arXiv preprint arXiv:2508.13544, 2025URL
36 
M. Tancik , P. Srinivasan , B. Mildenhall , S. Fridovich-Keil , N. Raghavan , U. Singhal , R. Ramamoorthi , J. Barron , R. Ng , Fourier features let networks learn high frequency functions in low dimensional domains, Advances in Neural Information Processing Systems (NeurIPS), Vol. 33, pp. 7537-7547, 2020URL
37 
A. Rahimi , B. Recht , Random features for large-scale kernel machines, Advances in Neural Information Processing Systems (NeurIPS), 2007URL
38 
H. Zhao , Z. Gao , Y. Wang , R. Xiong , Y. Zhang , Adaptive wavelet-positional encoding for high-frequency information learning in implicit neural representation, Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vol. 39, No. 10, pp. 10430-10438, 2025URL
39 
Z. Shen , Y. Cheng , R. H. Chan , Trident: the non-linear trilogy for implicit neural representations, arXiv preprint arXiv:2311.13610, 2023URL
40 
A. Kazerouni , R. Azad , A. Hosseini , D. Merhof , U. Bagci , Incode: implicit neural conditioning with prior knowledge embeddings, Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2024URL
41 
Z. Liu , Y. Wang , S. Vaidya , KAN: Kolmogorov-Arnold networks, arXiv preprint arXiv:2404.19756, 2024URL
42 
M. Heidari , R. Rezaeian , R. Azad , D. Merhof , H. Soltanian-Zadeh , I. Hacihaliloglu , SL2A-INR: single-layer learnable activation for implicit neural representation, arXiv preprint arXiv:2409.10836, 2024URL
43 
Z. Liu , P. Ma , KAN 2.0: kolmogorov-arnold networks meet science, arXiv preprint arXiv:2408.10205, 2024URL
44 
J. Zuiderveld , M. Federici , E. Bekkers , Towards lightweight controllable audio synthesis with conditional implicit neural representations, Advances in Neural Information Processing Systems (NeurIPS), 2021URL
45 
W. K. Han , B. Lee , H. Cho , S. Im , K. H. Jin , Towards lossless implicit neural representation via bit plane decomposition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025URL
46 
K. Shi , X. Zhou , S. Gu , Improved implicit neural representation with Fourier reparameterized training, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 25985-25994, 2024URL
47 
A. W. Reed , H. Kim , R. Anirudh , K. A. Mohan , K. Champley , J. Kang , S. Jayasuriya , Dynamic CT reconstruction from limited views with implicit neural representations and parametric motion fields, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2258-2268, 2021URL
48 
Y. Jiang , H. M. Kwan , T. Peng , G. Gao , F. Zhang , X. Zhu , J. Sole , D. Bull , HIIF: hierarchical encoding based implicit image function for continuous super-resolution, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2289-2299, 2025URL
49 
J. J. Park , P. Florence , J. Straub , R. Newcombe , S. Lovegrove , DeepSDF: learning continuous signed distance functions for shape representation, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 165-174, 2019URL
50 
Y. Liu , C. Jin , ICGAN: an implicit conditioning method for interpretable feature control of neural audio synthesis, arXiv preprint arXiv:2406.07131, 2024URL
51 
V. Kuleshov , S. Z. Enam , S. Ermon , Audio super resolution using neural networks, arXiv preprint arXiv:1708.00853, 2017URL
52 
J. Lee , J. Tack , N. Lee , J. Shin , Meta-learning sparse implicit neural representations, Advances in Neural Information Processing Systems (NeurIPS), 2021URL
53 
E. Agustsson , R. Timofte , NTIRE 2017 challenge on single image super-resolution: dataset and study, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017URL
54 
R. Timofte , S. Gu , J. Wu , L. Van Gool , L. Zhang , M.-H. Yang , M. Haris , NTIRE 2018 challenge on single image super-resolution: methods and results, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018URL
55 
E. Kodak , Kodak lossless true color image suite, 1999URL
56 
M. Bevilacqua , A. Roumy , C. Guillemot , M. L. Alberi-Morel , Low-complexity single-image super-resolution based on nonnegative neighbor embedding, Proceedings of the British Machine Vision Conference (BMVC), 2012URL
57 
G. Wang , J. C. Ye , B. De Man , Deep learning for tomographic image reconstruction, Nature Machine Intelligence, Vol. 2, No. 12, pp. 737-748, 2020DOI
58 
T. R. Moen , B. Chen , D. R. Holmes , X. Duan , Z. Yu , L. Yu , S. Leng , J. G. Fletcher , C. H. McCollough , Low-dose CT image and projection dataset, Medical Physics, Vol. 48, No. 2, pp. 902-911, 2021URL
59 
The Stanford 3D Scanning Repository, Stanford UniversityURL
60 
B. Mildenhall , P. P. Srinivasan , R. Ortiz-Cayon , N. K. Kalantari , R. Ramamoorthi , R. Ng , A. Kar , Local light field fusion: practical view synthesis with prescriptive sampling guidelines, ACM Transactions on Graphics (TOG), Vol. 38, No. 4, pp. 1-14, 2019DOI
61 
J. Yamagishi , C. Veaux , K. MacDonald , , CSTR VCTK corpus: english multi-speaker corpus for CSTR voice cloning toolkit (version 0.92), 2019DOI
62 
V. Panayotov , G. Chen , D. Povey , S. Khudanpur , LibriSpeech: an ASR corpus based on public domain audio books, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206-5210, 2015URL
63 
J. Engel , C. Resnick , A. Roberts , S. Dieleman , M. Norouzi , D. Eck , K. Simonyan , Neural audio synthesis of musical notes with Wavenet autoencoders, Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 1068-1077, 2017URL
64 
Z. Wang , A. C. Bovik , H. R. Sheikh , E. P. Simoncelli , Image quality assessment: from error visibility to structural similarity, IEEE Transactions on Image Processing, Vol. 13, No. 4, pp. 600-612, 2004DOI
65 
R. Zhang , P. Isola , A. A. Efros , E. Shechtman , O. Wang , The unreasonable effectiveness of deep features as a perceptual metric, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 586-595, 2018URL
66 
H. G. Barrow , J. M. Tenenbaum , R. C. Bolles , H. C. Wolf , Parametric correspondence and chamfer matching: two new techniques for image matching, Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 1977URL
67 
J. Le Roux , S. Wisdom , H. Erdogan , J. R. Hershey , SDR-half-baked or well done?, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626-630, 2019URL
68 
, Perceptual evaluation of speech quality (PESQ): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs, 2001URL
69 
C. H. Taal , R. C. Hendriks , R. Heusdens , J. Jensen , A short-time objective intelligibility measure for time-frequency weighted noisy speech, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4214-4217, 2010URL
70 
A. Dosovitskiy , L. Beyer , A. Kolesnikov , D. Weissenborn , X. Zhai , T. Unterthiner , M. Dehghani , M. Minderer , G. Heigold , S. Gelly , An image is worth 16x16 words: transformers for image recognition at scale, arXiv preprint arXiv:2010.11929, 2020URL
71 
A. Krizhevsky , I. Sutskever , G. E. Hinton , ImageNet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems (NeurIPS), Vol. 25, 2012URL
72 
D. Grattarola , P. Vandergheynst , Generalised implicit neural representations, Advances in Neural Information Processing Systems (NeurIPS), 2022URL
73 
D. Jayasundara , S. Rajagopalan , Y. Ranasinghe , T. D. Tran , V. M. Patel , Sinr: sparsity driven compressed implicit neural representations, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025URL
74 
W. K. Han , B. Lee , H. Cho , S. Im , K. H. Jin , Towards lossless implicit neural representation via bit plane decomposition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025URL
Sukhun Ko
../../Resources/ieie/IEIESPC.2026.15.3.396/au1.png

Sukhun Ko received his B.S. degree in big data convergence and began pursuing an M.S. degree in imaging science at the Graduate School of Advanced Imaging Science, Multimedia & Film (GSAIM), Chung-Ang University, Seoul, South Korea, in March 2025. His research interests include low-level vision tasks such as image and video super-resolution, video frame interpolation, and implicit neural representations. He is currently a member of the Creative Vision and Multimedia Lab (CMLab, https://cmlab.cau.ac.kr/) at Chung-Ang University.

Chanho Eom
../../Resources/ieie/IEIESPC.2026.15.3.396/au2.png

Chanho Eom is an assistant professor at the Graduate School of Advanced Imaging Science, Multimedia & Film (GSAIM), Chung-Ang University, Seoul, Korea. He received his B.S. and Ph.D. degrees in electrical and electronic engineering from Yonsei University in 2017 and 2023, respectively. He previously worked as a researcher at the Samsung Advanced Institute of Technology (SAIT). His research interests include computer vision and deep learning, particularly retrieval, person re-identification, and video analysis, both in theory and applications.

Jihyong Oh
../../Resources/ieie/IEIESPC.2026.15.3.396/au3.png

Jihyong Oh is an assistant professor in the Department of Imaging Science at the Graduate School of Advanced Imaging Science, Multimedia & Film (GSAIM), Chung-Ang University (CAU; Seoul, South Korea), and has led the Creative Vision and Multimedia Lab (https://cmlab.cau.ac.kr/) since September 2023. He was previously a postdoctoral researcher at VICLAB, KAIST. He received the B.E., M.E., and Ph.D. degrees in Electrical Engineering from KAIST in 2017, 2019, and 2023, respectively, and was a research intern at Meta Reality Labs in 2022. His research focuses on low-level vision, image/video restoration, 3D vision, and generative AI. He has published at CVPR, ICCV, ECCV, AAAI, TCSVT, and Remote Sensing, and serves as a reviewer for CVPR, ICCV, ECCV, SIGGRAPH, IEEE TPAMI, TIP, TGRS, and Access. He received Outstanding Reviewer Awards for ICCV 2021 and CVPR 2024.