1. Introduction
Cultural heritage is categorized into tangible and intangible cultural heritage. Tangible
cultural heritage is what we traditionally refer to as cultural heritage, that is,
cultural artifacts of historical, artistic and scientific value [1]. Intangible cultural heritage, on the other hand, focuses more on traditional cultural
expressions created with the participation and innovation of the people, including
traditional folk performing arts [2], rituals and festivals of folklore activities, traditional understanding and dissemination
of nature and the universe, and traditional handicrafts. Intangible cultural heritage
is a pearl nurtured in the long history of the peoples of the world [3], a bond and carrier of national emotions, and a spiritual home that has remained
unchanged for millennia as human civilization traces its roots [4]. In the context of globalization and modernization, ICH has gradually lost its original
soil and social environment [5]. The industrial revolution and the rapid development of urbanization have changed
people's production, life and way of thinking, making the living environment of folk
culture deteriorate and the cultural space be marginalized [6]. Traditional folk crafts, because they are transmitted by word of mouth and handed
down from generation to generation by the old generation of artists in a specific
region, are subject to the limitations of time, space and territory and are at risk
of being lost [7]. How to preserve and disseminate this cultural heritage has become a topic of discussion
among scholars [8].
Prior to the 1900s [9], the preservation of intangible cultural heritage in China relied on traditional
and inefficient methods such as taking photographs, interviews, and recordings, and
the precious cultural heritage that was preserved was damaged to varying degrees by
the erosion of time due to improper preservation. Nowadays, digital technology provides
a brand-new means of safeguarding tangible cultural heritage, especially the Internet
can widely disseminate ICH. Digitization technologies enable audiences to become not
only spectators but also interactive participants, such as watching animations and
multimedia digital images [10].
In the 1990s, UNESCO launched the Memory of the World (MOW) project with the aim of
preserving and accessing cultural heritage and raising awareness of cultural heritage
in all countries [11]. For a while, "digitization" became a buzzword in cultural heritage preservation.
The National Endowment for the Humanities and the University of Oxford in the United
Kingdom are collaborating on a project to digitize works from Shakespeare's time [12], proposing to consolidate Shakespeare's works from the United Kingdom and the United
States from around the globe up to the year 2000, and at the same time, to develop
a user interface that would allow the user to access these databases for relatively
in-depth comparisons and research. The Visual Media Center at Columbia University
focuses on the potential of new media to facilitate the interpretation and preservation
of the built environment [13]. Hui-Jeong Han et al. explore the ways of digital management and utilization of NRLs
in Korea, analyze the current situation of digital archiving of NRLs in depth, and
propose policies based on the factors of cultural governance and normative management
[14].
In recent years, digital technology has played an essential role in cultural heritage
preservation, offering a variety of methods to digitize and safeguard both tangible
and intangible cultural assets. Earlier digital preservation methods primarily involved
2D imaging techniques, such as high-resolution photography and photogrammetry, which,
while effective in capturing surface details, often fell short in representing the
three-dimensional intricacies of artifacts [15]. Techniques like Structured Light Scanning (SLS) and Laser Scanning have since been
employed to generate more accurate 3D models, providing a fuller representation of
the physical characteristics of heritage objects. However, these methods are typically
resource-intensive, requiring specialized equipment and significant post-processing,
which limits their accessibility and scalability, especially in resource-constrained
environments [16]. Recent advances in 3D reconstruction, such as Multi-View Stereo (MVS) and volumetric
capture, have improved the fidelity and usability of digital models. These methods
allow for the creation of detailed 3D models from multiple 2D images but often require
extensive computational resources and complex workflows [17]. Additionally, voxel-based approaches have been explored for their ability to capture
volumetric data, but they frequently struggle with maintaining high resolution and
texture detail in complex scenes [18]. Neural Radiance Fields (NeRF) technology has emerged as a promising alternative,
addressing many of the limitations of previous methods. NeRF is a scene representation
method via neural networks, capable of generating high-quality 3D models from a limited
set of 2D images [19]. The technique utilizes neural networks to learn the implicit representation of a
scene, allowing for realistic reproduction of light propagation and fine details [20]. This gives NeRF a significant advantage in the 3D reconstruction of complex scenes,
especially when dealing with the nuanced textures and lighting conditions characteristic
of intangible cultural heritage (ICH). Despite its potential, there is a notable gap
in research specifically applying NeRF technology to the reproduction of animated
scenes of ICH. This paper aims to fill this gap by proposing a NeRF-based system for
ICH animation scene reproduction [21].
2. Neural Radiance Fields Synthesis Algorithm
NeRF virtual point-of-view image synthesis is based on ray casting for volume rendering.
Each pixel of the input image computes a ray from the camera direction and is densely
sampled along the ray to optimize a large MLP, which needs to be updated multiple
times for all parameters of the MLP during training, resulting in high computational
complexity during training and rendering, making it difficult to satisfy the requirements
of interactive visualization. In order to realize real-time high-quality virtual point-of-view
synthesis, the implicit neural representation needs to be fast and of high quality.
In order to solve the above problems, a fast NeRF viewpoint synthesis algorithm combining
hash coding and feature texture mesh is proposed. The method first optimizes the training
of the implicit neural representation through the hash coding network, reduces the
number of redundant trainable parameters in the MLP, and achieves the fast construction
of the implicit neural representation by exchanging the memory consumption for the
small computational cost.
2.1. Algorithm Overview
The fast Ne RF viewpoint synthesis algorithm based on feature texture mesh proposed
in this paper constructs a fast implicit representation of the 3D scene through a
single-resolution hash coding module, a continuous scene representation training module,
and a joint optimization and rendering module, and generates the 3D scene model and
the feature texture maps at the same time, in order to realize the real-time virtual
viewpoint image synthesis task, as shown in Fig. 1. The process begins with the input of 2D images and corresponding camera parameters.
These inputs undergo preprocessing, including resizing and depth map prediction. Next,
hash coding is applied to optimize feature vectors, which are then fed into the Multi-Layer
Perceptron (MLP) for training. Finally, the trained model renders new views from arbitrary
viewpoints.
In the single-resolution hash coding module, a feature texture grid model G needs
to be predefined to initialize the implicit scene representation. Unlike sampling
on rays in NeRF, the algorithm needs to compute the intersection point $x \in \mathbb{R}^3$
of the camera rays with the initialized grid, and the vertex coordinates of the grid
where the intersection point is located, $v \in \mathbb{R}^3$, as the input spatial
coordinates for the next step of hash coding, and map the spatial locations via the
spatial hash function h(v). positions are mapped to obtain the encoded feature embedding
vector $\hat{F}$. In the continuous scene representation training module, the feature
embedding vector $\hat{F}$ output from the hash coding network is passed through three
small MLPs to output the transparency $\alpha$, texture feature vector f, and color
value c of each ray sampling point, and the alpha synthesis is used to synthesize
the color value $\widehat{C}$ of each camera ray using alpha synthesis instead of
body rendering, and the network is supervised by the loss of MSE between the rendered
color $\widehat{C}$ and the real color $C_{gt}$. The network is trained by the MSE
loss between the rendered color $\widehat{C}$ and the true color $C_{gt}$.
In the joint optimization and rendering module, the continuous transparency value
a obtained from the MLP network is binarized to obtain $\hat{a}$, the discrete data
is back-propagated using the STE (straight-through estimator) algorithm, and then
the continuous and discrete models are stably co-trained to achieve correct rendering
of the semi-transparent feature texture mesh. In the rendering stage, the feature
texture mesh G is extracted as an explicit discrete mesh model, and the texture mapping
of the model is computed in real-time by a color network. The design methodology and
process of each module, as well as the loss function design for supervised neural
network optimization, are described in detail below.
Fig. 1. Schematic diagram of the algorithm.
2.2. Single Resolution Hash Coding Module
A new parametric coding method aids in the training of the MLP before feeding the
spatial location into the neural network for inference, in order to reduce the size
of the MLP and thus the training time of the implicit neural representation. This
new coding method allows the implicit neural representation to be trained using a
smaller MLP than previous methods without sacrificing the quality of the virtual point-of-view
image synthesis, thereby reducing the number of histories of the network parameters
as the sampled points on the ray pass through the MLP.
The method is mainly assisted by a sparse grid structure containing trainable feature
covariates, so first a polygonal grid G storing texture features needs to be pre-crowned,
with each vertex of the grid storing a feature embedding vector F. The grid is initialized
by defining a regular grid of size $128 \times 128 \times 128$ in a unit cube centered
at the origin and creating within each grid A vertex v is created in each grid, and
each edge of the grid is used to create a quadrilateral connecting four neighboring
grid vertices to instantiate a face. The grid vertices are one-to-one corresponded
to the feature vectors through a hash table and queried by the spatial hash function
h(v). The feature embedding vector F can be optimized by stochastic gradient descent,
which is synchronized with the network parameters of the MLP to optimize the MLP globally
and exhibit smoothness. of inductive bias, while the computational cost of optimization
and evaluation is high, in contrast, the feature texture mesh-based representation
is updated locally during optimization and evaluation, is well expressive of local
details, and is computationally more efficient. The key to this process is the use
of a sparse grid structure containing trainable feature covariates.
Here are the pseusdocode of Hash Coding:
Initialize grid G with size $128 \times 128 \times 128$
For each vertex v in G:
Create a feature embedding vector F
Store F in the hash table using spatial hash function h(v)
For each ray r(t) = o + td emitted from the camera:
Compute intersection point x of r with grid G
Obtain 3D coordinates v of the vertex where x is located
Encode v using h(v) to obtain feature embedding vector F
Perform trilinear interpolation on F to obtain interpolated feature vector $\hat{F}$
Pass $\hat{F}$ to MLP for further processing
Optimize F and MLP parameters using stochastic gradient descent
The input to the hash coding network is different from the spatial points sampled
on the rays in NeRF, as it is necessary to use a predefined feature texture mesh,
the input to the hash coding network is the intersection of the ray with the feature
texture mesh. Firstly, a ray r(t) = 0 + td is emitted from the camera position, which
passes through the mesh G. The intersection point x of the ray r and the predefined
mesh is computed, which in turn obtains the internal intersection point k of the ray
and the mesh, and the 3D coordinates of the vertex of the mesh where the intersection
point is located, v, are hash-encoded to obtain the feature embedding vectors: F =
enc(v; $\phi$), where $\phi$ is a trainable parameter of the feature-textured mesh,
which are then encoded by the triple Trilinear interpolation is used to obtain the
feature embedding vectors of the intersection points. The feature embedding vectors
are queried by the spatial hash function on the hash table, which contains T feature
embedding vectors of dimension F, so the number of trainable parameters $\theta$ is
T $\times$ F. The hash coding process is shown in Fig. 2.
Fig. 2. Diagram of the hash coding process.
The hash coding process represented in Fig. 2 is:
For a given input 3D coordinate x, it is first necessary to find the mesh where it
is located (i.e., the mesh where the 3D coordinates are located), and obtain the vertex
positions v of the mesh.
A trainable feature embedding vector F is stored at the vertices of the mesh, so each
vertex 3D position v of the mesh needs to be queried from the hash table for the corresponding
feature embedding vector via a spatial hash function h(v), which is shown in
where $\oplus$ denotes the bitwise permutation operation, d is the dimension of the
input vector, $v_i$ denotes each dimension of the input three-dimensional position,
$\pi_i$ is the only large prime number, mod denotes the remainder operation, and T
is the number of feature vectors in the hash table. The formula indicates that the
three-dimensional coordinates of the space are used as the key, the three dimensions
of the coordinates are multiplied by $\pi_i$ to do the different-or-bit operation,
and finally the residual operation is performed on T to get the corresponding value
in the hash table, i.e., the feature embedding vector F. In the experiments, the number
of vertices in the mesh is smaller than T , and the mapping between vertices and feature
vectors is 1 : 1, which can alleviate the effect of the hashing collision.
Then, according to the position of x in its corresponding mesh, 3D linear feature
interpolation is performed. 3D linear interpolation is an interpolation method performed
in 3D space, and after obtaining the feature vector $F_i$ of each vertex on the mesh,
it is necessary to obtain the feature vector of the spatial coordinate x by trilinear
interpolation. The final interpolated feature vector $\hat{F} \in \mathbb{R}^F$ is
the input to the MLP. After single-resolution hash coding, only the feature embedding
vectors of the grid where the input coordinates are located need to be updated during
each optimization of the feature mesh, and only a small number of weights and biases
need to be updated for the shallow MLPs used in the subsequent optimization process,
whereas in NeRF, for each gradient propagated through the backward propagation of
the neural network, the weights and biases of each layer of the MLP network as well
as each channel need to be updated for each sample point, which is computationally
intensive. In NeRF, for each gradient propagated backward through the neural network,
each sampling point needs to update the weights and biases of each layer of the MLP
network and each channel, which results in a large amount of computation and a long
training time.
2.3. Joint Optimization and Rendering Module
In this module, the smooth transparency $\alpha$ obtained from the A-network needs
to be binarized to obtain the discrete transparency $\alpha_b$, as shown in the following
equation, because when rasterizing the feature texture mesh, the semi-transparent
mesh needs to be sorted according to the depth, and the rendering is executed sequentially
to ensure the correct alpha synthesis, and the general hardware-based rendering methods
do not support the rendering of semi-transparent mesh. The general hardware-based
rendering methods do not support the rendering of semi-transparent meshes, so the
successive alpha opacities generated by the MLP need to be discretely binarized to
achieve correct rendering in transparent polygon meshes.
After obtaining the binarized transparency, the continuous and discrete models are
jointly trained to ensure that the network can perform the correct alpha synthesis
for the semi-transparent mesh, and the joint optimization process of the continuous
and discrete models is shown in Fig. 3.
Fig. 3. Joint optimization process of discrete and continuous models.
In the process diagram shown in Fig. 3, the intersection point of the ray and the mesh is interpolated into a feature embedding
vector $\hat{F}$ by hash coding, and the feature embedding vector passes through the
transparency domain MLP and the feature domain MLP to obtain the continuous transparency
value $\alpha$ and the continuous texture feature f. The continuous transparency value
is binarized to obtain the discrete transparency value $\alpha_b$ and the discrete
texture feature $f_b$ is obtained by the discrete transparency value, the continuous
color value c and discrete color value $c_b$, respectively, after alpha compositing.
The continuous texture feature and the discrete texture feature obtain the continuous
color value c and the discrete color value $c_b$ after the color value MLP, and obtain
the continuous rendering color value $\widehat{C}(r)$ and the discrete rendering color
value $\widehat{C}_b(r)$ after the alpha synthesis, and calculate the loss value of
these two with the real value respectively, and the final loss function of the joint
optimization model is the sum of the discrete model loss function and the continuous
model loss function, so as to achieve the goal of the joint optimization model. The
final loss function of the joint optimization model is the sum of the loss function
of the discrete model and the loss function of the continuous model, thus realizing
the joint optimization of the continuous model and the discrete model.
In the rendering stage, the jointly optimized polygonal feature texture mesh is stored
in OBJ format, and the quadrilateral faces in the feature texture mesh are selectively
retained according to whether they are visible or not, and when the camera rays are
passed through the feature texture mesh, the weight of each intersection point, w,
is calculated, which is the weight of each quadrilateral face, and the face with the
largest weight is selected for each ray and only the quadrilateral faces with the
largest weight are retained, and the quadrilateral faces with the largest weight are
retained. By selecting the faces with the largest weights on each ray and retaining
only the quadrilateral faces with the largest weights, as well as retaining the quadrilateral
faces corresponding to those in the UV map, this achieves the goal of retaining the
visible quadrilateral faces, as well as storing the texture feature UV maps only for
polygons that are visible in the input viewpoint image. The mesh represents the extent
of the scene at the points where the rays intersect the mesh (i.e., the interior of
the mesh), and a mesh size of $128 \times 128 \times 128$ is sufficient to cover the
scene.
After that, the pixel coordinates in the 2D texture material of the model are converted
to 3D coordinates, and the discrete opacity $\alpha_b$ and texture eigenvalues $f_b$
corresponding to the 3D coordinates are baked into the texture UV map according to
the vertex-by-vertex iterative traversal of the quadrilateral faces in the feature
texture mesh, and are stored in a lossless compressed PNG file as shown in Fig. 4.
Fig. 4. Extraction of texture maps.
The UV texture maps of the model ultimately store the transparency $\alpha_b$ and
texture features $f_b$ generated by the MLPs, so it is necessary to compute each vertex
of all the quadrilateral faces, and then feed the vertices into the trained MLPs to
generate the corresponding transparency and texture feature vectors, and output the
8 dimensions of the vertices as the example feature vectors, and the texture feature
vectors that are in the range between [0, 1] are quantized as [0, 255], then the 8-channel
texture feature vector is stored into two 4-channel RGB$\alpha$ texture material maps,
one each with a 4-channel RGB$\alpha$ texture. The 8-channel texture feature vectors
are then stored as two 4-channel RGB$\alpha$ texture material maps each. The texture
features and line-of-sight directions of the texture maps are fed into $\mathcal{H}$
for real-time color mapping computation, which ultimately achieves real-time rendering
of the mesh model.
2.4. Loss Function Design
During the optimization of the feature texture mesh, in order to prevent the mesh
vertices v from leaving the corresponding mesh, they are kept inside the mesh during
the training process by the following penalty factor loss function as shown in
where $\mathcal{I}(v)$ represent the penalty factor, when the vertex v leaves the
corresponding grid outside, its value is 1000, then the value of the loss function
is relatively large, for $1000\|v\|$, when the vertex is inside the grid, the value
of $\mathcal{I}(v)$ is 0, the value of the loss function is $0.01\|v\|$.
Meanwhile, in order to optimize the implicit neural representation, the training process
of the network is done by calculating the mean square error between the predicted
color and the true color of the pixel point, and the RGB reconstruction color loss
function is shown in
where $C_{gt}(r)$ denotes the real pixel alpha synthesis to obtain.
After binarization of the discrete model and continuous model for joint training,
we need to calculate $\mathcal{L}_c^{bin}$ and the loss function $\mathcal{L}_\varepsilon$
of the continuous model respectively, both of them supervise the network by reconstructing
the color loss function through RGB, the loss function of the continuous model is
shown above, while the discrete model needs to compute the discrete color value $\widehat{C}_b$
and then compute the loss function after discretization of the transparency value
in the discrete model as shown in
The final loss function for the continuous and discrete models is the sum of $\mathcal{L}_c$
and $\mathcal{L}_C^{bin}$, as shown in
3. NeRF Animated Reconstruction of Intangible Cultural Scenes
The application of NeRF is to solve the problem of long training time for a single
scene and long rendering time for each virtual point-of-view image after training,
which is not able to meet the demand of real-time reconstruction and rendering. Currently,
NeRF technology only needs to input a certain number of 2D images and the corresponding
internal and external parameters of the camera to render realistic images of any virtual
viewpoints between the input viewpoints.
With the rapid development of Web technology, traditional 2D Web pages are becoming
insufficient. WebGL, a JavaScript API, enables accelerated rendering of high-performance
2D or 3D graphics through the device's hardware in compatible browsers. Three.js,
a 3D engine built on WebGL, simplifies the creation of 3D web pages by encapsulating
WebGL's complex APIs. This allows for real-time interactive rendering, such as the
NeRF-based fast point-of-view reconstruction application discussed in this paper.
Developed using HTML, JavaScript, React, and Three.js, the application enables efficient
3D web page creation and interactive experiences.
3.1. System Design Program
The development of the interactive platform for intangible culture 3D scene includes
three modules: data uploading and processing module, visualization training module,
and interaction and rendering module. The overall design flow of training and rendering
interaction is shown in Fig. 5.
Fig. 5. NeRF fast viewpoint reconstruction design flow.
Fig. 5 shows the design flow of Ne RF-based rapid 3D reconstruction: in the first step,
users need to upload their own data sets, and then upload the data to the cloud through
the data reading module; in the second step, the cloud reads the data uploaded by
the users for preprocessing, and performs feature extraction and matching through
clomap to estimate the camera parameters of the images; in the third step, the cloud
trains the preprocessed data sets and renders the rendered images in real time through
the front-end page; in the fourth step, after the cloud training is completed, the
rendering and interaction module allows users to interact with the 3D scene and manipulate
the camera to set up a 3D scene. The third step is to train the pre-processed dataset
in the cloud and perform real-time rendering, and the rendered image is displayed
through the front-end page; the fourth step is that, after the training in the cloud
is completed, through the rendering and interaction module, the user can interactively
operate the 3D scene, as well as manipulate the camera to set the rendering path for
rendering, and finally save the rendered video. The flowchart of the above steps is
shown in Fig. 6.
Fig. 6. NeRF fast viewpoint reconstruction engineering application realization flow.
3.2. Module Functional Design
Data Upload and Preprocessing Module: Users are required to upload an intangible culture
related dataset for training, which consists of a collection of images or video format
files. The image dataset is uploaded through the upload button on the page, and the
images will be saved to the images folder on the cloud server. If the user uploads
a video file, the system will use ffmpeg's vframes command to split the video into
frames, and the resulting frame sequence collection will also be saved to the images
folder. The processed image sequence will be processed by COLMAP's feature extractor
and exhaustive matcher commands for feature extraction and matching, and then the
model converter command will be used to estimate the internal and external parameters
of the image, and the result will be saved as a.json file.
Visualization Training Module: After reading the pre-processed intangible culture
dataset, the system starts the optimization training of the implicit neural representation.
In each iteration, a predetermined number of pixel points from the 2D image are sampled,
and a single-resolution grid point grid and three small network structures are initialized:
density model.init(), feature model init(), and color model. init(). The rendered
pixel points are obtained by calculating the intersections of the camera rays with
the grid and feeding the intersections into the network for calculation. Then, the
color of the sampled pixel point and the rendered pixel color value are used for loss
calculation. The training process is visualized and each iteration allocates a certain
amount of GPU resources for real-time rendering. The user can control the training
speed by adjusting the ratio of training to rendering resources and the resolution
of the rendered image. Users can also change different rendering images by adjusting
the neural rendering results of the network, such as outputting RGB color values to
generate color images or outputting bulk density values to generate depth images,
in order to achieve a diversified rendering of animated scenes of intangible culture.
Interaction and Rendering Module: After training, users can interact and adjust the
rendering path by manipulating the camera. The 3D scene is set up by Three.js, including
camera, scene and renderer. Use Perspective Camera and control the camera with Orbit
Controls (camera, dom). A new camera can be defined with THREE. Perspective Camera,
which adds keyframe positions to the render path and adjusts the render path by adjusting
the camera's coordinates camera. position. set and perspective camera; lookAt.
The rendering process is realized by deferred rendering, which is handled in two phases:
the geometry information storage phase and the RGB mapping calculation phase. In the
geometry storage phase, the geometry information of the scene is stored in the cache
and assigned to the Three.js shader texture material. In the RGB mapping computation
phase, a 2D rectangle is rendered, the stored parameters are read from the cache,
and the camera ray direction and feature texture vectors from the feature texture
map are passed to a small MLP that composites the visually relevant pixel color values.
Eventually, the 2D rectangle is stored as an RGB image and baked onto the model to
achieve a realistic reproduction of the animated scenes of intangible culture. As
shown in Fig. 7.
Fig. 7. Rendering process diagram.
3.3. Computational Efficiency and Scalability
Our method was designed with computational efficiency in mind, leveraging a hash coding
and feature texture mesh approach that significantly reduces the number of redundant
trainable parameters in the MLP network. This optimization allows for faster training
times without compromising the quality of the generated 3D models. On a single Nvidia
TITAN RTX2080 Ti, our method demonstrated faster convergence during training, completing
the process in approximately 20% less time than traditional NeRF approaches. In addition,
our method has been tested on various scene complexities and scales effectively with
increasing data volume. The use of hash coding allows the method to maintain high
performance even as the number of input images or the resolution of the 3D models
increases.
However, as scene complexity grows, there is a corresponding increase in computational
load, particularly in terms of memory usage and processing time. Despite this, our
method's ability to selectively allocate resources to critical areas helps mitigate
the impact, ensuring that performance remains within acceptable limits. In real-world
applications, particularly those involving the digital preservation of cultural heritage,
additional factors such as environmental conditions (e.g., lighting variability) and
the physical accessibility of artifacts can introduce variability in the input data.
This variability may affect the quality of the reconstruction, requiring further refinement
of the input dataset or additional post-processing steps to achieve the desired outcome.
4. Results and Discussion
4.1. Experimental Data Collection
This algorithm is experimented on the ScanNet dataset, which is an RGB-D video dataset
containing 2.5 million views in more than 1500 scans. On the ScanNet dataset, three
scenes are selected for this experiment. For each scene, 10 images of the local region
are selected and three of them are used as the test set for new view synthesis. The
dataset underwent a series of preprocessing steps to ensure optimal input quality
for the NeRF algorithm. Initially, images were resized to a resolution of $484 \times
648$ to standardize the input dimensions and reduce computational overhead. Following
resizing, depth maps were predicted using a model pre-trained on the ScanNet dataset.
These depth maps provide critical information about the spatial structure of scenes,
which enhances the accuracy of the NeRF reconstruction.
VSCODE was used for code writing and modification, and the algorithmic framework was
implemented in the PyTorch library and trained on a single Nvidia TITAN RTX2080 Ti.
In the depth map prediction phase, a pre-trained model is loaded and fine-tuned on
the Scan Net dataset to predict the depth map of the Scan Net image, and then the
depth information is added to the Ne RF network for training. The algorithm uses the
Adam optimizer with hyperparameters set to $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\alpha
= 10^{-5}$ and a batch size of 214. The batch size is 214. the network is trained
by minimizing the formula loss function which makes the network converge.
4.2. Evaluation Indicators
In NeRF and related algorithms, quantitative evaluation is required in order to compare
the advantages and disadvantages of different methods. The quantitative evaluation
metrics are mainly the following:
Peak Signal-to-Noise Ratio: PSNR is a commonly used metric for evaluating the quality
of an image. It calculates the peak signal-to-noise ratio of an image, which can be
used to evaluate the difference between the image rendered by the NeRF model and the
real strange image. Its calculation formula is as follows:
where $MAX_I$ is the maximum possible value of the pixel value of the image, for 8-bit
images, $MAX_I$ is usually 255.MSE is the mean square error between the predicted
image and the real image with the following formula:
where $P(i)$ is the pixel value of the predicted image, $G(i)$ is the pixel value
of the real image and $N$ is the total number of pixels in the image.
Structural Similarity Index: SSIM is a structural similarity index which takes into
account the brightness, contrast and structure of the image to better evaluate the
similarity between the image generated by the NeRF model and the real image, which
is calculated as follows:
where $u_p$ and $u_g$ are the luminance means of the predicted and real images, respectively,
$\sigma_p$ and $\sigma_g$ are the luminance standard deviations of the predicted and
real images, respectively, $\sigma_{pg}$ is the luminance covariance of the predicted
and real images, and $C_1$ and $C_2$ are constants used to prevent the denominator
from being zero.
Learned Perceptual Image Patch Similarity: LPIPS is a deep learning-based image similarity
metric that better captures the factors that contribute to the quality of human-perceived
images, and therefore better evaluates the quality of images generated by the NeRF
model.
where $P_i$ and $G_i$ are the $\iota$th image block in the predicted and real images,
respectively, and $\phi(\cdot)$ is a pre-trained convolutional neural network for
extracting the feature representation of the image block. Training time: Since NeRF
models require a large number of computational resources and time for training, the
performance of NeRF models also needs to be evaluated by considering the training
time, including the overall training time, the training time of each epoch, and so
on.
4.3. Tests Results
Tables 1 and 2 show the comparison of the quantitative metrics of the new views generated
by this algorithm and other algorithms in different scenarios. Table 1 shows the PSNR values and Table 2 shows the SSIM values, which are metrics for evaluating the similarity between images,
and the larger the value, the more similar the images are. From the table, it can
be seen that the algorithm achieves better results in each of the metrics.
Table 1. Comparison of quantitative experimental results in SacnNet dataset (PSNR$\uparrow$).
|
Method
|
Scene 1
|
Scene 2
|
Scene 3
|
Scene 4
|
Scene 5
|
Scene 6
|
|
NSVF
|
15.71
|
24.41
|
17.13
|
21.08
|
26.88
|
22.29
|
|
NeRF
|
15.76
|
27.73
|
16.57
|
23.94
|
25.18
|
22.76
|
|
NeRF/COLMAP
|
21.38
|
27.97
|
18.64
|
27.32
|
27.42
|
24.35
|
|
Our method
|
20.09
|
28.54
|
22.39
|
30.53
|
29.57
|
26.78
|
Table 2. Comparison of Quantitative Experimental Results in ScanNet dataset (SSIM$\uparrow$.)
|
Method
|
Scene 1
|
Scene 2
|
Scene 3
|
Scene 4
|
Scene 5
|
Scene 6
|
|
NSVF
|
0.721
|
0.821
|
0.734
|
0.792
|
0.873
|
0.795
|
|
NeRF
|
0.735
|
0.843
|
0.725
|
0.818
|
0.835
|
0.801
|
|
NeRF/COLMAP
|
0.796
|
0.882
|
0.756
|
0.896
|
0.898
|
0.837
|
|
Our method
|
0.782
|
0.901
|
0.802
|
0.941
|
0.926
|
0.879
|
Table 3 shows the quantitative evaluation metrics between rendered images at known viewing
angles. Data from 8 scenes are used for testing, and the average value of each metric
is taken as the data in the table. The experimental results show that the new view
generated by the algorithm proposed in this paper is similar to the real image to
a larger extent than the NeRF algorithm. For example, voxel-based methods tend to
suffer from lower resolution and detail in texture representation, while point cloud
approaches often struggle with maintaining consistency in lighting and shading. Mesh-based
techniques though effective in certain structured environments, often require extensive
preprocessing and are less adaptable to complex, unstructured scenes. In contrast,
our method not only excels in capturing fine details and realistic lighting but also
does so with greater computational efficiency.
Table 3. Comparison of quantitative experimental results of different methods in ScanNet
dataset.
|
Method
|
PSNR$\uparrow$
|
SSIM$\uparrow$
|
LPIPS$\downarrow$
|
|
NeRF
|
29.51
|
0.914
|
0.342
|
|
Our method
|
31.68
|
0.956
|
0.194
|
A visualization of the PSNR values is done as shown in Fig. 8, where both training and testing data are taken from the ScanNet dataset. It can
be seen that the present method achieves better experimental results in both training
and testing. On the Scan Net dataset, this algorithm generates a clearer new view
in comparison with the NeRF method. Taking PSNR, SSIM, and LPIPS as the quantitative
evaluation metrics, the accuracy of this algorithm is 31.68 for PSNR, 0.956 for SSIM,
and 0.194 for LPIPS, which is better than the NeRF algorithm in all three metrics.
Fig. 8. Comparison of the two methods (PSNR); (a) Train set; (b) Test set.
In addition to the quantitative metrics, we conducted a qualitative analysis to visually
demonstrate the improvements achieved by our method. Fig. 8 illustrates a comparison between scenes generated using our method and those created
with traditional NeRF methods. As shown, our approach produces more accurate texture
details and realistic light propagation, particularly evident in the intricate features
of cultural artifacts and natural elements such as lighting and shadow consistency.
5. Conclusion
In this paper, we presented a novel approach to the preservation and dissemination
of intangible cultural heritage (ICH) through the application of Neural Radiance Fields
(NeRF) technology. Our proposed algorithm significantly enhances the quality and efficiency
of 3D scene reconstruction and viewpoint synthesis compared to traditional methods.
By leveraging NeRF's ability to generate high-quality 3D models from a limited set
of 2D images, we developed a system that facilitates the real-time rendering and interactive
display of animated cultural scenes.
Our experiments on the ScanNet dataset demonstrated the superior performance of our
method over traditional NeRF-based approaches, achieving a PSNR of 31.68, SSIM of
0.956, and LPIPS of 0.194. The system offers data uploading, preprocessing, visualization
training, and interactive rendering, providing a comprehensive solution for digital
preservation of ICH. It preserves the visual and structural integrity of cultural
artifacts while engaging audiences with interactive experiences. However, handling
large-scale datasets and complex artifacts remains a challenge, necessitating further
optimization.
In conclusion, the integration of NeRF technology into the preservation of intangible
cultural heritage offers new avenues for safeguarding and sharing cultural expressions.
The high-quality 3D reconstructions and interactive capabilities of our system have
important theoretical and practical implications, paving the way for future research
and applications in cultural heritage preservation. The success of our method underscores
the potential of advanced digital technologies in maintaining the continuity of cultural
heritage amidst the challenges of globalization and modernization. Future work could
focus on several areas to enhance the proposed method further. First, optimizing the
algorithm's efficiency could reduce training times and computational costs, making
it more accessible for larger datasets and real-time applications. Exploring other
datasets, particularly those involving diverse cultural artifacts and environments,
would help generalize the algorithm's applicability across different contexts. Additionally,
integrating user feedback into the development process could lead to more user-friendly
interfaces and tools, ensuring that the technology meets the needs of cultural heritage
professionals and other stakeholders.