Mobile QR Code QR CODE

2025

Reject Ratio

81.5%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 15, No. 2, p.203-214

ISSN (online) :

2287-5255

Received : 21 July 2024Revised : 19 August 2024Accepted : 31 August 2024

DOI :

10.5573/IEIESPC.2026.15.2.203

Regular Paper

Recreating Intangible Culture Animation Scenes Based on Neural Radiance Fields Technology

Jiexiao Tang¹ Lei Xu²^*

(School of Fine Arts and Design, Hefei Normal University, Hefei 230601, China)
(School of Big Data and Artificial Intelligence, Anhui Xinhua University, Hefei 230088, China)

^* Corresponding Author: Lei Xu, xulei@axhu.edu.cn

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

Intangible cultural heritage (ICH), as a treasure cultivated by diverse cultures throughout history, embodies national sentiments and the spiritual essence of human civilization. However, globalization and modernization pose significant challenges to the transmission and preservation of ICH. Digital technology offers innovative solutions for ICH preservation, with Neural Radiance Fields (NeRF) technology emerging as a promising 3D reconstruction method. This paper introduces an algorithm to recreate animated ICH scenes using NeRF technology. Experimental results demonstrate that our fast viewpoint synthesis algorithm significantly surpasses traditional methods in generating high-quality views. On the ScanNet dataset, our method achieves superior results in quantitative metrics, including PSNR, SSIM, and LPIPS. Specifically, our method reaches a PSNR of 31.68, an SSIM of 0.956, and an LPIPS of 0.194, outperforming conventional NeRF approaches. We propose a comprehensive ICH reconstruction system that includes data uploading and preprocessing, visualization training, and interactive rendering modules. Utilizing NeRF technology, our system efficiently generates high-quality 3D models and real-time animated scenes, offering interactive experiences. In summary, the proposed system provides innovative methods for the digital preservation and dissemination of ICH, achieving high-quality 3D reconstruction and interactive display through NeRF technology, with significant theoretical and practical implications.

Keywords

Intangible cultural heritage (ICH), Neural radiance fields (NeRF), 3D reconstruction, Digital preservation, Viewpoint synthesis, Interactive rendering, Cultural heritage dissemination

1. Introduction

Cultural heritage is categorized into tangible and intangible cultural heritage. Tangible cultural heritage is what we traditionally refer to as cultural heritage, that is, cultural artifacts of historical, artistic and scientific value ^[1]. Intangible cultural heritage, on the other hand, focuses more on traditional cultural expressions created with the participation and innovation of the people, including traditional folk performing arts ^[2], rituals and festivals of folklore activities, traditional understanding and dissemination of nature and the universe, and traditional handicrafts. Intangible cultural heritage is a pearl nurtured in the long history of the peoples of the world ^[3], a bond and carrier of national emotions, and a spiritual home that has remained unchanged for millennia as human civilization traces its roots ^[4]. In the context of globalization and modernization, ICH has gradually lost its original soil and social environment ^[5]. The industrial revolution and the rapid development of urbanization have changed people's production, life and way of thinking, making the living environment of folk culture deteriorate and the cultural space be marginalized ^[6]. Traditional folk crafts, because they are transmitted by word of mouth and handed down from generation to generation by the old generation of artists in a specific region, are subject to the limitations of time, space and territory and are at risk of being lost ^[7]. How to preserve and disseminate this cultural heritage has become a topic of discussion among scholars ^[8].

Prior to the 1900s ^[9], the preservation of intangible cultural heritage in China relied on traditional and inefficient methods such as taking photographs, interviews, and recordings, and the precious cultural heritage that was preserved was damaged to varying degrees by the erosion of time due to improper preservation. Nowadays, digital technology provides a brand-new means of safeguarding tangible cultural heritage, especially the Internet can widely disseminate ICH. Digitization technologies enable audiences to become not only spectators but also interactive participants, such as watching animations and multimedia digital images ^[10].

In the 1990s, UNESCO launched the Memory of the World (MOW) project with the aim of preserving and accessing cultural heritage and raising awareness of cultural heritage in all countries ^[11]. For a while, "digitization" became a buzzword in cultural heritage preservation. The National Endowment for the Humanities and the University of Oxford in the United Kingdom are collaborating on a project to digitize works from Shakespeare's time ^[12], proposing to consolidate Shakespeare's works from the United Kingdom and the United States from around the globe up to the year 2000, and at the same time, to develop a user interface that would allow the user to access these databases for relatively in-depth comparisons and research. The Visual Media Center at Columbia University focuses on the potential of new media to facilitate the interpretation and preservation of the built environment ^[13]. Hui-Jeong Han et al. explore the ways of digital management and utilization of NRLs in Korea, analyze the current situation of digital archiving of NRLs in depth, and propose policies based on the factors of cultural governance and normative management ^[14].

In recent years, digital technology has played an essential role in cultural heritage preservation, offering a variety of methods to digitize and safeguard both tangible and intangible cultural assets. Earlier digital preservation methods primarily involved 2D imaging techniques, such as high-resolution photography and photogrammetry, which, while effective in capturing surface details, often fell short in representing the three-dimensional intricacies of artifacts ^[15]. Techniques like Structured Light Scanning (SLS) and Laser Scanning have since been employed to generate more accurate 3D models, providing a fuller representation of the physical characteristics of heritage objects. However, these methods are typically resource-intensive, requiring specialized equipment and significant post-processing, which limits their accessibility and scalability, especially in resource-constrained environments ^[16]. Recent advances in 3D reconstruction, such as Multi-View Stereo (MVS) and volumetric capture, have improved the fidelity and usability of digital models. These methods allow for the creation of detailed 3D models from multiple 2D images but often require extensive computational resources and complex workflows ^[17]. Additionally, voxel-based approaches have been explored for their ability to capture volumetric data, but they frequently struggle with maintaining high resolution and texture detail in complex scenes ^[18]. Neural Radiance Fields (NeRF) technology has emerged as a promising alternative, addressing many of the limitations of previous methods. NeRF is a scene representation method via neural networks, capable of generating high-quality 3D models from a limited set of 2D images ^[19]. The technique utilizes neural networks to learn the implicit representation of a scene, allowing for realistic reproduction of light propagation and fine details ^[20]. This gives NeRF a significant advantage in the 3D reconstruction of complex scenes, especially when dealing with the nuanced textures and lighting conditions characteristic of intangible cultural heritage (ICH). Despite its potential, there is a notable gap in research specifically applying NeRF technology to the reproduction of animated scenes of ICH. This paper aims to fill this gap by proposing a NeRF-based system for ICH animation scene reproduction ^[21].

2. Neural Radiance Fields Synthesis Algorithm

NeRF virtual point-of-view image synthesis is based on ray casting for volume rendering. Each pixel of the input image computes a ray from the camera direction and is densely sampled along the ray to optimize a large MLP, which needs to be updated multiple times for all parameters of the MLP during training, resulting in high computational complexity during training and rendering, making it difficult to satisfy the requirements of interactive visualization. In order to realize real-time high-quality virtual point-of-view synthesis, the implicit neural representation needs to be fast and of high quality. In order to solve the above problems, a fast NeRF viewpoint synthesis algorithm combining hash coding and feature texture mesh is proposed. The method first optimizes the training of the implicit neural representation through the hash coding network, reduces the number of redundant trainable parameters in the MLP, and achieves the fast construction of the implicit neural representation by exchanging the memory consumption for the small computational cost.

2.1. Algorithm Overview

The fast Ne RF viewpoint synthesis algorithm based on feature texture mesh proposed in this paper constructs a fast implicit representation of the 3D scene through a single-resolution hash coding module, a continuous scene representation training module, and a joint optimization and rendering module, and generates the 3D scene model and the feature texture maps at the same time, in order to realize the real-time virtual viewpoint image synthesis task, as shown in Fig. 1. The process begins with the input of 2D images and corresponding camera parameters. These inputs undergo preprocessing, including resizing and depth map prediction. Next, hash coding is applied to optimize feature vectors, which are then fed into the Multi-Layer Perceptron (MLP) for training. Finally, the trained model renders new views from arbitrary viewpoints.

In the single-resolution hash coding module, a feature texture grid model G needs to be predefined to initialize the implicit scene representation. Unlike sampling on rays in NeRF, the algorithm needs to compute the intersection point $x \in \mathbb{R}^3$ of the camera rays with the initialized grid, and the vertex coordinates of the grid where the intersection point is located, $v \in \mathbb{R}^3$, as the input spatial coordinates for the next step of hash coding, and map the spatial locations via the spatial hash function h(v). positions are mapped to obtain the encoded feature embedding vector $\hat{F}$. In the continuous scene representation training module, the feature embedding vector $\hat{F}$ output from the hash coding network is passed through three small MLPs to output the transparency $\alpha$, texture feature vector f, and color value c of each ray sampling point, and the alpha synthesis is used to synthesize the color value $\widehat{C}$ of each camera ray using alpha synthesis instead of body rendering, and the network is supervised by the loss of MSE between the rendered color $\widehat{C}$ and the real color $C_{gt}$. The network is trained by the MSE loss between the rendered color $\widehat{C}$ and the true color $C_{gt}$.

In the joint optimization and rendering module, the continuous transparency value a obtained from the MLP network is binarized to obtain $\hat{a}$, the discrete data is back-propagated using the STE (straight-through estimator) algorithm, and then the continuous and discrete models are stably co-trained to achieve correct rendering of the semi-transparent feature texture mesh. In the rendering stage, the feature texture mesh G is extracted as an explicit discrete mesh model, and the texture mapping of the model is computed in real-time by a color network. The design methodology and process of each module, as well as the loss function design for supervised neural network optimization, are described in detail below.

Fig. 1. Schematic diagram of the algorithm.

2.2. Single Resolution Hash Coding Module

A new parametric coding method aids in the training of the MLP before feeding the spatial location into the neural network for inference, in order to reduce the size of the MLP and thus the training time of the implicit neural representation. This new coding method allows the implicit neural representation to be trained using a smaller MLP than previous methods without sacrificing the quality of the virtual point-of-view image synthesis, thereby reducing the number of histories of the network parameters as the sampled points on the ray pass through the MLP.

The method is mainly assisted by a sparse grid structure containing trainable feature covariates, so first a polygonal grid G storing texture features needs to be pre-crowned, with each vertex of the grid storing a feature embedding vector F. The grid is initialized by defining a regular grid of size $128 \times 128 \times 128$ in a unit cube centered at the origin and creating within each grid A vertex v is created in each grid, and each edge of the grid is used to create a quadrilateral connecting four neighboring grid vertices to instantiate a face. The grid vertices are one-to-one corresponded to the feature vectors through a hash table and queried by the spatial hash function h(v). The feature embedding vector F can be optimized by stochastic gradient descent, which is synchronized with the network parameters of the MLP to optimize the MLP globally and exhibit smoothness. of inductive bias, while the computational cost of optimization and evaluation is high, in contrast, the feature texture mesh-based representation is updated locally during optimization and evaluation, is well expressive of local details, and is computationally more efficient. The key to this process is the use of a sparse grid structure containing trainable feature covariates.

Here are the pseusdocode of Hash Coding:

Initialize grid G with size $128 \times 128 \times 128$

For each vertex v in G:

Create a feature embedding vector F

Store F in the hash table using spatial hash function h(v)

For each ray r(t) = o + td emitted from the camera:

Compute intersection point x of r with grid G

Obtain 3D coordinates v of the vertex where x is located

Encode v using h(v) to obtain feature embedding vector F

Perform trilinear interpolation on F to obtain interpolated feature vector $\hat{F}$

Pass $\hat{F}$ to MLP for further processing

Optimize F and MLP parameters using stochastic gradient descent

The input to the hash coding network is different from the spatial points sampled on the rays in NeRF, as it is necessary to use a predefined feature texture mesh, the input to the hash coding network is the intersection of the ray with the feature texture mesh. Firstly, a ray r(t) = 0 + td is emitted from the camera position, which passes through the mesh G. The intersection point x of the ray r and the predefined mesh is computed, which in turn obtains the internal intersection point k of the ray and the mesh, and the 3D coordinates of the vertex of the mesh where the intersection point is located, v, are hash-encoded to obtain the feature embedding vectors: F = enc(v; $\phi$), where $\phi$ is a trainable parameter of the feature-textured mesh, which are then encoded by the triple Trilinear interpolation is used to obtain the feature embedding vectors of the intersection points. The feature embedding vectors are queried by the spatial hash function on the hash table, which contains T feature embedding vectors of dimension F, so the number of trainable parameters $\theta$ is T $\times$ F. The hash coding process is shown in Fig. 2.

Fig. 2. Diagram of the hash coding process.

The hash coding process represented in Fig. 2 is:

For a given input 3D coordinate x, it is first necessary to find the mesh where it is located (i.e., the mesh where the 3D coordinates are located), and obtain the vertex positions v of the mesh.

A trainable feature embedding vector F is stored at the vertices of the mesh, so each vertex 3D position v of the mesh needs to be queried from the hash table for the corresponding feature embedding vector via a spatial hash function h(v), which is shown in

$ h(v) = \left( \oplus_{i=1}^d v_i \pi_i \right) \mod T, $

where $\oplus$ denotes the bitwise permutation operation, d is the dimension of the input vector, $v_i$ denotes each dimension of the input three-dimensional position, $\pi_i$ is the only large prime number, mod denotes the remainder operation, and T is the number of feature vectors in the hash table. The formula indicates that the three-dimensional coordinates of the space are used as the key, the three dimensions of the coordinates are multiplied by $\pi_i$ to do the different-or-bit operation, and finally the residual operation is performed on T to get the corresponding value in the hash table, i.e., the feature embedding vector F. In the experiments, the number of vertices in the mesh is smaller than T , and the mapping between vertices and feature vectors is 1 : 1, which can alleviate the effect of the hashing collision.

Then, according to the position of x in its corresponding mesh, 3D linear feature interpolation is performed. 3D linear interpolation is an interpolation method performed in 3D space, and after obtaining the feature vector $F_i$ of each vertex on the mesh, it is necessary to obtain the feature vector of the spatial coordinate x by trilinear interpolation. The final interpolated feature vector $\hat{F} \in \mathbb{R}^F$ is the input to the MLP. After single-resolution hash coding, only the feature embedding vectors of the grid where the input coordinates are located need to be updated during each optimization of the feature mesh, and only a small number of weights and biases need to be updated for the shallow MLPs used in the subsequent optimization process, whereas in NeRF, for each gradient propagated through the backward propagation of the neural network, the weights and biases of each layer of the MLP network as well as each channel need to be updated for each sample point, which is computationally intensive. In NeRF, for each gradient propagated backward through the neural network, each sampling point needs to update the weights and biases of each layer of the MLP network and each channel, which results in a large amount of computation and a long training time.

2.3. Joint Optimization and Rendering Module

In this module, the smooth transparency $\alpha$ obtained from the A-network needs to be binarized to obtain the discrete transparency $\alpha_b$, as shown in the following equation, because when rasterizing the feature texture mesh, the semi-transparent mesh needs to be sorted according to the depth, and the rendering is executed sequentially to ensure the correct alpha synthesis, and the general hardware-based rendering methods do not support the rendering of semi-transparent mesh. The general hardware-based rendering methods do not support the rendering of semi-transparent meshes, so the successive alpha opacities generated by the MLP need to be discretely binarized to achieve correct rendering in transparent polygon meshes.

$ \alpha_b = \begin{cases} 0, & a \le 0.5, \\ 1, & a > 0.5. \end{cases} $

After obtaining the binarized transparency, the continuous and discrete models are jointly trained to ensure that the network can perform the correct alpha synthesis for the semi-transparent mesh, and the joint optimization process of the continuous and discrete models is shown in Fig. 3.

Fig. 3. Joint optimization process of discrete and continuous models.

In the process diagram shown in Fig. 3, the intersection point of the ray and the mesh is interpolated into a feature embedding vector $\hat{F}$ by hash coding, and the feature embedding vector passes through the transparency domain MLP and the feature domain MLP to obtain the continuous transparency value $\alpha$ and the continuous texture feature f. The continuous transparency value is binarized to obtain the discrete transparency value $\alpha_b$ and the discrete texture feature $f_b$ is obtained by the discrete transparency value, the continuous color value c and discrete color value $c_b$, respectively, after alpha compositing. The continuous texture feature and the discrete texture feature obtain the continuous color value c and the discrete color value $c_b$ after the color value MLP, and obtain the continuous rendering color value $\widehat{C}(r)$ and the discrete rendering color value $\widehat{C}_b(r)$ after the alpha synthesis, and calculate the loss value of these two with the real value respectively, and the final loss function of the joint optimization model is the sum of the discrete model loss function and the continuous model loss function, so as to achieve the goal of the joint optimization model. The final loss function of the joint optimization model is the sum of the loss function of the discrete model and the loss function of the continuous model, thus realizing the joint optimization of the continuous model and the discrete model.

In the rendering stage, the jointly optimized polygonal feature texture mesh is stored in OBJ format, and the quadrilateral faces in the feature texture mesh are selectively retained according to whether they are visible or not, and when the camera rays are passed through the feature texture mesh, the weight of each intersection point, w, is calculated, which is the weight of each quadrilateral face, and the face with the largest weight is selected for each ray and only the quadrilateral faces with the largest weight are retained, and the quadrilateral faces with the largest weight are retained. By selecting the faces with the largest weights on each ray and retaining only the quadrilateral faces with the largest weights, as well as retaining the quadrilateral faces corresponding to those in the UV map, this achieves the goal of retaining the visible quadrilateral faces, as well as storing the texture feature UV maps only for polygons that are visible in the input viewpoint image. The mesh represents the extent of the scene at the points where the rays intersect the mesh (i.e., the interior of the mesh), and a mesh size of $128 \times 128 \times 128$ is sufficient to cover the scene.

After that, the pixel coordinates in the 2D texture material of the model are converted to 3D coordinates, and the discrete opacity $\alpha_b$ and texture eigenvalues $f_b$ corresponding to the 3D coordinates are baked into the texture UV map according to the vertex-by-vertex iterative traversal of the quadrilateral faces in the feature texture mesh, and are stored in a lossless compressed PNG file as shown in Fig. 4.

Fig. 4. Extraction of texture maps.

The UV texture maps of the model ultimately store the transparency $\alpha_b$ and texture features $f_b$ generated by the MLPs, so it is necessary to compute each vertex of all the quadrilateral faces, and then feed the vertices into the trained MLPs to generate the corresponding transparency and texture feature vectors, and output the 8 dimensions of the vertices as the example feature vectors, and the texture feature vectors that are in the range between [0, 1] are quantized as [0, 255], then the 8-channel texture feature vector is stored into two 4-channel RGB$\alpha$ texture material maps, one each with a 4-channel RGB$\alpha$ texture. The 8-channel texture feature vectors are then stored as two 4-channel RGB$\alpha$ texture material maps each. The texture features and line-of-sight directions of the texture maps are fed into $\mathcal{H}$ for real-time color mapping computation, which ultimately achieves real-time rendering of the mesh model.

2.4. Loss Function Design

During the optimization of the feature texture mesh, in order to prevent the mesh vertices v from leaving the corresponding mesh, they are kept inside the mesh during the training process by the following penalty factor loss function as shown in

$ \mathcal{L}_v = \sum_{v \in V} \left( 10^3 \mathcal{I}(v) + 10^{-2} \right) v, $

where $\mathcal{I}(v)$ represent the penalty factor, when the vertex v leaves the corresponding grid outside, its value is 1000, then the value of the loss function is relatively large, for $1000\|v\|$, when the vertex is inside the grid, the value of $\mathcal{I}(v)$ is 0, the value of the loss function is $0.01\|v\|$.

Meanwhile, in order to optimize the implicit neural representation, the training process of the network is done by calculating the mean square error between the predicted color and the true color of the pixel point, and the RGB reconstruction color loss function is shown in

$ \mathcal{L}_c = \left\| \widehat{C}(r) - C_{gt}(r) \right\|_2^2, $

where $C_{gt}(r)$ denotes the real pixel alpha synthesis to obtain.

After binarization of the discrete model and continuous model for joint training, we need to calculate $\mathcal{L}_c^{bin}$ and the loss function $\mathcal{L}_\varepsilon$ of the continuous model respectively, both of them supervise the network by reconstructing the color loss function through RGB, the loss function of the continuous model is shown above, while the discrete model needs to compute the discrete color value $\widehat{C}_b$ and then compute the loss function after discretization of the transparency value in the discrete model as shown in

$ \mathcal{L}_c^{bin} = \left\| \widehat{C}_b(r) - C_{gt}(r) \right\|_2^2. $

The final loss function for the continuous and discrete models is the sum of $\mathcal{L}_c$ and $\mathcal{L}_C^{bin}$, as shown in

$ \mathcal{L}_c^{total} = \frac{1}{2} \mathcal{L}_c^{bin} + \frac{1}{2} \mathcal{L}_c. $

3. NeRF Animated Reconstruction of Intangible Cultural Scenes

The application of NeRF is to solve the problem of long training time for a single scene and long rendering time for each virtual point-of-view image after training, which is not able to meet the demand of real-time reconstruction and rendering. Currently, NeRF technology only needs to input a certain number of 2D images and the corresponding internal and external parameters of the camera to render realistic images of any virtual viewpoints between the input viewpoints.

With the rapid development of Web technology, traditional 2D Web pages are becoming insufficient. WebGL, a JavaScript API, enables accelerated rendering of high-performance 2D or 3D graphics through the device's hardware in compatible browsers. Three.js, a 3D engine built on WebGL, simplifies the creation of 3D web pages by encapsulating WebGL's complex APIs. This allows for real-time interactive rendering, such as the NeRF-based fast point-of-view reconstruction application discussed in this paper. Developed using HTML, JavaScript, React, and Three.js, the application enables efficient 3D web page creation and interactive experiences.

3.1. System Design Program

The development of the interactive platform for intangible culture 3D scene includes three modules: data uploading and processing module, visualization training module, and interaction and rendering module. The overall design flow of training and rendering interaction is shown in Fig. 5.

Fig. 5. NeRF fast viewpoint reconstruction design flow.

Fig. 5 shows the design flow of Ne RF-based rapid 3D reconstruction: in the first step, users need to upload their own data sets, and then upload the data to the cloud through the data reading module; in the second step, the cloud reads the data uploaded by the users for preprocessing, and performs feature extraction and matching through clomap to estimate the camera parameters of the images; in the third step, the cloud trains the preprocessed data sets and renders the rendered images in real time through the front-end page; in the fourth step, after the cloud training is completed, the rendering and interaction module allows users to interact with the 3D scene and manipulate the camera to set up a 3D scene. The third step is to train the pre-processed dataset in the cloud and perform real-time rendering, and the rendered image is displayed through the front-end page; the fourth step is that, after the training in the cloud is completed, through the rendering and interaction module, the user can interactively operate the 3D scene, as well as manipulate the camera to set the rendering path for rendering, and finally save the rendered video. The flowchart of the above steps is shown in Fig. 6.

Fig. 6. NeRF fast viewpoint reconstruction engineering application realization flow.

3.2. Module Functional Design

Data Upload and Preprocessing Module: Users are required to upload an intangible culture related dataset for training, which consists of a collection of images or video format files. The image dataset is uploaded through the upload button on the page, and the images will be saved to the images folder on the cloud server. If the user uploads a video file, the system will use ffmpeg's vframes command to split the video into frames, and the resulting frame sequence collection will also be saved to the images folder. The processed image sequence will be processed by COLMAP's feature extractor and exhaustive matcher commands for feature extraction and matching, and then the model converter command will be used to estimate the internal and external parameters of the image, and the result will be saved as a.json file.

Visualization Training Module: After reading the pre-processed intangible culture dataset, the system starts the optimization training of the implicit neural representation. In each iteration, a predetermined number of pixel points from the 2D image are sampled, and a single-resolution grid point grid and three small network structures are initialized: density model.init(), feature model init(), and color model. init(). The rendered pixel points are obtained by calculating the intersections of the camera rays with the grid and feeding the intersections into the network for calculation. Then, the color of the sampled pixel point and the rendered pixel color value are used for loss calculation. The training process is visualized and each iteration allocates a certain amount of GPU resources for real-time rendering. The user can control the training speed by adjusting the ratio of training to rendering resources and the resolution of the rendered image. Users can also change different rendering images by adjusting the neural rendering results of the network, such as outputting RGB color values to generate color images or outputting bulk density values to generate depth images, in order to achieve a diversified rendering of animated scenes of intangible culture.

Interaction and Rendering Module: After training, users can interact and adjust the rendering path by manipulating the camera. The 3D scene is set up by Three.js, including camera, scene and renderer. Use Perspective Camera and control the camera with Orbit Controls (camera, dom). A new camera can be defined with THREE. Perspective Camera, which adds keyframe positions to the render path and adjusts the render path by adjusting the camera's coordinates camera. position. set and perspective camera; lookAt.

The rendering process is realized by deferred rendering, which is handled in two phases: the geometry information storage phase and the RGB mapping calculation phase. In the geometry storage phase, the geometry information of the scene is stored in the cache and assigned to the Three.js shader texture material. In the RGB mapping computation phase, a 2D rectangle is rendered, the stored parameters are read from the cache, and the camera ray direction and feature texture vectors from the feature texture map are passed to a small MLP that composites the visually relevant pixel color values. Eventually, the 2D rectangle is stored as an RGB image and baked onto the model to achieve a realistic reproduction of the animated scenes of intangible culture. As shown in Fig. 7.

Fig. 7. Rendering process diagram.

3.3. Computational Efficiency and Scalability

Our method was designed with computational efficiency in mind, leveraging a hash coding and feature texture mesh approach that significantly reduces the number of redundant trainable parameters in the MLP network. This optimization allows for faster training times without compromising the quality of the generated 3D models. On a single Nvidia TITAN RTX2080 Ti, our method demonstrated faster convergence during training, completing the process in approximately 20% less time than traditional NeRF approaches. In addition, our method has been tested on various scene complexities and scales effectively with increasing data volume. The use of hash coding allows the method to maintain high performance even as the number of input images or the resolution of the 3D models increases.

However, as scene complexity grows, there is a corresponding increase in computational load, particularly in terms of memory usage and processing time. Despite this, our method's ability to selectively allocate resources to critical areas helps mitigate the impact, ensuring that performance remains within acceptable limits. In real-world applications, particularly those involving the digital preservation of cultural heritage, additional factors such as environmental conditions (e.g., lighting variability) and the physical accessibility of artifacts can introduce variability in the input data. This variability may affect the quality of the reconstruction, requiring further refinement of the input dataset or additional post-processing steps to achieve the desired outcome.

4. Results and Discussion

4.1. Experimental Data Collection

This algorithm is experimented on the ScanNet dataset, which is an RGB-D video dataset containing 2.5 million views in more than 1500 scans. On the ScanNet dataset, three scenes are selected for this experiment. For each scene, 10 images of the local region are selected and three of them are used as the test set for new view synthesis. The dataset underwent a series of preprocessing steps to ensure optimal input quality for the NeRF algorithm. Initially, images were resized to a resolution of $484 \times 648$ to standardize the input dimensions and reduce computational overhead. Following resizing, depth maps were predicted using a model pre-trained on the ScanNet dataset. These depth maps provide critical information about the spatial structure of scenes, which enhances the accuracy of the NeRF reconstruction.

VSCODE was used for code writing and modification, and the algorithmic framework was implemented in the PyTorch library and trained on a single Nvidia TITAN RTX2080 Ti. In the depth map prediction phase, a pre-trained model is loaded and fine-tuned on the Scan Net dataset to predict the depth map of the Scan Net image, and then the depth information is added to the Ne RF network for training. The algorithm uses the Adam optimizer with hyperparameters set to $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\alpha = 10^{-5}$ and a batch size of 214. The batch size is 214. the network is trained by minimizing the formula loss function which makes the network converge.

4.2. Evaluation Indicators

In NeRF and related algorithms, quantitative evaluation is required in order to compare the advantages and disadvantages of different methods. The quantitative evaluation metrics are mainly the following:

Peak Signal-to-Noise Ratio: PSNR is a commonly used metric for evaluating the quality of an image. It calculates the peak signal-to-noise ratio of an image, which can be used to evaluate the difference between the image rendered by the NeRF model and the real strange image. Its calculation formula is as follows:

$ PSNR = 20\log_{10} \left( \frac{MAX_I}{\sqrt{MSE}} \right), $

where $MAX_I$ is the maximum possible value of the pixel value of the image, for 8-bit images, $MAX_I$ is usually 255.MSE is the mean square error between the predicted image and the real image with the following formula:

$ MSE = \frac{1}{N} \sum_{i=1}^N [P(i) - G(i)]^2, $

where $P(i)$ is the pixel value of the predicted image, $G(i)$ is the pixel value of the real image and $N$ is the total number of pixels in the image.

Structural Similarity Index: SSIM is a structural similarity index which takes into account the brightness, contrast and structure of the image to better evaluate the similarity between the image generated by the NeRF model and the real image, which is calculated as follows:

$ SSIM = \frac{(2u_p u_g + C_1)(2\sigma_{pg} + C_2)}{(u_p^2 + u_g^2 + C_1)(\sigma_p^2 + \sigma_g^2 + C_2)}, $

where $u_p$ and $u_g$ are the luminance means of the predicted and real images, respectively, $\sigma_p$ and $\sigma_g$ are the luminance standard deviations of the predicted and real images, respectively, $\sigma_{pg}$ is the luminance covariance of the predicted and real images, and $C_1$ and $C_2$ are constants used to prevent the denominator from being zero.

Learned Perceptual Image Patch Similarity: LPIPS is a deep learning-based image similarity metric that better captures the factors that contribute to the quality of human-perceived images, and therefore better evaluates the quality of images generated by the NeRF model.

$ LPIPS = \frac{1}{N} \sum_{i=1}^N \|\phi(P_i) - \phi(G_i)\|_2^2, $

where $P_i$ and $G_i$ are the $\iota$th image block in the predicted and real images, respectively, and $\phi(\cdot)$ is a pre-trained convolutional neural network for extracting the feature representation of the image block. Training time: Since NeRF models require a large number of computational resources and time for training, the performance of NeRF models also needs to be evaluated by considering the training time, including the overall training time, the training time of each epoch, and so on.

4.3. Tests Results

Tables 1 and 2 show the comparison of the quantitative metrics of the new views generated by this algorithm and other algorithms in different scenarios. Table 1 shows the PSNR values and Table 2 shows the SSIM values, which are metrics for evaluating the similarity between images, and the larger the value, the more similar the images are. From the table, it can be seen that the algorithm achieves better results in each of the metrics.

Table 1. Comparison of quantitative experimental results in SacnNet dataset (PSNR$\uparrow$).

Method	Scene 1	Scene 2	Scene 3	Scene 4	Scene 5	Scene 6
NSVF	15.71	24.41	17.13	21.08	26.88	22.29
NeRF	15.76	27.73	16.57	23.94	25.18	22.76
NeRF/COLMAP	21.38	27.97	18.64	27.32	27.42	24.35
Our method	20.09	28.54	22.39	30.53	29.57	26.78

Table 2. Comparison of Quantitative Experimental Results in ScanNet dataset (SSIM$\uparrow$.)

Method	Scene 1	Scene 2	Scene 3	Scene 4	Scene 5	Scene 6
NSVF	0.721	0.821	0.734	0.792	0.873	0.795
NeRF	0.735	0.843	0.725	0.818	0.835	0.801
NeRF/COLMAP	0.796	0.882	0.756	0.896	0.898	0.837
Our method	0.782	0.901	0.802	0.941	0.926	0.879

Table 3 shows the quantitative evaluation metrics between rendered images at known viewing angles. Data from 8 scenes are used for testing, and the average value of each metric is taken as the data in the table. The experimental results show that the new view generated by the algorithm proposed in this paper is similar to the real image to a larger extent than the NeRF algorithm. For example, voxel-based methods tend to suffer from lower resolution and detail in texture representation, while point cloud approaches often struggle with maintaining consistency in lighting and shading. Mesh-based techniques though effective in certain structured environments, often require extensive preprocessing and are less adaptable to complex, unstructured scenes. In contrast, our method not only excels in capturing fine details and realistic lighting but also does so with greater computational efficiency.

Table 3. Comparison of quantitative experimental results of different methods in ScanNet dataset.

Method	PSNR$\uparrow$	SSIM$\uparrow$	LPIPS$\downarrow$
NeRF	29.51	0.914	0.342
Our method	31.68	0.956	0.194

A visualization of the PSNR values is done as shown in Fig. 8, where both training and testing data are taken from the ScanNet dataset. It can be seen that the present method achieves better experimental results in both training and testing. On the Scan Net dataset, this algorithm generates a clearer new view in comparison with the NeRF method. Taking PSNR, SSIM, and LPIPS as the quantitative evaluation metrics, the accuracy of this algorithm is 31.68 for PSNR, 0.956 for SSIM, and 0.194 for LPIPS, which is better than the NeRF algorithm in all three metrics.

Fig. 8. Comparison of the two methods (PSNR); (a) Train set; (b) Test set.

In addition to the quantitative metrics, we conducted a qualitative analysis to visually demonstrate the improvements achieved by our method. Fig. 8 illustrates a comparison between scenes generated using our method and those created with traditional NeRF methods. As shown, our approach produces more accurate texture details and realistic light propagation, particularly evident in the intricate features of cultural artifacts and natural elements such as lighting and shadow consistency.

5. Conclusion

In this paper, we presented a novel approach to the preservation and dissemination of intangible cultural heritage (ICH) through the application of Neural Radiance Fields (NeRF) technology. Our proposed algorithm significantly enhances the quality and efficiency of 3D scene reconstruction and viewpoint synthesis compared to traditional methods. By leveraging NeRF's ability to generate high-quality 3D models from a limited set of 2D images, we developed a system that facilitates the real-time rendering and interactive display of animated cultural scenes.

Our experiments on the ScanNet dataset demonstrated the superior performance of our method over traditional NeRF-based approaches, achieving a PSNR of 31.68, SSIM of 0.956, and LPIPS of 0.194. The system offers data uploading, preprocessing, visualization training, and interactive rendering, providing a comprehensive solution for digital preservation of ICH. It preserves the visual and structural integrity of cultural artifacts while engaging audiences with interactive experiences. However, handling large-scale datasets and complex artifacts remains a challenge, necessitating further optimization.

In conclusion, the integration of NeRF technology into the preservation of intangible cultural heritage offers new avenues for safeguarding and sharing cultural expressions. The high-quality 3D reconstructions and interactive capabilities of our system have important theoretical and practical implications, paving the way for future research and applications in cultural heritage preservation. The success of our method underscores the potential of advanced digital technologies in maintaining the continuity of cultural heritage amidst the challenges of globalization and modernization. Future work could focus on several areas to enhance the proposed method further. First, optimizing the algorithm's efficiency could reduce training times and computational costs, making it more accessible for larger datasets and real-time applications. Exploring other datasets, particularly those involving diverse cultural artifacts and environments, would help generalize the algorithm's applicability across different contexts. Additionally, integrating user feedback into the development process could lead to more user-friendly interfaces and tools, ensuring that the technology meets the needs of cultural heritage professionals and other stakeholders.

Funding

1. The Key Project of Humanities and Social Sciences in Anhui Province's Colleges and Universities in 2025 (Project No.: 2025AHGXSK30158). This paper is a research outcome of the Digital Protection and Dissemination Center of Huizhou Cultural Heritage at Hefei Normal University.

2. National College Students' Innovation and Entrepreneurship Training Program Project in 2025: Interactive Animation Design and Promotion for Smart Elderly Care Services (Project Number: 202512216023).

3. The Quality Engineering Teaching Reform Project of Anhui Xinhua University in 2025, titled "Research and Practice on Digital Media Technology Talent Cultivation Driven by Project-based Curriculum Groups under the Collaborative Model of Industry and Education" (Project Number: 2025jy027).

4. The internal doctoral research start-up fund talent project of Anhui Xinhua University (Project Number: bs2025kyqd053).

References

Brown M. F. , 2005, Heritage trouble: Recent work on the protection of intangible cultural property, International Journal of Cultural Property, Vol. 12, No. 1, pp. 40-61

Petrillo P. L. , 2019, The Legal Protection of the Intangible Cultural Heritage: A Comparative Perspective

Lin Q. , Lian Z. , 2018, On protection of intangible cultural heritage in China from the intellectual property rights perspective, Sustainability, Vol. 10, No. 12, pp. 4369

Deacon H. , Smeets R. , 2018, Intangible heritage safeguarding and intellectual property protection in the context of implementing the UNESCO ICH Convention, Safeguarding Intangible Heritage, pp. 36-53

Antons C. , Logan W. , 2017, Intellectual and cultural property and the safeguarding of intangible cultural heritage, Intellectual Property, Cultural Property and Intangible Cultural Heritage, pp. 1-17

Zhang Y. , Han M. , Chen W. , 2018, The strategy of digital scenic area planning from the perspective of intangible cultural heritage protection, EURASIP Journal on Image and Video Processing, Vol. 2018, pp. 1-11

Eichler J. , 2021, Intangible cultural heritage, inequalities and participation: Who decides on heritage?, The International Journal of Human Rights, Vol. 25, No. 5, pp. 793-814

Hou Y. , Kenderdine S. , Picca D. , Egloff M. , Adamou A. , 2022, Digitizing intangible cultural heritage embodied: State of the art, Journal on Computing and Cultural Heritage, Vol. 15, No. 3, pp. 1-20

Zhou Y. , Sun J. , Huang Y. , 2019, The digital preservation of intangible cultural heritage in China: A survey, Preservation, Digital Technology & Culture, Vol. 48, No. 2, pp. 95-103

Bulut M. , 2018, Digital Performance: The Use of New Media Technologies in the Performing Arts, Master's thesis

Duedahl P. , 2016, A History of UNESCO: Global Actions and Impacts

Bird C. , Eliadis J. , Scriven H. , 2016, Shakespeare is 'GREAT', Shakespeare's Cultural Capital: His Economic Impact from the Sixteenth to the Twenty-First Century, pp. 148-162

Sussman A. , Hollander J. , 2021, Cognitive Architecture: Designing for How We Respond to the Built Environment

Park S. J. , Bae M. H. , Jeong M. H. , Jeong S. H. , Lee N. , Byun S. Y. , Park K. H. , 2023, Risk factors and clinical outcomes of extubation failure in very early preterm infants: A single-center cohort study, BMC Pediatrics, Vol. 23, No. 1, pp. 36

Stanco F. , Battiato S. , Gallo G. , 2011, Digital imaging for cultural heritage preservation, Analysis, Restoration, and Reconstruction of Ancient Artworks

Hirschmuller H. , 2007, Stereo processing by semiglobal matching and mutual information, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 30, No. 2, pp. 328-341

Goesele M. , Curless B. , Seitz S. M. , 2006, Multi-view stereo revisited, Vol. 2

Vogiatzis G. , Esteban C. H. , Torr P. H. S. , Cipolla R. , 2007, Multiview stereo via volumetric graph-cuts and occlusion robust photo-consistency, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 29, No. 12, pp. 2241-2246

Mildenhall B. , Srinivasan P. P. , Tancik M. , Barron J. T. , Ramamoorthi R. , Ng R. , 2021, NeRF: Representing scenes as neural radiance fields for view synthesis, Communications of the ACM, Vol. 65, No. 1, pp. 99-106

Tancik M. , Mildenhall B. , Wang T. , Schmidt D. , Srinivasan P. P. , Barron J. T. , Ng R. , 2021, Learned initializations for optimizing coordinate-based neural representations

Martin-Brualla R. , Radawan N. , Sajjadi M. S. M. , Barron J. T. , Dosovitckiy A. , Duckworth D. , 2021, NeRF in the wild: Neural radiance fields for unconstrained photo collections

Tang Jiexiao

Tang Jiexiao (1985.5-), male, Han nationality, born in Gaomi, Ph.D., lecturer at the School of Fine Arts and Design of Hefei Normal University, research direction: digital media,film and television animation.

Xu Lei

Xu Lei (1988.8-), male, Han nationality, born in Zaozhuang, PH.D., associate professor at the School of Big Data and Artificial Intelligence, Anhui Xinhua University, research direction: digital media, film and television animation.