A Quick Introduction to NeRFs

Deep Gan Team
4 min readNov 1, 2023

Among reward papers from 2023 Computer Vision and Pattern Recognition (CVPR) Conference, is a view synthesizing paper, named DynIBaR: Neural Dynamic Image-Based Rendering. It’s a progression of work beyond NeRFs (Neural Radiance Fields), NeRF is a computer vision algorithm designed to generate 3D scene representations from 2D images and associated camera positions. It was first introduced in a research paper in 2020. The core concepts are fully connected 3D network, volume rendering, and hierarchical volume sampling etc. In this blog post, we will explain the basics of NeRFs algorithm and its use cases.

Algorithm

NeRFs (Neural Radiance Fields) are a class of algorithms that render novel camera views given a set of images and camera positions. Note that the rendered images are rendered using new camera positions not available in the original data. The neural network that’s used will take in a ray cast into the scene (x, y, z, θ, φ) and output RGB and the volume density.

This NeRF model can then be used in volume rendering a scene. The ray cast into the scene can be calculated from some arbitrary camera position, and the neural network gives a sampled RGB output along the volume density. To render the pixel in a specific rendering, We cast a ray through each pixel into the scene, and infer the RGB value from the neural network model. New views are generated by querying or sampling the 5D coordinates along each camera ray. Once the rendered images are obtained, the rendered images can be compared with the original input images on a per-pixel basis. The NeRF algorithm would then optimize the model given a specific camera view to render images as close to the ground truth images as possible.

Volume metric rendering

The 5D neural radiance field represents a scene as the volume density and directional emitted radiance at any point in space. For a ray passing through, it’s a double integral. First, T(t) is the integrated volume density, denoting the accumulated transmittance along the ray from t_n to t, i.e., the probability that the ray travels from tn to t without hitting any other particle. The volume density σ(x) can be interpreted as the differential probability of a ray terminating at an infinitesimal particle at location x. The outer integral is accumulating all the colors.

Here is an example of a ray in the function of distance.

Note that this representation is continuous. NeRF uses quadrature to render discretized voxel grids.

Data

As for data used for training a NeRF model, they are video image data and the positional data (e.g. rotation and translation) of the camera. For example, the video that you record on your iPhone or Android camera will suffice. Note that the camera extrinsic data will also need to be separately obtained.

Training the network

To train a NeRF model, we essentially overfit a model to our scene. Unlike general wisdom to build a generalizable model, the NeRF concept is to fit one specific scene to a model, and that model can then be used in volumetric rendering to infer any novel view in the scene.

The model is trained and output with RGB pixels along the ray density. The input data set would have the camera extrinsics for each frame. For each frame, we then project it to a 3d point cloud. As a result, we can obtain RGB value samples of multiple angles of the same 3d point from different images. The model then fits for a function that can infer the RGB value through the 3d point from a range of angles. This fitting operates on all points in the 3d point cloud, resulting in the NeRF model.

Neural Network Architecture

Note that the neural networks that are used for the NeRF algorithm are fully connected neural networks rather than other architectures like CNNs or transformers. Since the input to the neural network consists of only 5 (the x, y, z position as well as θ and φ for the viewing direction), other networks that try to exploit structural patterns of input like convolutional kernels or attention, may not be appropriate for this task.

Pros and Cons of NeRFs

While NeRFs offer interesting benefits with capturing subtle lighting changes and inferring novel views (from even inside solid objects), NeRFs present many other challenges that have made them impractical. Having to run inference, you have to run the model for every single pixel you render, which becomes impractical for real-time rendering. Methods to get around this have either been removing some of the benefits of having the NeRF model to render each pixel or to use faster compute from other, more powerful machines to accelerate inference and perform rendering. In addition, you still have the training time to generate the NeRF model in the first place, with base formulation taking hours to days to train per scene.

Use cases

NeRFs can be used for AR (Augmented Reality) use cases by adding virtual 3D objects into real world scenes. Due to NeRFs capturing lighting changes due to angle, objects or scenes rendered with NeRFs can appear more “realistic” as one moves around the NeRF scene. However, the NeRF rendering is a different process from traditional graphics rendering pipelines, so you’re often left with the choice of only having a NeRF rendered scene or losing some information by converting a NeRF scene or object to a mesh with textures like an obj file.

NeRFs Resources

--

--

Deep Gan Team

We’re a team of Machine Learning Engineers exploring and researching deep learning technologies