Skip to content

Unofficial (partial) reimplementation of GeNVS by Chan et al.


Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



10 Commits

Repository files navigation


Unofficial (partial) reimplementation of GeNVS by Chan et al.


After the success of text to image models, making 3D worlds and environments from textual and image descriptions is one of the main challenges for generative models.

Optimization-based models, such as DreamFusion, have shown great promise at generating individual objects, but are computationally expensive. Other models have been developed, such as 3D-GANs, and 3D-diffusion models, which with enough training data should prove a great way to generate individual objects. However, generating a whole environment with multiple different objects is a more complex task.

Recently, GeNVS has been proposed by Chan et al. This is a model which converts an image to a 3D representation. A denosing diffusion model, conditioned on the rendering of this 3D representation from a novel camera position, is used to generate a plausible image from this view. The diffusion model allows the approach to refine missing details and generate unseen details in this new view.

Whilst this model is here trained on single images, I believe that this class of approach is the most promising route to generating 3D environments. This will later require substantial improvement in the way in which multiple volumetric representations are combined.

The point of this repository is to make a simple attempt to reproduce the model of Chan et al, using relatively limited computational resources.


Neural rendering fields - NeRFs for short - are a volumetric representation of a 3D object or environment. Whilst they are commonly implemented using a Deep Learning toolkit, they are not really a "Deep Learning" method, as the Neural part refers to the representation of the volumetric field, which at least for the initial implementations was an implicit multilayer perceptron (MLP). These representations are then optimized using similar algorithms to those used in Deep Learning (Adam etc). More recent NeRF methods use voxels, triplanes or hash-tables as a (partially) explicit representation of the volumetric field.

This volumetric field is used to generate images from specified camera viewpoints using a volume rendering approach. At each point along a camera ray, the volumetric field supplies a density and a emitted colour. These are integrated along the ray (using an rendering equation) to generate the colour of the pixel.

The task (Novel View Synthesis)

Given a view of an object from one direction, Novel View Synthesis is the task of reconstructing views of the object from other camera positions. This requires a model that is able to synthesise the unseen parts of the object. As in many cases this problem is underdetermined, with a distribution of different possible outputs, a generative model which takes noise as an input is preferable to a regression model. The objective for a regression model (typically least squares or L1) optimizes the model to produce the mean or median potential output, which can be grey and/or amorphous. Generative models, which generally have a noise input, are able to generate a distribution of outputs for a given conditioning input.


GeNVS by Chan et al combines three main learnable components:

  1. A network which projects images to features of a NeRF (aligned with the camera frustrum). This NeRF is a voxel grid (like DVNO), of dimension 64x64x32x16 (height x width x depth x channels).

  2. A NeRF, used for volume rendering. This has two small MLPs which map the interpolated features at each point in 3D to opacity (1-channel) and a 16-channel latent feature. Using the standard NeRF volume rendering approach, these volumetric features are used to render a 16-channel, 64x64 image from the novel viewpoint.

  3. A Denoising network, based on EDM of Karras et al. . This uses the 16-channel rendered image (bilinearly upscaled to 128x128) to condition a denoising diffusion model, which generates the novel view. This is trained with a denoising objective,

Part of the motivation for trying to reproduce this method, is that the individual components are relatively lightweight (<100M parameters), and at least in principle could be pre-trained individually, before being combined and fine-tuned.

Other approaches for this task

  • PixelNerf . This is very similar to the network used in part 1, but generates a RGB NeRF which is used directly as the output. As this is a regression method, difficult to avoid blurriness / gray in the unseen portions of the model (as the objective tries to ensure that the model predicts the mean/median of the possible distribution).

  • VisionNeRF . I believe that this essentially behaves like a very good PixelNeRF, with more powerful architecture. Still struggles with detail on the unobserved parts of the object (but copies colour across etc.)

  • NerfDiff . This is a similar approach to GeNVS but uses an RGB triplane NeRF - camera but not frustrum aligned. This model uses a larger denoising UNet (400M-1B), with the option of cross attention to the input view. As this outputs RGB images it's possible to use a neat method a bit like score-distillation sampling during inference of the NeRF.

  • 3DiM . UNet with cross attention between views. Pose encoding of camera ray start position and direction also supplied to the denosing model. This proves difficult to train.

  • Zero123(XL) . This is like 3DiM but based on Stable Diffusion, conditioning generation on both the original view, an encoding of the original view, and the relative camera angle. This seeems to work pretty well, but being Stable-Diffusion based it is quite difficult to train. There is no explicit 3D representation of the object, so consistency is not guaranteed - see One-2-3-45 for a neat method that quickly resolves this inconsistency.

Differences with the paper

There are some elements that are either unclear in the paper, deliberately changed, or just different by accident.

  • View-conditioned denoiser uses the k-diffusion implementation of EDM at . This is similar to that of EDM.

  • Auxillary losses applied on volume rendered RGB images at novel views (first three channels of rendered latents). This probably impedes optimal performance later, but permits pre-training of just the input network and NeRF, independent of the denoising network. This RGB rendering is by necessity blurry (and cannot deal with ambiguity), but seems to train faster than the diffusion model.

  • Denoising diffusion model separately pre-trained (on the same training dataset as used for the whole model), and then combined with the input image->NeRF model.

  • Also (small) auxillary losses applied to the occupancy of the rendered views. This attempted to stop the NeRF filling the entire volume of the frustrum, but at the weighting I used I believe it had little effect in practice.

  • Image to NeRF model tends towards partial billboarding, with detail placed between the object and the camera. Attempted to correct this by additional loss penalizing differences in depth and opacity between the image where the source view is in the same position as the camera, and the image where the source view is in a different position. This didn't seem to help massively - the model just seemed to generate more background density. Training with a depth-objective would perhaps be a better approach, but the SRN dataset does not have depth images.

  • Only one or two (not three) views supplied for each batch, in order to train at batch size>1 on consumer hardware. (For later training up to three views supplied.)

  • Increased noise level in diffusion model - seems to help the model whilst training give better predictions of x0, conditioned on the NeRF renderings, far more quickly than the default setting - but may hinder sampling high resolution details later.

  • Stochastic sampling - for whatever reason (insufficient training? discrepency between training and sampling noise levels?) the deterministic samplers perform poorly on this dataset. The model here uses 250 steps of the Euler-like method from the EDM paper.

  • Simplistic autoregressive sampling - conditioning on supplied image, up to 4 intermediate images, and the previously generated image. Greatly improves sampling output but still flickers a bit with current trained model. Note that (to work well) sampling should start from a camera position near the supplied image and move gradually away from it,


python [transfer=path_to_ckpt]

Config file in config/config_{high,med,low}_noise.json

Actual training procedure was convoluted - pre-train diffusion model at 64x64, train with different image -> NeRF model, upscale to 128x128, and then replace image->NeRF model by DeepLabV3+.

TODO: Retrain with clearer procedure (on different dataset - chairs or COCO3D).

Data Preparation

Visit SRN repository, and extract the downloaded files in /data/. Here we use 90% of the training data for training and 10% as the validation set.

From , there is a pickle file that contains available view-png files per object.

Note that ShapeNet is only licensed for non-commercial use.

Pre-trained Model Weights

Model weights available at

Training procedure somewhat complex - original model generated by pretraining 64x64 diffusion model and image to NeRF models, combining them and then further finetuning at resolutions 64 and 128.

Increased SNR levels (lonormal distribution with mean 1.0, standard deviation 1.4) were used for this (see k-configs/config_64_cars_noisy.json), and then subsequently dropped later in training. This is much higher noise levels than in the original paper, but we want good predictions of the denoised image at high noise levels.

Noise levels dropped slightly (mean 0.5, standard deviation 1.4) for further fine-tuning. Still considerably higher than those in the EDM paper (mean -1.2 standard deviation 1.2). Scope for further experimentation - this task is different to standard diffusion as the conditioning with an extra image is far more informative than a text prompt.

Current results

Conditioned on a single view

Sampling using a fixed number of input views (default 1), around a spiral

Conditioning image


Novel views generated (upper - denoised samples, lower- RGB renderings from NeRF)

Stochastic sampling

python --transfer=genvs-unofficial/ --stochastic

Deterministic sampling

python --transfer=genvs-unofficial/ --prefix cars_det

Sampling progress (stochastic)

python --transfer=genvs-unofficial/ --stochastic --progress

Sampling progress (deterministic)

python --transfer=genvs-unofficial/ --progress --prefix cars_det

Another initial view



Unconditional samples (Supply pure noise conditioning image to diffusing model)


python --transfer=genvs-unofficial/ --stochastic --progress --unconditional --prefix uc




python --transfer=genvs-unofficial/ --progress --unconditional --prefix uc_det



Autoregressive sampling

This produces much better results than sampling from a single view. Strongly suggests this is required for decent levels of multi-view consistency. Still struggles a little bit with flickering, and consistency of details between the different sides of the vehicle. Unclear if this is due to insufficient training of the Image -> NeRF network, or a deficiency because pairs of features on opposite sides of the vehicle can never appear together in a single image. In the latter case, cross attention between views (3DiM, nerfdiff) may be a sensible addition to the denoising model.

First frame of each video is the conditioning image.


Additional approach - ControlNet built on segmind/tiny-sd

In branch controlnet, find an alternative approach where a Stable Diffusion ControlNet, using the compressed architecture of segmind/tiny-sd, (based on to replace the denoising diffusion model.

Denoising diffusion model pretrained on ShapeNet cars, image -> NeRF model taken from earlier model above. Further trained on ShapeNet cars for ~150 epochs.

NOTE THAT DIFFUSERS NEEDS PATCHING TO ALLOW CONTROLNETS WITH TINY-SD (the lack of mid-block additional residuals makes it think it is a T2I-Adaptor).

Checkpoint available at

Sampling as for earlier model, but Classifier Free Guidance with cfg>1 required for good results. Code also contains a (probably buggy) wrapper permitting use of the k-diffusion samplers (with non zero churn) with such controlnets. ControlNet has an additional class input 0=conditioned on NeRF output 1=unconditional.

Samples shown with cfg=2, churn=40/250 (stochastic sampling), churn=0 (deterministic sampling). Autogregressive sampling with cfg=2, churn=0. All with 100 timesteps, Karras schedule.

Stochastic, non-autoregressive

Car 0





Car 1



Car 2



Car 3




Car 0


Car 1



100_uc_2 0_1 0-final-1-000000



(Initial conditioning on first frame of video)


Multi-view denoising (Stable diffusion)

Motivated by a series of recent papers, notably MVDream Viewset Diffusion and SyncDreamer I further refined the model to denoise multiple views at the same time.

This extends the "tiling" trick, used for Stable and to generate more consistent videos (see Text2VideoZero - , where self attention was extended to across the frames of the video).

The model was trained to jointly denoise 4 random views at the same time - one or two supplied conditioning views, and two to three unseen views - with extended cross-attention.

At inference time - for moderate number of views possible to denoise every view at once. However, for larger viewsets this is not possible. Strategy is to take some of the previously generated (noise-free) latents, and add the appropriate amount of noise to them for the current timestep. These are then concatenated with the noisy latents, and passed to the denoising UNet (with cross-frame attention). This tends to add consistency between views. For further consistency, for non-autoregressive sampling, the input view is always used as one of these views (as it is by far the easiest view to denoise). For autoregressive sampling, the conditioning views are used as the extra latents.

Checkpoint at

Stochastic, non-autoregressive sampling

Car 0





Car 1





Car 2



Car 3



CarB 0



CarB 1



CarB 2



CarB 3



Unconditional samples

Attention in groups of two (generated) images

100_uc_2 0_1 0-final-1-000000



  • Further training of model on a real multi-GPU system.

  • Investigate inference strategy further - which images to retain in conditioning, and whether to resample views?

  • Increase augmentation amount - current denoising model struggles with views which differ substantially from training set.

  • Train on larger, more general dataset.

  • Explore noise range schedules during training - start with fairly high noise levels and drop over time.

  • Also explore LR schedule.

  • Get a decent pixelNeRF to use as a starting point for training

  • Similarly, obtain a decent k-diffusion model to fine-tune rather than train from scratch.

  • Explore using a T2I adapter rather than a big controlnet

  • Like SyncDreamer use the rendered image as an additional input to Zero123, rather than SD.


K-diffusion from Katherine Crawson and others

NeRF rendering using an old version of ashawkey's excellent

Some data pipeline from and

DeepLabV3+ implementation from


Unofficial (partial) reimplementation of GeNVS by Chan et al.







No releases published


No packages published