Zero-Shot Text-Guided Object
Generation with Dream Fields

Ajay
Jain
UC Berkeley, Google Research
Ben Mildenhall
Google Research
Jonathan T. Barron
Google Research
Pieter
Abbeel
UC Berkeley
Ben
Poole
Google Research
Paper (arXiv) Code Colab notebook demo

Abstract

We combine neural rendering with multi-modal image and text representations to synthesize diverse 3D objects solely from natural language descriptions. Our method, Dream Fields, can generate the geometry and color of a wide range of objects without 3D supervision. Due to the scarcity of diverse, captioned 3D data, prior methods only generate objects from a handful of categories, such as ShapeNet. Instead, we guide generation with image-text models pre-trained on large datasets of captioned images from the web. Our method optimizes a Neural Radiance Field from many camera views so that rendered images score highly with a target caption according to a pre-trained CLIP model. To improve fidelity and visual quality, we introduce simple geometric priors, including sparsity-inducing transmittance regularization, scene bounds, and new MLP architectures. In experiments, Dream Fields produce realistic, multi-view consistent object geometry and color from a variety of natural language captions.


Example generated objects

Dream Fields can be trained with diverse captions written by artists or from COCO. Descriptions control the style of generated objects, such as color and context.

bouquet of flowers sitting in a clear glass vase.
a sculpture of a rooster.
Slide 1 of 11.

Compositional generation

The compositional nature of language allows users to combine concepts in novel ways and control generation. A template prompt describing a primary object (an armchair or a teapot) is stylized with 16 materials: avocado, glacier, orchid, pikachu, brain coral, gourd, peach, rubik's cube, doughnut, hibiscus, peacock, sardines, fossil, lotus root, pig, or strawberry. These prompt templates are sourced from DALL-E.

an archair in the shape of a ____.
an archair imitating a ____.
a teapot in the shape of a ____.
a teapot imitating a ____.

Related publications

Overview of DietNeRF's semantic consistency loss.

Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis

Ajay Jain, Matthew Tancik, Pieter Abbeel ICCV 2021 International Conference on Computer Vision

DietNeRF regularizes Neural Radiance Fields with a CLIP-based loss to improve 3D reconstruction. Given only a few images of an object or scene, we reconstruct its 3D structure & render novel views using prior knowledge contained in large image encoders.

mip-NeRF integrated positional encoding

Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields

Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, Pratul Srinivasan ICCV 2021 International Conference on Computer Vision

NeRF is aliased, but we can anti-alias it by casting cones and prefiltering the positional encoding function. Dream Fields combine mip-NeRF's integrated positional encoding with Fourier features.


Citation

Ajay Jain, Ben Mildenhall, Jonathan T. Barron, Pieter Abbeel, Ben Poole. Zero-Shot Text-Guided Object Generation with Dream Fields. arXiv, 2021.

@article{jain2021dreamfields,
  author = {Jain, Ajay and Mildenhall, Ben and Barron, Jonathan T. and Abbeel, Pieter and Poole, Ben},
  title = {Zero-Shot Text-Guided Object Generation with Dream Fields},
  joural = {arXiv},
  month = {December},
  year = {2021},
}