Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models

Abstract

We present Text2Room, a method for generating room-scale textured 3D meshes from a given text prompt as input. To this end, we leverage pre-trained 2D text-to-image models to synthesize a sequence of images from different poses. In order to lift these outputs into a consistent 3D scene representation, we combine monocular depth estimation with a text-conditioned inpainting model. The core idea of our approach is a tailored viewpoint selection such that the content of each image can be fused into a seamless, textured 3D mesh. More specifically, we propose a continuous alignment strategy that iteratively fuses scene frames with the existing geometry to create a seamless mesh. Unlike existing works that focus on generating single objects or zoom-out trajectories from text, our method generates complete 3D scenes with multiple objects and explicit 3D geometry. We evaluate our approach using qualitative and quantitative metrics, demonstrating it as the first method to generate room-scale 3D geometry with compelling textures from only text as input.

Iterative Scene Generation

We iteratively create a textured 3D mesh from a sequence of camera poses. For each new pose, we render the current mesh to obtain partial RGB and depth renderings. We complete both, utilizing respective inpainting models and the text prompt. Next, we perform depth alignment and mesh filtering to obtain an optimal next mesh patch, that is finally fused with the existing geometry.

Two-Stage Viewpoint Selection

A key part of our method is the choice of text prompts and camera poses from which the scene is synthesized. We propose a two-stage viewpoint selection strategy, that samples each next camera pose from optimal positions and refines empty regions subsequently.

Generation Stage

In the first stage, we create the main parts of the scene, including the general layout and furniture. For that, we subsequently render multiple predefined trajectories in different directions that eventually cover the whole room.

Completion Stage

After the first stage, the scene layout and furniture is defined. Since the scene is generated on-the-fly, the mesh contains holes that were not observed by any camera. We complete the scene by sampling additional poses a-posteriori, looking at those holes.

Interactive 3D Mesh Viewer - Use Your Mouse to Navigate the Scene

Load 3D Model (400 MB)

A living room with a lit furnace, couch, and cozy curtains, bright lamps that make the room look well-lit.

BibTeX

@InProceedings{hoellein2023text2room, author = {H\"ollein, Lukas and Cao, Ang and Owens, Andrew and Johnson, Justin and Nie{\ss}ner, Matthias}, title = {Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {7909-7920} }

Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models

ICCV 2023 (Oral)

Text2Room generates textured 3D meshes from a given text prompt using 2D text-to-image models.

A living room with a lit furnace, couch, and cozy curtains, bright lamps that make the room look well-lit.

Editorial Style Photo, Rustic Farmhouse, Living Room, Stone Fireplace, Wood, Leather, Wool

Editorial Style Photo, Eye Level, Coastal Bathroom, Clawfoot Tub, Seashell, Wicker, Blue and White

A library with tall bookshelves, tables, chairs, and reading lamps

Editorial Style Photo, Wide Shot, Modern Nursery, Table Lamp, Rocking Chair, Tree Wall Decal, Wood

a small office with a chair, desk, and monitors

A living room with lots of bookshelves, couches, and small tables