Coarse-Grained Prompt:
Medium-Grained Prompt:
Fine-Grained Prompt:
Realistic 3D indoor scene synthesis is crucial for Embodied AI and digital content creation. However, achieving high fidelity, strong generalization and precise controllability remains challenging due to complex semantic and physical constraints. Existing methods follow two paradigms: (1) Training models on layout datasets to directly generate numerical 3D layouts, which often generalize poorly to unseen room types; (2) Using LLMs/VLMs to produce open-vocabulary intermediate representations (e.g., scene graphs) followed by constraint-based optimization, improving plausibility but sacrificing flexibility due to predefined rules. Both approaches struggle to adapt to fine-grained user requirements. We introduce DirectLayout, a framework that directly generates numerical 3D layouts from text descriptions, without relying on intermediate representations and constrained optimization. DirectLayout decomposes the generation into three stages: producing a Bird's-Eye View (BEV) layout, lifting it into 3D space, and refining object placements for plausibility. To enable explicit spatial reasoning and help the model grasp basic principles of object placement, we employ Chain-of-Thought (CoT) activation based on the 3D-Front dataset. Additionally, we design CoT-Grounded Generative Layout Reward to enhance generalization and spatial planning. During inference, DirectLayout addresses asset-layout mismatches via Iterative Asset-Layout Alignment through in-context learning. Extensive experiments demonstrate that DirectLayout achieves impressive semantic consistency, generalization and physical plausibility.
Training Stage: BEV Layout Generator is first fine-tuned on BEV layouts curated from the 3D-Front dataset, guided by CoT annotations generated by GPT-4o. Subsequently, it is further optimized through DPO, leveraging CoT-Grounded Generative Layout Reward derived from Spatial Evaluator (VLM) and Quantitative Evaluator (reasoning LLM).
Inference Stage: Given a text prompt, BEV Layout Generator produces a 2D layout, which is then lifted to a 3D layout by 3D Layout Generator. Iterative Asset-Layout Alignment refines the 3D scene by using the Spatial and Quantitative Evaluators to provide feedback to the layout generators, ensuring consistency between the layout and generated 3D assets from an object generator.