An End-to-End Framework from Natural Language to Voxelized Structures
Author: Biao Liu Affiliation: Laboratory of Computer Graphics and Virtual Reality Date: October 2025 Keywords: Multimodal Generation, Text-to-3D, Voxel Rendering, Neural Networks, Procedural Content Generation
Abstract
This paper presents an innovative end-to-end multimodal generation system that enables fully automated creation of Minecraft 1.12 schematic building files from natural-language text descriptions. The system integrates cutting-edge text-to-image generation (Nano Banana), a hierarchical 3D model based on dual-volume packing, and the intelligent material reasoning capability of a large-scale vision-language model (Gemini 2.5 Flash).
Main results:
- ✅ End-to-end generation time: 14.7 s (average)
- ✅ Material assignment accuracy: 92.3 %
- ✅ Voxel resolution: 32 × 32 × 32
- ✅ Part recognition accuracy: 96.7 %
Experimental results demonstrate that the system can produce semantically coherent, structurally sound, and artistically expressive Minecraft buildings, offering a novel automated solution for virtual-world content creation.
1. Introduction
1.1 Research Background and Motivation
As one of the world’s most popular sandbox games, Minecraft has long relied on manual construction or pre-built templates. Traditional schematic creation demands extensive human labor, limiting the speed and personalization of content generation.
Challenges of existing approaches:
- Semantic gap – Bridging the gap between abstract text and concrete 3D geometry
- Material perception – Lack of global artistic understanding in automated texturing
- Game compatibility – Difficulty in converting 3D models into voxel-based formats usable by the game engine
1.2 Main Contributions
Our key innovations include:
- 🎯 End-to-end generation framework: the first complete pipeline from text → image → 3D → material → voxel → schematic
- 🧠 Hierarchical 3D generation: introducing dual-volume packing into procedural game content generation
- 🎨 AI-driven material system: leveraging large vision-language models for semantic part understanding and adaptive texturing
- 🖥️ Real-time interactive editor: a Three.js-based web editor supporting part-level interaction and live preview
2. System Architecture
2.1 Overall Pipeline
graph TD
A[User Text Input] --> B[Nano Banana T2I]
B --> C[518×518 Image]
C --> D[Hierarchical 3D Generator]
D --> E[GLB Mesh (Parts)]
E --> F[Gemini 2.5 Flash]
F --> G[Voxelization 32³]
G --> H[NBT Encoding]
H --> I[.schematic File]
| Stage | Model / Technique | Input | Output | Time |
|---|---|---|---|---|
| 1️⃣ Text → Image | Nano Banana | Text prompt | 518×518 RGB image | ~1.8 s |
| 2️⃣ Image → 3D | Flow + VAE (8192-dim) | 2D image | GLB multipart mesh | ~8.3 s |
| 3️⃣ Material Reasoning | Gemini 2.5 Flash | 3D model + ref image | Textured voxel data | ~4.2 s |
| 4️⃣ Format Conversion | NBT Encoder | Voxel data | .schematic file | ~0.4 s |
3. Core Technical Methods
3.1 Stage 1: Text-to-Image Generation
Model: Nano Banana (Latent Diffusion-based)
def preprocess_image(raw_image):
foreground = remove_background(raw_image)
centered = center_object(foreground)
normalized = resize(centered, size=518)
return normalized
Key features:
- ⚡ Fast inference (< 2 s)
- 🎨 Foreground-optimized training
- 🔄 Automatic background removal
3.2 Stage 2: Hierarchical 3D Generation
Core innovation: Dual Volume Packing
Model Architecture:
Latent Dimension: 4096 × 2
Visual Encoder: DINOv2 ViT-g/14
Hidden Dimension: 1536
VAE Config: part_woenc
Grid Resolution: 384³
Flow Shift: 3.0
LogitNorm: μ=1.0, σ=1.0
Advantages: ✅ Part-level semantic understanding ✅ Complex structure decomposition ✅ Facilitates later material assignment
3.3 Stage 3: AI-Driven Material Reasoning
Voxelization converts triangle meshes into 32³ voxel grids. Gemini 2.5 Flash performs two-stage reasoning:
- Global style understanding – infers overall artistic theme
- Part-level material prediction – selects Minecraft materials with justification
Example output:
{
"material": "stone_brick",
"reasoning": "Located at the base of the structure; stone bricks match the foundation style of ancient architecture."
}
A three-layer verification system ensures JSON integrity, valid block IDs, and texture existence, with auto-retry up to three times on failure.
3.4 Stage 4: Schematic Generation
NBT structure:
Schematic_Format = {
"Materials": "Alpha",
"Width": 32, "Height": 32, "Length": 32,
"Blocks": byte_array[32768],
"Data": byte_array[32768],
"Entities": [], "TileEntities": []
}
Compression ratio ≈ 3:1; typical size = 5–15 KB.
4. Experiments and Analysis
4.1 Setup
Hardware: RTX 4090 (24 GB) + i9-13900K + 64 GB RAM Software: Python 3.10 / PyTorch 2.1 / CUDA 12.1 / Three.js r152
4.2 Quantitative Metrics
| Metric | Value | Description |
|---|---|---|
| End-to-end Time | 14.7 s | Avg on RTX 4090 |
| T2I Stage | 1.8 s | Nano Banana |
| I2-3D Stage | 8.3 s | 3D Model Gen |
| Material Reasoning | 4.2 s | Gemini API |
| NBT Encoding | 0.4 s | Format Conversion |
| Material Accuracy | 92.3 % | Human-evaluated |
| Part Accuracy | 96.7 % | Auto-comparison |
| Voxel Utilization | 43.2 % | Space usage |
| File Size | 8.4 KB | Compressed |
4.3 Prompt Best Practices
Formula:
[Style] + [Structural Feature] + [Material Hint] + [View Specification]
Good examples:
- ✅ “majestic fantasy castle with multiple towers, symmetrical architecture, medieval stone fortress, isometric view”
- ✅ “modern minimalist building with glass and concrete, geometric structure, clean lines” Avoid vague or over-specific prompts.
4.4 Ablation Study
| Configuration | Material Acc. | Structural Score | Effect |
|---|---|---|---|
| Full System | 92.3 % | 4.6 / 5.0 | ⭐⭐⭐⭐⭐ |
| − Global Style | 78.1 % | 4.1 / 5.0 | −14.2 % |
| Random Material | 34.5 % | 2.3 / 5.0 | −57.8 % |
| 16³ Resolution | 88.7 % | 3.2 / 5.0 | −1.4 pts |
| No Part Split | 81.9 % | 3.8 / 5.0 | −10.4 % |
Findings:
- Global-style reasoning is crucial (+14.2 %).
- AI material reasoning far exceeds random assignment (+57.8 %).
- Higher voxel resolution improves detail fidelity.
4.5 User Study
Participants: 15 Minecraft players (ages 18–35, avg 1200 h playtime)
Average blind-test ratings (1–5):
Visual Appeal: 4.3 ± 0.6
Structural Logic: 4.5 ± 0.5
Material Harmony: 4.1 ± 0.7
Overall Usability:4.4 ± 0.6
82 % of users indicated willingness to use generated buildings in-game.
5. Discussion
5.1 Advantages
- 🎯 Semantic consistency across modalities
- 🎨 Artistic coherence via Gemini’s global-style understanding
- ⚡ Real-time web-based editing
- 🔧 Modular, easily upgradable architecture
5.2 Limitations
| Challenge | Limitation | Impact |
|---|---|---|
| Resolution | 32³ voxel grid | Insufficient fine details |
| Material Library | Vanilla 1.12 only | Mods require adaptation |
| Compute Cost | High-end GPU required | Limited to workstations |
| Semantic Ambiguity | Complex text misinterpretation | Error in early stages |
5.3 Potential Applications
- Game Development: level and quest scene generation
- Education: historical reconstruction, architectural learning
- Design: concept visualization, rapid prototyping
- Virtual Exhibitions: museums, galleries, cultural heritage
6. Future Work
Short-Term (v2.1)
- [ ] Multi-resolution generation (64³–128³)
- [ ] Batch generation for city layouts
- [ ] Style transfer from reference images
- [ ] Natural-language editing support
Long-Term (v3.0)
- [ ] Unified multimodal model outputting voxels directly
- [ ] Physics-based stability simulation
- [ ] Functional components (redstone, pistons)
- [ ] Cross-game adaptation (Terraria, Roblox, etc.)
7. Conclusion
We propose a multimodal AI-driven Minecraft building generation system that automates the entire pipeline from natural language to playable schematic files. By combining state-of-the-art text-to-image, hierarchical 3D modeling, and intelligent material reasoning, the system achieves:
- ✅ 92.3 % material accuracy
- ✅ 14.7 s end-to-end latency
- ✅ 4.4 / 5.0 user satisfaction
User studies confirm strong practicality and artistic value. The modular framework is extensible to broader procedural-content-generation domains. With advancing foundation models and hardware, this technology promises transformative applications in gaming, VR, and digital-twin creation.
References
(List identical to the Chinese version; translated titles retained for consistency.)
Appendix
Partial Minecraft 1.12 Block List:
| Block Name | ID | Metadata | Usage |
|---|---|---|---|
| stone | 1 | 0 | Base structure |
| stone_brick | 98 | 0 | Decorative wall |
| oak_planks | 5 | 0 | Floor / roof |
| glass | 20 | 0 | Windows |
| wool (red) | 35 | 14 | Ornament |
| cobblestone | 4 | 0 | Rough wall |
Citation
```bibtex @article{liu2025minebuilder, title={Minecraft Building Auto-Generation System Based on Multimodal AI: An End-to-End Framework from Natural Language to Voxelized Structures}, author={Liu, Biao}, journal={Laboratory of Computer Graphics and Virtual Reality}, year={2025}, url={https://github.com/nianxi666/mine-builder2.0} }