Minecraft Building Auto-Generation System Based on Multimodal AI

An End-to-End Framework from Natural Language to Voxelized Structures

Author: Biao Liu Affiliation: Laboratory of Computer Graphics and Virtual Reality Date: October 2025 Keywords: Multimodal Generation, Text-to-3D, Voxel Rendering, Neural Networks, Procedural Content Generation

Abstract

This paper presents an innovative end-to-end multimodal generation system that enables fully automated creation of Minecraft 1.12 schematic building files from natural-language text descriptions. The system integrates cutting-edge text-to-image generation (Nano Banana), a hierarchical 3D model based on dual-volume packing, and the intelligent material reasoning capability of a large-scale vision-language model (Gemini 2.5 Flash).

Main results:

✅ End-to-end generation time: 14.7 s (average)
✅ Material assignment accuracy: 92.3 %
✅ Voxel resolution: 32 × 32 × 32
✅ Part recognition accuracy: 96.7 %

Experimental results demonstrate that the system can produce semantically coherent, structurally sound, and artistically expressive Minecraft buildings, offering a novel automated solution for virtual-world content creation.

1. Introduction

1.1 Research Background and Motivation

As one of the world’s most popular sandbox games, Minecraft has long relied on manual construction or pre-built templates. Traditional schematic creation demands extensive human labor, limiting the speed and personalization of content generation.

Challenges of existing approaches:

Semantic gap – Bridging the gap between abstract text and concrete 3D geometry
Material perception – Lack of global artistic understanding in automated texturing
Game compatibility – Difficulty in converting 3D models into voxel-based formats usable by the game engine

1.2 Main Contributions

Our key innovations include:

🎯 End-to-end generation framework: the first complete pipeline from text → image → 3D → material → voxel → schematic
🧠 Hierarchical 3D generation: introducing dual-volume packing into procedural game content generation
🎨 AI-driven material system: leveraging large vision-language models for semantic part understanding and adaptive texturing
🖥️ Real-time interactive editor: a Three.js-based web editor supporting part-level interaction and live preview

2. System Architecture

2.1 Overall Pipeline

graph TD
    A[User Text Input] --> B[Nano Banana T2I]
    B --> C[518×518 Image]
    C --> D[Hierarchical 3D Generator]
    D --> E[GLB Mesh (Parts)]
    E --> F[Gemini 2.5 Flash]
    F --> G[Voxelization 32³]
    G --> H[NBT Encoding]
    H --> I[.schematic File]

Stage	Model / Technique	Input	Output	Time
1️⃣ Text → Image	Nano Banana	Text prompt	518×518 RGB image	~1.8 s
2️⃣ Image → 3D	Flow + VAE (8192-dim)	2D image	GLB multipart mesh	~8.3 s
3️⃣ Material Reasoning	Gemini 2.5 Flash	3D model + ref image	Textured voxel data	~4.2 s
4️⃣ Format Conversion	NBT Encoder	Voxel data	.schematic file	~0.4 s

3. Core Technical Methods

3.1 Stage 1: Text-to-Image Generation

Model: Nano Banana (Latent Diffusion-based)

def preprocess_image(raw_image):
    foreground = remove_background(raw_image)
    centered = center_object(foreground)
    normalized = resize(centered, size=518)
    return normalized

Key features:

⚡ Fast inference (< 2 s)
🎨 Foreground-optimized training
🔄 Automatic background removal

3.2 Stage 2: Hierarchical 3D Generation

Core innovation: Dual Volume Packing

Model Architecture:
  Latent Dimension: 4096 × 2
  Visual Encoder: DINOv2 ViT-g/14
  Hidden Dimension: 1536
  VAE Config: part_woenc
  Grid Resolution: 384³
  Flow Shift: 3.0
  LogitNorm: μ=1.0, σ=1.0

Advantages: ✅ Part-level semantic understanding ✅ Complex structure decomposition ✅ Facilitates later material assignment

3.3 Stage 3: AI-Driven Material Reasoning

Voxelization converts triangle meshes into 32³ voxel grids. Gemini 2.5 Flash performs two-stage reasoning:

Global style understanding – infers overall artistic theme
Part-level material prediction – selects Minecraft materials with justification

Example output:

{
  "material": "stone_brick",
  "reasoning": "Located at the base of the structure; stone bricks match the foundation style of ancient architecture."
}

A three-layer verification system ensures JSON integrity, valid block IDs, and texture existence, with auto-retry up to three times on failure.

3.4 Stage 4: Schematic Generation

NBT structure:

Schematic_Format = {
    "Materials": "Alpha",
    "Width": 32, "Height": 32, "Length": 32,
    "Blocks": byte_array[32768],
    "Data": byte_array[32768],
    "Entities": [], "TileEntities": []
}

Compression ratio ≈ 3:1; typical size = 5–15 KB.

4. Experiments and Analysis

4.1 Setup

Hardware: RTX 4090 (24 GB) + i9-13900K + 64 GB RAM Software: Python 3.10 / PyTorch 2.1 / CUDA 12.1 / Three.js r152

4.2 Quantitative Metrics

Metric	Value	Description
End-to-end Time	14.7 s	Avg on RTX 4090
T2I Stage	1.8 s	Nano Banana
I2-3D Stage	8.3 s	3D Model Gen
Material Reasoning	4.2 s	Gemini API
NBT Encoding	0.4 s	Format Conversion
Material Accuracy	92.3 %	Human-evaluated
Part Accuracy	96.7 %	Auto-comparison
Voxel Utilization	43.2 %	Space usage
File Size	8.4 KB	Compressed

4.3 Prompt Best Practices

Formula:

[Style] + [Structural Feature] + [Material Hint] + [View Specification]

Good examples:

✅ “majestic fantasy castle with multiple towers, symmetrical architecture, medieval stone fortress, isometric view”
✅ “modern minimalist building with glass and concrete, geometric structure, clean lines” Avoid vague or over-specific prompts.

4.4 Ablation Study

Configuration	Material Acc.	Structural Score	Effect
Full System	92.3 %	4.6 / 5.0	⭐⭐⭐⭐⭐
− Global Style	78.1 %	4.1 / 5.0	−14.2 %
Random Material	34.5 %	2.3 / 5.0	−57.8 %
16³ Resolution	88.7 %	3.2 / 5.0	−1.4 pts
No Part Split	81.9 %	3.8 / 5.0	−10.4 %

Findings:

Global-style reasoning is crucial (+14.2 %).
AI material reasoning far exceeds random assignment (+57.8 %).
Higher voxel resolution improves detail fidelity.

4.5 User Study

Participants: 15 Minecraft players (ages 18–35, avg 1200 h playtime)

Average blind-test ratings (1–5):

Visual Appeal:    4.3 ± 0.6
Structural Logic: 4.5 ± 0.5
Material Harmony: 4.1 ± 0.7
Overall Usability:4.4 ± 0.6

82 % of users indicated willingness to use generated buildings in-game.

5. Discussion

5.1 Advantages

🎯 Semantic consistency across modalities
🎨 Artistic coherence via Gemini’s global-style understanding
⚡ Real-time web-based editing
🔧 Modular, easily upgradable architecture

5.2 Limitations

Challenge	Limitation	Impact
Resolution	32³ voxel grid	Insufficient fine details
Material Library	Vanilla 1.12 only	Mods require adaptation
Compute Cost	High-end GPU required	Limited to workstations
Semantic Ambiguity	Complex text misinterpretation	Error in early stages

5.3 Potential Applications

Game Development: level and quest scene generation
Education: historical reconstruction, architectural learning
Design: concept visualization, rapid prototyping
Virtual Exhibitions: museums, galleries, cultural heritage

6. Future Work

Short-Term (v2.1)

[ ] Multi-resolution generation (64³–128³)
[ ] Batch generation for city layouts
[ ] Style transfer from reference images
[ ] Natural-language editing support

Long-Term (v3.0)

[ ] Unified multimodal model outputting voxels directly
[ ] Physics-based stability simulation
[ ] Functional components (redstone, pistons)
[ ] Cross-game adaptation (Terraria, Roblox, etc.)

7. Conclusion

We propose a multimodal AI-driven Minecraft building generation system that automates the entire pipeline from natural language to playable schematic files. By combining state-of-the-art text-to-image, hierarchical 3D modeling, and intelligent material reasoning, the system achieves:

✅ 92.3 % material accuracy
✅ 14.7 s end-to-end latency
✅ 4.4 / 5.0 user satisfaction

User studies confirm strong practicality and artistic value. The modular framework is extensible to broader procedural-content-generation domains. With advancing foundation models and hardware, this technology promises transformative applications in gaming, VR, and digital-twin creation.

References

(List identical to the Chinese version; translated titles retained for consistency.)

Appendix

Partial Minecraft 1.12 Block List:

Block Name	ID	Metadata	Usage
stone	1	0	Base structure
stone_brick	98	0	Decorative wall
oak_planks	5	0	Floor / roof
glass	20	0	Windows
wool (red)	35	14	Ornament
cobblestone	4	0	Rough wall

Citation

```bibtex @article{liu2025minebuilder, title={Minecraft Building Auto-Generation System Based on Multimodal AI: An End-to-End Framework from Natural Language to Voxelized Structures}, author={Liu, Biao}, journal={Laboratory of Computer Graphics and Virtual Reality}, year={2025}, url={https://github.com/nianxi666/mine-builder2.0} }

南瓜AI