南瓜AI

登录 注册

Minecraft Building Auto-Generation System Based on Multimodal AI

An End-to-End Framework from Natural Language to Voxelized Structures

Author: Biao Liu Affiliation: Laboratory of Computer Graphics and Virtual Reality Date: October 2025 Keywords: Multimodal Generation, Text-to-3D, Voxel Rendering, Neural Networks, Procedural Content Generation


Abstract

This paper presents an innovative end-to-end multimodal generation system that enables fully automated creation of Minecraft 1.12 schematic building files from natural-language text descriptions. The system integrates cutting-edge text-to-image generation (Nano Banana), a hierarchical 3D model based on dual-volume packing, and the intelligent material reasoning capability of a large-scale vision-language model (Gemini 2.5 Flash).

Main results:

  • ✅ End-to-end generation time: 14.7 s (average)
  • ✅ Material assignment accuracy: 92.3 %
  • ✅ Voxel resolution: 32 × 32 × 32
  • ✅ Part recognition accuracy: 96.7 %

Experimental results demonstrate that the system can produce semantically coherent, structurally sound, and artistically expressive Minecraft buildings, offering a novel automated solution for virtual-world content creation.


1. Introduction

1.1 Research Background and Motivation

As one of the world’s most popular sandbox games, Minecraft has long relied on manual construction or pre-built templates. Traditional schematic creation demands extensive human labor, limiting the speed and personalization of content generation.

Challenges of existing approaches:

  1. Semantic gap – Bridging the gap between abstract text and concrete 3D geometry
  2. Material perception – Lack of global artistic understanding in automated texturing
  3. Game compatibility – Difficulty in converting 3D models into voxel-based formats usable by the game engine

1.2 Main Contributions

Our key innovations include:

  • 🎯 End-to-end generation framework: the first complete pipeline from text → image → 3D → material → voxel → schematic
  • 🧠 Hierarchical 3D generation: introducing dual-volume packing into procedural game content generation
  • 🎨 AI-driven material system: leveraging large vision-language models for semantic part understanding and adaptive texturing
  • 🖥️ Real-time interactive editor: a Three.js-based web editor supporting part-level interaction and live preview

2. System Architecture

2.1 Overall Pipeline

graph TD
    A[User Text Input] --> B[Nano Banana T2I]
    B --> C[518×518 Image]
    C --> D[Hierarchical 3D Generator]
    D --> E[GLB Mesh (Parts)]
    E --> F[Gemini 2.5 Flash]
    F --> G[Voxelization 32³]
    G --> H[NBT Encoding]
    H --> I[.schematic File]
Stage Model / Technique Input Output Time
1️⃣ Text → Image Nano Banana Text prompt 518×518 RGB image ~1.8 s
2️⃣ Image → 3D Flow + VAE (8192-dim) 2D image GLB multipart mesh ~8.3 s
3️⃣ Material Reasoning Gemini 2.5 Flash 3D model + ref image Textured voxel data ~4.2 s
4️⃣ Format Conversion NBT Encoder Voxel data .schematic file ~0.4 s

3. Core Technical Methods

3.1 Stage 1: Text-to-Image Generation

Model: Nano Banana (Latent Diffusion-based)

def preprocess_image(raw_image):
    foreground = remove_background(raw_image)
    centered = center_object(foreground)
    normalized = resize(centered, size=518)
    return normalized

Key features:

  • ⚡ Fast inference (< 2 s)
  • 🎨 Foreground-optimized training
  • 🔄 Automatic background removal

3.2 Stage 2: Hierarchical 3D Generation

Core innovation: Dual Volume Packing

Model Architecture:
  Latent Dimension: 4096 × 2
  Visual Encoder: DINOv2 ViT-g/14
  Hidden Dimension: 1536
  VAE Config: part_woenc
  Grid Resolution: 384³
  Flow Shift: 3.0
  LogitNorm: μ=1.0, σ=1.0

Advantages: ✅ Part-level semantic understanding ✅ Complex structure decomposition ✅ Facilitates later material assignment


3.3 Stage 3: AI-Driven Material Reasoning

Voxelization converts triangle meshes into 32³ voxel grids. Gemini 2.5 Flash performs two-stage reasoning:

  1. Global style understanding – infers overall artistic theme
  2. Part-level material prediction – selects Minecraft materials with justification

Example output:

{
  "material": "stone_brick",
  "reasoning": "Located at the base of the structure; stone bricks match the foundation style of ancient architecture."
}

A three-layer verification system ensures JSON integrity, valid block IDs, and texture existence, with auto-retry up to three times on failure.


3.4 Stage 4: Schematic Generation

NBT structure:

Schematic_Format = {
    "Materials": "Alpha",
    "Width": 32, "Height": 32, "Length": 32,
    "Blocks": byte_array[32768],
    "Data": byte_array[32768],
    "Entities": [], "TileEntities": []
}

Compression ratio ≈ 3:1; typical size = 5–15 KB.


4. Experiments and Analysis

4.1 Setup

Hardware: RTX 4090 (24 GB) + i9-13900K + 64 GB RAM Software: Python 3.10 / PyTorch 2.1 / CUDA 12.1 / Three.js r152


4.2 Quantitative Metrics

Metric Value Description
End-to-end Time 14.7 s Avg on RTX 4090
T2I Stage 1.8 s Nano Banana
I2-3D Stage 8.3 s 3D Model Gen
Material Reasoning 4.2 s Gemini API
NBT Encoding 0.4 s Format Conversion
Material Accuracy 92.3 % Human-evaluated
Part Accuracy 96.7 % Auto-comparison
Voxel Utilization 43.2 % Space usage
File Size 8.4 KB Compressed

4.3 Prompt Best Practices

Formula:

[Style] + [Structural Feature] + [Material Hint] + [View Specification]

Good examples:

  • “majestic fantasy castle with multiple towers, symmetrical architecture, medieval stone fortress, isometric view”
  • “modern minimalist building with glass and concrete, geometric structure, clean lines” Avoid vague or over-specific prompts.

4.4 Ablation Study

Configuration Material Acc. Structural Score Effect
Full System 92.3 % 4.6 / 5.0 ⭐⭐⭐⭐⭐
− Global Style 78.1 % 4.1 / 5.0 −14.2 %
Random Material 34.5 % 2.3 / 5.0 −57.8 %
16³ Resolution 88.7 % 3.2 / 5.0 −1.4 pts
No Part Split 81.9 % 3.8 / 5.0 −10.4 %

Findings:

  • Global-style reasoning is crucial (+14.2 %).
  • AI material reasoning far exceeds random assignment (+57.8 %).
  • Higher voxel resolution improves detail fidelity.

4.5 User Study

Participants: 15 Minecraft players (ages 18–35, avg 1200 h playtime)

Average blind-test ratings (1–5):

Visual Appeal:    4.3 ± 0.6
Structural Logic: 4.5 ± 0.5
Material Harmony: 4.1 ± 0.7
Overall Usability:4.4 ± 0.6

82 % of users indicated willingness to use generated buildings in-game.


5. Discussion

5.1 Advantages

  1. 🎯 Semantic consistency across modalities
  2. 🎨 Artistic coherence via Gemini’s global-style understanding
  3. ⚡ Real-time web-based editing
  4. 🔧 Modular, easily upgradable architecture

5.2 Limitations

Challenge Limitation Impact
Resolution 32³ voxel grid Insufficient fine details
Material Library Vanilla 1.12 only Mods require adaptation
Compute Cost High-end GPU required Limited to workstations
Semantic Ambiguity Complex text misinterpretation Error in early stages

5.3 Potential Applications

  • Game Development: level and quest scene generation
  • Education: historical reconstruction, architectural learning
  • Design: concept visualization, rapid prototyping
  • Virtual Exhibitions: museums, galleries, cultural heritage

6. Future Work

Short-Term (v2.1)

  • [ ] Multi-resolution generation (64³–128³)
  • [ ] Batch generation for city layouts
  • [ ] Style transfer from reference images
  • [ ] Natural-language editing support

Long-Term (v3.0)

  • [ ] Unified multimodal model outputting voxels directly
  • [ ] Physics-based stability simulation
  • [ ] Functional components (redstone, pistons)
  • [ ] Cross-game adaptation (Terraria, Roblox, etc.)

7. Conclusion

We propose a multimodal AI-driven Minecraft building generation system that automates the entire pipeline from natural language to playable schematic files. By combining state-of-the-art text-to-image, hierarchical 3D modeling, and intelligent material reasoning, the system achieves:

  • 92.3 % material accuracy
  • 14.7 s end-to-end latency
  • 4.4 / 5.0 user satisfaction

User studies confirm strong practicality and artistic value. The modular framework is extensible to broader procedural-content-generation domains. With advancing foundation models and hardware, this technology promises transformative applications in gaming, VR, and digital-twin creation.


References

(List identical to the Chinese version; translated titles retained for consistency.)


Appendix

Partial Minecraft 1.12 Block List:

Block Name ID Metadata Usage
stone 1 0 Base structure
stone_brick 98 0 Decorative wall
oak_planks 5 0 Floor / roof
glass 20 0 Windows
wool (red) 35 14 Ornament
cobblestone 4 0 Rough wall

Citation

```bibtex @article{liu2025minebuilder, title={Minecraft Building Auto-Generation System Based on Multimodal AI: An End-to-End Framework from Natural Language to Voxelized Structures}, author={Liu, Biao}, journal={Laboratory of Computer Graphics and Virtual Reality}, year={2025}, url={https://github.com/nianxi666/mine-builder2.0} }