Microsoft Releases TRELLIS.2: High-Fidelity 3D Generation via O-Voxel Architecture

Open-source release targets professional workflows with 1536³ resolution and PBR material support

· 3 min read · PSEEDR Editorial

Microsoft has officially open-sourced TRELLIS.2, a 4-billion parameter AI model capable of generating high-resolution 3D assets with physically based rendering (PBR) materials from a single image. Released in December 2025 under the MIT license, the model leverages a flow-matching Transformer (DiT) architecture and a novel sparse voxel representation to address persistent topological limitations in generative 3D workflows.

The release of TRELLIS.2 represents a significant architectural shift in the domain of image-to-3D generation, moving away from the limitations of iso-surface field methods that have dominated recent generative approaches. By utilizing a 4-billion parameter flow-matching Transformer (DiT), Microsoft has prioritized structural fidelity and topological flexibility over the purely implicit representations found in earlier models.

At the core of TRELLIS.2 is the introduction of the "O-Voxel" (Omni-Voxel) representation. Traditional 3D generation methods often rely on Signed Distance Fields (SDF) or Neural Radiance Fields (NeRF), which can struggle with complex, non-watertight geometries such as open surfaces (e.g., clothing, vegetation) or non-manifold structures. The O-Voxel system employs a "field-free" sparse voxel structure that encodes both geometry and appearance features within a unified latent space. This allows the model to reconstruct complex internal structures and intricate topologies that previous architectures frequently smoothed over or failed to render correctly.

Performance benchmarks for TRELLIS.2 indicate a focus on high-resolution output suitable for professional workflows. The model supports a maximum generation resolution of 1536³, a substantial increase over the standard 512³ output of many concurrent open-source alternatives. While initial reports cited generation speeds of 30 to 40 seconds, verified benchmarks on NVIDIA H100 hardware provide a more nuanced performance profile. A full generation cycle at the maximum 1536³ resolution requires approximately 60 seconds-split between 35 seconds for shape generation and 25 seconds for material processing [clarified]. The 30-40 second figure aligns more closely with the shape-only generation phase or complete generation at medium resolutions (1024³), which clocks in at approximately 17 seconds.

The model's architecture integrates a Sparse Voxel-based 3D Variational Autoencoder (VAE) with the flow-matching DiT. This combination enables the system to denoise sparse voxel grids directly, facilitating the generation of PBR (Physically Based Rendering) materials alongside geometry. This capability is critical for integrating generative assets into modern game engines and rendering pipelines, which require distinct texture maps for roughness, metallicity, and albedo rather than simple baked-in lighting.

Microsoft's decision to release the model weights under the MIT license places TRELLIS.2 in direct competition with proprietary solutions and other open-weights models like TripoSR and LGM. The permissive licensing structure allows for commercial integration, potentially accelerating the adoption of generative 3D tools in game development and industrial design sectors. By offering a model that resolves the "watertight" constraints of SDFs while maintaining the speed advantages of sparse representations, TRELLIS.2 addresses a specific gap in the automated 3D asset creation pipeline.

Key Takeaways

Sources