Apple Releases SHARP: Sub-Second Monocular 3D Synthesis via Feedforward Networks

Apple researchers have open-sourced SHARP, a feedforward neural network capable of generating photorealistic 3D Gaussian Splats from a single image in under one second. Released in December 2025, the project represents a shift away from computationally expensive optimization-based methods, achieving rendering speeds exceeding 100 frames per second while significantly reducing visual error metrics compared to previous state-of-the-art models.

The release of SHARP (Sharp Monocular View Synthesis) addresses a persistent bottleneck in generative 3D computer vision: the trade-off between generation speed and visual fidelity. Traditional approaches, such as Neural Radiance Fields (NeRF) or optimization-based 3D Gaussian Splatting (3DGS), typically require minutes of iterative training for each specific scene to produce a high-quality 3D representation. SHARP circumvents this latency by utilizing a "single-shot feedforward neural network" architecture. By employing a Vision Transformer (ViT) encoder, the model regresses 3D Gaussian distribution parameters directly from the input image, eliminating the need for per-scene optimization.

Performance metrics published in the accompanying research paper (arXiv:2512.10685) indicate a substantial leap in output quality. SHARP achieves a reduction in Learned Perceptual Image Patch Similarity (LPIPS) by 25-34% and Deep Image Structure and Texture Similarity (DISTS) by 21-43% compared to prior best-performing models. This fidelity is maintained despite the system generating 3D scenes in less than one second-a speed the researchers characterize as "3 orders of magnitude faster" than optimization-based alternatives. The system supports real-time rendering at rates exceeding 100 frames per second, making it viable for interactive applications.

A critical technical differentiator for SHARP is its handling of spatial dimensions. Unlike many monocular depth estimators that provide only relative depth information, SHARP predicts "absolute scale 3D models". This capability enables precise camera motion and realistic parallax effects, which are essential for integrating generated assets into mixed-reality environments. This focus on absolute scale suggests the technology is aligned with the requirements of spatial computing platforms, where accurate depth perception is necessary for user comfort and immersion.

Apple has released the code and pre-trained weights via the apple/ml-sharp repository on GitHub under an Apache 2.0 license for the code and a separate license for weights. Notably, the documentation highlights GPU acceleration support via CUDA for video rendering paths. This inclusion facilitates adoption within the broader academic and industrial research communities, which predominantly rely on NVIDIA hardware infrastructure, despite Apple's proprietary focus on Metal for its silicon.

While the model sets a new standard for speed and quality, reliance on single-view inputs presents inherent challenges regarding occlusion handling. The system must infer geometry for unseen angles based on learned priors from the training data, a common limitation in monocular synthesis. However, the strong zero-shot generalization reported across multiple datasets suggests the ViT encoder effectively captures robust 3D structural priors.

Key Takeaways

SHARP generates 3D Gaussian Splats from a single image in under one second, replacing minute-long optimization processes.
The model reduces visual error metrics (LPIPS and DISTS) by over 20% compared to previous state-of-the-art methods.
Unlike relative depth estimators, SHARP produces absolute scale 3D representations, enabling precise camera tracking.
The architecture utilizes a Vision Transformer (ViT) encoder for direct parameter regression.
Apple has open-sourced the project with CUDA support, targeting the broader AI research community.

Key Takeaways

Sources