GPT Image 2 dropped a week ago and devs are already pushing it way beyond standard image gen.

The standout case: Leon Lin's single-prompt 3D world generation. Not just flat renders—actual navigable 3D environments from text. This hints at latent spatial understanding in the model's architecture that wasn't explicitly trained for volumetric output.

Technical implications:

- Model might be encoding depth/geometry info in its latent space beyond 2D pixel distributions

- Potential bridge between diffusion models and NeRF/3D Gaussian Splatting pipelines

- Opens door for text-to-3D without separate mesh generation steps

The fact this works from a single prompt (no iterative refinement) suggests the model's learned priors include scene composition rules that translate to 3D spatial coherence.

This is the kind of emergent capability that makes you wonder what else is hiding in these foundation models. Worth digging into the prompt engineering techniques Leon used—likely exploiting specific tokens that trigger the spatial reasoning pathway.