Meta unveils LlamaMM 2: spatial-aware multimodal model for storyboarding and audio

AI · 6 min read

LlamaMM 2 extends the Llama family into multimodal territory with a particular emphasis on spatial relationships in images and scenes. Its outputs include annotated storyboard frames, object bounding maps, scene descriptions with depth cues, and binaural audio snippets to give a sense of placement in 3D.

Meta positions the model for XR prototyping: designers can sketch scenes, prompt LlamaMM 2 for alternative compositions, and receive both visual and spatialized audio suggestions that feed directly into Unity and Unreal importer tools. The model also produces a short scene script giving camera movement and audio cues.

Privacy and moderation remain front and center: Meta offers hosted and enterprise self-hosted options and provides filters for copyrighted characters or sensitive content. Several XR studios highlighted the workflow speed-up for pre-production, noting the model’s ability to rapidly iterate camera angles and acoustic positioning.