Meta releases Llama 3 Micro for mobile inference on Quest and Android

AI · 4 min read

Llama 3 Micro is a quantized, architecture-slimmed model with optimizations for low-memory environments and vectorized execution on mobile NPUs. Meta’s benchmarks show responsive chat interactions and short-form generation with battery-friendly profiles on recent Quest hardware.

The company also shipped a mobile runtime that supports dynamic batching, streaming outputs and selective offload to the cloud for heavier tasks. For developers, Meta provided an SDK with safety filters and usage quotas to help constrain hallucinations and control compute costs.

Meta emphasized interoperability with Spark AR and Horizon SDKs so conversational agents, in-VR NPCs, and local content generation can be implemented with a unified model stack. The announcement positions Llama 3 Micro as a fallback for latency-sensitive applications that require offline capability.