A new class of foundation models takes images and text as input and emits actions, from UI clicks to robot joint commands. This industry study compares deployment results across several labs and product teams.
What is working
Grounding language instructions in pixels lets a single model generalize across tasks it was never explicitly trained on. Teams report strong zero-shot transfer on navigation and simple manipulation.
Where it breaks
Long-horizon planning, latency budgets, and safety guarantees remain unsolved. The study recommends pairing VLA models with classical controllers for any task with physical risk.