A new class of foundation models takes images and text as input and emits actions, from UI clicks to robot joint commands. This industry study compares deployment results across several labs and product teams.

What is working

Grounding language instructions in pixels lets a single model generalize across tasks it was never explicitly trained on. Teams report strong zero-shot transfer on navigation and simple manipulation.

image + instruction -> VLA model -> action sequence

Where it breaks

Long-horizon planning, latency budgets, and safety guarantees remain unsolved. The study recommends pairing VLA models with classical controllers for any task with physical risk.