Multimodal/Industry Study/12/29/2025/10 min read

Vision-Language-Action Models: An Industry Study

A cross-lab survey of how multimodal foundation models are being adapted to output actions for agents and robots, and where they break.

Brohan et al. · Multiple labs · 2025 survey

Priya Shah

Multimodal research writer

Vision-Language-Action Models: An Industry Study

A new class of foundation models takes images and text as input and emits actions, from UI clicks to robot joint commands. This industry study compares deployment results across several labs and product teams.

What is working

Grounding language instructions in pixels lets a single model generalize across tasks it was never explicitly trained on. Teams report strong zero-shot transfer on navigation and simple manipulation.

image + instruction -> VLA model -> action sequence

Where it breaks

Long-horizon planning, latency budgets, and safety guarantees remain unsolved. The study recommends pairing VLA models with classical controllers for any task with physical risk.

Citation

Brohan, A. et al. (2025). RT-2: Vision-Language-Action Models. arXiv:2307.15818.

Source paper

Keep reading

Related papers

More Multimodal

LLMs

Comments

Add a practical note, implementation detail, or question. Comments are saved for editorial review.

No approved comments are visible yet. Start the discussion below.

Vision-Language-Action Models: An Industry Study

What is working

Where it breaks

Citation

Related papers

Compute-Optimal Training: Scaling Laws Revisited

CSS Container Queries Explained

Sparse Mixture-of-Experts at Inference Scale

Comments

Vision-Language-Action Models: An Industry Study

What is working

Where it breaks

Citation

Related papers

Compute-Optimal Training: Scaling Laws Revisited

CSS Container Queries Explained

Sparse Mixture-of-Experts at Inference Scale

Comments

The papers that matter, summarized weekly.