Online reinforcement learning on real robots is slow and unsafe. Offline RL learns policies purely from previously collected trajectories, sidestepping the cost of live trial-and-error.

The distribution-shift problem

The central challenge is that a learned policy may prefer actions absent from the dataset, where value estimates are unreliable. The paper constrains the policy to stay close to the data distribution while still improving over it.

Field note: offline RL shines when you already have large logs from teleoperation or scripted controllers.

Results on manipulation

On a suite of grasping and stacking tasks, the conservative offline method matches online baselines while never touching the robot during training, a meaningful safety and cost improvement.