Online reinforcement learning on real robots is slow and unsafe. Offline RL learns policies purely from previously collected trajectories, sidestepping the cost of live trial-and-error.
The distribution-shift problem
The central challenge is that a learned policy may prefer actions absent from the dataset, where value estimates are unreliable. The paper constrains the policy to stay close to the data distribution while still improving over it.
Results on manipulation
On a suite of grasping and stacking tasks, the conservative offline method matches online baselines while never touching the robot during training, a meaningful safety and cost improvement.