Expressive Value Learning for
Scalable Offline Reinforcement Learning

Nicolas Espinosa Dice1, Kianté Brantley2, Wen Sun1

1Cornell University, 2Harvard University

Paper | Code | Thread

Abstract

RL has yet to be fully leveraged in robotics, principally due to its lack of scalability. Offline RL offers a promising avenue by training agents on large, diverse datasets, avoiding the costly real-world interactions of online RL. Scaling offline RL to increasingly complex datasets requires expressive generative models such as diffusion and flow matching. However, existing methods typically depend on either backpropagation through time (BPTT), which is computationally prohibitive, or policy distillation, which limits scalability to larger base policies. We consider the question of how to develop a scalable offline RL approach without relying on distillation or BPTT. We introduce Expressive Value Learning for Scalable Offline RL (EVOR): a scalable offline RL approach that integrates both expressive policies and expressive value functions. EVOR learns an optimal, regularized Q-function via flow matching during training. At inference-time, EVOR performs inference-time policy extraction via rejection sampling against the expressive value function, enabling efficient optimization, regularization, and compute-scalable search without retraining.

Citation

@article{espinosa2025expressive,
  title={Expressive Value Learning for Scalable Offline Reinforcement Learning},
  author={Espinosa-Dice, Nicolas and Brantley, Kiante and Sun, Wen},
  journal={arXiv Preprint},
  year={2025}
}