Simple, Scalable Distributional Reinforcement Learning

¹Cornell University, ²Harvard University

Pre-print coming soon

Abstract

Recent work on robot foundation models has demonstrated that vision–language–action (VLA) policies are a promising avenue for generalization across diverse tasks. However, fine-tuning such models via online reinforcement learning (RL) is often infeasible due to resource constraints. Offline RL is a promising alternative, but existing approaches typically require backpropagation through time—which is computationally expensive and can degrade pre-trained perception/language representations—or rely on distillation—which introduces compounding errors between teacher/student networks. Moreover, large VLA models remain slow at inference, necessitating asynchronous decision/execution techniques like action chunking. We introduce Scalable Distributional Reinforcement Learning (SDRL), a scalable framework for offline fine-tuning and inference-time scaling of large VLA-based policies. Instead of directly optimizing the policy, SDRL trains a distributional reward model to estimate a distribution over discounted rewards-to-go. Using the learned reward model, SDRL approximates the optimal Q-function, enabling inference-time policy extraction and scaling through best-of-N search without policy gradients or distillation. Our approach further incorporates action chunking to accelerate inference, yielding the best of both worlds: efficient training and scalable inference.

Scaling Offline RL via Efficient and Expressive Shortcut Models

→