Planning, Fast and Slow: Online Reinforcement Learning with Action-Free Offline Data via Multiscale Planners

  • Chengjie Wu
  • , Hao Hu
  • , Yiqin Yang
  • , Ning Zhang
  • , Chongjie Zhang

Research output: Contribution to journalConference articlepeer-review

Abstract

The surge in volumes of video data offers unprecedented opportunities for advancing reinforcement learning (RL). This growth has motivated the development of passive RL, seeking to convert passive observations into actionable insights. This paper explores the prerequisites and mechanisms through which passive data can be utilized to improve online RL. We show that, in identifiable dynamics, where action impact can be distinguished from stochasticity, learning on passive data is statistically beneficial. Building upon the theoretical insights, we propose a novel algorithm named Multiscale State-Centric Planners (MSCP) that leverages two planners at distinct scales to offer guidance across varying levels of abstraction. The algorithm's fast planner targets immediate objectives, while the slow planner focuses on achieving longer-term goals. Notably, the fast planner incorporates pessimistic regularization to address the distributional shift between offline and online data. MSCP effectively handles the practical challenges involving imperfect pretraining and limited dataset coverage. Our empirical evaluations across multiple benchmarks demonstrate that MSCP significantly outperforms existing approaches, underscoring its proficiency in addressing complex, long-horizon tasks through the strategic use of passive data.

Original languageEnglish
Pages (from-to)53515-53541
Number of pages27
JournalProceedings of Machine Learning Research
Volume235
StatePublished - 2024
Event41st International Conference on Machine Learning, ICML 2024 - Vienna, Austria
Duration: Jul 21 2024Jul 27 2024

Fingerprint

Dive into the research topics of 'Planning, Fast and Slow: Online Reinforcement Learning with Action-Free Offline Data via Multiscale Planners'. Together they form a unique fingerprint.

Cite this