arxiv:2210.08323

A Policy-Guided Imitation Approach for Offline Reinforcement Learning

Published on Apr 5, 2023

Authors:

Abstract

Offline reinforcement learning method that decomposes policy learning into guide and execute components, enabling stable training while achieving out-of-distribution generalization through state-compositionality.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Offline reinforcement learning (RL) methods can generally be categorized into two types: RL-based and Imitation-based. RL-based methods could in principle enjoy out-of-distribution generalization but suffer from erroneous off-policy evaluation. Imitation-based methods avoid off-policy evaluation but are too conservative to surpass the dataset. In this study, we propose an alternative approach, inheriting the training stability of imitation-style methods while still allowing logical out-of-distribution generalization. We decompose the conventional reward-maximizing policy in offline RL into a guide-policy and an execute-policy. During training, the guide-poicy and execute-policy are learned using only data from the dataset, in a supervised and decoupled manner. During evaluation, the guide-policy guides the execute-policy by telling where it should go so that the reward can be maximized, serving as the Prophet. By doing so, our algorithm allows state-compositionality from the dataset, rather than action-compositionality conducted in prior imitation-style methods. We dumb this new approach Policy-guided Offline RL (POR). POR demonstrates the state-of-the-art performance on D4RL, a standard benchmark for offline RL. We also highlight the benefits of POR in terms of improving with supplementary suboptimal data and easily adapting to new tasks by only changing the guide-poicy.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2210.08323 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2210.08323 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2210.08323 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.