POPO: Pessimistic Offline Policy Optimization | |
He Q(何强)1,2; Hou XW(侯新文)2; Liu Y(刘禹)2 | |
2022-04 | |
会议日期 | 23-27 May 2022 |
会议地点 | Singapore, Singapore |
关键词 | reinforcement learning offline optimization out-of-distribution |
DOI | 10.1109/ICASSP43922.2022.9747886 |
英文摘要 | Offline reinforcement learning (RL) aims to optimize policy from large pre-recorded datasets without interaction with the environment. This setting offers the promise of utilizing diverse and static datasets to obtain policies without costly, risky, active exploration. However, commonly used off-policy deep RL methods perform poorly when facing arbitrary off-policy datasets. In this work, we show that there exists an estimation gap of value-based deep RL algorithms in the offline setting. To eliminate the estimation gap, we propose a novel offline RL algorithm that we term Pessimistic Offline Policy Optimization (POPO), which learns a pessimistic value function. To demonstrate the effectiveness of POPO, we perform experiments on various quality datasets. And we find that POPO performs surprisingly well and scales to tasks with high-dimensional state and action space, comparing or outperforming tested state-of-the-art offline RL algorithms on benchmark tasks. |
会议录出版者 | IEEE |
语种 | 英语 |
URL标识 | 查看原文 |
内容类型 | 会议论文 |
源URL | [http://ir.ia.ac.cn/handle/173211/48891] |
专题 | 综合信息系统研究中心_脑机融合与认知评估 |
通讯作者 | Hou XW(侯新文) |
作者单位 | 1.University of Chinese Academy of Sciences 2.Institute of Automation, Chinese Academy of Sciences |
推荐引用方式 GB/T 7714 | He Q,Hou XW,Liu Y. POPO: Pessimistic Offline Policy Optimization[C]. 见:. Singapore, Singapore. 23-27 May 2022. |
个性服务 |
查看访问统计 |
相关权益政策 |
暂无数据 |
收藏/分享 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论