POPO: Pessimistic Offline Policy Optimization

doi:10.1109/ICASSP43922.2022.9747886

CORC > 自动化研究所 > 中国科学院自动化研究所 > 综合信息系统研究中心 > 脑机融合与认知评估

	POPO: Pessimistic Offline Policy Optimization
	He Q(何强)1,2; Hou XW(侯新文)2; Liu Y(刘禹)2
	2022-04
会议日期	23-27 May 2022
会议地点	Singapore, Singapore
关键词	reinforcement learning offline optimization out-of-distribution
DOI	10.1109/ICASSP43922.2022.9747886
英文摘要	Offline reinforcement learning (RL) aims to optimize policy from large pre-recorded datasets without interaction with the environment. This setting offers the promise of utilizing diverse and static datasets to obtain policies without costly, risky, active exploration. However, commonly used off-policy deep RL methods perform poorly when facing arbitrary off-policy datasets. In this work, we show that there exists an estimation gap of value-based deep RL algorithms in the offline setting. To eliminate the estimation gap, we propose a novel offline RL algorithm that we term Pessimistic Offline Policy Optimization (POPO), which learns a pessimistic value function. To demonstrate the effectiveness of POPO, we perform experiments on various quality datasets. And we find that POPO performs surprisingly well and scales to tasks with high-dimensional state and action space, comparing or outperforming tested state-of-the-art offline RL algorithms on benchmark tasks.
会议录出版者	IEEE
语种	英语
URL标识	查看原文
内容类型	会议论文
源URL	[http://ir.ia.ac.cn/handle/173211/48891]
专题	综合信息系统研究中心_脑机融合与认知评估
通讯作者	Hou XW(侯新文)
作者单位	1.University of Chinese Academy of Sciences 2.Institute of Automation, Chinese Academy of Sciences
推荐引用方式 GB/T 7714	He Q,Hou XW,Liu Y. POPO: Pessimistic Offline Policy Optimization[C]. 见:. Singapore, Singapore. 23-27 May 2022.

个性服务

查看访问统计

相关权益政策

暂无数据

收藏/分享

所有评论 (0)

[发表评论/异议/意见]

暂无评论

评论
权益异议
反馈意见

评注功能仅针对注册用户开放，请您登录

您对该条目有什么异议，请向管理员反馈。
内容：
Email：	*
单位:
验证码：	刷新

您在知识库使用过程中有什么好的想法或者建议可以反馈给我们。
标题：	*
内容：
Email：	*
验证码：	刷新

相关链接

CORC

联系我们