Setting up a well-designed reward function has been challenging for many reinforcement learning applications. Preference-based reinforcement learning (PbRL) provides a new framework that avoids reward engineering by leveraging human preferences (i.e., preferring apples over oranges) as the reward signal. Therefore, improving the efficacy of data usage for preference data becomes critical. In this work, we propose Meta-Reward-Net (MRN), a data-efficient PbRL framework that incorporates bi-level optimization for both reward and policy learning. The key idea of MRN is to adopt the performance of the Q-function as the learning target. Based on this, MRN learns the Q-function and the policy in the inner level while updating the reward function adaptively according to the performance of the Q-function on the preference data in the outer level. Our experiments on robotic simulated manipulation tasks and locomotion tasks demonstrate that MRN outperforms prior methods in the case of few preference labels and significantly improves data efficiency, achieving state-of-the-art in preference-based RL. Ablation studies further demonstrate that MRN learns a more accurate Q-function compared to prior work and shows obvious advantages when only a small amount of human feedback is available. The source code and videos of this project are released at https://sites.google.com/view/meta-reward-net 1 .
对于许多强化学习应用来说,设置一个精心设计的奖励函数一直具有挑战性。基于偏好的强化学习(PbRL)提供了一个新的框架,它通过利用人类偏好(例如,更喜欢苹果而不是橙子)作为奖励信号来避免奖励工程。因此,提高偏好数据的使用效率变得至关重要。在这项工作中,我们提出了元奖励网络(MRN),这是一个数据高效的PbRL框架,它将双层优化纳入奖励和策略学习中。MRN的关键思想是采用Q函数的性能作为学习目标。基于此,MRN在内部层面学习Q函数和策略,同时根据Q函数在外部层面的偏好数据上的性能自适应地更新奖励函数。我们在机器人模拟操作任务和运动任务上的实验表明,在偏好标签较少的情况下,MRN优于先前的方法,并显著提高了数据效率,在基于偏好的强化学习中达到了最先进水平。消融研究进一步表明,与先前的工作相比,MRN学习到了更准确的Q函数,并且在只有少量人类反馈可用时显示出明显的优势。该项目的源代码和视频发布在https://sites.google.com/view/meta-reward-net1。