Past analyses of reinforcement learning from human feedback (RLHF) assume that the human fully observes the environment. What happens when human feedback is based only on partial observations? We formally define two failure cases: deception and overjustification. Modeling the human as Boltzmann-rational w.r.t. a belief over trajectories, we prove conditions under which RLHF is guaranteed to result in policies that deceptively inflate their performance, over-justify their behavior to make an impression, or both. To help address these issues, we mathematically characterize how partial observability of the environment translates into (lack of) ambiguity in the learned return function. In some cases, accounting for partial observability makes it theoretically possible to recover the return function and thus the optimal policy, while in other cases, there is irreducible ambiguity. We caution against blindly applying RLHF in partially observable settings and propose research directions to help tackle these challenges.
过去对人类反馈强化学习(RLHF)的分析假定人类完全观察了环境。如果人的反馈仅基于部分观察,会发生什么情况呢?我们正式定义了两种失败情况:欺骗和过度调整。我们将人类建模为波尔兹曼理性的轨迹信念,并证明了在哪些条件下 RLHF 可以保证导致政策欺骗性地夸大其性能,或过度证明其行为以给人留下印象,或两者兼而有之。为了帮助解决这些问题,我们从数学角度描述了环境的部分可观测性如何转化为所学回报函数的(缺乏)模糊性。在某些情况下,考虑到部分可观测性,理论上就有可能恢复回报函数,从而恢复最优策略,而在其他情况下,则存在不可还原的模糊性。我们告诫大家不要在部分可观测的环境中盲目应用 RLHF,并提出了有助于应对这些挑战的研究方向。