Reinforcement Learning for Finite Horizons (ReLeaF)

有限视野强化学习 (ReLeaF)

基本信息

批准号：
EP/X021513/1
负责人：
Sven Schewe
金额：
$ 26万
依托单位：
University of Liverpool
依托单位国家：
英国
项目类别：
Fellowship
财政年份：
2022
资助国家：
英国
起止时间：
2022 至无数据
项目状态：
未结题

来源：
https://gtr.ukri.org/projects?ref=EP%2FX021513%2F1
关键词：
Reinforcement Learning Finite Horizons ReLeaF

项目摘要

Reinforcement learning (RL) is a technique for learning how to take actions in an initially unknown environment in order to optimise an expected outcome, which is modelled through the notion of maximising an accumulative reward. Learning algorithms with goals written as temporal specifications have three key ingredients: the translation from the specification to appropriate finite automata; the translation of these finite automata to reward structures, such that a strategy that provides optimal rewards is guaranteed to provide optimal control; and a wrapper into a discounting scheme that, for appropriate parameters, will ensure that a learner converge to an optimal strategy.We will consider the RL problems for a popular specification language used in automation and motion planning, the finite horizon linear time temporal logic LTLf. In particular, we will study model-free RL algorithms, which are more suitable to real-world applications where the behaviour of the environment is hard to predict, than its model-based counterpart. We will propose learning algorithms that provide translations from finite horizon LTL to reward structures with formal guarantees of satisfying the given goals for environments modelled as Markov Decision Processes (MDPs). We will extend our techniques to infinite-state MDPs, including variations where formal guarantees can be provided -- like countable, finitely branching MDPs -- and study conditions for our techniques to provide guarantees in more general classes, such as smoothness guarantees for compact MDPs. We will complement these lines of research by looking at goals with constraints. This is effectively considering prioritised goals, where meeting safety constraints takes precedence, while other properties -- such as efficiency -- are considered as tie-breakers among strategies that provide the same safety guarantees.

增强学习（RL）是一种学习如何在最初未知环境中采取行动以优化预期结果的技术，该结果是通过最大化累积奖励的概念来建立的。以书写为时间规格的目标学习算法具有三个关键成分：从规范到适当有限自动机的翻译；这些有限自动机对奖励结构的翻译，以确保提供最佳奖励的策略提供最佳的控制；包装器将包装器纳入折现方案，该方案为了适当的参数，将确保学习者收敛到最佳策略。我们将考虑在自动化和运动计划中使用的流行规范语言的RL问题，这是有限的地平线线性时间时间逻辑LTLF。特别是，我们将研究无模型的RL算法，这些算法比基于模型的对应物更适合于难以预测的环境行为的现实应用程序。我们将提出学习算法，这些算法可提供从有限的地平线LTL的翻译，以奖励结构，并正式保证满足以马尔可夫决策过程（MDP）为模型的环境。我们将把技术扩展到无限状态的MDP，包括可以提供正式保证的变化（例如可计数，有限的分支MDP）以及我们的技术条件，以提供更一般类别的保证，例如紧凑型MDP的平滑度保证。我们将通过查看具有限制的目标来补充这些研究路线。这实际上是在考虑到安全限制优先考虑的优先级目标，而其他物业（例如效率）被视为提供相同安全保证的战略中的打破势。