Developing inquisitive, model-based agents for reinforcement learning

开发好奇的、基于模型的强化学习代理

基本信息

批准号：
RGPIN-2019-06079
负责人：
White, Adam
金额：
$ 2.04万
依托单位：
University of Alberta
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2020
资助国家：
加拿大
起止时间：
2020-01-01 至 2021-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=714340
关键词：
Developing inquisitive model based agents

项目摘要

Natural agents, like animals, learn from a life-time of experience. Most artificial learning systems do not. Newborns begin life with a frenzy of learning: attempting to master their muscle twitches and make sense of their visual inputs. This knowledge is continuously reused and refined throughout life. Our current Artificial Intelligence (AI) systems are well-suited to problems with a clear cause and effect relationship between the system's decisions and the utility of those decisions. Swimming into a shark will cause a loss of life. Shooting an alien ship will increase the score. However, in problems where the consequences of a decision are significantly delayed, it is more difficult to learn this mapping. The most challenging and largely unsolved AI benchmark problems feature such delayed consequences. It is common practice for state-of-the-art systems to train for the equivalent of 30 days on each Atari game, and still achieve well-below human performance in games that feature delayed consequences. One way to deal with the problem of delayed consequences is for the AI to construct its own understanding of how the world works, usually called a model of the world. A model encodes the regularities of the world. For example a model might encode: (1) when I am lined up with a shark and I decide to fire a torpedo, the shark will disappear, and (2) if I am standing on a platform and I decide to jump down, I will end up on the ground. Given access to a model of this form, an AI can mentally simulate future situations that would result from behaving in particular ways without actually interacting with the world. Just as a human can decide where they might end-up if the took a new path down to the river. We can imagine the outcome of taking this alternative path without physically doing it, and avoid unnecessary exploration unless we decide it is valuable to do so. Model-based mental simulation can dramatically improve the efficiency of learning. The remaining question is how does the system decide how to best make use of mental simulation. People often decide to try out things they have never done before. We choose to engage in activities that are mentally and physically challenging, but not beyond our abilities. Humans are motivated by novelty, curiosity and knowledge seeking, and bored by things we already know about. Combining this idea with a model could allow an AI to simulate different ways of behaving, preferring those ways of behaving that result in reduction of uncertainty and the acquisition of new knowledge. With a model, the AI can generate its own internal feedback to focus its mental simulations. The objective of this research program is two fold: (1) to design new approaches for representing and learning models of the world, and (2) to integrate mechanisms that can guide mental simulations (planning) and exploration toward uncertainty and knowledge acquisition.

自然主体就像动物一样，从一生的经验中学习。大多数人工学习系统都没有。新生儿以疯狂的学习开始了他们的生活：试图掌握他们的肌肉抽搐并理解他们的视觉输入。这些知识在一生中不断地重复使用和完善。我们当前的人工智能 (AI) 系统非常适合解决系统决策与这些决策的效用之间具有明确因果关系的问题。游进鲨鱼会导致死亡。射击外星飞船会增加分数。然而，在决策结果明显延迟的问题中，学习这种映射会更加困难。最具挑战性且基本上尚未解决的人工智能基准问题的特点是这种延迟后果。对于最先进的系统来说，通常的做法是在每款 Atari 游戏上进行相当于 30 天的训练，但在具有延迟后果的游戏中仍能实现远低于人类的表现。处理延迟后果问题的一种方法是人工智能构建自己对世界如何运作的理解，通常称为世界模型。模型编码了世界的规律。例如，模型可能会编码：（1）当我与鲨鱼排成一排并决定发射鱼雷时，鲨鱼会消失，（2）如果我站在平台上并决定跳下去，我最终会落在地面上。如果能够访问这种形式的模型，人工智能可以在心理上模拟因特定行为方式而导致的未来情况，而无需与世界实际互动。就像一个人可以决定如果采取一条新的路径到河边他们最终会去哪里一样。我们可以想象走这条替代道路的结果，而无需实际行动，并避免不必要的探索，除非我们认为这样做是有价值的。基于模型的心理模拟可以极大地提高学习效率。剩下的问题是系统如何决定如何最好地利用心理模拟。人们经常决定尝试一些他们以前从未做过的事情。我们选择从事对精神和身体有挑战性的活动，但不超出我们的能力。人类受新奇、好奇和求知的驱使，并对已知的事物感到厌倦。将这个想法与模型相结合可以让人工智能模拟不同的行为方式，更喜欢那些能够减少不确定性和获取新知识的行为方式。通过模型，人工智能可以生成自己的内部反馈来集中其心理模拟。该研究计划的目标有两个：（1）设计新的方法来表示和学习世界模型，（2）整合可以指导心理模拟（规划）和探索不确定性和知识获取的机制。

项目成果

期刊论文数量（0）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

数据更新时间：{{ journalArticles.updateTime }}

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

White, Adam其他文献

Questioning Anglocentrism in plural policing studies: Private security regulation in Belgium and the United Kingdom

DOI：
10.1177/14773708211014853
发表时间：
2021-05-12
期刊：
EUROPEAN JOURNAL OF CRIMINOLOGY
影响因子：
1.9
作者：
Leloup, Pieter;White, Adam
通讯作者：
White, Adam

Multi-timescale nexting in a reinforcement learning robot

DOI：
10.1177/1059712313511648
发表时间：
2014-04-01
期刊：
ADAPTIVE BEHAVIOR
影响因子：
1.6
作者：
Modayil, Joseph;White, Adam;Sutton, Richard S.
通讯作者：
Sutton, Richard S.

From eye-blinks to state construction: Diagnostic benchmarks for online representation learning.

DOI：
10.1177/10597123221085039
发表时间：
2023-03
期刊：
ADAPTIVE BEHAVIOR
影响因子：
1.6
作者：
Rafiee, Banafsheh;Abbas, Zaheer;Ghiassian, Sina;Kumaraswamy, Raksha;Sutton, Richard S.;Ludvig, Elliot A.;White, Adam
通讯作者：
White, Adam