CIF: Small: Accelerating Stochastic Approximation for Optimization and Reinforcement Learning

CIF：小型：加速优化和强化学习的随机逼近

基本信息

批准号：
2306023
负责人：
Sean Meyn
金额：
$ 60万
依托单位：
University of Florida
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2023
资助国家：
美国
起止时间：
2023-07-01 至 2026-06-30
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2306023&HistoricalAwards=false
关键词：
CIF Small Accelerating Stochastic Approximation

项目摘要

This project concerns the design and analysis of recursive algorithms, which have broad applications in engineering and computer science. Recursive algorithms play a crucial role in machine learning systems like ChatGPT, which rely on large amounts of data for training. Reinforcement learning, a field with numerous famous examples, utilizes recursive algorithms for training computer programs; among the most famous examples include computer programs that excel in games such as GO and chess. Training is interpreted as "learning" optimal responses (e.g., the next move) based on observations (the current configuration of a chessboard). While stochastic approximation is recognized as a mathematical model for recursive algorithms and plays a major role in the mathematical theory of learning, the supporting theory has not kept pace with empirical success. In reinforcement learning, it is often uncertain if training will be successful or how much training is required. Along with fundamental research to create new foundations for algorithmic learning, the research project also involves graduate student mentoring, dissemination of new and existing research results through online video lectures, and also dissemination through the Workshop on Cognition and Control organized by the investigator, which is held annually at the University of Florida attracting speakers from across the U.S. and abroad. Techniques will be developed to ensure stability and accelerate convergence of stochastic approximation algorithms in terms of transients and variance. New approaches to algorithm design will include techniques based on ordinary differential equation methods, recent theory of Markov processes, and approaches to learning based on quasi-random exploration. Much of the work in algorithm design reduces to a feedback control problem, initially posed in continuous time to leverage concepts from nonlinear control and stability theory. A remarkable example is the Newton-Raphson flow which is globally convergent under mild assumptions. A dependable "algorithmic feedback law" in continuous time is then translated into a reliable and efficient algorithm implemented in discrete time. The general theory will be developed within two specific application areas: reinforcement learning and gradient-free optimization. Reinforcement learning presents the greatest challenge because, to-date, there is little theory available to establish the stability of these recursive algorithms outside of very special cases. Moreover, in recent work the investigator with his students have shown that Markovian memory can result in very slow convergence, even when the algorithm is optimized; in such cases it is necessary to change the algorithmic goal without negatively impacting the quality of the final solution delivered by the algorithm. In the case of reinforcement learning the primary objective is to efficiently learn an effective rule for decision making (i.e., a policy). Fortunately, there is great freedom in choosing a criterion of fit for learning the best policy within a given class.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

该项目涉及在工程和计算机科学中广泛应用的递归算法的设计和分析。递归算法在诸如Chatgpt之类的机器学习系统中起着至关重要的作用，该系统依靠大量数据进行培训。强化学习是一个有许多著名示例的领域，它利用递归算法来培训计算机程序；最著名的示例包括在GO和Chess等游戏中表现出色的计算机程序。培训被解释为基于观察结果（当前的棋盘配置）的“学习”最佳响应（例如，下一步）。虽然随机近似被认为是递归算法的数学模型，并且在学习的数学理论中起着重要作用，但支持理论并未与经验成功保持同步。在加强学习中，通常不确定培训是否成功或需要多少培训。与为算法学习创造新的基础的基本研究，该研究项目还涉及研究生指导，通过在线视频讲座传播新的和现有的研究结果，并通过研究人员组织的认知和控制研讨会传播，该研讨会每年在佛罗里达大学举行，在佛罗里达大学吸引了美国和法国的演讲者。将开发技术以确保在瞬态和方差方面的稳定性和加速随机近似算法的收敛性。算法设计的新方法将包括基于普通微分方程方法的技术，马尔可夫过程的最新理论以及基于准随机探索的学习方法。算法设计中的大部分工作都减少了反馈控制问题，最初是在连续时间提出的，以利用非线性控制和稳定性理论的概念。一个了不起的例子是牛顿 - 拉夫森流，该流程在温和的假设下是全球收敛的。然后，在连续时间内可靠的“算法反馈法”被转化为在离散时间内实施的可靠，有效算法。一般理论将在两个特定的应用领域内开发：强化学习和无梯度优化。强化学习提出了最大的挑战，因为迄今为止，在非常特殊情况之外，几乎没有理论来确定这些递归算法的稳定性。此外，在最近的工作中，调查员与他的学生一起表明，即使优化了算法，马尔可夫的记忆也会导致非常缓慢的收敛。在这种情况下，有必要更改算法目标，而不会对算法提供的最终解决方案的质量产生负面影响。在强化学习的情况下，主要目标是有效学习决策的有效规则（即政策）。幸运的是，选择适合在给定课程中的最佳政策的标准有很大的自由。该奖项反映了NSF的法定任务，并被认为是值得通过基金会的知识分子优点和更广泛的影响评估标准通过评估来支持的。