CIF: Small: Accelerating Stochastic Approximation for Optimization and Reinforcement Learning

CIF：小型：加速优化和强化学习的随机逼近

基本信息

批准号：
2306023
负责人：
Sean Meyn
金额：
$ 60万
依托单位：
University of Florida
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2023
资助国家：
美国
起止时间：
2023-07-01 至 2026-06-30
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2306023&HistoricalAwards=false
关键词：
CIF Small Accelerating Stochastic Approximation

项目摘要

This project concerns the design and analysis of recursive algorithms, which have broad applications in engineering and computer science. Recursive algorithms play a crucial role in machine learning systems like ChatGPT, which rely on large amounts of data for training. Reinforcement learning, a field with numerous famous examples, utilizes recursive algorithms for training computer programs; among the most famous examples include computer programs that excel in games such as GO and chess. Training is interpreted as "learning" optimal responses (e.g., the next move) based on observations (the current configuration of a chessboard). While stochastic approximation is recognized as a mathematical model for recursive algorithms and plays a major role in the mathematical theory of learning, the supporting theory has not kept pace with empirical success. In reinforcement learning, it is often uncertain if training will be successful or how much training is required. Along with fundamental research to create new foundations for algorithmic learning, the research project also involves graduate student mentoring, dissemination of new and existing research results through online video lectures, and also dissemination through the Workshop on Cognition and Control organized by the investigator, which is held annually at the University of Florida attracting speakers from across the U.S. and abroad. Techniques will be developed to ensure stability and accelerate convergence of stochastic approximation algorithms in terms of transients and variance. New approaches to algorithm design will include techniques based on ordinary differential equation methods, recent theory of Markov processes, and approaches to learning based on quasi-random exploration. Much of the work in algorithm design reduces to a feedback control problem, initially posed in continuous time to leverage concepts from nonlinear control and stability theory. A remarkable example is the Newton-Raphson flow which is globally convergent under mild assumptions. A dependable "algorithmic feedback law" in continuous time is then translated into a reliable and efficient algorithm implemented in discrete time. The general theory will be developed within two specific application areas: reinforcement learning and gradient-free optimization. Reinforcement learning presents the greatest challenge because, to-date, there is little theory available to establish the stability of these recursive algorithms outside of very special cases. Moreover, in recent work the investigator with his students have shown that Markovian memory can result in very slow convergence, even when the algorithm is optimized; in such cases it is necessary to change the algorithmic goal without negatively impacting the quality of the final solution delivered by the algorithm. In the case of reinforcement learning the primary objective is to efficiently learn an effective rule for decision making (i.e., a policy). Fortunately, there is great freedom in choosing a criterion of fit for learning the best policy within a given class.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

该项目涉及递归算法的设计和分析，该算法在工程和计算机科学中具有广泛的应用。递归算法在 ChatGPT 等依赖大量数据进行训练的机器学习系统中发挥着至关重要的作用。强化学习是一个有很多著名例子的领域，它利用递归算法来训练计算机程序；最著名的例子包括在围棋和国际象棋等游戏中表现出色的计算机程序。训练被解释为基于观察（棋盘的当前配置）“学习”最佳响应（例如下一步）。虽然随机逼近被认为是递归算法的数学模型，并且在学习的数学理论中发挥着重要作用，但支持理论并没有跟上经验成功的步伐。在强化学习中，通常不确定训练是否会成功或需要多少训练。除了为算法学习创造新基础的基础研究外，该研究项目还包括研究生指导、通过在线视频讲座传播新的和现有的研究成果，以及通过研究者组织的认知与控制研讨会进行传播。每年在佛罗里达大学举行，吸引来自美国和国外的演讲者。将开发技术以确保稳定性并加速随机逼近算法在瞬态和方差方面的收敛。新的算法设计方法将包括基于常微分方程方法的技术、最新的马尔可夫过程理论以及基于准随机探索的学习方法。算法设计中的大部分工作都归结为反馈控制问题，最初是在连续时间内提出的，以利用非线性控制和稳定性理论的概念。一个显着的例子是牛顿-拉夫森流，它在温和的假设下全局收敛。然后，连续时间中可靠的“算法反馈定律”被转化为离散时间中实现的可靠且高效的算法。一般理论将在两个特定应用领域内发展：强化学习和无梯度优化。强化学习提出了最大的挑战，因为迄今为止，除了非常特殊的情况之外，几乎没有可用的理论来确定这些递归算法的稳定性。此外，在最近的工作中，研究者和他的学生表明，即使算法被优化，马尔可夫记忆也会导致收敛速度非常慢；在这种情况下，有必要改变算法目标，而不会对算法提供的最终解决方案的质量产生负面影响。就强化学习而言，主要目标是有效学习有效的决策规则（即策略）。幸运的是，在选择适合学习特定类别中最佳政策的标准方面有很大的自由。该奖项反映了 NSF 的法定使命，并通过使用基金会的智力优点和更广泛的影响审查标准进行评估，被认为值得支持。