MCS: AF: Small: Algorithms for Large Scale Prediction Problems

MCS：AF：小型：大规模预测问题的算法

基本信息

批准号：
1115788
负责人：
Peter Bartlett
金额：
$ 35万
依托单位：
University of California-Berkeley
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2011
资助国家：
美国
起止时间：
2011-07-15 至 2015-06-30
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1115788&HistoricalAwards=false
关键词：
MCS AF Small Algorithms Large

项目摘要

In large scale prediction problems that arise in many application areas, data is plentiful, and it is computational resources that constrain the performance of prediction methods. The broad goal of this research project is the design and analysis of methods for large scale prediction problems that make effective use of limited computational resources. The main aims are: to improve our understanding of the tradeoff between the accuracy of a prediction method and its computational requirements; to develop model selection methods that adaptively choose the model complexity to give the best predictive accuracy for the available computational resources; to improve our understanding of the difficulty of solving large scale prediction problems using distributed computational resources; to develop analysis techniques and methods for asynchronous online prediction, which exploit the flexibility to respond to queries out of order; and hence to develop effective methods for large scale prediction problems.As data acquisition and storage has become cheaper, enormous data sets have become available in many areas, including web information retrieval, the biological, medical, and physical sciences, manufacturing, finance and retail. Consequently, for many statistical prediction problems, the amount of data available is so huge that we can treat it as unlimited. For instance, in using image and caption data to train a prediction rule that can automatically choose appropriate labels for images, the web provides an effectively unlimited supply of training data. Similar situations arise in using click stream data to predict the choices of visitors to a popular web site, or in using customers' ratings of movies to make useful recommendations. For these large scale prediction problems, the bottleneck to performance is not the amount of data, rather it is the computational resources that are available. Many modern prediction methods have been designed and analyzed from the perspective that data is precious: they aim for optimal predictive accuracy for a given sample size. But for large scale problems, this is the wrong perspective; computation is the precious resource that must be used wisely. This shift in perspective introduces some novel tradeoffs. One of the most important tradeoffs arises in choosing the complexity of a prediction rule. Should we use our computational resources trying to optimize over a very complex family of prediction rules, which would not allow us to gather much data? Or should we save computation by using simpler prediction rules, and instead spend this computation on gathering more data? This research project is aimed at improving our understanding of these tradeoffs, and hence developing strategies for large scale prediction problems that best exploit the available computational resources.

在许多应用领域出现的大规模预测问题中，数据是丰富的，而计算资源限制了预测方法的性能。该研究项目的总体目标是设计和分析大规模预测问题的方法，有效利用有限的计算资源。主要目标是：提高我们对预测方法的准确性与其计算要求之间的权衡的理解；开发模型选择方法，自适应地选择模型复杂性，从而为可用计算资源提供最佳的预测精度；提高我们对使用分布式计算资源解决大规模预测问题的难度的理解；开发异步在线预测的分析技术和方法，利用响应无序查询的灵活性；从而开发针对大规模预测问题的有效方法。随着数据采集和存储变得越来越便宜，许多领域都可以使用大量数据集，包括网络信息检索、生物、医学和物理科学、制造、金融和零售。因此，对于许多统计预测问题，可用的数据量非常巨大，我们可以将其视为无限的。例如，在使用图像和标题数据来训练可以自动为图像选择适当标签的预测规则时，网络实际上提供了无限的训练数据供应。使用点击流数据来预测流行网站的访问者的选择，或者使用客户对电影的评级来提出有用的推荐时，也会出现类似的情况。对于这些大规模预测问题，性能的瓶颈不是数据量，而是可用的计算资源。许多现代预测方法都是从数据宝贵的角度进行设计和分析的：它们的目标是在给定的样本量下获得最佳的预测精度。但对于大规模问题，这是错误的观点；计算是必须明智使用的宝贵资源。这种观点的转变引入了一些新颖的权衡。最重要的权衡之一是选择预测规则的复杂性。我们是否应该使用我们的计算资源来尝试优化一系列非常复杂的预测规则，因为这不允许我们收集大量数据？或者我们应该通过使用更简单的预测规则来节省计算，而不是将这些计算用于收集更多数据？该研究项目旨在提高我们对这些权衡的理解，从而制定能够最好地利用可用计算资源的大规模预测问题的策略。