CAREER: Towards a Big Data Application Server Stack

职业：迈向大数据应用服务器堆栈

基本信息

批准号：
1351047
负责人：
Tyson Condie
金额：
$ 46.47万
依托单位：
University of California-Los Angeles
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2014
资助国家：
美国
起止时间：
2014-02-01 至 2019-01-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1351047&HistoricalAwards=false
关键词：
CAREER Towards Big Data Application

项目摘要

Google's MapReduce inspired much of the Big Data Analytics work and has served as a template for open source systems like Apache Hadoop. The MapReduce programming model has wide applicability, but widespread adoption has exposed some limitations, such as the lack of support for iteration (which is common in machine learning algorithms), stream processing, graph analytics, real-time and interactive queries. Beyond the programming framework, the underlying implementation offers a template for how to scale-out massively distributed computations: break them up into small tasks that can be carried out in parallel by partitioning the underlying data, and save intermediate state to mitigate the impact of partial failures (which must be planned for when running on large clusters). The challenge then, is to build implementations of other programming frameworks (e.g., SQL and machine learning) that share the same scale-out and fault-tolerance runtime characteristics of MapReduce without imposing its limitations. Resource managers such as Apache Hadoop YARN, Google Omega and Berkeley Mesos take a first step in this direction by separating resource allocation from the details of higher-level programming models and languages. Resource managers multiplex several jobs on the same underlying machine cluster, thereby increasing utilization and fostering clean-slate software stacks. When the task executing in a container a slice of a single machine's resources (CPU/GPU, memory, disk) is finished, the container is returned to the resource manager, where it is made available to other jobs. Unlike in higher-level stacks, a container is a blank-slate process, designed to host arbitrary computations. This project prescribes further reusable software layers that capture issues like how many resources should I dedicate to a job?; what are the redundant code-pathways and can I provide them in a reusable library?; what are the right language and runtime abstractions? Exploring these questions in the context of systems like MapReduce and related SQL implementations, ML toolkits, storage systems, and messaging systems, on next generation resource managers, is the primary focus of our work.The goal is to unify a suite of large-scale data processing tasks on a single runtime layer, built on modern resource managers (the cloud operating systems). Our results will factor out commonalities in specialized systems and provide them in a single underlying runtime system, shortening the time to ?market? for the next ready-to-use Big Data toolkit, which in turn would increase the availability of such tools to the broader community. Experience gained by implementing and deploying applications at scale, over next generation resource managers, could help inform critical design choices in the development of future cloud computing platforms, and hence impact a broad range of scientific, engineering, national security, healthcare and business applications. The project offers enhanced opportunities for research-based advanced training of graduate and undergraduate students, including members of groups that are currently under-represented in computer science, in databases, machine learning, and cloud computing.

Google的MapReduce启发了许多大数据分析工作，并充当了Apache Hadoop等开源系统的模板。 MAPREDUCE编程模型具有广泛的适用性，但是广泛采用已经暴露了一些局限性，例如缺乏对迭代的支持（在机器学习算法中很常见），流处理，图形分析，实时和交互式查询。除了编程框架之外，基础实现还提供了一个模板，以扩展大规模分布式计算：将它们分解为可以通过分区数据来平行执行的小任务，并节省中间状态，以减轻部分故障的影响（在大型群集上运行时必须计划）。然后，挑战是建立其他编程框架（例如SQL和机器学习）的实现，这些框架（例如，SQL和机器学习）共享MapReduce的相同扩展和容忍度运行时特征而不施加限制。 Apache Hadoop纱，Google Omega和Berkeley Mesos等资源经理通过将资源分配与高级编程模型和语言的详细信息分开，朝这个方向迈出了第一步。资源经理在同一基础机器群集上多元化几个作业，从而增加了利用率并促进了清洁式软件堆栈。当在容器中执行任务单个计算机资源的切片（CPU/GPU，内存，磁盘）时，该容器将返回到资源管理器，在该资源管理器中可用于其他作业。与高级堆栈不同，容器是一个空白的过程，旨在托管任意计算。该项目规定了进一步的可重复使用的软件层，这些软件层捕获了我应该专门用于工作的问题？什么是冗余代码轨道，我可以在可重复使用的库中提供它们吗？正确的语言和运行时抽象是什么？在下一代资源管理器上，在MapReduce和相关SQL实现，ML工具包，存储系统和消息系统等系统的背景下探索这些问题是我们工作的主要重点。目标是在现代资源管理器（云操作系统）上建立的单个运行时层上的大规模数据处理任务。我们的结果将考虑专用系统中的共同点，并将其提供在一个基本的运行时系统中，从而缩短了进入市场的时间？对于下一个可用的大数据工具包，这反过来将增加对更广泛社区的此类工具的可用性。通过大规模实施和部署应用程序在下一代资源管理方面获得的经验，可以帮助为未来的云计算平台的开发提供关键的设计选择，从而影响广泛的科学，工程，国家安全，医疗保健和业务应用程序。该项目为研究生和本科生的基于研究的高级培训提供了增强的机会，其中包括目前在计算机科学，数据库，机器学习和云计算中代表不足的团体的成员。

项目成果

期刊论文数量（0）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

数据更新时间：{{ journalArticles.updateTime }}

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Tyson Condie其他文献

I do declare: consensus in a logic language

我确实声明：用逻辑语言达成共识

DOI：
发表时间：
2010
期刊：
OPSR
影响因子：
0
作者：
P. Alvaro;Tyson Condie;Neil Conway;J. Hellerstein;Russell Sears
通讯作者：
Russell Sears

REEF: Retainable Evaluator Execution Framework

REEF：可保留的评估器执行框架

DOI：
10.14778/2536274.2536318
发表时间：
2013
期刊：
Proc. VLDB Endow.
影响因子：
0
作者：
Byung;Tyson Condie;C. Curino;Raghu Ramakrishnan;Russell Sears;Markus Weimer
通讯作者：
Markus Weimer

Declarative Systems

DOI：
发表时间：
2011
期刊：
影响因子：
0
作者：
Tyson Condie
通讯作者：
Tyson Condie

mCerebrum and Cerebral Cortex: A Real-time Collection, Analytic, and Intervention Platform for High-frequency Mobile Sensor Data

mCerebrum 和大脑皮层：高频移动传感器数据的实时收集、分析和干预平台

DOI：
发表时间：
2017
期刊：
American Medical Informatics Association Annual Symposium
影响因子：
0
作者：
T. Hnat;Syed Monowar Hossain;Nasir Ali;Simona Carini;Tyson Condie;I. Sim;M. Srivastava;Santosh Kumar
通讯作者：
Santosh Kumar