DC: Small: Collaborative Research: DARE: Declarative and Scalable Recovery

DC：小型：协作研究：DARE：声明式和可扩展的恢复

基本信息

批准号：
1017073
负责人：
Andrea Arpaci-Dusseau
金额：
$ 19万
依托单位：
University of Wisconsin-Madison
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2010
资助国家：
美国
起止时间：
2010-09-15 至 2013-08-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1017073&HistoricalAwards=false
关键词：
DC Small Collaborative Research DARE

项目摘要

One dominant characteristic of today's large-scale computing systemsis the prevalence of large storage clusters. Storage clusters at thescale of hundreds or thousands of commodity machines areincreasingly being deployed. At companies like Amazon, Google, Yahoo,and others, thousands of nodes are managed as a single system.As large clusters have brought many benefits, they also bring a newchallenge: a growing number and frequency of failures that must bemanaged. Bits, sectors, disks, machines, racks, and many othercomponents fail. With millions of servers and hundreds of datacenters, there are millions of opportunities for these components tofail. Failing to deal with failures will directly impact thereliability and availability of data and jobs.Unfortunately, we still hear data-loss stories even recently. Forexample, in March 2009, Facebook lost millions of photos due tosimultaneous disk failures that "should" rarely happen at the sametime (but it happened); in July 2009, a large bank was fined a recordtotal of 3 millions pounds after losing data on thousands of itscustomers; more recently, in October 2009, T-Mobile Sidekick, whichuses Microsoft's cloud service, also lost its customer data. Theseincidents have shown that existing large-scale storage systems arestill fragile to failures.To address the challenges of large-scale recovery, the goal of thisproject is to: (1) seek the fundamental problems of recovery intoday's scalable world of computing, (2) improve the reliability,performance, and scalability of existing large-scale recovery, and (3)explore formally grounded languages to empower rigorous specificationof recovery properties and behaviors. Our vision is to build systemsthat "DARE to fail": systems that deliberately fail themselves,exercise recovery routinely, and enable easy and correct deployment ofnew recovery policies.For more information, please visit this website:http://boom.cs.berkeley.edu/dare/

当今大规模计算系统的一个主要特征是大型存储集群的盛行。数百或数千台商用机器规模的存储集群正在越来越多地得到部署。在亚马逊、谷歌、雅虎等公司，数千个节点作为单个系统进行管理。大型集群带来了许多好处，但也带来了新的挑战：必须管理越来越多的故障和频率。位、扇区、磁盘、机器、机架和许多其他组件发生故障。拥有数百万台服务器和数百个数据中心，这些组件有数以百万计的机会发生故障。未能处理故障将直接影响数据和作业的可靠性和可用性。不幸的是，即使在最近，我们仍然听到数据丢失的故事。例如，2009 年 3 月，Facebook 由于同时发生磁盘故障而丢失了数百万张照片，而这种情况“应该”很少同时发生（但它确实发生了）； 2009 年 7 月，一家大型银行因丢失数千名客户数据而被处以创纪录的 300 万英镑罚款；最近，2009 年 10 月，使用微软云服务的 T-Mobile Sidekick 也丢失了其客户数据。这些事件表明，现有的大规模存储系统仍然容易出现故障。为了应对大规模恢复的挑战，该项目的目标是：（1）寻求当今可扩展计算世界恢复的根本问题，（2）提高现有大规模恢复的可靠性、性能和可扩展性，以及（3）探索形式化基础语言以支持恢复属性和行为的严格规范。我们的愿景是构建“敢于失败”的系统：故意让自己失败的系统，定期进行恢复，并能够轻松正确地部署新的恢复策略。有关更多信息，请访问此网站：http://boom.cs.berkeley .edu/敢/