III: Large: Collaborative Research: Web Archive Cooperative

III：大型：协作研究：网络档案合作社

基本信息

批准号：
1009916
负责人：
Hector Garcia-Molina
金额：
$ 235.05万
依托单位：
Stanford University
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2010
资助国家：
美国
起止时间：
2010-08-01 至 2015-07-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1009916&HistoricalAwards=false
关键词：
III Large Collaborative Research Web

项目摘要

Web Science is an emerging discipline that studies the Web: how human activity is shaped by Web interactions, how the Web can benefit society, and how Web technologies can be improved. Central to Web Science is access to data that records the history of the Web, as well as data that records human activity (e.g., posed queries, tagged pages, Twitter updates). It is currently very difficult for academic researchers to obtain such Web data because it is hard to locate, it is fragmented across diverse sites, and is recorded using inconsistent formats and strategies. This project will build a Web Archive Cooperative (WAC) that will integrate existing archives (repositories of Web data), making it feasible to access large volumes of data in a simplified fashion. The WAC will be a virtual service, providing search facilities and access mechanisms to existing resources. These resources will not just be Web pages, but all types of available Web information, such as query logs, tag annotations, blogs, profiles and Twitter updates. Furthermore, resources will also include the software tools for building and managing Web archives.The project will explore three goals for a resource discovery service: (1) the manual or automated discovery of entire existing Web related archives; (2) the selection among known archives of the ones that support a specific research question; and (3) the identification of individual resources from within the selected archives. Tools for characterizing discovered archives, especially for the case where the archive does not provide rich descriptive metadata, will also be developed. Characterization of an archive includes elements such as an estimate of the archive's coverage, particulars of the crawling parameters, like dates/frequencies, crawl duration, depth, per-site ceiling on the number of collected pages, content statistics, and link structure. Mechanisms for integrating diverse archives will be developed, and the mechanisms will be applied to site reconstruction (from various archives) and archive views (a logical fusion of resources from multiple sources). Since integration issues are so challenging, an experimental testbed will be set up with small but diverse resources. The testbed will contain several crawls of the same target sites, each obtained with different crawlers and using different parameters. The testbed will also contain related resources. Storage trading schemes will be developed, allowing members to trade local backup space for remote space. A Web archive replication tool will be developed based on existing notions for self-preserving objects. Alternatives for replica synchronization will be studied.Workshops to bring together key Web Science researchers will be organized to discuss available resources and impediments to sharing. These workshops will drive research and identify needed tools and protocols. With small groups of participants, challenge problems will be established, e.g., combining a set of Web archives. Reports of these results at future workshops can incentivize others to participate in the WAC. In addition, an Advisory Board of industrial, government, and academic experts has been set up to guide the project. A Summer Institute for Web Science graduate students will be held. At this Institute, students will learn to use the latest tools and will learn from each other's experiences in dealing with Web data. In addition, a one-day workshop will be developed, to be offered at Web Science conferences (WWW, SIGIR, etc.) to educate participants about WAC resources. An undergraduate Web Sciences track for computer science majors will be set up, taking advantage of WAC resources. The project will have impact in two ways. First, it will provide tools and services that facilitate access to Web resources. Any researcher, from a computer scientist studying efficient Web search, to a social scientist studying how human beliefs are changing today, to a historian studying how the early Web evolved, to a biologist understanding how disease spreads, will benefit from the work. Second, the project motivates students and young researchers to stay in academia. Currently top talent is flowing to industry because only they have comprehensive Web data, and it is so hard to do significant Web Science at universities. The WAC can provide an alternative, attracting more researchers and teachers to this important area.

Web Science是一门研究网络的新兴学科：人类活动是如何通过网络互动塑造的，网络如何使社会受益以及如何改善网络技术。 Web科学的中心是访问记录Web历史记录的数据，以及记录人类活动的数据（例如，提出的查询，标记的页面，Twitter更新）。目前，学术研究人员很难获得此类Web数据，因为很难找到，它在各种站点之间被分散，并且使用不一致的格式和策略记录。该项目将建立一个网络档案合作社（WAC），该合作社将集成现有的档案（网络数据存储库），使以简化的方式访问大量数据。 WAC将是虚拟服务，为现有资源提供搜索设施和访问机制。这些资源不仅是网页页面，还将是所有类型的可用Web信息，例如查询日志，标签注释，博客，个人资料和Twitter更新。此外，资源还将包括用于构建和管理Web档案的软件工具。该项目将探索资源发现服务的三个目标：（1）手册或自动发现整个现有与Web相关的档案的发现；（2）在支持特定研究问题的档案的已知档案中的选择；（3）从选定档案中识别单个资源。还将开发表征发现的档案的工具，特别是对于档案不提供丰富描述性元数据的情况。档案的表征包括诸如档案覆盖范围的估计，爬行参数的细节，例如日期/频率，爬行持续时间，深度，每个位置上限，收集的页面数量，内容统计信息和链接结构。将开发用于整合各种档案的机制，并将机制应用于站点重建（来自各种档案）和档案视图（来自多个来源的资源的逻辑融合）。由于集成问题非常具有挑战性，因此将使用少量但多样化的资源建立实验性测试床。测试床将包含相同目标位点的几个爬网，每个爬网都用不同的爬网和使用不同的参数获得。测试床还将包含相关资源。将制定存储交易计划，使成员可以将本地备份空间用于远程空间。 Web存档复制工具将基于现有的自我保护对象的概念开发。将研究复制同步的替代方案。将组织关键的网络科学研究人员的工作坊，以讨论可用的资源和共享障碍。这些研讨会将推动研究并确定所需的工具和协议。在一小部分参与者的情况下，将建立挑战问题，例如结合一组网络档案。这些结果在未来的研讨会上的报告可以激励他人参加WAC。此外，已经成立了工业，政府和学术专家顾问委员会来指导该项目。将举行夏季网络科学研究生暑期研究所。在这个学院，学生将学习使用最新工具，并将从彼此处理网络数据方面的经验中学习。此外，将在网络科学会议（www，sigir等）上开发一日研讨会，以向参与者提供有关WAC资源的教育。通过利用WAC资源，将建立一个计算机科学专业的本科Web Sciences Track。该项目将以两种方式影响。首先，它将提供有助于访问Web资源的工具和服务。从研究有效的网络搜索的计算机科学家到研究人类信念如何变化的社会科学家到研究早期网络如何发展的历史学家，再到了解疾病如何传播的生物学家将从工作中受益。其次，该项目激励学生和年轻研究人员留在学术界。目前，顶级人才正在流向行业，因为只有他们拥有全面的网络数据，而且很难在大学中进行重要的网络科学。 WAC可以提供替代方案，吸引更多的研究人员和教师进入这一重要领域。