EFFICIENT SPECTRAL APPROACHES FOR FINDING UNDERLYING STRUCTURES IN BIG DATA

用于查找大数据底层结构的高效谱方法

基本信息

批准号：
9278252
负责人：
Yuval Kluger
金额：
$ 41万
依托单位：
YALE UNIVERSITY
依托单位国家：
美国
项目类别：
财政年份：
2016
资助国家：
美国
起止时间：
2016-05-24 至 2019-04-30
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/9278252
关键词：
Algorithmic Analysis Algorithms Architecture Big Data Binding Bioinformatics Biological Categories Cells ChIP-seq Clinical Research Computer software Confusion Data Data Analyses Data Analytics Data Discovery Data Set Databases Detection Development Dimensions Employee Strikes Event Excision Future Genomics Goals Knowledge Discovery Language Lead Massive Parallel Sequencing Mathematics Memory Methods Microprocessor Nucleotides Numeric Rating Scale Pattern Performance Randomized Research Running Science Signal Transduction Stream Structure Techniques Time Validation Variant base big biomedical data computerized data processing computerized tools design experimental study falls genome wide association study genomic data health care delivery high dimensionality improved insertion/deletion mutation insight interest learning strategy novel novel strategies population stratification prototype public health relevance repository tool user-friendly whole genome

项目摘要

DESCRIPTION (provided by applicant)l Recently we developed several spectral approaches for analyzing very large genomics datasets or complete databases that fall into the category of big data (BD). The first approach is designed to perform SVD or PCA based on randomization that can dramatically accelerate the computation of their eigenvectors and eigenvalues relative to the standard Lanczos algorithm implemented in all common software packages. Computing PCA and the SVD more efficiently could revolutionize the innumerable biomedical applications based on PCA and the SVD, e.g. population stratification in very large GWAS. These algorithms produce higher accuracy than classical (deterministic) methods, enable the processing of data streams that are too large to store, and parallelize easily to be used in multicore microprocessors. Our second novel approach is an unsupervised spectral learning method. It provides new mathematical insights of striking conceptual simplicity for ranking multiple competing algorithms without access to validation data and for optimally combining this ensemble of algorithms to obtain improved predictions in the absence of ground truth. Constructing a tool that provides end users an option to optimally rank or combine algorithms for analysis of genomics data is a practical and efficient solution to remove the confusion among end-users or bioinformaticians who are faced with the need to decide which tool to choose for their study, as a large number of biological results inferred by the different tools are often in disagreement. The choice of the best performing algorithm or pipeline is essential as it can often lead to substantial improvement in quality of the readout from massively parallel sequencing experiments. Moreover, combining these tools typically results in performance superior to the best performing algorithm. Our goal is to establish a team whose focus is to provide and disseminate full-blown implementations of spectral BD tools and methods that have broad applicability across the spectrum of biomedical sciences, clinical research, and healthcare delivery. Specifically we will develop scalable PCA and SVD for Genomics and biomedical applications, further advance our spectral method for ranking the performance of competing pipelines and combine them to achieve better predictions without access to validation data. Moreover, we will develop scalable dimensional reduction techniques for organizing BD from biomedical applications.

描述（由适用提供）l最近我们开发了几种用于分析属于大数据类别（BD）类别的非常大的基因组数据集或完整数据库的光谱方法。第一种方法旨在基于随机化执行SVD或PCA，该随机化可以显着加速其特征向量和特征值相对于所有通用软件包中实现的标准Lanczos算法的计算。计算PCA和SVD更有效地可以革新基于PCA和SVD的无数生物医学应用，例如非常大的GWAS中的人口分层。这些算法比经典（确定性）方法产生的精度更高，使得可以处理太大而无法存储的数据流，并轻松地并行化以在多核心微处理器中使用。我们的第二种新颖方法是一种无监督的光谱学习方法。它提供了引人注目的概念简单性的新数学见解，用于对多种竞争算法进行排名，而无需访问验证数据，并最佳地结合了这种算法的整体，以在没有地面真理的情况下获得改进的预测。构建一个为最终用户提供一个选项，以最佳排名或结合算法进行基因组学数据分析是一种实用和有效的解决方案，可以消除最终用户或生物信息学家之间的混乱，这些人面临选择要选择哪种工具进行研究的工具，因为经常通过不同的工具推断出的生物学结果，这些工具通常是在分解的。最佳性能算法或管道的选择是必不可少的，因为它通常可以从大规模平行的测序实验中大大提高读数质量。此外，结合这些工具通常会导致性能优于最佳性能算法。我们的目标是建立一个团队，其重点是提供和传播光谱BD工具和方法的全面实现，这些工具和方法在生物医学科学，临床研究和医疗保健提供方面具有广泛适用性。具体而言，我们将开发用于基因组学和生物医学应用的可扩展PCA和SVD，进一步推进我们对竞争管道的性能进行排名的光谱方法，并将它们组合在一起以实现更好的预测，而无需访问验证数据。此外，我们将开发可从生物医学应用组织BD的可扩展尺寸缩小技术。