Biology-aware machine learning methods for characterizing microbiome genotype and phenotype

用于表征微生物组基因型和表型的生物学感知机器学习方法

基本信息

批准号：
10275055
负责人：
Siavash Mir arabbaygi
金额：
$ 34.47万
依托单位：
UNIVERSITY OF CALIFORNIA, SAN DIEGO
依托单位国家：
美国
项目类别：
财政年份：
2021
资助国家：
美国
起止时间：
2021-09-15 至 2026-08-31
项目状态：
未结题

项目摘要

PROJECT SUMMARY 1 The Mirarab laboratory designs leading computational methods for answering biological and biomedical ques- 2 tions, focusing on scalability and accuracy. These methods span several areas (e.g., microbiome proﬁling, 3 multiple sequence alignment, and phylogenomics), and a common thread among them is evolutionary mod- 4 eling. The lab has developed scalable and accurate methods for reconstructing evolutionary histories (i.e., 5 phylogenies) and using these histories in downstream biomedical applications. Reconstructing phylogenies is a 6 fundamental goal and a precursor to many biological analyses. Methods developed by this lab (e.g., ASTRAL) 7 are at the forefronts of modern genome-wide phylogenetics. Moreover, biomedical research increasingly uses 8 evolutionary histories in diverse areas like microbiome analyses, immunology, epidemiology, and comparative 9 genomics. While the lab has previously focused more on inferring species histories, it has recently started 10 to shift its focus to developing methods for microbiome analyses. The inference and the use of evolutionary 11 histories in analyzing environmental microbiome samples present a unique set of challenges. 12 In the next ﬁve years, the Mirarab lab will focus on designing, testing, and applying improved methods for 13 statistical analyses of microbiome data. These methods will target two questions. (i) Proﬁling: What organisms 14 constitute a given sample? (ii) Association: How are samples different in their organismal composition, and 15 how do these differences connect to measurable characteristics of their environment? While both questions 16 have been subject to considerable research, many computational challenges remain, providing an opportunity 17 for better methods to make a signiﬁcant impact. Instead of focusing solely on new algorithms, the lab will 18 also work on building better reference datasets and combining data from multiple sources. Thus, the project 19 aims to harness the unprecedented computational power, large available datasets, and recent advances in 20 machine learning to improve state-of-the-art dramatically. The project will not use off-the-shelf machine learning 21 methods in a black-box fashion. Instead, it develops methods that incorporate biological knowledge (e.g., of the 22 evolutionary relationships) into machine learning methods in a principled biologically-motivated fashion. 23 The lab will pursue several ambitious goals for both proﬁling and association questions. The project will 24 (i) create methods to infer a continuously-updated reference alignment and tree encompassing all sequenced 25 prokaryotic genomes (half a million currently) to be used for proﬁling, (ii) build methods for ultra-sensitive sam- 26 ple proﬁling, (iii) use deep learning to connect data obtained using amplicon sequencing and metagenomics, 27 (iv) build discordance-aware phylogenetic measures of sample differentiation, and (v) develop machine learning 28 methods for associating a proﬁled microbiome to phenotypes of interest such as disease. These new methods 29 will draw on statistics, machine learning, discrete optimization, and high-performance computing. Consistent 30 with the goals of MIRA, the project may explore new unforeseen opportunities if they ﬁt its general goals.

项目概要 1 Mirarab 实验室设计了领先的计算方法来回答生物学和生物医学问题 2 项，重点关注可扩展性和准确性，这些方法涵盖多个领域（例如微生物组分析、 3 多重序列比对和系统基因组学），其中的共同点是进化模式 4 eling。实验室开发了可扩展且准确的方法来重建进化历史（即， 5 系统发育）并在下游生物医学应用中使用这些历史重建系统发育。 6 本实验室开发的基本目标和方法（例如 ASTRAL）。 7 处于现代全基因组系统发育学的前沿，而且生物医学研究越来越多地使用它们。微生物组分析、免疫学、流行病学和比较等不同领域的 8 条进化史 9 基因组学虽然之前更多地专注于推断物种历史，但最近才开始研究。 10 将重点转向微生物组分析方法的开发进化的推论和使用。分析环境微生物组样本的 11 条历史提出了一系列独特的挑战。 12 在接下来的五年中，Mirarab 实验室将专注于设计、测试和应用改进的方法 13 微生物组数据的统计分析这些方法将针对两个问题（i）分析：什么生物体。 14 构成给定样本？ (ii) 关联：样本的生物成分有何不同，以及 15 这些差异如何与其环境的可测量特征联系起来？ 16 已经经过大量研究，仍然存在许多计算挑战，提供了机会 17 实验室将不再仅仅关注新算法，而是寻求更好的方法来产生重大影响。 18 还致力于构建更好的参考数据集并组合来自多个来源的数据。 19 旨在利用前所未有的计算能力、大量可用数据集以及最新进展 20 显着提高最先进的机器学习该项目不会使用现成的机器学习。相反，它以黑盒方式开发了 21 种方法，其中包含了生物学知识（例如，生物知识）。 22 进化关系）以原则性的生物驱动方式转化为机器学习方法。 23 该实验室将在分析和关联问题方面追求几个雄心勃勃的目标。 24 (i) 创建方法来推断持续更新的参考比对和包含所有测序的树 25 个原核基因组（目前有 50 万个）用于分析，(ii) 构建超灵敏样本的方法 26 ple 分析，(iii) 使用深度学习来连接使用扩增子测序和宏基因组学获得的数据， 27 (iv) 建立样本分化的不一致感知系统发育测量，以及 (v) 开发机器学习 28 种将微生物组与感兴趣的表型（例如疾病）相关联的方法。 29 将利用统计学、机器学习、离散优化和高性能计算。 30 根据 MIRA 的目标，如果符合其总体目标，该项目可能会探索新的不可预见的机会。