A Comprehensive Genomic Community Resource of Transcriptional Regulation

转录调控的综合基因组群落资源

基本信息

批准号：
10625529
负责人：
Anshul Kundaje
金额：
$ 80.94万
依托单位：
UNIV OF MASSACHUSETTS MED SCH WORCESTER
依托单位国家：
美国
项目类别：
财政年份：
2022
资助国家：
美国
起止时间：
2022-06-01 至 2027-03-31
项目状态：
未结题

来源：
https://reporter.nih.gov/project-details/10625529
关键词：
ATAC-seq Algorithms Atlases Automobile Driving Base Pairing Benchmarking Binding CRISPR/Cas technology Catalogs Cells ChIP-seq Chromatin Code Collaborations Collection Communities Community Outreach Computer Models DNA DNA Sequence Data Data Analyses Data Set Development Disease Education and Outreach Educational workshop Elements Epigenetic Process Exons Functional disorder Future Genes Genomics Histones Human Human BioMolecular Atlas Program Human Genome Human Genome Project Human body Individual International Interruption Maps Mediating Methods Modeling Nematoda Online Systems Organism Pattern Physiology Process Quality Control Registries Regulatory Element Research Research Personnel Resolution Resources Role Scheme Signal Transduction Specific qualifier value Techniques Technology Testing Time Tissues Training Trans-Omics for Precision Medicine Transcriptional Regulation Untranslated RNA Variant Visualization Work base cell type community building community setting data analysis pipeline data repository deep learning deep learning model deep sequencing design epigenome epigenomics experimental study follow-up genome wide association study in silico in vivo Model machine learning model model development novel online resource outreach predictive modeling public repository repository sequence learning syntax tool trait transcription factor

项目摘要

Project Summary/Abstract The Human Genome Project (HGP) completed the first draft human genome sequence two decades ago. The HGP revealed that human complexity arises from only approximately 20,000 coding genes, roughly the same number as much simpler organisms such as nematodes. Intricate patterns of transcriptional regulation mediated by non-coding regulatory elements specify the myriad cell types and states required for human complexity. Genome-wide association studies have subsequently identified thousands of disease-associated variants, many of which interrupt the function of these non-coding elements to disrupt transcriptional regulation. Thus, in order to better understand human physiology and pathophysiology, comprehensive atlases of regulatory elements are essential. Many previous efforts, including the International Human Epigenome Consortium (IHEC), the FANTOM Consortium, the Roadmap Epigenomics Project, and the ENCODE Project, have aimed to build comprehensive collections of regulatory elements, as well as computational models to better predict regulatory activity and understand the sequence features underlying regulatory function. ENCODE (2003-2022) is a large- scale consortium effort which aims to annotate every functional non-coding element of the human genome; during our work on the project, we built a Registry of approximately 1 million human candidate cis-regulatory elements (cCREs). We further developed deep-learning approaches which model the transcription factor motif syntax that underlies element function at base-pair resolution and built two web-based resources, SCREEN and Factorbook, to make our results accessible to the scientific community. Here, we propose to extend this framework to build the Community Resource for Transcriptional Regulation (CRTR), a comprehensive atlas of non-coding regulatory elements and machine-learning models which will encompass community and consortium deep-sequencing data, both bulk and single cell, across a broad array of cell types and states. Our project has five aims. First, we aim to curate community and consortium data for inclusion in CRTR and perform uniform processing and quality control. Second, we aim to train deep-learning sequence models on bulk epigenetic datasets to identify transcription factor motif syntax driving regulatory element activity in distinct tissues and cell types. Third, we aim to train sequence models on single cell datasets to identify transcription factor motif syntax driving transcriptional regulation in high-resolution cell states and during cell state transitions. Fourth, we aim to use the aforementioned results to build comprehensive benchmark datasets and machine-learning model collections, which will aid future analysts in designing new models to predict regulatory readouts. Fifth, we aim to build a state-of-the-art web-based user interface to enable users to perform integrative analyses and in silico experimentation with CRTR, and hold workshops and other outreach to maximize the impact of the resource and its accessibility to the broader scientific community.

项目概要/摘要人类基因组计划（HGP）在二十年前完成了第一份人类基因组序列草案。这 HGP 揭示人类的复杂性仅源自大约 20,000 个编码基因，大致相同数量与更简单的生物一样多，例如线虫。介导的转录调控的复杂模式通过非编码调节元件指定人类复杂性所需的无数细胞类型和状态。全基因组关联研究随后发现了数千种与疾病相关的变异，其中许多是其中中断这些非编码元件的功能以破坏转录调节。因此，为了为了更好地了解人类生理学和病理生理学，调节元件的综合图集基本的。之前的许多努力，包括国际人类表观基因组联盟 (IHEC)、 FANTOM 联盟、Roadmap Epigenomics 项目和 ENCODE 项目旨在建立监管要素的全面集合以及更好地预测监管的计算模型活性并了解调节功能背后的序列特征。 ENCODE（2003-2022）是一个大型的规模联盟的努力，旨在注释人类基因组的每个功能非编码元件；在我们的项目工作期间，我们建立了一个包含大约 100 万人类候选顺式监管的注册库元素（cCRE）。我们进一步开发了模拟转录因子基序的深度学习方法语法是碱基对解析时元素功能的基础，并构建了两个基于 Web 的资源，SCREEN 和 Factorbook，让科学界能够获取我们的结果。在此，我们建议延长此规定建立转录调控社区资源（CRTR）的框架，这是一个综合的图集非编码监管要素和机器学习模型，其中包括社区和联盟跨多种细胞类型和状态的批量和单细胞深度测序数据。我们的项目有五个目标。首先，我们的目标是整理社区和联盟数据以纳入 CRTR 并执行统一加工和质量控制。其次，我们的目标是训练大量表观遗传的深度学习序列模型用于识别驱动不同组织和细胞中调节元件活性的转录因子基序语法的数据集类型。第三，我们的目标是在单细胞数据集上训练序列模型以识别转录因子基序语法在高分辨率细胞状态和细胞状态转换期间驱动转录调节。第四，我们的目标是使用上述结果构建全面的基准数据集和机器学习模型集合，这将帮助未来的分析师设计新模型来预测监管读数。五、我们的目标构建最先进的基于网络的用户界面，使用户能够执行综合分析和计算机模拟与 CRTR 进行实验，并举办研讨会和其他外展活动，以最大限度地发挥资源和更广泛的科学界的可及性。