Genome Assemblies, Analyses, and Comparisons

基因组组装、分析和比较

基本信息

批准号：
10927044
负责人：
David LANDSMAN
金额：
$ 27.67万
依托单位：
NATIONAL LIBRARY OF MEDICINE
依托单位国家：
美国
项目类别：
财政年份：
资助国家：
美国
起止时间：
至
项目状态：
未结题

来源：
https://reporter.nih.gov/project-details/10927044
关键词：
Acceleration Adopted Binding Biology Biomedical Research ChIP-seq Chordata Classification Code Collaborations Collection Communities Complex Computational Biology Computing Methodologies DNA DNA sequencing Data Data Analyses Data Files Data Set Databases Decontamination Detection Development Directories Disease Education Elements Enzymes Eukaryota Excision FAIR principles Gene Expression Profiling Genes Genome Genomics Genotype Hour Intention Knowledge Label Language Linux Methods Mexico Names Occupations Opuntia Organism Pathway interactions Phenotype Phylogenetic Analysis Play Process Provider Publishing Pythons RNA RNA analysis Recipe Reproducibility Research Research Personnel Role Running Sampling Scientist Sequence Homologs Structure Taxonomy Technology Tissue-Specific Gene Expression Training Transcript base bioinformatics tool cloud platform computational pipelines cost data analysis pipeline database structure design experimental study genome database improved interactive tool interest laboratory experiment laptop next generation sequence data organizational structure portability programs public database reference genome screening tool transcriptome transcriptome sequencing

项目摘要

Development of PM4NGS, a project management framework for NGS data analysis: NGS data analysis has advanced the design, implementation, and execution of many complex computational biology pipelines. For computational biologists, pipelines are multi-step methods that should follow the FAIR (Findability, Accessibility, Interoperability, and Reusability) data principles and guarantee reproducibility, portability and scalability. Workflow languages and managers, docker containers, and scientific computational notebooks have been adopted by the scientific community with the intention to improve reproducibility, portability, maintainability, and shareability of computational pipelines. Following these principles, our group has developed PM4NGS 6, a project management framework for NGS data analysis. This framework comprises the automatic creation of a standard organizational structure of directories and files; bioinformatics tool management, using Docker/Biocontainers or Conda/Bioconda ; data analysis pipelines in the Common Workflow Language (CWL) format; and pre-configured Jupyter notebooks with minimum Python code. The framework was designed as a fully interactive tool for data analysis on personal laptops or workstations. It also can be used as an educational tool to train new bioinformaticians on how to organize an NGS data analysis project that shows a detailed view of the pipeline components. PM4NGS currently includes four NGS data analysis workflows as templates: differential gene expression and GO enrichment analysis from RNA-Seq data; differential binding analysis from ChIP-Seq data; DNA motif binding detection from ChIP-exo data; and transcriptome assembly, including annotation and submission for unannotated organisms. These templates can be reused or modified to create new computational biology workflows. This framework aims to reduce the gap between researchers in experimental laboratories, producing NGS data, and the workflows for the data analysis. The complexity of working with multiple directories, data files, and programs on the Linux command line interface is managed completely by PM4NGS, allowing researchers to focus on result interpretation. De novo transcriptome assembly, annotation, and submission for UNANNOTATED organisms: We have developed a transcriptome assembly, annotation, and submission workflow for unannotated organisms. This workflow was implemented as a PM4NGS-based template and designed to run on the Google cloud platform (GCP). Users can run the PM4NGS Jupyter notebooks on their personal laptops or workstations and submit the more intense computing jobs to the GCP. As part of the development, a suitability study was published to demonstrate the benefits of using a public cloud provider for computational biology experiments 7. We demonstrate that public cloud providers are a practical alternative for the execution of advanced computational biology experiments at low cost. Using our cloud recipes, the BLAST alignments required to annotate a transcriptome with 500,000 transcripts can be processed in less than 2 hours, with a computing cost of about $200$250. This workflow was used to assemble, annotate, and submit the Opuntia streptacantha transcriptome from the BioProject PRJNA320545 (a collaboration with scientists at Universidad Autnoma de San Luis Potos, Mexico). The workflow uses Trinity to assemble RNA-Seq raw reads into transcripts. The transcripts are clustered to create Trinity genes. Homologous sequences are identified to annotate the transcripts with GO terms, enzyme names, and conserved domains. The functional annotation for the Opuntia streptacantha transcriptome is published with additional information about the assembly and differential gene expression analysis of two experimental conditions at https://www.ncbi.nlm.nih.gov/research/nopaldb/. Detection and removal of foreign contamination on RNA-Seq samples: Our transcriptome assembly and annotation pipeline include a workflow to detect and remove foreign RNA contamination from the input samples. RNA-Seq contamination has played a large role in misleading multiple research conclusions. It is most troublesome if the target organism does not have a reference genome or annotation in public databases. We have developed GTax, a taxonomic structured database of genomic sequences that can be used with BLAST for taxonomic classification and contamination filtering. This approach efficiently detects and eliminates contaminant reads in RNA-Seq data. GTax genomic sequences were extracted from the NCBI Genome database, using Datasets. The database includes a subset of the latest assemblies of a collection of reference genomes. Sequences were filtered by RefSeq Accession prefixes to reduce the size and possible contaminated sequences. The sequences were organized into 19 mutually exclusive and hierarchical taxonomic groups. For example, taxonomies in the Viridiplantae kingdom are divided into three GTax groups: the Liliopsida group, which includes all monocotyledon sequences; the Eudicotyledons group, which includes all dicotyledon sequences; and the Viridiplantae group, into which all of the other taxa in the Viridiplantae kingdom are placed. The same principle is applied to the Chordata phylum and all taxonomy groups from Neoteleostei to Sarcopterygii. Finally, all remaining Eukaryote taxa are placed in the Eukaryota taxonomy group. This taxonomic structured division of the genomic sequences in GTax keeps phylogenetically closely related species in the same taxonomy group and greatly reduces the size of the searchable BLAST database. The Sauropsida group, which is the biggest group and contains 1,073 sequences and 46,172,754,879 total bases, is only 6.84% of the NT database. Current version of GTax sequences represent 72.18% of the NT database. Our decontamination approach is initiated with a screening of the RNA-Seq reads (using BLAST) against the taxonomy group of the target organism. In these cases, we can screen millions of RNA-Seq reads against less than 6% of the NT database. Then, unidentified reads are screened against the remainder of the GTax taxonomy groups. Reads labeled as correct are those that match the taxonomy group of the target organism. Those that remain unidentified are labeled as such.

PM4NGS的开发，这是NGS数据分析的项目管理框架： NGS数据分析已提出了许多复杂的计算生物学管道的设计，实现和执行。对于计算生物学家，管道是多步方法，应遵循公平（可访问性，可访问性，互操作性和可重复性）数据原理，并保证可重复性，可移植性和可扩展性。科学界已经采用了工作流语言和经理，码头容器和科学计算笔记本，目的是提高计算管道的可重复性，可移植性，可维护性和共享性。遵循这些原则，我们的小组开发了PM4NGS 6，这是一个用于NGS数据分析的项目管理框架。该框架包括自动创建目录和文件的标准组织结构；生物信息学工具管理，使用Docker/Biocontainers或Conda/Bioconda；数据分析管道中的通用工作流语言（CWL）格式；以及带有最小Python代码的预先配置的Jupyter笔记本。该框架被设计为用于个人笔记本电脑或工作站的数据分析的完全交互工具。它也可以用作一种教育工具，可以培训新的生物信息学家如何组织NGS数据分析项目，以显示管道组件的详细视图。 PM4NGS目前包括四个NGS数据分析工作流程作为模板：差异基因表达和从RNA-Seq数据中进行富集分析； CHIP-SEQ数据的差异结合分析； DNA基序的结合检测来自芯片-EXO数据；和转录组组件，包括未经通知的生物的注释和提交。这些模板可以重复使用或修改以创建新的计算生物学工作流程。该框架旨在减少研究人员在实验实验室中的差距，生成NGS数据以及用于数据分析的工作流程。 PM4NG完全管理了使用Linux命令行接口上多个目录，数据文件和程序的复杂性，从而使研究人员可以专注于结果解释。从头转录组大会，注释和提交未经通知的生物：我们已经为未注释的生物开发了转录组组件，注释和提交工作流程。该工作流程是基于PM4NGS的模板实现的，旨在在Google Cloud Platform（GCP）上运行。用户可以在其个人笔记本电脑或工作站上运行PM4NGS Jupyter笔记本电脑，并向GCP提交更激烈的计算工作。作为开发的一部分，发表了一项适用性研究，以证明使用公共云提供商进行计算生物学实验7。我们证明，公共云提供商是以低成本执行高级计算生物学实验的实用替代方法。使用我们的云食谱，可以在不到2小时的时间内处理带有500,000个成绩单的转录组所需的爆炸对齐，计算成本约为200美元250美元。该工作流程用于组装，注释和提交Bioproject PrjNA320545（与墨西哥San Luis Potos大学的科学家的合作）中的Opuntia treptacantha转录组。工作流使用三位一体将RNA-Seq RAW读取成转录本。这些转录本被聚集以创建三位一体基因。标识同源序列以用GO项，酶名称和保守域注释转录本。 Opuntia tretpacantha转录组的功能注释发表了有关两个实验条件的组装和差异基因表达分析的其他信息，请参见https://www.ncbi.nlm.nih.gov/research/research/nopaldb/。在RNA-seq样品上检测和去除外国污染：我们的转录组组件和注释管道包括一个工作流程，以检测和去除输入样品中的外源RNA污染。 RNA-Seq污染在误导多个研究结论中发挥了重要作用。如果目标生物在公共数据库中没有参考基因组或注释，那将是最麻烦的。我们开发了GTAX，这是一个基因组序列的分类结构化数据库，可以与BLAST一起用于分类分类和污染过滤。这种方法有效地检测并消除了RNA-Seq数据中的污染物读取。使用数据集从NCBI基因组数据库中提取GTAX基因组序列。该数据库包括参考基因组集合的最新组件的子集。通过RefSeq登录前缀过滤序列，以减少大小和可能的受污染序列。这些序列被组织成19个相互排斥和分层分类群。例如，Viridiplantae王国中的分类学分为三个GTAX组：Liliopsida组，其中包括所有单子叶序列； Eudicotyledons组，其中包括所有Dicotyledon序列；和Viridiplantae群体，将其放置在Viridiplantae王国中的所有其他分类单元。从Neoteleotei到Sarcopterygii的Chordata Phylum和所有分类群都适用于同一原理。最后，所有剩余的真核生物分类单元都放在真核生物分类学组中。 GTAX中基因组序列的这种分类结构化划分可在同一分类学组中保持系统发育密切相关的物种，并大大降低了可搜索的爆炸数据库的大小。 Sauropsida组是最大的组，其中包含1,073个序列和46,172,754,879个碱基，仅占NT数据库的6.84％。 GTAX序列的当前版本代表NT数据库的72.18％。我们的去污染方法是通过对目标生物的分类学组的RNA-seq读数（使用BLAST）进行筛选而启动的。在这些情况下，我们可以筛选数百万个RNA-seq读取的NT数据库的6％。然后，对GTAX分类群的其余部分进行了筛选不明的读物。读取标记为正确的是与目标生物的分类群相匹配的读数。那些保持不明的人被标记为这样。

项目成果

期刊论文数量（40）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

Whole-Genome Sequences of Two Campylobacter coli Isolates from the Antimicrobial Resistance Monitoring Program in Colombia.

DOI：
10.1128/genomea.00131-16
发表时间：
2016-03-17
期刊：
Genome announcements
影响因子：
0
作者：
Bernal JF;Donado-Godoy P;Valencia MF;León M;Gómez Y;Rodríguez F;Agarwala R;Landsman D;Mariño-Ramírez L
通讯作者：
Mariño-Ramírez L

Effect of the transposable element environment of human genes on gene length and expression.

DOI：
10.1093/gbe/evr015
发表时间：
2011
期刊：
Genome biology and evolution
影响因子：
3.3
作者：
Jjingo D;Huda A;Gundapuneni M;Mariño-Ramírez L;Jordan IK
通讯作者：
Jordan IK

Workflow and web application for annotating NCBI BioProject transcriptome data.

DOI：
10.1093/database/bax008
发表时间：
2017-01-01
期刊：
Database : the journal of biological databases and curation
影响因子：
0
作者：
Vera Alvarez R;Medeiros Vidal N;Garzón-Martínez GA;Barrero LS;Landsman D;Mariño-Ramírez L
通讯作者：
Mariño-Ramírez L

Whole-Genome Sequence of Multidrug-Resistant Campylobacter coli Strain COL B1-266, Isolated from the Colombian Poultry Chain.

DOI：
10.1128/genomea.00130-16
发表时间：
2016-03-17
期刊：
Genome announcements
影响因子：
0
作者：
Bernal JF;Donado-Godoy P;Arévalo A;Duarte C;Realpe ME;Díaz PL;Gómez Y;Rodríguez F;Agarwala R;Landsman D;Mariño-Ramírez L
通讯作者：
Mariño-Ramírez L

Prediction of transposable element derived enhancers using chromatin modification profiles.