Genome Assemblies, Analyses, and Comparisons
基因组组装、分析和比较
基本信息
- 批准号:10927044
- 负责人:
- 金额:$ 27.67万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:
- 资助国家:美国
- 起止时间:至
- 项目状态:未结题
- 来源:
- 关键词:AccelerationAdoptedBindingBiologyBiomedical ResearchChIP-seqChordataClassificationCodeCollaborationsCollectionCommunitiesComplexComputational BiologyComputing MethodologiesDNADNA sequencingDataData AnalysesData FilesData SetDatabasesDecontaminationDetectionDevelopmentDirectoriesDiseaseEducationElementsEnzymesEukaryotaExcisionFAIR principlesGene Expression ProfilingGenesGenomeGenomicsGenotypeHourIntentionKnowledgeLabelLanguageLinuxMethodsMexicoNamesOccupationsOpuntiaOrganismPathway interactionsPhenotypePhylogenetic AnalysisPlayProcessProviderPublishingPythonsRNARNA analysisRecipeReproducibilityResearchResearch PersonnelRoleRunningSamplingScientistSequence HomologsStructureTaxonomyTechnologyTissue-Specific Gene ExpressionTrainingTranscriptbasebioinformatics toolcloud platformcomputational pipelinescostdata analysis pipelinedatabase structuredesignexperimental studygenome databaseimprovedinteractive toolinterestlaboratory experimentlaptopnext generation sequence dataorganizational structureportabilityprogramspublic databasereference genomescreeningtooltranscriptometranscriptome sequencing
项目摘要
Development of PM4NGS, a project management framework for NGS data analysis:
NGS data analysis has advanced the design, implementation, and execution of many complex computational biology pipelines. For computational biologists, pipelines are multi-step methods that should follow the FAIR (Findability, Accessibility, Interoperability, and Reusability) data principles and guarantee reproducibility, portability and scalability.
Workflow languages and managers, docker containers, and scientific computational notebooks have been adopted by the scientific community with the intention to improve reproducibility, portability, maintainability, and shareability of computational pipelines.
Following these principles, our group has developed PM4NGS 6, a project management framework for NGS data analysis. This framework comprises the automatic creation of a standard organizational structure of directories and files; bioinformatics tool management, using Docker/Biocontainers or Conda/Bioconda ; data analysis pipelines in the Common Workflow Language (CWL) format; and pre-configured Jupyter notebooks with minimum Python code. The framework was designed as a fully interactive tool for data analysis on personal laptops or workstations. It also can be used as an educational tool to train new bioinformaticians on how to organize an NGS data analysis project that shows a detailed view of the pipeline components.
PM4NGS currently includes four NGS data analysis workflows as templates: differential gene expression and GO enrichment analysis from RNA-Seq data; differential binding analysis from ChIP-Seq data; DNA motif binding detection from ChIP-exo data; and transcriptome assembly, including annotation and submission for unannotated organisms. These templates can be reused or modified to create new computational biology workflows. This framework aims to reduce the gap between researchers in experimental laboratories, producing NGS data, and the workflows for the data analysis. The complexity of working with multiple directories, data files, and programs on the Linux command line interface is managed completely by PM4NGS, allowing researchers to focus on result interpretation.
De novo transcriptome assembly, annotation, and submission for UNANNOTATED organisms:
We have developed a transcriptome assembly, annotation, and submission workflow for unannotated organisms. This workflow was implemented as a PM4NGS-based template and designed to run on the Google cloud platform (GCP). Users can run the PM4NGS Jupyter notebooks on their personal laptops or workstations and submit the more intense computing jobs to the GCP. As part of the development, a suitability study was published to demonstrate the benefits of using a public cloud provider for computational biology experiments 7. We demonstrate that public cloud providers are a practical alternative for the execution of advanced computational biology experiments at low cost. Using our cloud recipes, the BLAST alignments required to annotate a transcriptome with 500,000 transcripts can be processed in less than 2 hours, with a computing cost of about $200$250.
This workflow was used to assemble, annotate, and submit the Opuntia streptacantha transcriptome from the BioProject PRJNA320545 (a collaboration with scientists at Universidad Autnoma de San Luis Potos, Mexico). The workflow uses Trinity to assemble RNA-Seq raw reads into transcripts. The transcripts are clustered to create Trinity genes. Homologous sequences are identified to annotate the transcripts with GO terms, enzyme names, and conserved domains. The functional annotation for the Opuntia streptacantha transcriptome is published with additional information about the assembly and differential gene expression analysis of two experimental conditions at https://www.ncbi.nlm.nih.gov/research/nopaldb/.
Detection and removal of foreign contamination on RNA-Seq samples:
Our transcriptome assembly and annotation pipeline include a workflow to detect and remove foreign RNA contamination from the input samples. RNA-Seq contamination has played a large role in misleading multiple research conclusions. It is most troublesome if the target organism does not have a reference genome or annotation in public databases. We have developed GTax, a taxonomic structured database of genomic sequences that can be used with BLAST for taxonomic classification and contamination filtering. This approach efficiently detects and eliminates contaminant reads in RNA-Seq data.
GTax genomic sequences were extracted from the NCBI Genome database, using Datasets. The database includes a subset of the latest assemblies of a collection of reference genomes. Sequences were filtered by RefSeq Accession prefixes to reduce the size and possible contaminated sequences. The sequences were organized into 19 mutually exclusive and hierarchical taxonomic groups. For example, taxonomies in the Viridiplantae kingdom are divided into three GTax groups: the Liliopsida group, which includes all monocotyledon sequences; the Eudicotyledons group, which includes all dicotyledon sequences; and the Viridiplantae group, into which all of the other taxa in the Viridiplantae kingdom are placed. The same principle is applied to the Chordata phylum and all taxonomy groups from Neoteleostei to Sarcopterygii. Finally, all remaining Eukaryote taxa are placed in the Eukaryota taxonomy group. This taxonomic structured division of the genomic sequences in GTax keeps phylogenetically closely related species in the same taxonomy group and greatly reduces the size of the searchable BLAST database. The Sauropsida group, which is the biggest group and contains 1,073 sequences and 46,172,754,879 total bases, is only 6.84% of the NT database. Current version of GTax sequences represent 72.18% of the NT database.
Our decontamination approach is initiated with a screening of the RNA-Seq reads (using BLAST) against the taxonomy group of the target organism. In these cases, we can screen millions of RNA-Seq reads against less than 6% of the NT database. Then, unidentified reads are screened against the remainder of the GTax taxonomy groups. Reads labeled as correct are those that match the taxonomy group of the target organism. Those that remain unidentified are labeled as such.
开发 PM4NGS,一个用于 NGS 数据分析的项目管理框架:
NGS 数据分析推进了许多复杂计算生物学流程的设计、实施和执行。对于计算生物学家来说,管道是多步骤方法,应遵循 FAIR(可查找性、可访问性、互操作性和可重用性)数据原则,并保证可重复性、可移植性和可扩展性。
工作流语言和管理器、docker 容器和科学计算笔记本已被科学界采用,旨在提高计算管道的可重复性、可移植性、可维护性和可共享性。
遵循这些原则,我们小组开发了 PM4NGS 6,一个用于 NGS 数据分析的项目管理框架。该框架包括自动创建目录和文件的标准组织结构;生物信息学工具管理,使用 Docker/Biocontainers 或 Conda/Bioconda ;通用工作流语言(CWL)格式的数据分析管道;以及使用最少的 Python 代码预配置 Jupyter 笔记本。该框架被设计为一个完全交互式的工具,用于在个人笔记本电脑或工作站上进行数据分析。它还可以用作教育工具,培训新生物信息学家如何组织 NGS 数据分析项目,显示管道组件的详细视图。
PM4NGS目前包括四种NGS数据分析工作流程作为模板:来自RNA-Seq数据的差异基因表达和GO富集分析; ChIP-Seq 数据的差异结合分析;从 ChIP-exo 数据检测 DNA 基序结合;和转录组组装,包括未注释生物体的注释和提交。这些模板可以重复使用或修改以创建新的计算生物学工作流程。该框架旨在缩小实验实验室研究人员、NGS 数据生成和数据分析工作流程之间的差距。在 Linux 命令行界面上处理多个目录、数据文件和程序的复杂性完全由 PM4NGS 管理,使研究人员能够专注于结果解释。
未注释生物体的从头转录组组装、注释和提交:
我们为未注释的生物体开发了转录组组装、注释和提交工作流程。该工作流程作为基于 PM4NGS 的模板实现,并设计为在 Google 云平台 (GCP) 上运行。用户可以在个人笔记本电脑或工作站上运行 PM4NGS Jupyter 笔记本,并向 GCP 提交更密集的计算作业。作为开发的一部分,发布了一项适用性研究,以证明使用公共云提供商进行计算生物学实验的好处7。我们证明公共云提供商是以低成本执行高级计算生物学实验的实用替代方案。使用我们的云配方,注释具有 500,000 个转录本的转录组所需的 BLAST 比对可以在不到 2 小时内处理,计算成本约为 200-250 美元。
该工作流程用于组装、注释和提交 BioProject PRJNA320545(与墨西哥圣路易斯波托斯自治大学的科学家合作)的 Opuntia streptacantha 转录组。该工作流程使用 Trinity 将 RNA-Seq 原始读数组装成转录本。转录本聚集形成 Trinity 基因。鉴定同源序列以用 GO 术语、酶名称和保守结构域注释转录物。 Opuntia streptacantha 转录组的功能注释已在 https://www.ncbi.nlm.nih.gov/research/nopaldb/ 上发布,并提供有关两种实验条件下的组装和差异基因表达分析的其他信息。
RNA-Seq 样品中外源污染的检测和去除:
我们的转录组组装和注释流程包括检测和去除输入样本中外来 RNA 污染的工作流程。 RNA-Seq 污染在误导多项研究结论方面发挥了重要作用。如果目标生物体在公共数据库中没有参考基因组或注释,那就是最麻烦的。我们开发了 GTax,这是一个基因组序列的分类结构化数据库,可与 BLAST 一起使用进行分类分类和污染过滤。该方法可有效检测并消除 RNA-Seq 数据中的污染物读数。
Gtax 基因组序列是使用数据集从 NCBI 基因组数据库中提取的。该数据库包括参考基因组集合的最新组件的子集。序列通过 RefSeq Accession 前缀进行过滤,以减少序列大小和可能被污染的序列。这些序列被组织成 19 个相互排斥且分层的分类组。例如,Viridiplantae 界的分类学分为三个 Gtax 组: Liliopsida 组,包括所有单子叶植物序列;真双子叶植物群,包括所有双子叶植物序列;和 Viridiplantae 组,将 Viridiplantae 界中的所有其他分类单元都归入该组。同样的原理也适用于脊索动物门和从新骨化石到肉翅目的所有分类群。最后,所有剩余的真核生物分类群被置于真核生物分类组中。 GTax 中基因组序列的这种分类结构划分使系统发育密切相关的物种处于同一分类组中,并大大减小了可搜索 BLAST 数据库的大小。蜥蜥纲是最大的类群,包含 1,073 个序列和 46,172,754,879 个总碱基,仅占 NT 数据库的 6.84%。当前版本的 GTax 序列占 NT 数据库的 72.18%。
我们的净化方法首先针对目标生物体的分类组筛选 RNA-Seq 读数(使用 BLAST)。在这些情况下,我们可以针对不到 6% 的 NT 数据库筛选数百万个 RNA-Seq 读数。然后,根据 GTax 分类组的其余部分筛选未识别的读数。标记为正确的读数是那些与目标生物体的分类组相匹配的读数。那些仍未被识别的被标记为此类。
项目成果
期刊论文数量(40)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Whole-Genome Sequences of Two Campylobacter coli Isolates from the Antimicrobial Resistance Monitoring Program in Colombia.
- DOI:10.1128/genomea.00131-16
- 发表时间:2016-03-17
- 期刊:
- 影响因子:0
- 作者:Bernal JF;Donado-Godoy P;Valencia MF;León M;Gómez Y;Rodríguez F;Agarwala R;Landsman D;Mariño-Ramírez L
- 通讯作者:Mariño-Ramírez L
Effect of the transposable element environment of human genes on gene length and expression.
- DOI:10.1093/gbe/evr015
- 发表时间:2011
- 期刊:
- 影响因子:3.3
- 作者:Jjingo D;Huda A;Gundapuneni M;Mariño-Ramírez L;Jordan IK
- 通讯作者:Jordan IK
Workflow and web application for annotating NCBI BioProject transcriptome data.
- DOI:10.1093/database/bax008
- 发表时间:2017-01-01
- 期刊:
- 影响因子:0
- 作者:Vera Alvarez R;Medeiros Vidal N;Garzón-Martínez GA;Barrero LS;Landsman D;Mariño-Ramírez L
- 通讯作者:Mariño-Ramírez L
Whole-Genome Sequence of Multidrug-Resistant Campylobacter coli Strain COL B1-266, Isolated from the Colombian Poultry Chain.
- DOI:10.1128/genomea.00130-16
- 发表时间:2016-03-17
- 期刊:
- 影响因子:0
- 作者:Bernal JF;Donado-Godoy P;Arévalo A;Duarte C;Realpe ME;Díaz PL;Gómez Y;Rodríguez F;Agarwala R;Landsman D;Mariño-Ramírez L
- 通讯作者:Mariño-Ramírez L
Prediction of transposable element derived enhancers using chromatin modification profiles.
- DOI:10.1371/journal.pone.0027513
- 发表时间:2011
- 期刊:
- 影响因子:3.7
- 作者:Huda A;Tyagi E;Mariño-Ramírez L;Bowen NJ;Jjingo D;Jordan IK
- 通讯作者:Jordan IK
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
David LANDSMAN其他文献
David LANDSMAN的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('David LANDSMAN', 18)}}的其他基金
Analysis Of Gene Regulatory Sequences From Whole Chromosomes And Genomes
全染色体和基因组的基因调控序列分析
- 批准号:
7735074 - 财政年份:
- 资助金额:
$ 27.67万 - 项目类别:
Structural And Functional Analysis Of Protein Sequence Families
蛋白质序列家族的结构和功能分析
- 批准号:
7735069 - 财政年份:
- 资助金额:
$ 27.67万 - 项目类别:
Structural-Functional Analysis-Protein Sequence Families
结构-功能分析-蛋白质序列家族
- 批准号:
7148031 - 财政年份:
- 资助金额:
$ 27.67万 - 项目类别:
Structural and Functional Analysis of Gene and Protein Sequence Families
基因和蛋白质序列家族的结构和功能分析
- 批准号:
10018390 - 财政年份:
- 资助金额:
$ 27.67万 - 项目类别:
Structural and Functional Analysis of Gene and Protein Sequence Families
基因和蛋白质序列家族的结构和功能分析
- 批准号:
9353157 - 财政年份:
- 资助金额:
$ 27.67万 - 项目类别:
Gene Regulatory Sequences From Whole Chromosome /Genome
来自全染色体/基因组的基因调控序列
- 批准号:
6843578 - 财政年份:
- 资助金额:
$ 27.67万 - 项目类别:
Structural And Functional Analysis Of Protein Sequence F
蛋白质序列 F 的结构和功能分析
- 批准号:
6681342 - 财政年份:
- 资助金额:
$ 27.67万 - 项目类别:
Gene Regulatory Sequences And Protein Binding in Genome Sequences
基因调控序列和基因组序列中的蛋白质结合
- 批准号:
8943221 - 财政年份:
- 资助金额:
$ 27.67万 - 项目类别:
Gene Regulatory Sequences And Protein Binding in Genome Sequences
基因调控序列和基因组序列中的蛋白质结合
- 批准号:
10688917 - 财政年份:
- 资助金额:
$ 27.67万 - 项目类别:
相似国自然基金
锶银离子缓释钛表面通过线粒体自噬调控NLRP3炎症小体活化水平促进骨整合的机制研究
- 批准号:82301139
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
万寿菊黄酮通过MAPK/Nrf2-ARE通路缓解肉鸡肠道氧化应激损伤的作用机制
- 批准号:32302787
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
肠道菌群及其代谢产物通过mRNA m6A修饰调控猪肉品质的机制研究
- 批准号:32330098
- 批准年份:2023
- 资助金额:220 万元
- 项目类别:重点项目
PUFAs通过SREBPs提高凡纳滨对虾低盐适应能力的机制研究
- 批准号:32303021
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
EGLN3羟化酶通过调控巨噬细胞重编程促进肺癌细胞EMT及转移的机制研究
- 批准号:82373030
- 批准年份:2023
- 资助金额:49 万元
- 项目类别:面上项目
相似海外基金
BRAIN CONNECTS: PatchLink, scalable tools for integrating connectomes, projectomes, and transcriptomes
大脑连接:PatchLink,用于集成连接组、投影组和转录组的可扩展工具
- 批准号:
10665493 - 财政年份:2023
- 资助金额:
$ 27.67万 - 项目类别:
Tele-Sox: A Tele-Medicine solution based on wearables and gamification to prevent Venous thromboembolism in Oncology Geriatric Patients
Tele-Sox:基于可穿戴设备和游戏化的远程医疗解决方案,用于预防肿瘤老年患者的静脉血栓栓塞
- 批准号:
10547300 - 财政年份:2023
- 资助金额:
$ 27.67万 - 项目类别:
A Flexible High-Throughput Immunological Assay to Support Next-Generation Influenza Vaccine Studies
灵活的高通量免疫分析支持下一代流感疫苗研究
- 批准号:
10655239 - 财政年份:2023
- 资助金额:
$ 27.67万 - 项目类别:
Optimizing the Generation of Monoclonal Antibodies for Prevention and Treatment of HSV Disease
优化用于预防和治疗 HSV 疾病的单克隆抗体的生成
- 批准号:
10717320 - 财政年份:2023
- 资助金额:
$ 27.67万 - 项目类别:
Enhancing Drug Discovery Research by Free Energy Modeling
通过自由能建模加强药物发现研究
- 批准号:
10730788 - 财政年份:2023
- 资助金额:
$ 27.67万 - 项目类别: