Enriching SARS-CoV-2 sequence data in public repositories with information extracted from full text articles
利用从全文文章中提取的信息丰富公共存储库中的 SARS-CoV-2 序列数据
基本信息
- 批准号:10701081
- 负责人:
- 金额:$ 58.37万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:2021
- 资助国家:美国
- 起止时间:2021-09-17 至 2024-08-31
- 项目状态:已结题
- 来源:
- 关键词:2019-nCoVAddressAgeAgreementAlgorithmsBase SequenceCOVID-19COVID-19 pandemicClinicalClinical DataCollaborationsCommunicable DiseasesCoronavirusDataData AnalysesData SetEpidemiologyEvolutionFundingGenbankGenderGeneticGenotypeGeographyGoalsHealthInternationalInterventionJointsJournalsKnowledgeLinkLocationManualsMetadataMethodsModelingNatural Language ProcessingOntologyOutcomePatient CarePatientsPeer ReviewPerformancePhylogenetic AnalysisPopulationPopulation GroupPopulations at RiskPrintingProbabilityPublic HealthPublicationsPublishingRaceRecordsRelative RisksReportingResearchResearch PersonnelResolutionResourcesRiskSARS coronavirusScientistSequence AnalysisSeveritiesSpecific qualifier valueSystemTestingTextUnified Medical Language SystemUnited States National Institutes of HealthUpdateViralViral GenomeVirusWorkclinical phenotypecohortcomorbiditycoronavirus diseasedashboarddata sharingdeep learningdemographicsfield studygenomic epidemiologyheuristicsimprovedinsightnovelpandemic diseasepopulation healthpreventpublic databasepublic repositoryresidenceresponsesecondary analysissextext searchingtransmission processtrendvirus characteristic
项目摘要
Project Summary
In response to the COVID-19 pandemic, scientists have published over one hundred thousand research articles
and made available over eight hundred thousand virus genome sequences. These sequences, along with their
metadata, can be used to understand virus evolution and spread and their implications for public health, a field of
study called genomic epidemiology. However, these sequence records do not typically contain patient metadata
such as demographics, clinical severity, or comorbidities, preventing researchers from uncovering trends in
population health. To understand the severity of the problem, we analyzed nearly 748 thousand SARS-CoV-2
records from GISAID and 60 thousand from GenBank for the presence of patient metadata finding age and
gender were represented in < 1% of GenBank records and in GISAID, 26% included sex, and 24% had age. For
other fields, the amount of missing data is even more pronounced, with neither resource providing information on
a patient's race and only GISAID specifying severity (i.e. ICU) in less than 5% of records. To address missing
virus metadata, researchers could utilize the publication associated with the new sequences, however, the virus
sequence record is often never updated with a link to the publication. From the set of records that we analyzed,
3.4% (of 748K) in GISAID and < 1% (of 117K) in GenBank had a link to a publication. This greatly hinders
secondary data analysis of these sequences and limits the ability to use them at scale to uncover associations
between the viral genome, transmission risk, and health outcomes. The goal of this proposal is to enhance
genomic epidemiology and population health of COVID-19 with a framework to continuously and automatically
enrich SARS-CoV-2 nucleic acid sequence metadata in public databases such as GenBank and GISAID with
metadata in associated published articles. We will incorporate input from clinicians at the front-line of patient
care during the pandemic and build on our NIH funded work (R01AI117011), which used Natural Language
Processing (NLP) to enrich the geographic metadata of a sequence record using its corresponding published
article. We have used these data in virus phylogeographic models and shown the benefit of using enriched
metadata for modeling virus evolution and spread. Theavailability of SARS-CoV-2 sequences, paired withfull-
text COVID-19 articles and preprints, presents an opportunity for metadata enrichment and scientific discovery
beyond our prior work. Our specific aims are to: (1) enrich SARS-CoV-2 sequence metadata using text extracted
from publications and (2) derive key epidemiologic insights for different patient demographics using our enriched
SARS-CoV-2 sequence dataset. We will leverage our prior joint work funded by the NIH to enable the secondary
use of enriched metadata for genomic epidemiology to improve our understanding of SARS-CoV-2 evolution and
spread among different population groups. We will disseminate the enriched data through our GeoBoost2 data
dashboard, GenBank LinkOut and the i2b2 platform. The latter will more immediately allow integration with
COVID-specific clinical data shared by the 4CE Consortium.
项目概要
为应对 COVID-19 大流行,科学家发表了超过十万篇研究文章
并提供了超过八十万个病毒基因组序列。这些序列以及它们的
元数据可用于了解病毒的进化和传播及其对公共卫生的影响,这是一个领域
研究称为基因组流行病学。然而,这些序列记录通常不包含患者元数据
例如人口统计、临床严重程度或合并症,阻止研究人员发现趋势
人口健康。为了了解问题的严重性,我们分析了近 74.8 万个 SARS-CoV-2
来自 GISAID 的记录和来自 GenBank 的 6 万条记录,用于查找年龄和年龄的患者元数据的存在
GenBank 记录中的性别比例不到 1%,而在 GISAID 中,26% 包含性别,24% 包含年龄。为了
在其他领域,缺失的数据量甚至更加明显,两种资源都没有提供相关信息
患者的种族,并且仅 GISAID 在不到 5% 的记录中指定了严重程度(即 ICU)。解决失踪问题
病毒元数据,研究人员可以利用与新序列相关的出版物,但是,该病毒
序列记录通常不会通过出版物的链接进行更新。从我们分析的一组记录来看,
GISAID 中的 3.4%(共 748K)和 GenBank 中的 < 1%(共 117K)有出版物链接。这极大地阻碍了
对这些序列进行二次数据分析并限制了大规模使用它们来揭示关联的能力
病毒基因组、传播风险和健康结果之间的关系。该提案的目标是增强
COVID-19 的基因组流行病学和人口健康,具有连续、自动的框架
丰富公共数据库(例如 GenBank 和 GISAID)中的 SARS-CoV-2 核酸序列元数据
相关已发表文章中的元数据。我们将吸收患者前线临床医生的意见
大流行期间的护理,并以我们 NIH 资助的工作 (R01AI117011) 为基础,该工作使用自然语言
处理(NLP),使用其相应的已发布序列来丰富序列记录的地理元数据
文章。我们在病毒系统发育地理学模型中使用了这些数据,并展示了使用丰富的
用于建模病毒进化和传播的元数据。 SARS-CoV-2 序列的可用性,与全配对
文本 COVID-19 文章和预印本,为元数据丰富和科学发现提供了机会
超出了我们之前的工作。我们的具体目标是:(1) 使用提取的文本丰富 SARS-CoV-2 序列元数据
从出版物中获取;(2)利用我们丰富的数据,得出不同患者人口统计数据的关键流行病学见解
SARS-CoV-2 序列数据集。我们将利用我们之前由 NIH 资助的联合工作来实现二级
使用丰富的基因组流行病学元数据来提高我们对 SARS-CoV-2 进化的理解
在不同人群中传播。我们将通过 GeoBoost2 数据传播丰富的数据
仪表板、GenBank LinkOut 和 i2b2 平台。后者将更立即地允许与
4CE 联盟共享的新冠肺炎特定临床数据。
项目成果
期刊论文数量(5)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
GRACIELA GONZALEZ HERNANDEZ其他文献
GRACIELA GONZALEZ HERNANDEZ的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('GRACIELA GONZALEZ HERNANDEZ', 18)}}的其他基金
Enriching SARS-CoV-2 sequence data in public repositories with information extracted from full text articles
利用从全文文章中提取的信息丰富公共存储库中的 SARS-CoV-2 序列数据
- 批准号:
10681068 - 财政年份:2022
- 资助金额:
$ 58.37万 - 项目类别:
Enriching SARS-CoV-2 sequence data in public repositories with information extracted from full text articles
利用从全文文章中提取的信息丰富公共存储库中的 SARS-CoV-2 序列数据
- 批准号:
10390667 - 财政年份:2021
- 资助金额:
$ 58.37万 - 项目类别:
Tracking Evolution and Spread of Viral Genomes by Geospatial Observation Error
通过地理空间观测误差追踪病毒基因组的进化和传播
- 批准号:
9249484 - 财政年份:2016
- 资助金额:
$ 58.37万 - 项目类别:
Text Processing and Geospatial Uncertainty for Phylogeography of Zoonotic Viruses
人畜共患病毒系统发育地理学的文本处理和地理空间不确定性
- 批准号:
8698542 - 财政年份:2013
- 资助金额:
$ 58.37万 - 项目类别:
Mining Social Network Postings for Mentions of Potential Adverse Drug Reactions
挖掘社交网络帖子中提及潜在药物不良反应的内容
- 批准号:
8222740 - 财政年份:2012
- 资助金额:
$ 58.37万 - 项目类别:
相似国自然基金
时空序列驱动的神经形态视觉目标识别算法研究
- 批准号:61906126
- 批准年份:2019
- 资助金额:24.0 万元
- 项目类别:青年科学基金项目
本体驱动的地址数据空间语义建模与地址匹配方法
- 批准号:41901325
- 批准年份:2019
- 资助金额:22.0 万元
- 项目类别:青年科学基金项目
大容量固态硬盘地址映射表优化设计与访存优化研究
- 批准号:61802133
- 批准年份:2018
- 资助金额:23.0 万元
- 项目类别:青年科学基金项目
针对内存攻击对象的内存安全防御技术研究
- 批准号:61802432
- 批准年份:2018
- 资助金额:25.0 万元
- 项目类别:青年科学基金项目
IP地址驱动的多径路由及流量传输控制研究
- 批准号:61872252
- 批准年份:2018
- 资助金额:64.0 万元
- 项目类别:面上项目
相似海外基金
Alzheimer's Disease and Related Dementia-like Sequelae of SARS-CoV-2 Infection: Virus-Host Interactome, Neuropathobiology, and Drug Repurposing
阿尔茨海默病和 SARS-CoV-2 感染的相关痴呆样后遗症:病毒-宿主相互作用组、神经病理生物学和药物再利用
- 批准号:
10661931 - 财政年份:2023
- 资助金额:
$ 58.37万 - 项目类别:
Infant Immunologic and Neurologic Development following Maternal Infection in Pregnancy during Recent Epidemics
近期流行病期间妊娠期感染后婴儿的免疫和神经系统发育
- 批准号:
10784250 - 财政年份:2023
- 资助金额:
$ 58.37万 - 项目类别:
Interactions of SARS-CoV-2 infection and genetic variation on the risk of cognitive decline and Alzheimer’s disease in Ancestral and Admixed Populations
SARS-CoV-2 感染和遗传变异的相互作用对祖先和混血人群认知能力下降和阿尔茨海默病风险的影响
- 批准号:
10628505 - 财政年份:2023
- 资助金额:
$ 58.37万 - 项目类别:
The impact of immune escape on the epidemiology and evolutionary dynamics of the COVID-19 pandemic in Yucatan, Mexico
免疫逃逸对墨西哥尤卡坦半岛 COVID-19 大流行的流行病学和进化动态的影响
- 批准号:
10741899 - 财政年份:2023
- 资助金额:
$ 58.37万 - 项目类别:
Impact of SARS-CoV-2 infection on respiratory viral immune responses in children with and without asthma
SARS-CoV-2 感染对患有和不患有哮喘的儿童呼吸道病毒免疫反应的影响
- 批准号:
10568344 - 财政年份:2023
- 资助金额:
$ 58.37万 - 项目类别: