Leveraging k-mer sketching statistics to enhance metagenomic methods and alignment algorithms

利用 k-mer 草图统计来增强宏基因组方法和比对算法

基本信息

项目摘要

Project Summary In the face of increasing data sizes, sketching techniques such as MinHash sketching and its winnowed version have been among the most effective in facilitating scalabile analysis. Frequently though, bioinformatic algorithms using these techniques do not account for the randomness inherent in both the sketching process and in the mutation processes that generate the data (e.g. sequencing errors or evolutionary mutations). This project directly addresses this limitation by laying the statistical foundations for how these sketching approaches interact with mutation processes and k-mer based techniques, resulting in new algorithms for important biomedical problems. Aim 1 derives, for the first time, confidence and prediction intervals for frequently utilized sketching-based bioinformatics quantities that until now existed only as point estimates.To do so, it relies on sophisticated techniques from probability theory. The mathematical foundations laid by Aim 1 will not only help us achieve the biological aims of this proposal, but will also serve as a basis for quantifying the performance of future sketching-based bioinformatics algorithms. Aim 2 will then use these results to develop the first metagenomic taxonomic profiling algorithm that accounts for the uncertainty present when predicting the presence and relative abundance of microorganisms in a sample. This will resolve a long-standing issue in this field by providing researchers an informed way to filter their noisy data without sacrificing sensitivity, thereby facilitating biomedical discoveries (e.g. novel CRISPR systems). In addition, this aim will result in the first scalable method to quickly estimate the fraction of a metagenomic sample that is not described by current reference databases, thus illuminating which datasets contain the highest quantity of novel genetic material and hence possibility for biological discovery (e.g. novel antibiotics). Aim 2 will be achieved using techniques from compressive sensing as well as probability theory. Aim 3 will both use and extend the results of Aim 1 to quantifiably improve one of the most fundamental tools in a computational biologist’s toolkit: sequence alignment. This will equip modern sequence aligners with much needed significance scores and confidence intervals, as well as allow for the automatic selection of parameter settings to achieve a desired precision or recall. Due to their ubiquity in biomedical research, even a small improvement in the accuracy and features of an aligner will have tremendous impact. Aim 3 will be achieved using techniques from probabilistic algorithms. Finally, the long-term objective of this proposal is to provide researchers a toolkit that enables the development of scalable k-mer-based sketching algorithms without sacrificing their ability to quantify statistical significance.
项目摘要 面对越来越多的数据大小,素描技术,例如Minhash素描及其wo 版本一直是支持Scalabile分析的最有效的版本之一。但是,经常是生物信息学 使用这些技术的算法不能说明两个草图过程中固有的随机性 在生成数据的突变过程中(例如测序误差或进化突变)。这 项目通过为这些素描奠定统计基础直接解决此限制 方法与突变过程和基于K-MER的技术相互作用,从而产生了新的算法 重要的生物医学问题。 AIM 1首次获得信心和预测间隔 经常利用基于草图的生物信息学数量,到目前为止仅作为点估计而存在。 这样做,它依赖于概率理论的复杂技术。 AIM 1奠定的数学基础 不仅将帮助我们实现该提案的生物学目标,而且还将作为量化的基础 未来基于草图的生物信息学算法的性能。 AIM 2然后将这些结果用于 开发第一个核能分类学分析算法,该算法解释了存在的不确定性 预测样品中微生物的存在和相对丰度。这将解决一个 在该领域的长期存在,通过为研究人员提供一种无知的噪音数据的知识方法 牺牲灵敏度,从而支持生物医学发现(例如新型CRISPR系统)。此外,这 AIM将导致第一种可扩展的方法快速估计不是宏基因组样品的分数 通过当前参考数据库描述,因此阐明了哪些数据集包含最高数量的数据集 新颖的遗传物质,因此可能是生物发现的可能性(例如,新型抗生素)。 AIM 2将是 AIM 3都将使用和 扩展AIM 1的结果以量化计算中最基本的工具之一 生物学家的工具包:序列对齐。这将配备现代序列对齐,并急需 显着性得分和置信区间,并允许自动选择参数设置 达到所需的精确或召回。由于它们在生物医学研究中的普遍性,即使是很小的改进 在准确性和特征中,对准器将产生巨大的影响。 AIM 3将使用 有问题算法的技术。最后,该提议的长期目标是提供 研究人员一种工具包,可以开发可扩展的基于K-MER的素描算法 牺牲他们量化统计意义的能力。

项目成果

期刊论文数量(9)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash.
  • DOI:
    10.1101/gr.277651.123
  • 发表时间:
    2023-07
  • 期刊:
  • 影响因子:
    7
  • 作者:
    Rahman Hera, Mahmudur;Pierce-Ward, N Tessa;Koslicki, David
  • 通讯作者:
    Koslicki, David
Finding phylogeny-aware and biologically meaningful averages of metagenomic samples: L 2 UniFrac.
寻找宏基因组样本的系统发育感知和生物学意义的平均值:L 2 UniFrac。
Finding phylogeny-aware and biologically meaningful averages of metagenomic samples: L2UniFrac.
共 3 条
  • 1
前往

相似国自然基金

基于先进算法和行为分析的江南传统村落微气候的评价方法、影响机理及优化策略研究
  • 批准号:
    52378011
  • 批准年份:
    2023
  • 资助金额:
    50 万元
  • 项目类别:
    面上项目
社交网络上观点动力学的重要影响因素与高效算法
  • 批准号:
    62372112
  • 批准年份:
    2023
  • 资助金额:
    50.00 万元
  • 项目类别:
    面上项目
员工算法规避行为的内涵结构、量表开发及多层次影响机制:基于大(小)数据研究方法整合视角
  • 批准号:
    72372021
  • 批准年份:
    2023
  • 资助金额:
    40 万元
  • 项目类别:
    面上项目
算法人力资源管理对员工算法应对行为和工作绩效的影响:基于员工认知与情感的路径研究
  • 批准号:
    72372070
  • 批准年份:
    2023
  • 资助金额:
    40 万元
  • 项目类别:
    面上项目
算法鸿沟影响因素与作用机制研究
  • 批准号:
    72304017
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目

相似海外基金

Fluency from Flesh to Filament: Collation, Representation, and Analysis of Multi-Scale Neuroimaging data to Characterize and Diagnose Alzheimer's Disease
从肉体到细丝的流畅性:多尺度神经影像数据的整理、表示和分析,以表征和诊断阿尔茨海默病
  • 批准号:
    10462257
    10462257
  • 财政年份:
    2023
  • 资助金额:
    $ 44.35万
    $ 44.35万
  • 项目类别:
New Algorithms for Cryogenic Electron Microscopy
低温电子显微镜的新算法
  • 批准号:
    10543569
    10543569
  • 财政年份:
    2023
  • 资助金额:
    $ 44.35万
    $ 44.35万
  • 项目类别:
Move and Snooze: Adding insomnia treatment to an exercise program to improve pain outcomes in older adults with knee osteoarthritis
活动和小睡:在锻炼计划中添加失眠治疗,以改善患有膝骨关节炎的老年人的疼痛结果
  • 批准号:
    10797056
    10797056
  • 财政年份:
    2023
  • 资助金额:
    $ 44.35万
    $ 44.35万
  • 项目类别:
Elucidating causal mechanisms of ethanol-induced analgesia in BXD recombinant inbred mouse lines
阐明 BXD 重组近交系小鼠乙醇诱导镇痛的因果机制
  • 批准号:
    10825737
    10825737
  • 财政年份:
    2023
  • 资助金额:
    $ 44.35万
    $ 44.35万
  • 项目类别:
High-throughput thermodynamic and kinetic measurements for variant effects prediction in a major protein superfamily
用于预测主要蛋白质超家族变异效应的高通量热力学和动力学测量
  • 批准号:
    10752370
    10752370
  • 财政年份:
    2023
  • 资助金额:
    $ 44.35万
    $ 44.35万
  • 项目类别: