NIRG: FARSPhase: a Flexible, widely Applicable, Robust, and Scalable phasing algorithm for human genetics

NIRG：FARSPhase：一种灵活、广泛适用、稳健且可扩展的人类遗传学定相算法

基本信息

批准号：
MR/M000370/1
负责人：
John Hickey
金额：
$ 48.26万
依托单位：
University of Edinburgh
依托单位国家：
英国
项目类别：
Research Grant
财政年份：
2015
资助国家：
英国
起止时间：
2015 至无数据
项目状态：
已结题

来源：
https://gtr.ukri.org/projects?ref=MR%2FM000370%2F1
关键词：
NIRG FARSPhase Flexible widely Applicable

项目摘要

In computational genetics, phasing is the modelling of the underlying haploid structure of diploid genotypes. It is important for many genetic studies because inheritance actually takes place at the haploid level, even though we can only directly observe diploid genotypes with current mainstream technologies. In many applications haplotypes provide richer and more useful information than genotypes alone. Applications of haplotype phase include understanding the interplay of genetic variation and disease, enabling identity-by-descent models for use in heritability analysis, gene association studies and genomic prediction, imputation of un-typed genetic variation, prioritizing individuals for sequencing, calling genotypes, detecting genotype error, inferring human demographic history, inferring points of recombination, detecting recurrent mutation and signatures of selection, and modelling cis-regulation of gene expression.Human genetics data sets that will likely be phased in the future can be categorised into: (i) huge populations of nominally unrelated individuals (e.g. 500,000 individuals, UK Biobank); (ii) smaller subsets of such populations (e.g. data collected in individual studies); (iii) large (e.g. 50,000 individuals) or small (e.g. 1,000 individuals) data sets collected from isolated populations with high degrees of relatedness within them (e.g. Orcades - Orkney, deCODE - Iceland, VIKING - Sweden); (iv) data sets with and without pedigree information; (v) data sets that combine several of these features (e.g. Generation Scotland); and (vi) data sets with different types of genomic information (e.g. single nucleotide polymorphisms, low- or high-coverage sequence, short or longer sequence reads, etc.).There are many phasing methods for human genetics data and these can be broadly classified into two groups: (i) heuristic methods (e.g. Long-Range Phasing (LRP)); and (ii) probabilistic methods (e.g. Hidden Markov Models (HMM)). Phasing is computationally intensive and the size and features of different data sets make them more or less suited to particular methods. LRP is computationally fast in comparison to HMM, but is only applicable to situations where individuals share relatively recent ancestry (e.g. within 10 generations) and thus share relatively long haplotypes (e.g. 5 to 10 cM length). Isolated populations (e.g. as in Orcades, Orkney) are ideally suited to LRP but huge populations with hundreds of thousands of nominally unrelated individuals may also be suitable (e.g. UK Biobank). Application of current HMM to such huge populations is computationally intractable. However, HMM are more suited to subsets of such populations than LRP because HMM only require that individuals share short haplotypes (e.g. <1 cM) due to sharing very distant relatives (e.g. 50 to 100 generations ago).LRP and HMM methods are complementary in many ways. One models long haplotypes, the other short haplotypes. HMM methods are more flexible and can better model uncertainty in the data. LRP methods are computationally much more efficient and are also more accurate in scenarios to which they are suited. LRP methods are also more amenable to incorporation of pedigree information. A combined algorithm could exploit this complementarity.The objective of this proposal is to develop FARSPhase: a Flexible, widely Applicable, Robust, and Scalable, phasing algorithm for human genetics that combines the best features of LRP, other heuristics, and HMM methods into a single framework. As well as meeting the phasing needs for small data sets, if successful, this research will enable huge data sets be phased and thereby opening the possibility of more powerful analysis. The developed algorithm will be combined into a user friendly software package built using best practices in software engineering and its performance will be tested in a wide range of simulated and real data sets that reflect the likely future phasing scenarios for human genetics.

在计算遗传学中，分阶段是二倍体基因型的基础单倍体结构的建模。对于许多遗传研究而言，这一点很重要，因为遗传实际上是在单倍体水平上进行的，即使我们只能通过当前主流技术直接观察二倍体基因型。在许多应用中，单倍型比单独的基因型提供了更丰富，更有用的信息。 Applications of haplotype phase include understanding the interplay of genetic variation and disease, enabling identity-by-descent models for use in heritability analysis, gene association studies and genomic prediction, imputation of un-typed genetic variation, prioritizing individuals for sequencing, calling genotypes, detecting genotype error, inferring human demographic history, inferring points of recombination, detecting recurrent mutation and signatures of selection, and对基因表达的顺式调节进行建模。未来可能分阶段可能分阶段的人类遗传学数据集可以分为：（i）名义上不相关的个体（例如500,000个人，英国生物库）的大量人群；（ii）此类人群的较小子集（例如，在个别研究中收集的数据）；（iii）大的（例如50,000个个人）或小（例如1,000个个人）数据集，这些数据集是从内部具有高度相关性的孤立人群中收集的（例如Orcades -Orcades -Orkney，Decode -Decode -Iceland -Iceland，Viking -Viking -Sweden）；（iv）带有和没有谱系信息的数据集；（v）结合其中几个功能的数据集（例如苏格兰一代）；（vi）具有不同类型的基因组信息的数据集（例如，单核苷酸多态性，低或高覆盖序列，短或更长的序列读取等）。人类遗传学数据有许多相规，这些方法可以广泛地分类为两组：（i）（i）HEARISTIC方法（例如（i）长期phas（例如）（例如）（例如）（例如）（例如）（例如）（例如）; （ii）概率方法（例如隐藏的马尔可夫模型（HMM））。相位是计算密集的，并且不同数据集的大小和特征使它们或多或少适合特定方法。与HMM相比，LRP在计算上是快速的，但仅适用于个人共享相对较新的祖先（例如10代），因此具有相对较长的单倍型（例如5至10 cm长度）的情况。孤立的人群（例如，在Orcades，Orkney中）非常适合LRP，但大量人群，成千上万个名义无关的个体也可能是合适的（例如，英国生物库）。当前的HMM在如此庞大的人群中的应用在计算上是棘手的。但是，HMM比LRP更适合此类人群的子集，因为HMM仅要求个人共享短倍型（例如<1 cm），这是由于共享非常遥远的亲戚（例如50至100代之前）.LRP和HMM方法在许多方面都是互补的。一种模型长的单倍型，另一种简短的单倍型。 HMM方法更灵活，可以更好地模拟数据中的不确定性。 LRP方法在计算上的效率要高得多，并且在适合它们的情况下也更准确。 LRP方法也更适合合并谱系信息。合并的算法可以利用这种互补性。该提案的目的是开发Farsphase：一种灵活，广泛适用，健壮且可扩展的，分为人类遗传学的算法，将LRP的最佳特征与LRP，其他启发术，其他启发术和HMM方法结合在一起。除了满足小型数据集的相位需求外，如果成功的话，这项研究将使大型数据集进行分阶段，从而打开更有力的分析的可能性。开发的算法将合并为使用软件工程中最佳实践构建的用户友好软件包，其性能将在各种模拟和真实的数据集中进行测试，以反映人类遗传学的未来阶段情况。

项目成果

期刊论文数量（10）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

A hybrid method for the imputation of genomic data in livestock populations.

DOI：
10.1186/s12711-017-0300-y
发表时间：
2017-03-03
期刊：
Genetics, selection, evolution : GSE
影响因子：
0
作者：
Antolín R;Nettelblad C;Gorjanc G;Money D;Hickey JM
通讯作者：
Hickey JM

MOESM8 of A hybrid method for the imputation of genomic data in livestock populations

MOESM8 家畜种群基因组数据插补的混合方法

DOI：
10.6084/m9.figshare.c.3708046_d8
发表时间：
2017
期刊：
影响因子：
0
作者：
AntolAN R
通讯作者：
AntolAN R

MOESM3 of A hybrid method for the imputation of genomic data in livestock populations

用于家畜种群基因组数据插补的混合方法的 MOESM3

DOI：
10.6084/m9.figshare.c.3708046_d3
发表时间：
2017
期刊：
影响因子：
0
作者：
AntolAN R
通讯作者：
AntolAN R

Effect of manipulating recombination rates on response to selection in livestock breeding programs.

DOI：
10.1186/s12711-016-0221-1
发表时间：
2016-06-22
期刊：
Genetics, selection, evolution : GSE
影响因子：
0
作者：
Battagin M;Gorjanc G;Faux AM;Johnston SE;Hickey JM
通讯作者：
Hickey JM

A family-based phasing algorithm for sequence data

基于家族的序列数据定相算法

DOI：
10.1101/504480
发表时间：
2018
期刊：
影响因子：
0
作者：
Battagin M
通讯作者：
Battagin M

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

John Hickey其他文献

Spatial Dissection of the Bone Marrow Microenvironment in Multiple Myeloma By High Dimensional Multiplex Tissue Imaging

DOI：
10.1182/blood-2023-189255
发表时间：
2023-11-02
期刊：
Conference abstract
影响因子：
作者：
Marc-Andrea Baertsch;Alexander Brobeil;John Hickey;Maximilian Haist;Alexandra Maria Poos;Guolan Lu;Wilson Kuswanto;Christian Schuerch;Harald Voehringer;Wolfgang Huber;Gunhild Mechtersheimer;Carsten Mueller-Tidow;Peter Schirmacher;Katja Weisel;Roland Fenk;Hartmut Goldschmidt;Yury Goltsev;Marc S. Raab;Niels Weinhold;Garry P. Nolan
通讯作者：
Garry P. Nolan