Metagenomic samples have high spatiotemporal variability. Hence, it is useful to summarize and characterize the microbial makeup of a given environment in a way that is biologically reasonable and interpretable. The UniFrac metric has been a robust and widely used metric for measuring the variability between metagenomic samples. We propose that the characterization of metagenomic environments can be improved by finding the average, a.k.a. the barycenter, among the samples with respect to the UniFrac distance. However, it is possible that such a UniFrac-average includes negative entries, making it no longer a valid representation of a metagenomic community.
To overcome this intrinsic issue, we propose a special version of the UniFrac metric, termed L2UniFrac, which inherits the phylogenetic nature of the traditional UniFrac and with respect to which one can easily compute the average, producing biologically meaningful environment-specific “representative samples.” We demonstrate the usefulness of such representative samples as well as the extended usage of L2UniFrac in efficient clustering of metagenomic samples, and provide mathematical characterizations and proofs to the desired properties of L2UniFrac.
A prototype implementation is provided at https://github.com/KoslickiLab/L2-UniFrac.git. All figures, data, and analysis can be reproduced at https://github.com/KoslickiLab/L2-UniFrac-Paper
宏基因组样本具有高度的时空变异性。因此,以一种在生物学上合理且可解释的方式总结和描述给定环境的微生物组成是很有用的。UniFrac度量标准一直是一种稳健且广泛使用的用于测量宏基因组样本之间变异性的度量标准。我们提出,通过在样本中依据UniFrac距离找到平均值(也就是重心),可以改进对宏基因组环境的描述。然而,这样一个UniFrac平均值可能包含负值,使其不再是宏基因组群落的有效表示。
为了克服这一内在问题,我们提出了一种特殊版本的UniFrac度量标准,称为L2UniFrac,它继承了传统UniFrac的系统发育特性,并且依据它可以很容易地计算平均值,产生具有生物学意义的特定环境“代表性样本”。我们展示了这种代表性样本的有用性以及L2UniFrac在宏基因组样本高效聚类中的扩展应用,并为L2UniFrac的期望特性提供了数学描述和证明。
在https://github.com/KoslickiLab/L2 - UniFrac.git提供了一个原型实现。所有的图表、数据和分析都可以在https://github.com/KoslickiLab/L2 - UniFrac - Paper重现。