Microbial community metagenomes and individual microbial genomes are becoming increasingly accessible by means of high-throughput sequencing. Assessing organismal membership within a community is typically performed using one or a few taxonomic marker genes such as the 16S rDNA, and these same genes are also employed to reconstruct molecular phylogenies. There is thus a growing need to bioinformatically catalog strongly conserved core genes that can serve as effective taxonomic markers, to assess the agreement among phylogenies generated from different core gene, and to characterize the biological functions enriched within core genes and thus conserved throughout large microbial clades. We present a method to recursively identify core genes (i.e. genes ubiquitous within a microbial clade) in high-throughput from a large number of complete input genomes. We analyzed over 1,100 genomes to produce core gene sets spanning 2,861 bacterial and archaeal clades, ranging in size from one to >2,000 genes in inverse correlation with the α-diversity (total phylogenetic branch length) spanned by each clade. These cores are enriched as expected for housekeeping functions including translation, transcription, and replication, in addition to significant representations of regulatory, chaperone, and conserved uncharacterized proteins. In agreement with previous manually curated core gene sets, phylogenies constructed from one or more of these core genes agree with those built using 16S rDNA sequence similarity, suggesting that systematic core gene selection can be used to optimize both comparative genomics and determination of microbial community structure. Finally, we examine functional phylogenies constructed by clustering genomes by the presence or absence of orthologous gene families and show that they provide an informative complement to standard sequence-based molecular phylogenies.
微生物群落宏基因组和单个微生物基因组通过高通量测序变得越来越容易获取。评估一个群落中的生物成员通常使用一个或几个分类标记基因,如16S rDNA,并且这些相同的基因也被用于重建分子系统发育。因此,越来越需要对可作为有效分类标记的高度保守核心基因进行生物信息学编目,评估由不同核心基因产生的系统发育之间的一致性,并描述核心基因中富集的生物功能,从而在大型微生物分支中得以保守。我们提出一种从大量完整输入基因组中高通量递归识别核心基因(即微生物分支中普遍存在的基因)的方法。我们分析了1100多个基因组,以产生涵盖2861个细菌和古菌分支的核心基因集,其大小从1个到>2000个基因不等,与每个分支所涵盖的α -多样性(总系统发育分支长度)呈负相关。这些核心基因如预期的那样富含包括翻译、转录和复制在内的持家功能,此外还有大量的调节蛋白、伴侣蛋白和保守的未表征蛋白质。与先前人工整理的核心基因集一致,由一个或多个这些核心基因构建的系统发育与使用16S rDNA序列相似性构建的系统发育一致,这表明系统的核心基因选择可用于优化比较基因组学和微生物群落结构的确定。最后,我们研究了通过根据直系同源基因家族的有无对基因组进行聚类而构建的功能系统发育,并表明它们为基于标准序列的分子系统发育提供了有信息价值的补充。