Transcription factors (TFs) are DNA-binding proteins that have a central role in regulating gene expression. Identification of DNA-binding sites of TFs is a key task in understanding transcriptional regulation, cellular processes and disease. Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) enables genome-wide identification of in vivo TF binding sites. However, it is still difficult to map every TF in every cell line owing to cost and biological material availability, which poses an enormous obstacle for integrated analysis of gene regulation. To address this problem, we propose a novel computational approach, TFBSImpute, for predicting additional TF binding profiles by leveraging information from available ChIP-seq TF binding data. TFBSImpute fuses the dataset to a 3-mode tensor and imputes missing TF binding signals via simultaneous completion of multiple TF binding matrices with positional consistency. We show that signals predicted by our method achieve overall similarity with experimental data and that TFBSImpute significantly outperforms baseline approaches, by assessing the performance of imputation methods against observed ChIP-seq TF binding profiles. Besides, motif analysis shows that TFBSImpute preforms better in capturing binding motifs enriched in observed data compared with baselines, indicating that the higher performance of TFBSImpute is not simply due to averaging related samples. We anticipate that our approach will constitute a useful complement to experimental mapping of TF binding, which is beneficial for further study of regulation mechanisms and disease.
转录因子(TF)是DNA结合蛋白,在调节基因表达中具有核心作用。 TFS的DNA结合位点的鉴定是理解转录调节,细胞过程和疾病的关键任务。染色质免疫沉淀,然后进行高通量测序(CHIP-SEQ),可以使整个基因组的体内TF结合位点鉴定。但是,由于成本和生物材料的可用性,仍然很难在每个细胞系中绘制每个TF,这为基因调节综合分析构成了巨大的障碍。为了解决这个问题,我们提出了一种新型的计算方法TFBSIMPUTE,用于通过利用可用芯片seq TF结合数据的信息来预测其他TF结合曲线。 tfbsimpute将数据集融合到3模式张量,并通过同时完成具有位置一致性的多个TF结合矩阵,从而渗透了TF结合信号。我们表明,通过我们的方法预测的信号与实验数据达到了总体相似性,并且通过评估观察到的chip-seq TF结合曲线的插补方法的性能,TFBS极大地超过了基线方法。此外,基序分析表明,与基准相比,TFBSIMPIST在捕获富含观察到的数据的结合基序方面更好地形成了预言,这表明TFBSIMPUTE的性能较高并不仅仅是由于平均相关样品。我们预计我们的方法将构成对TF结合的实验映射的有用补充,这对于进一步研究调节机制和疾病是有益的。