The generation of genomic binding or accessibility data from massively parallel sequencing technologies such as ChIP-seq and DNase-seq continues to accelerate. Yet state-of-the-art computational approaches for the identification of DNA binding motifs often yield motifs of weak predictive power. Here we present a novel computational algorithm called MotifSpec, designed to find predictive motifs, in contrast to over-represented sequence elements. The key distinguishing feature of this algorithm is that it uses a dynamic search space and a learned threshold to find discriminative motifs in combination with the modeling of motifs using a full PWM (position weight matrix) rather than k-mer words or regular expressions. We demonstrate that our approach finds motifs corresponding to known binding specificities in several mammalian ChIP-seq datasets, and that our PWMs classify the ChIP-seq signals with accuracy comparable to, or marginally better than motifs from the best existing algorithms. In other datasets, our algorithm identifies novel motifs where other methods fail. Finally, we apply this algorithm to detect motifs from expression datasets in C. elegans using a dynamic expression similarity metric rather than fixed expression clusters, and find novel predictive motifs.
从诸如ChIP - seq和DNase - seq等大规模平行测序技术中产生的基因组结合或可及性数据在持续加速增长。然而,用于鉴定DNA结合基序的最先进的计算方法往往产生预测能力较弱的基序。在此我们提出一种名为MotifSpec的新型计算算法,其旨在找到具有预测性的基序,这与过度代表的序列元件不同。该算法的关键区别特征在于它使用动态搜索空间和学习到的阈值来寻找判别性基序,并结合使用完整的位置权重矩阵(PWM)而非k - mer词或正则表达式对基序进行建模。我们证明我们的方法在几个哺乳动物ChIP - seq数据集中找到了与已知结合特异性相对应的基序,并且我们的位置权重矩阵对ChIP - seq信号进行分类的准确性与现有最佳算法的基序相当,或略优于它们。在其他数据集中,我们的算法在其他方法失败的情况下识别出了新的基序。最后,我们应用该算法,使用动态表达相似性度量而非固定的表达簇从秀丽隐杆线虫的表达数据集中检测基序,并发现了新的具有预测性的基序。