Alternative splicing (AS) is an important mechanism in the development of many cancers, as novel or aberrant AS patterns play an important role as an independent onco-driver. In addition, cancer-specific AS is potentially an effective target of personalized cancer therapeutics. However, detecting AS events remains a challenging task, especially if these AS events are novel. This is exacerbated by the fact that existing transcriptome annotation databases are far from being comprehensive, especially with regard to cancer-specific AS. Additionally, traditional sequencing technologies are severely limited by the short length of the generated reads, which rarely spans more than a single splice junction site. Given these challenges, transcriptomic long-read (LR) sequencing presents a promising potential for the detection and discovery of AS. We present Freddie, a computational annotation-independent isoform discovery and detection tool. Freddie takes as input transcriptomic LR sequencing of a sample alongside its genomic split alignment and computes a set of isoforms for the given sample. It then partitions the input reads into sets that can be processed independently and in parallel. For each partition, Freddie segments the genomic alignment of the reads into canonical exon segments. The goal of this segmentation is to be able to represent any potential isoform as a subset of these canonical exons. This segmentation is formulated as an optimization problem and is solved with a dynamic programming algorithm. Then, Freddie reconstructs the isoforms by jointly clustering and error-correcting the reads using the canonical segmentation as a succinct representation. The clustering and error-correcting step is formulated as an optimization problem—the Minimum Error Clustering into Isoforms (MErCi) problem—and is solved using integer linear programming (ILP). We compare the performance of Freddie on simulated datasets with other isoform detection tools with varying dependence on annotation databases. We show that Freddie outperforms the other tools in its accuracy, including those given the complete ground truth annotation. We also run Freddie on a transcriptomic LR dataset generated in-house from a prostate cancer cell line with a matched short-read RNA-seq dataset. Freddie results in isoforms with a higher short-read cross-validation rate than the other tested tools. Freddie is open source and available at https://github.com/vpc-ccg/freddie/.
可变剪接(AS)是许多癌症发生发展中的一个重要机制,因为新的或异常的可变剪接模式作为一种独立的致癌驱动因素发挥着重要作用。此外,癌症特异性可变剪接有可能成为个性化癌症治疗的一个有效靶点。然而,检测可变剪接事件仍然是一项具有挑战性的任务,尤其是当这些可变剪接事件是新的时。现有的转录组注释数据库远不够全面,特别是在癌症特异性可变剪接方面,这使得情况更加复杂。此外,传统的测序技术受到所产生读长较短的严重限制,这些读长很少能跨越一个以上的剪接连接位点。鉴于这些挑战,转录组长读长(LR)测序为可变剪接的检测和发现提供了有前景的潜力。我们介绍了Freddie,一种不依赖计算注释的异构体发现和检测工具。Freddie将一个样本的转录组长读长测序及其基因组拆分比对作为输入,并为给定样本计算一组异构体。然后它将输入读段划分为可以独立并行处理的集合。对于每个分区,Freddie将读段的基因组比对分割为规范外显子片段。这种分割的目的是能够将任何潜在的异构体表示为这些规范外显子的一个子集。这种分割被表述为一个优化问题,并通过动态规划算法来解决。然后,Freddie通过使用规范分割作为简洁表示对读段进行联合聚类和纠错来重建异构体。聚类和纠错步骤被表述为一个优化问题——最小误差聚类为异构体(MErCi)问题,并通过整数线性规划(ILP)来解决。我们将Freddie在模拟数据集上的性能与其他对注释数据库依赖程度不同的异构体检测工具进行了比较。我们表明Freddie在准确性方面优于其他工具,包括那些给定完整真实注释的工具。我们还在一个由前列腺癌细胞系内部产生的转录组长读长数据集以及一个匹配的短读长RNA - seq数据集上运行了Freddie。Freddie产生的异构体比其他测试工具具有更高的短读长交叉验证率。Freddie是开源的,可在https://github.com/vpc - ccg/freddie/获取。