POWRE: Combining Data Mining and Information Visualization Techniques with a Molecular Biology Sequence Similarity Database System

POWRE：将数据挖掘和信息可视化技术与分子生物学序列相似性数据库系统相结合

基本信息

批准号：
9753283
负责人：
Elizabeth Shoop
金额：
$ 7.06万
依托单位：
University of Minnesota-Twin Cities
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
1998
资助国家：
美国
起止时间：
1998-01-01 至 1999-12-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=9753283&HistoricalAwards=false
关键词：
POWRE Combining Data Mining Information

项目摘要

The main objective of this project is to aid genome researchers with the task of elucidating patterns and clusters in large amounts of biological data. For genome researchers who are interested in comparing gene or protein sequences to the sequences within one genome or across genomes, this task involves executing hundreds of thousands of similarity searches that produce text output. This project involves the development of two specific software tools for visualizing and exploring the similarity data in a database of biological sequence similarity results. The first tool will be an Interactive Categorization Tool. This tool will display attributes of selected similarity database objects in a 2D scatterplot and enable dynamic manipulation of the display. This will enable the genome researcher to explore the attributes of similarities and categorize the similarities based on those attributes. For example, the genome researcher will be able to vary the input parameters of a function for computing the strength of each detected similarity and display a plot with the strength of each similarity shown as the color of each point, and the points situated in the 2D space based on score and statistical significance as the X and Y axes. The tool will enable genome researchers to dynamically manipulate the generation of higher- level concepts or categories for detected similarities (strong, marginal, and weak similarities as opposed to individual similarities with particular values of score and statistical significance that are more difficult to compare). This will lead to their ability to categorize hits as orthologous or paralogous, based on various attributes of the detected similarities. Score and p-value are not the only attributes that can be used -- the system is general enough that other attributes, such as percent identity, percent conserved, and length of alignment, among others, could be used in functions. Thus, genome researchers can cond uct exploration at different stages of the genome comparison research process. The second tool will be a Cluster Exploration Tool. Using the results from data mining techniques that cluster like sequences together, genome researchers will be able to visualize the similarities among the sequences in the clusters. For example, the tool can be used for a cluster of new unknown sequences that were found similar to members of a group of known sequences. The new sequences can be positioned as nodes on the left in a bipartite graph, and the known sequences that they are similar to can be positioned along the right. Lines drawn between the nodes, colored differently based on the strength of the hits, will enable the researcher to visualize the connectedness of the sequences in the cluster. Details about each sequence and each similarity in the cluster can be obtained from the DBMS. This will enable genome researchers to study groups of orthologous or parologous sequences. A key feature of these tools is that they will be 'thin' clients (often referred to as applets) that communicate with the underlying DBMS via queries formulated visually by the genome researchers. The use of Java- based components for these tools will enable them to be easily used and shared by the bioinformatics community and the genome research community. The development of these tools will demonstrate the feasibility of the thin-client approach that is the hallmark of the network computing architecture philosophy.

该项目的主要目的是帮助基因组研究人员阐明大量生物学数据的模式和簇的任务。对于有兴趣将基因或蛋白质序列与一个基因组内或跨基因组内的序列进行比较的基因组研究人员，此任务涉及执行数十万个产生文本输出的相似性搜索。该项目涉及开发两个特定的软件工具，用于可视化和探索生物序列相似性结果数据库中的相似性数据。第一个工具将是一种交互式分类工具。该工具将在2D散点图中显示所选相似性数据库对象的属性，并启用显示器的动态操作。这将使基因组研究人员能够探索相似性的属性，并根据这些属性对相似性进行分类。例如，基因组研究人员将能够改变一个函数的输入参数，以计算每个检测到的相似性的强度，并显示具有每个点所显示的每个点的强度的图，并根据分数和统计意义在2D空间中显示的每个相似性，并以x和y轴为统计学意义。该工具将使基因组研究人员能够动态操纵检测到的相似性（强，边际和弱相似性，而不是具有特定分数值和统计意义的个人相似性，更难比较）。这将导致他们基于检测到的相似性的各种属性将命中分类为直系同源或寄生虫的能力。得分和p值并不是唯一可以使用的属性 - 系统足够通用，以至于其他属性（例如身份百分比，保守百分比和对齐方式）可以用于函数中。因此，基因组研究人员可以在基因组比较研究过程的不同阶段进行探索。第二个工具将是集群探索工具。使用类似序列聚集在一起的数据挖掘技术的结果，基因组研究人员将能够可视化簇中序列之间的相似性。例如，该工具可用于与一组已知序列的成员相似的新的未知序列集群。新序列可以将其定位在两部分图中的左侧节点，并且它们与之相似的已知序列可以沿右侧放置。节点之间绘制的线，根据命中的强度有所不同，将使研究人员能够可视化群集中序列的连接性。可以从DBMS获得有关每个序列和群集中每个相似性的详细信息。这将使基因组研究人员能够研究直系同源或差距序列的群体。这些工具的一个关键特征是，它们将是“薄”客户（通常称为applet），它们通过基因组研究人员在视觉上与基础DBMS进行通信。将基于Java的组件用于这些工具将使生物信息学界和基因组研究社区轻松使用和共享它们。这些工具的开发将证明是网络计算体系结构理念的标志的薄凝位方法的可行性。