Clustering methods for classical data are well established, though the associated algorithms primarily focus on partitioning methods and agglomerative hierarchical methods. With the advent of massively large data sets, too large to be analyzed by traditional techniques, new paradigms are needed. Symbolic data methods form one solution to this problem. While symbolic data can be important and arise naturally in their own right, they are particularly relevant when faced with data that emerged from aggregation of (larger) data sets. One format is when the data are histogram‐valued in ℝp, instead of points in ℝp as in classical data. This paper looks at the problem of constructing hierarchies using a divisive polythetic algorithm based on dissimilarity measures derived for histogram observations. WIREs Comput Stat 2017, 9:e1405. doi: 10.1002/wics.1405
经典数据的聚类方法已经很成熟,尽管相关算法主要集中在划分方法和凝聚层次方法上。随着海量数据集的出现(这些数据集太大,无法用传统技术进行分析),需要新的范式。符号数据方法是解决这一问题的一种方案。虽然符号数据本身可能很重要且自然产生,但在面对由(更大的)数据集聚合而产生的数据时,它们尤其相关。一种形式是当数据在ℝp中是直方图值,而不是像经典数据那样是ℝp中的点。本文研究了使用一种基于为直方图观测值导出的相异性度量的分裂多元算法构建层次结构的问题。《计算统计学跨学科评论》2017年,9:e1405。doi:10.1002/wics.1405