Contemporary datasets can be immense and complex in nature. Thus, summarizing and extracting information frequently precedes any analysis. The summarizing techniques are many and varied and driven by underlying scientific questions of interest. One type of resulting datasets contains so-called histogram-valued observations. While such datasets are becoming more and more pervasive, methodologies to analyse them are still very inadequate. One area of interest falls under the rubric of cluster analysis. Unfortunately, to date, no dis/similarity or distance measures that are readily computable exist for multivariate histogram-valued data. To redress that problem, the present article introduces various dissimilarity measures for histogram data. In particular, extensions to the Gowda-Diday and Ichino-Yaguchi measures for interval data are introduced, along with extensions of some DeCarvalho measures. In addition, a cumulative distribution measure is developed for histograms. These new measures are illustrated for the Fisher iris data and applied to a U.S. temperature dataset.
当代数据集在本质上可能是庞大且复杂的。因此,在进行任何分析之前,通常要先对信息进行汇总和提取。汇总技术多种多样,且由相关的基础科学问题所驱动。一种由此产生的数据集包含所谓的直方图值观测数据。虽然这类数据集变得越来越普遍,但分析它们的方法仍然非常不足。一个感兴趣的领域属于聚类分析的范畴。不幸的是,到目前为止,对于多元直方图值数据,还没有易于计算的相异度或距离度量。为了解决这个问题,本文介绍了多种用于直方图数据的相异度度量。特别是,引入了针对区间数据的高田 - 迪代(Gowda - Diday)和市野 - 矢口(Ichino - Yaguchi)度量的扩展,以及一些德卡瓦略(DeCarvalho)度量的扩展。此外,还为直方图开发了一种累积分布度量。这些新度量通过费舍尔鸢尾花数据进行了说明,并应用于美国的一个温度数据集。