Persistence diagrams have been widely used to quantify the underlying features of filtered topological spaces in data visualization. In many applications, computing distances between diagrams is essential; however, computing these distances has been challenging due to the computational cost. In this paper, we propose a persistence diagram hashing framework that learns a binary code representation of persistence diagrams, which allows for fast computation of distances. This framework is built upon a generative adversarial network (GAN) with a diagram distance loss function to steer the learning process. Instead of using standard representations, we hash diagrams into binary codes, which have natural advantages in large-scale tasks. The training of this model is domain-oblivious in that it can be computed purely from synthetic, randomly created diagrams. As a consequence, our proposed method is directly applicable to various datasets without the need for retraining the model. These binary codes, when compared using fast Hamming distance, better maintain topological similarity properties between datasets than other vectorized representations. To evaluate this method, we apply our framework to the problem of diagram clustering and we compare the quality and performance of our approach to the state-of-the-art. In addition, we show the scalability of our approach on a dataset with 10k persistence diagrams, which is not possible with current techniques. Moreover, our experimental results demonstrate that our method is significantly faster with the potential of less memory usage, while retaining comparable or better quality comparisons.
持久图在数据可视化中已被广泛用于量化过滤拓扑空间的潜在特征。在许多应用中,计算图之间的距离至关重要;然而,由于计算成本,计算这些距离一直具有挑战性。在本文中,我们提出了一个持久图哈希框架,该框架学习持久图的二进制编码表示,从而能够快速计算距离。这个框架建立在具有图距离损失函数的生成对抗网络(GAN)之上,以引导学习过程。我们不是使用标准表示,而是将图哈希为二进制编码,这在大规模任务中具有天然优势。该模型的训练与领域无关,因为它可以完全从合成的、随机创建的图中进行计算。因此,我们提出的方法可直接应用于各种数据集,而无需重新训练模型。当使用快速汉明距离进行比较时,这些二进制编码比其他向量化表示更好地保持数据集之间的拓扑相似性。为了评估该方法,我们将我们的框架应用于图聚类问题,并将我们的方法与最先进的方法在质量和性能方面进行比较。此外,我们展示了我们的方法在一个具有10000个持久图的数据集中的可扩展性,这是当前技术无法做到的。而且,我们的实验结果表明,我们的方法速度明显更快,并且有可能占用更少的内存,同时在质量比较上保持相当或更好的水平。