Scientific applications continue to grow and produce extremely large amounts of data, which require efficient compression algorithms for long-term storage. Compression errors in scientific applications can have a deleterious impact on downstream processing. Thus, it is crucial to preserve all the “known” Quantities of Interest (QoI) during compression. To address this issue, most existing approaches guarantee the reconstruction error of the original data or primary data (PD), but cannot directly control the problem of preserving the QoI. In this work, we propose a physics-informed compression technique that is composed of two parts: (i) reduction of the PD with bounded errors and (ii) preservation of the QoI. In the first step, we combine tensor decompositions, autoencoders, product quantizers, and error-bounded lossy compressors to bound the reconstruction error at high levels of compression. In the second step, we use constraint satisfaction post-processing followed by quantization to preserve the QoI. To illustrate the challenges of reducing the reconstruction errors of the PD and QoI, we focus on simulation data generated by a large-scale fusion code, XGC, which can produce tens of petabytes in a single day. The results show that our approach can achieve a high compression amount while accurately preserving the QoI within scientifically acceptable bounds.
科学应用不断发展并产生极大量的数据,这些数据需要高效的压缩算法以进行长期存储。科学应用中的压缩错误可能对下游处理产生有害影响。因此,在压缩过程中保留所有“已知”的关注量(QoI)至关重要。为解决这一问题,大多数现有方法保证原始数据或主数据(PD)的重建误差,但无法直接控制保留QoI的问题。在这项工作中,我们提出一种物理信息压缩技术,它由两部分组成:(i)在有界误差下对PD进行降维,以及(ii)保留QoI。在第一步中,我们结合张量分解、自动编码器、乘积量化器和有界误差有损压缩器,在高压缩水平下限制重建误差。在第二步中,我们使用约束满足后处理,然后进行量化以保留QoI。为说明减少PD和QoI重建误差的挑战,我们聚焦于由大规模聚变代码XGC生成的模拟数据,该代码一天可产生数十拍字节的数据。结果表明,我们的方法能够在科学可接受的范围内准确保留QoI的同时实现高压缩量。