Interconnect networks are the foundation for modern high performance computing (HPC) systems. Parallel discrete event simulation (PDES), serving as a cornerstone in the study of large-scale networking systems by modeling and simulating the real-world behaviors of HPC facilities, faces escalating computational complexities at an unsustainable scale. The research community is interested in building a surrogate-ready PDES framework where an accurate surrogate model can be used to forecast HPC behaviors and replace computationally expensive PDES phases. In this paper, we focus on forecasting application iteration times, the key indicator of large-scale networking performance, with network features, such as bandwidth-consumed and busy time on routers. We introduce five representative methods, including LAST, Average, ARIMA, LSTM, and the proposed framework LSTM-Feat, to forecast the iteration times of an exemplar application MILC running on a dragonfly system. By incorporating network features, LSTM-Feat can understand dependencies between network features and iteration times, thus facilitating forecasts. The experiments demonstrate the effectiveness of incorporating network features into surrogate models and the potential of surrogate models to accelerate PDES.
互连网络是现代高性能计算(HPC)系统的基础。并行离散事件模拟(PDES)作为通过对高性能计算设施的实际行为进行建模和模拟来研究大规模网络系统的基石,面临着难以承受的不断增加的计算复杂性。研究界有兴趣构建一个可使用替代模型的PDES框架,在该框架中,准确的替代模型可用于预测高性能计算行为并取代计算成本高昂的PDES阶段。在本文中,我们专注于利用网络特征(如路由器上的带宽消耗和繁忙时间)来预测应用迭代时间,这是大规模网络性能的关键指标。我们介绍了五种有代表性的方法,包括LAST、平均法、ARIMA、LSTM以及所提出的框架LSTM - Feat,用于预测在蜻蜓系统上运行的示例应用MILC的迭代时间。通过纳入网络特征,LSTM - Feat能够理解网络特征与迭代时间之间的依赖关系,从而有助于预测。实验证明了将网络特征纳入替代模型的有效性以及替代模型加速PDES的潜力。