Cluster expansion (CE) is a powerful theoretical tool to study the configuration-dependent properties of substitutionally disordered systems. Typically, a CE model is built by fitting a few tens or hundreds of target quantities calculated by first-principles approaches. To validate the reliability of the model, a convergence test of the cross-validation (CV) score to the training set size is commonly conducted to verify the sufficiency of the training data. However, such a test only confirms the convergence of the predictive capability of the CE model within the training set, and it is unknown whether the convergence of the CV score would lead to robust thermodynamic simulation results such as order-disorder phase transition temperature Tc. In this work, using carbon defective MoC1-x as a model system and aided by the machine-learning force field technique, a training data pool with about 13000 configurations has been efficiently obtained and used to generate different training sets of the same size randomly. By conducting parallel Monte Carlo simulations with the CE models trained with different randomly selected training sets, the uncertainty in calculated Tc can be evaluated at different training set sizes. It is found that the training set size that is sufficient for the CV score to converge still leads to a significant uncertainty in the predicted Tc and that the latter can be considerably reduced by enlarging the training set to that of a few thousand configurations. This work highlights the importance of using a large training set to build the optimal CE model that can achieve robust statistical modeling results and the facility provided by the machine-learning force field approach to efficiently produce adequate training data.
团簇展开(CE)是一种强大的理论工具,用于研究替代无序系统的构型相关性质。通常,CE模型是通过拟合由第一性原理方法计算出的几十个或几百个目标量来构建的。为了验证模型的可靠性,通常会对交叉验证(CV)分数相对于训练集大小进行收敛性测试,以验证训练数据的充分性。然而,这样的测试仅仅确认了CE模型在训练集内预测能力的收敛性,而CV分数的收敛是否会导致稳健的热力学模拟结果,例如有序 - 无序相变温度Tc,尚不清楚。在这项工作中,以含碳缺陷的MoC1 - x作为模型体系,并借助机器学习力场技术,高效地获得了一个包含约13000种构型的训练数据池,并用于随机生成相同大小的不同训练集。通过使用由不同随机选择的训练集训练的CE模型进行并行蒙特卡罗模拟,可以评估在不同训练集大小下计算得到的Tc的不确定性。研究发现,足以使CV分数收敛的训练集大小仍然会导致预测的Tc存在显著的不确定性,并且通过将训练集扩大到几千种构型,可以大大降低这种不确定性。这项工作强调了使用大训练集来构建能够实现稳健统计建模结果的最优CE模型的重要性,以及机器学习力场方法为高效生成足够训练数据所提供的便利。