Spectroscopic data, particularly diffraction data, contain detailed crystal and microstructure information and thus are crucial for materials discovery. Powder X-ray diffraction (XRD) patterns are greatly effective in identifying crystals. Although machine learning (ML) has significantly advanced the analysis of powder XRD patterns, the progress is hindered by a lack of training data. To address this, we introduce SimXRD, the largest open-source simulated XRD pattern dataset so far, to accelerate the development of crystallographic informatics. SimXRD comprises 4,065,346 simulated powder X-ray diffraction patterns, representing 119,569 distinct crystal structures under 33 simulated conditions that mimic real-world variations. We find that the crystal symmetry inherently follows a long-tailed distribution and evaluate 21 sequence learning models on SimXRD. The results indicate that existing neural networks struggle with low-frequency crystal classifications. The present work highlights the academic significance and the engineering novelty of simulated XRD patterns in this interdisciplinary field.
光谱数据,特别是衍射数据,包含详细的晶体和微观结构信息,因此对于材料发现至关重要。粉末X射线衍射(XRD)图案在识别晶体方面非常有效。尽管机器学习(ML)对粉末XRD模式的分析有了显着提高,但由于缺乏训练数据而阻碍了进度。为了解决这个问题,我们介绍了迄今为止最大的开源模拟XRD模式数据集,以加速晶体学信息学的开发。 Simxrd包含4,065,346个模拟粉末X射线衍射模式,在33个模拟现实世界变化下,代表119,569个不同的晶体结构。我们发现晶体对称性固有地遵循长尾巴分布,并在Simxrd上评估了21个序列学习模型。结果表明,现有的神经网络与低频晶体分类斗争。目前的工作突出了该跨学科领域模拟XRD模式的学术意义和工程新颖性。