. The need for data privacy and security – enforced through increasingly strict data protection regulations – renders the use of healthcare data for machine learning difficult. In particular, the transfer of data between different hospitals is often not permissible and thus cross-site pooling of data not an option. The Personal Health Train (PHT) paradigm proposed within the GO-FAIR initiative implements an ’algorithm to the data’ paradigm that ensures that distributed data can be accessed for analysis without transferring any sensitive data. We present PHT-meDIC, a productively deployed open-source implementation of the PHT concept. Containerization allows us to easily deploy even complex data analysis pipelines (e.g, genomics, image analysis) across multiple sites in a secure and scalable manner. We discuss the underlying technological concepts, security models, and governance processes. The implementation has been successfully applied to distributed analyses of large-scale data, including applications of deep neural networks to medical image data.
数据隐私和安全的需求——通过日益严格的数据保护法规得以强化——使得将医疗数据用于机器学习变得困难。特别是,不同医院之间的数据传输往往是不被允许的,因此跨站点的数据汇集也不可行。在GO - FAIR倡议中提出的个人健康列车(PHT)范式实现了一种“算法靠近数据”的范式,确保在不传输任何敏感数据的情况下能够获取分布式数据进行分析。我们介绍PHT - meDIC,这是PHT概念的一个已有效部署的开源实现。容器化使我们能够以安全且可扩展的方式在多个站点轻松部署甚至复杂的数据分析流程(例如,基因组学、图像分析)。我们讨论了相关的基础技术概念、安全模型和治理流程。该实现已成功应用于大规模数据的分布式分析,包括深度神经网络在医学图像数据上的应用。