Learning from incomplete data by combining physiological knowledge and machine learning

结合生理知识和机器学习从不完整数据中学习

基本信息

批准号：
562032-2021
负责人：
Layton, AnitaAT
金额：
$ 1.46万
依托单位：
University of Waterloo
依托单位国家：
加拿大
项目类别：
Alliance Grants
财政年份：
2022
资助国家：
加拿大
起止时间：
2022-01-01 至 2023-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=743653
关键词：
Learning incomplete data combining physiological

项目摘要

Data analytics is playing an increasingly critical role in our decision making. However, data-based prediction is often challenged by missing data. Machine learning is well suited for analyzing massive data, but its accuracy can be hampered by the quality of the data. This problem is particularly critical in health or environmental records, which are notoriously incomplete and erroneous. Errors in the data fed to a machine learning model may yield costly or even lethal mistakes in its prediction. Given this knowledge gap, this project seeks to develop a method that maximizes the amount of information that can be gleamed from a physiological dataset with missing data. To accomplish that goal, we will first develop an innovative data imputation method that is based on known physiology. Specifically, the method will be based on HumMod, a state-of-the-art computational model of human physiology. HumMod is presently formulated for a middle-age man only; hence, in Objective 1 we will extend HumMod to take into account sex and age, by developing instantiations of the model for a middle-age woman, an older man, and an older woman. In Objective 2, we will apply the sex- and age-specific HumMod models to clean data provided by our partners, and then apply machine learning analysis to predict individual health status. The accuracy of the prediction will be compared with analogous predictions made by the same machine learning model but applied to datasets cleaned by other data imputation methods. We expect our physiologically-based data imputation method to out-perform alternative methods, many of which can introduce bias or are sensitive to outliers. The impact of this project will be greatly strengthened by the participation of industry partner AstraZeneca Canada, who will provide cash contribution, and by non-profit Diabetes Action Canada, who will provide access to a large database.

数据分析在我们的决策中起着越来越重要的作用。但是，基于数据的预测通常受到丢失数据的挑战。机器学习非常适合分析大量数据，但是数据质量可能会阻碍其准确性。这个问题在健康或环境记录中尤为重要，众所周知，这些记录是不完整且错误的。馈送到机器学习模型的数据中的错误可能会在预测中造成昂贵甚至致命的错误。鉴于此知识差距，该项目旨在开发一种方法，该方法最大程度地提高了可以从生理数据集中闪闪发光的信息量，而数据丢失了。为了实现这一目标，我们将首先开发一种基于已知生理学的创新数据插补方法。具体而言，该方法将基于Hummod，它是人类生理的最先进的计算模型。 Hummod目前仅针对中年男子配制；因此，在目标1中，我们将通过为中年妇女，一个年长的男人和一个年长的女人的模型实例化来扩展Hummod以考虑性和年龄。在目标2中，我们将应用性别和年龄特异性的Hummod模型清洁我们的合作伙伴提供的数据，然后应用机器学习分析以预测个人健康状况。预测的准确性将与同一机器学习模型做出的类似预测进行比较，但应用于其他数据插补方法清洁的数据集。我们期望我们基于生理的数据插补方法超过替代方法，其中许多方法可能引入偏见或对异常值敏感。该项目的影响将大大加强加拿大行业合作伙伴Astrazeneca的参与，他们将提供现金贡献，并由加拿大非营利性糖尿病行动加拿大行动，他们将提供对大型数据库的访问。