High-dimensional Data Analysis: Modeling Unobserved Heterogeneity in Data, and Studying Imbalanced Classification Problems

高维数据分析:对数据中未观察到的异质性进行建模,并研究不平衡分类问题

基本信息

  • 批准号:
    RGPIN-2020-05011
  • 负责人:
  • 金额:
    $ 1.75万
  • 依托单位:
  • 依托单位国家:
    加拿大
  • 项目类别:
    Discovery Grants Program - Individual
  • 财政年份:
    2021
  • 资助国家:
    加拿大
  • 起止时间:
    2021-01-01 至 2022-12-31
  • 项目状态:
    已结题

项目摘要

Data science has become the center of attention in a wide range of scientific disciplines, thanks to ever-expanding means of data collection in today's world. Unprecedented size and structural complexity of current data in many applications call for computationally efficient and statistically sound methodologies for extracting useful information from such data. Toward this goal, the general theme of my research program focuses on analyzing high-dimensional data. More specifically, over the five years of this proposal, my short-term objectives are: I) Statistical modeling of heterogeneous high-dimensional data: In applications such as health sciences, engineering and environment, social sciences, and financial econometrics, high-dimensional data often arise from heterogeneous populations consisting of multiple hidden homogeneous sub-populations. Finite mixture of regressions (FMR) and Markov regime-switching autoregressive (MSAR) models provide flexible tools for capturing unobserved heterogeneity in data. The later models are used for modeling time series data. In practice, when fitting such models to a dataset, one faces three inferential problems: order selection or estimation of the number of hidden sub-populations or regimes, variable selection, and so-called post-selection statistical inference such as hypothesis testing or confidence intervals for parameters of a data-driven selected model. Despite their wide applications, rigorous methodological developments addressing the aforementioned problems in the growing literature on high-dimensional statistics have been very limited. In my short-term objectives, I will investigate new likelihood-based regularization techniques for: order selection in FMR and MSAR, and variable selection in sparse dynamic FMR and vector MSAR with fixed order and in high-dimensional settings. Establishment of such results will pave the way toward post-selection inference problems which are the subjects of my long-term objectives. II) High-dimensional imbalanced classification problems: In applications such as fraud detection, medical diagnosis, or equipment malfunction detection, classification tasks often suffer from both high-dimensionality and imbalance in the observed frequency of some classes in the training data. The latter is due to either data collection process or because some classes are indeed rare in the population. Due to data scarcity in minority class(es), conventional discriminative methods are often biased toward the majority class(es) resulting in much higher misclassification rates for the minority class(es). Imbalanced classification problems are generally hard, so I begin by studying imbalanced linear binary cases. I will investigate the utility of divide-and-conquer techniques coupled with hard-thresholding variable selection methods for bias correction in the standard linear discriminant analysis toward the minority class in high-dimensions. I will also study multi-class problems.
数据科学已成为广泛的科学学科的关注中心,感谢您在当今世界中不断扩展的数据收集手段。在许多应用程序中,当前数据的空前大小和结构复杂性要求从这些数据中提取有用信息的计算高效和统计上的合理方法。为了实现这一目标,我的研究计划的一般主题集中在分析的高维数据上。更具体地说,在该提案的五年中,我的短期目标是:i)非均质高维数据的统计建模:在诸如健康科学,工程和环境,社会科学和金融经济学等应用中,通常是由高维数据引起的,通常是由由多个隐藏同质同质副群组成的异质种群引起的。回归(FMR)和马尔可夫制度转换自回归(MSAR)模型的有限混合物为捕获数据中未观察到的异质性提供了灵活的工具。后来的型号用于建模时间序列数据。实际上,当将此类模型拟合到数据集时,一个人会面临三个推论问题:订单选择或估计隐藏的子人群或制度的数量,可变选择以及所谓的选择后统计推断,例如假设测试或置信区间,例如数据驱动的数据驱动的模型的参数。尽管应用了广泛的应用,但严格的方法论发展解决了不断增长的有关高维统计文献中与理性有关的问题的问题非常有限。在我的短期目标中,我将研究以下新的基于可能性的调节技术:FMR和MSAR中的订单选择,以及具有固定顺序和高维设置的稀疏动态FMR和Vector MSAR中的可变选择。建立此类结果将为选择后推理问题铺平道路,这是我长期目标的主题。 ii)高维不平衡分类问题:在诸如欺诈检测,医学诊断或设备故障检测等应用中,分类任务通常在培训数据中某些类别的某些类别的频率中遭受高差异性和不平衡性。后者是由于数据收集过程,或者是因为某些班级在人群中确实很少见。由于少数民族类别中的数据稀缺性,常规判别方法通常会偏向多数级别(ES),从而导致少数族裔类别(ES)的错误分类率更高。分类问题通常很困难,因此我首先研究了不平衡的线性二元病例。我将调查分界线和互动技术的实用性,并在标准的线性判别分析中对高维度中的少数群体类别中的偏置校正方法进行偏置校正的努力。我还将研究多级问题。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Khalili, Abbas其他文献

Feature selection in finite mixture of sparse normal linear models in high-dimensional feature space
  • DOI:
    10.1093/biostatistics/kxq048
  • 发表时间:
    2011-01-01
  • 期刊:
  • 影响因子:
    2.1
  • 作者:
    Khalili, Abbas;Chen, Jiahua;Lin, Shili
  • 通讯作者:
    Lin, Shili
Disseminated Intravascular Coagulation Associated with Large Deletion of Immunoglobulin Heavy Chain
Autosomal Recessive Agammaglobulinemia: A Novel Non-sense Mutation in CD79a
  • DOI:
    10.1007/s10875-014-9989-3
  • 发表时间:
    2014-02-01
  • 期刊:
  • 影响因子:
    9.1
  • 作者:
    Khalili, Abbas;Plebani, Alessandro;Aghamohammadi, Asghar
  • 通讯作者:
    Aghamohammadi, Asghar
Order Selection in Finite Mixture Models With a Nonsmooth Penalty
Order Selection in Finite Mixture Models With a Nonsmooth Penalty

Khalili, Abbas的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Khalili, Abbas', 18)}}的其他基金

High-dimensional Data Analysis: Modeling Unobserved Heterogeneity in Data, and Studying Imbalanced Classification Problems
高维数据分析:对数据中未观察到的异质性进行建模,并研究不平衡分类问题
  • 批准号:
    RGPIN-2020-05011
  • 财政年份:
    2022
  • 资助金额:
    $ 1.75万
  • 项目类别:
    Discovery Grants Program - Individual
High-dimensional Data Analysis: Modeling Unobserved Heterogeneity in Data, and Studying Imbalanced Classification Problems
高维数据分析:对数据中未观察到的异质性进行建模,并研究不平衡分类问题
  • 批准号:
    RGPIN-2020-05011
  • 财政年份:
    2020
  • 资助金额:
    $ 1.75万
  • 项目类别:
    Discovery Grants Program - Individual
Statistical inference in finite mixture of regressions and mixture-of-experts models in high-dimensional spaces, and varying coefficient finite mixture of regression models
高维空间中回归和专家混合模型的有限混合的统计推断,以及回归模型的变系数有限混合
  • 批准号:
    RGPIN-2015-03805
  • 财政年份:
    2019
  • 资助金额:
    $ 1.75万
  • 项目类别:
    Discovery Grants Program - Individual
Statistical inference in finite mixture of regressions and mixture-of-experts models in high-dimensional spaces, and varying coefficient finite mixture of regression models
高维空间中回归和专家混合模型的有限混合的统计推断,以及回归模型的变系数有限混合
  • 批准号:
    RGPIN-2015-03805
  • 财政年份:
    2018
  • 资助金额:
    $ 1.75万
  • 项目类别:
    Discovery Grants Program - Individual
Statistical inference in finite mixture of regressions and mixture-of-experts models in high-dimensional spaces, and varying coefficient finite mixture of regression models
高维空间中回归和专家混合模型的有限混合的统计推断,以及回归模型的变系数有限混合
  • 批准号:
    RGPIN-2015-03805
  • 财政年份:
    2017
  • 资助金额:
    $ 1.75万
  • 项目类别:
    Discovery Grants Program - Individual
Statistical inference in finite mixture of regressions and mixture-of-experts models in high-dimensional spaces, and varying coefficient finite mixture of regression models
高维空间中回归和专家混合模型的有限混合的统计推断,以及回归模型的变系数有限混合
  • 批准号:
    RGPIN-2015-03805
  • 财政年份:
    2016
  • 资助金额:
    $ 1.75万
  • 项目类别:
    Discovery Grants Program - Individual
Statistical inference in finite mixture of regressions and mixture-of-experts models in high-dimensional spaces, and varying coefficient finite mixture of regression models
高维空间中回归和专家混合模型的有限混合的统计推断,以及回归模型的变系数有限混合
  • 批准号:
    RGPIN-2015-03805
  • 财政年份:
    2015
  • 资助金额:
    $ 1.75万
  • 项目类别:
    Discovery Grants Program - Individual
Model selection and statistical inference in mixture distributions and hidden markov (regression) models
混合分布和隐马尔可夫(回归)模型中的模型选择和统计推断
  • 批准号:
    386578-2010
  • 财政年份:
    2014
  • 资助金额:
    $ 1.75万
  • 项目类别:
    Discovery Grants Program - Individual
Model selection and statistical inference in mixture distributions and hidden markov (regression) models
混合分布和隐马尔可夫(回归)模型中的模型选择和统计推断
  • 批准号:
    386578-2010
  • 财政年份:
    2013
  • 资助金额:
    $ 1.75万
  • 项目类别:
    Discovery Grants Program - Individual
Model selection and statistical inference in mixture distributions and hidden markov (regression) models
混合分布和隐马尔可夫(回归)模型中的模型选择和统计推断
  • 批准号:
    386578-2010
  • 财政年份:
    2012
  • 资助金额:
    $ 1.75万
  • 项目类别:
    Discovery Grants Program - Individual

相似国自然基金

基于基因组数据自动化分析为后生动物类群大规模开发扩增子捕获探针的实现
  • 批准号:
    32370477
  • 批准年份:
    2023
  • 资助金额:
    50 万元
  • 项目类别:
    面上项目
带结构试验的设计与数据分析
  • 批准号:
    12371259
  • 批准年份:
    2023
  • 资助金额:
    43.5 万元
  • 项目类别:
    面上项目
基于多源勘察数据融合与概率分析的软硬相间地层滑坡演化机理研究
  • 批准号:
    42307257
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
知识与数据混合驱动的含缺陷点阵结构不确定性分析与优化方法研究
  • 批准号:
    12302149
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
基于可解释深度学习的复杂组学数据分析的关键方法研究
  • 批准号:
    62373200
  • 批准年份:
    2023
  • 资助金额:
    50 万元
  • 项目类别:
    面上项目

相似海外基金

I-Corps: Vision analysis system using inferred three-dimensional data to analyze and correct a user’s pose in relation to 3D space
I-Corps:视觉分析系统,使用推断的三维数据来分析和纠正用户相对于 3D 空间的姿势
  • 批准号:
    2403992
  • 财政年份:
    2024
  • 资助金额:
    $ 1.75万
  • 项目类别:
    Standard Grant
Oral pathogen - mediated pro-tumorigenic transformation through disruption of an Adherens Junction - associated RNAi machinery
通过破坏粘附连接相关的 RNAi 机制,口腔病原体介导促肿瘤转化
  • 批准号:
    10752248
  • 财政年份:
    2024
  • 资助金额:
    $ 1.75万
  • 项目类别:
Robust Three-Dimensional Pattern Recognition based on Object Oriented Data Analysis
基于面向对象数据分析的鲁棒三维模式识别
  • 批准号:
    23K16900
  • 财政年份:
    2023
  • 资助金额:
    $ 1.75万
  • 项目类别:
    Grant-in-Aid for Early-Career Scientists
Fluency from Flesh to Filament: Collation, Representation, and Analysis of Multi-Scale Neuroimaging data to Characterize and Diagnose Alzheimer's Disease
从肉体到细丝的流畅性:多尺度神经影像数据的整理、表示和分析,以表征和诊断阿尔茨海默病
  • 批准号:
    10462257
  • 财政年份:
    2023
  • 资助金额:
    $ 1.75万
  • 项目类别:
Core D: Integrated Computational Analysis Core
核心D:综合计算分析核心
  • 批准号:
    10555896
  • 财政年份:
    2023
  • 资助金额:
    $ 1.75万
  • 项目类别:
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了