喵ID:tnEyt0免责声明

Feature selection in finite mixture of sparse normal linear models in high-dimensional feature space

基本信息

DOI:
10.1093/biostatistics/kxq048
发表时间:
2011-01-01
期刊:
影响因子:
2.1
通讯作者:
Lin, Shili
中科院分区:
数学2区
文献类型:
Article
作者: Khalili, Abbas;Chen, Jiahua;Lin, Shili研究方向: -- MeSH主题词: --
关键词: --
来源链接:pubmed详情页地址

文献摘要

Rapid advancement in modern technology has allowed scientists to collect data of unprecedented size and complexity. This is particularly the case in genomics applications. One type of statistical problem in such applications is concerned with modeling an output variable as a function of a small subset of a large number of features based on relatively small sample sizes, which may even be coming from multiple subpopulations. As such, selecting the correct predictive features (variables) for each subpopulation is the key. To address this issue, we consider the problem of feature selection in finite mixture of sparse normal linear (FMSL) models in large feature spaces. We propose a 2-stage procedure to overcome computational difficulties and large false discovery rates caused by the large model space. First, to deal with the curse of dimensionality, a likelihood-based boosting is designed to effectively reduce the number of candidate features. This is the key thrust of our new method. The greatly reduced set of features is then subjected to a sparsity inducing procedure via a penalized likelihood method. A novel scheme is also proposed for the difficult problem of finding good starting points for the expectation-maximization estimation of mixture parameters. We use an extended Bayesian information criterion to determine the final FMSL model. Simulation results indicate that the procedure is successful in selecting the significant features without including a large number of insignificant ones. A real data example on gene transcription regulation is also presented.
现代技术的快速进步使科学家能够收集到规模和复杂程度前所未有的数据。在基因组学应用中尤其如此。此类应用中的一类统计问题涉及基于相对较小的样本量(这些样本量甚至可能来自多个亚群),将一个输出变量建模为大量特征中的一个小子集的函数。因此,为每个亚群选择正确的预测特征(变量)是关键。为了解决这个问题,我们考虑在大特征空间中的稀疏正态线性有限混合(FMSL)模型中的特征选择问题。我们提出了一个两阶段的程序来克服由大模型空间导致的计算困难和高错误发现率。首先,为了应对维度灾难,设计了一种基于似然的提升方法,以有效减少候选特征的数量。这是我们新方法的关键要点。然后,通过一种惩罚似然方法对大幅减少的特征集进行稀疏诱导处理。还针对为混合参数的期望最大化估计寻找良好起始点这一难题提出了一种新的方案。我们使用一种扩展的贝叶斯信息准则来确定最终的FMSL模型。模拟结果表明,该程序在选择重要特征的同时不会包含大量不重要的特征方面是成功的。还给出了一个关于基因转录调控的真实数据示例。
参考文献(13)
被引文献(0)

数据更新时间:{{ references.updateTime }}

Lin, Shili
通讯地址:
--
所属机构:
--
电子邮件地址:
--
免责声明免责声明
1、猫眼课题宝专注于为科研工作者提供省时、高效的文献资源检索和预览服务;
2、网站中的文献信息均来自公开、合规、透明的互联网文献查询网站,可以通过页面中的“来源链接”跳转数据网站。
3、在猫眼课题宝点击“求助全文”按钮,发布文献应助需求时求助者需要支付50喵币作为应助成功后的答谢给应助者,发送到用助者账户中。若文献求助失败支付的50喵币将退还至求助者账户中。所支付的喵币仅作为答谢,而不是作为文献的“购买”费用,平台也不从中收取任何费用,
4、特别提醒用户通过求助获得的文献原文仅用户个人学习使用,不得用于商业用途,否则一切风险由用户本人承担;
5、本平台尊重知识产权,如果权利所有者认为平台内容侵犯了其合法权益,可以通过本平台提供的版权投诉渠道提出投诉。一经核实,我们将立即采取措施删除/下架/断链等措施。
我已知晓