When dealing with highly imbalanced data, the cost-sensitive random forest algorithm has problems such as insufficient learning of small-class samples caused by bootstrap sampling, a relatively large proportion of large-class samples, and the cost-sensitive mechanism being easily weakened. In this paper, after clustering the large-class samples, the weak balance criterion is used multiple times to down-sample each cluster, so that the selected large-class samples are fused with the small-class samples of the original training set to generate multiple new imbalanced data sets for the training of cost-sensitive decision trees. Thus, a weakly balanced cost-sensitive random forest algorithm based on clustering is proposed, which not only enables sufficient learning of small-class samples but also ensures that the cost-sensitive mechanism is less affected by reducing the number of large-class samples. Experiments show that the algorithm in this paper has better performance when dealing with highly imbalanced data sets.
在处理高度不平衡数据时,代价敏感随机森林算法存在自助法采样导致小类样本学习不充分、大类样本占比较大、容易削弱代价敏感机制等问题.文中通过对大类样本聚类后,多次采用弱平衡准则对每个集群进行降采样,使选择的大类样本与原训练集的小类样本融合生成多个新的不平衡数据集,用于代价敏感决策树的训练.由此提出基于聚类的弱平衡代价敏感随机森林算法,不仅使小类样本得到充分学习,同时通过降低大类样本数量,保证代价敏感机制受其影响较小.实验表明,文中算法在处理高度不平衡数据集时性能较优.