The integrity of sample labels has a significant impact on the classification accuracy of supervised learning problems. However, in real data, due to factors such as the randomness of the labeling process and the non-professionalism of labelers, data labels will inevitably be polluted by noise, that is, the observed labels of samples are different from the true labels. To reduce the negative impact of noisy labels on the classification accuracy of classifiers, a noisy label correction method is proposed in this paper. This method uses a base classifier to classify observed samples and estimate the noise rate to identify noisy label data, and then uses the classification results of the base classifier to relabel the noisy label samples to obtain a sample data set with corrected noisy label samples. The experimental results on synthetic data sets and real data sets show that this relabeling algorithm has a certain improvement effect on the classification results under the interference of different base classifiers and different noise rates. Compared with the algorithm without noise reduction on the synthetic data set, its accuracy rate is increased by about 5%. In the high noise rate environment on the CIFAR and MNIST data sets, the F1 value of this relabeling algorithm is on average more than 7% higher than that of Elk08 and Nat13, and 53% higher than that of the noise-free algorithm.
样本标签的完整性对于有监督学习问题的分类精度有着显著影响,然而在现实数据中,由于标注过程的随机性和标注人员的不专业性等因素,数据标签不可避免地会受到噪声污染,即样本的观测标签不同于真实标签。为降低噪声标签对分类器分类精度的负面影响,文中提出一种噪声标签纠正方法,该方法利用基分类器对观测样本进行分类并估计噪声率,以识别噪声标签数据,再利用基分类器的分类结果对噪声标签样本进行重新标注,得到噪声标签样本被修正后的样本数据集。在合成数据集与真实数据集上的实验结果表明,该重标注算法在不同基分类器和不同噪声率干扰下对分类结果都有一定的提升作用,在合成数据集上对比无降噪声算法,其正确率提升5%左右,而在CIFAR和MNIST数据集上的高噪声率环境下,该重标注算法的F1值比Elk08和Nat13平均高7%以上,比无噪声算法高53%。