We consider the problem of building high-level, class-specific feature detectors from only unlabeled data. For example, is it possible to learn a face detector using only unlabeled images? To answer this, we train a deep sparse autoencoder on a large dataset of images (the model has 1 billion connections, the dataset has 10 million 200x200 pixel images downloaded from the Internet). We train this network using model parallelism and asynchronous SGD on a cluster with 1,000 machines (16,000 cores) for three days. Contrary to what appears to be a widely-held intuition, our experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not. Control experiments show that this feature detector is robust not only to translation but also to scaling and out-of-plane rotation. We also find that the same network is sensitive to other high-level concepts such as cat faces and human bodies. Starting from these learned features, we trained our network to recognize 22,000 object categories from ImageNet and achieve a leap of 70% relative improvement over the previous state-of-the-art.
我们考虑仅从无标记数据构建高级的、特定类别的特征检测器的问题。例如,是否有可能仅使用无标记图像学习一个人脸检测器呢?为了回答这个问题,我们在一个大型图像数据集上训练一个深度稀疏自动编码器(该模型有10亿个连接,数据集包含从互联网下载的1000万张200×200像素的图像)。我们在一个拥有1000台机器(16000个核心)的集群上使用模型并行和异步随机梯度下降(SGD)对这个网络进行了三天的训练。与一种似乎被广泛持有的直觉相反,我们的实验结果表明,无需将图像标记为是否包含人脸就有可能训练一个人脸检测器。对照实验表明,这个特征检测器不仅对平移具有鲁棒性,而且对缩放和平面外旋转也具有鲁棒性。我们还发现,同一个网络对其他高级概念如猫脸和人体也很敏感。从这些学习到的特征出发,我们训练我们的网络从ImageNet中识别22000个物体类别,并相对之前的最先进水平实现了70%的飞跃式提升。