This notebook paper presents overview and comparative analysis of our system designed for untrimmed video classification task in ActivityNet Challenge 2016. We investigate and exploit multiple spatio-temporal clues, i.e., frames, motion (optical flow), and short video clips, using 2D or 3D convolutional neural networks (CNNs). The mechanism of different quantization methods are studied as well. Furthermore, improved dense trajectory with fisher vector encoding on long video clips and MFCC audio features are utilized. All activities are classified by late fusing the predictions of one-versus-rest linear SVMs learnt on each clue. Finally, OCR is employed to refine the prediction scores.
本笔记本纸张呈现了对我们为2016年ActivityNet挑战赛中的未剪辑视频分类任务所设计的系统的概述和比较分析。我们使用二维或三维卷积神经网络(CNNs)研究并利用多种时空线索,即帧、运动(光流)和短视频片段。还研究了不同量化方法的机制。此外,利用了在长视频片段上采用费舍尔向量编码的改进密集轨迹以及梅尔频率倒谱系数(MFCC)音频特征。通过对在每个线索上学习到的一对多线性支持向量机的预测进行后期融合来对所有活动进行分类。最后,采用光学字符识别(OCR)来优化预测分数。