This paper addresses the problem of joint detection and recounting of abnormal events in videos. Recounting of abnormal events, i.e., explaining why they are judged to be abnormal, is an unexplored but critical task in video surveillance, because it helps human observers quickly judge if they are false alarms or not. To describe the events in the human-understandable form for event recounting, learning generic knowledge about visual concepts (e.g., object and action) is crucial. Although convolutional neural networks (CNNs) have achieved promising results in learning such concepts, it remains an open question as to how to effectively use CNNs for abnormal event detection, mainly due to the environment-dependent nature of the anomaly detection. In this paper, we tackle this problem by integrating a generic CNN model and environment-dependent anomaly detectors. Our approach first learns CNN with multiple visual tasks to exploit semantic information that is useful for detecting and recounting abnormal events. By appropriately plugging the model into anomaly detectors, we can detect and recount abnormal events while taking advantage of the discriminative power of CNNs. Our approach outperforms the state-of-the-art on Avenue and UCSD Ped2 benchmarks for abnormal event detection and also produces promising results of abnormal event recounting.
本文探讨了视频中异常事件的联合检测和描述问题。异常事件的描述,即解释为什么它们被判定为异常,在视频监控中是一项尚未被探索但至关重要的任务,因为它有助于人类观察者快速判断是否为误报。为了以人类可理解的形式描述事件以便进行事件描述,学习关于视觉概念(例如物体和动作)的通用知识至关重要。尽管卷积神经网络(CNNs)在学习此类概念方面取得了有希望的成果,但如何有效地将CNNs用于异常事件检测仍然是一个未解决的问题,这主要是由于异常检测具有依赖环境的特性。在本文中,我们通过整合一个通用的CNN模型和依赖环境的异常检测器来解决这个问题。我们的方法首先通过多个视觉任务学习CNN,以利用对检测和描述异常事件有用的语义信息。通过将该模型恰当地嵌入异常检测器中,我们能够在利用CNNs的判别能力的同时检测和描述异常事件。我们的方法在用于异常事件检测的Avenue和UCSD Ped2基准测试中优于现有技术,并且在异常事件描述方面也取得了有希望的结果。