Anomaly Locality in Video Surveillance

Federico Landi             Cees G. M. Snoek             Rita Cucchiara


This paper strives for the detection of real-world anomalies such as burglaries and assaults in surveillance videos. Although anomalies are generally local, as they happen in a limited portion of the frame, none of the previous works on the subject has ever studied the contribution of locality. In this work, we explore the impact of considering spatiotemporal tubes instead of whole-frame video segments. For this purpose, we enrich existing surveillance videos with spatial and temporal annotations: it is the first dataset for anomaly detection with bounding box supervision in both its train and test set.

Our experiments show that a network trained with spatiotemporal tubes performs better than its analogous model trained with whole-frame videos. In addition, we discover that the locality is robust to different kinds of errors in the tube extraction phase at test time. Finally, we demonstrate that our network can provide spatiotemporal proposals for unseen surveillance videos leveraging only video-level labels. By doing, we enlarge our spatiotemporal anomaly dataset without the need for further human labeling.


Given an input clip, we aim to determine whether the observed scene is normal or anomalous. Equivalently, we want our model to output the probability that an unusual event is taking place in the input video. The output of our model is a continuous number between 0 and 1, so we are casting anomaly detection to a regression problem. Additionally, we want to focus on the precise locality where the anomaly occurs: we do so by including a novel tube extraction module in our architecture. With this approach, we can change the granularity of our analysis from full-frame videos to spatiotemporal tubes. Our model consists of three main components: a tube extraction module, a video encoder, and a regression network.


To the best of our knowledge, none of the existing anomaly detection datasets provides spatiotemporal annotations for unusual events in its training set. To overcome the lack of labeled data, we enrich a portion of the recently-proposed UCF-Crime with spatiotemporal annotations. We start by selecting six among the 13 anomalous categories that are present in UCF-Crime, with particular attention to human-based anomalies: Arrest, Assault, Burglary, Robbery, Stealing, and Vandalism. We then select 100 videos belonging to the designated categories, resulting in more than an hour of video sequences. Finally, we use Vatic to annotate bounding boxes for anomalous events. Although we do not use them in our experiments, action class labels for tasks such as action recognition or localization are available. For further details about our annotation policy, please refer to our paper.

We show some examples from UCFCrime2Local in the following lines:

Arrest Assault Burglary
Robbery Stealing Vandalism


You can download our annotations here. We do not redistribute the videos, that can be downloaded from UCFCrime website. If you have any question, please contact the author.


If you use our annotations or find our work useful for your research, please cite:

  	  title={Anomaly Locality in Video Surveillance},
  	  author={Landi, Federico and Snoek, Cees GM and Cucchiara, Rita},
  	  journal={arXiv preprint arXiv:1901.10364},