Back to the research area

Deep Learning in videos

Despite the achievements of Deep Learning on images, text, and audio data, it is still not clear how these techniques can be successfully applied to video-related tasks, like video summarization, event detection and video concept detection. In this setting, the main goal of our research is to develop and test new Deep Learning algorithms for Temporal Video Segmentation and Concept Detection in videos.

Temporal video segmentation is a well-established problem in video analysis, and aims at organizing a video into groups of frame sequences according to some coherency criterion. In case of edited video, frames are grouped into shots (sequences of frames taken by the same camera) and shots can in turn be grouped into semantically coherent segments, which are also called stories, since they often are story telling. Since the resulting segments can be automatically tagged and annotated, this decomposition allows a fine-grained search and re-use of existing video archives, and is therefore of great interest both from the scientific point of view and from the industrial applications point of view.

white row

We developed a new approach to Temporal Video Segmentation which was able to beat all existing approaches on benchmark datasets, and which relies on a multimodal Deep Neural Network that combines static appearance, features from the video transcript, and concept detection in a metric learning fashion. The core of the algorithm is a concept detection strategy which can analyse the video transcript and build classifiers for relevant concepts “on the fly”, exploiting the Imagenet database.

Our research in the next months will include a complete refinement, analysis and fine-tuning of this approach, as well as the development of new algorithms for concept detection and temporal clustering. We also plan to perform theoretical studies on Convolutional Neural Networks and Recurrent Neural Networks for structured prediction. Training and fine-tuning of these architectures will be performed on large-scale video and image datasets, thus requiring adequate computational capacities. A relevant scientific impact is expected, both in international computer vision and multimedia conferences and in top class journals.



Imagelab has received three important grants for this project:

  • The NVIDIA Hardware Grant, with the donation of one Tesla K40 GPU.

  • The Italian Supercomputing Resource Allocation (ISCRA) Grant from CINECA

  • The Facebook AI Research Partnership, with the donation of a GPU-based server


1 Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita "Hierarchical Boundary-Aware Neural Encoder for Video Captioning" 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, Hawaii, July, 22-25, 2017 Conference
2 Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita "NeuralStory: an Interactive Multimedia System for Video Indexing and Re-use" Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing, Florence, Italy, 19-21 June 2017, 2017
DOI: 10.1145/3095713.3095735 Conference
3 Baraldi, Lorenzo; Grana, Costantino; Messina, Alberto; Cucchiara, Rita "A Browsing and Retrieval System for Broadcast Videos using Scene Detection and Automatic Annotation" Proceedings of the 2016 ACM on Multimedia Conference, Amsterdam, The Netherlands, pp. 733 -734 , 15 - 19 October 2016, 2016
DOI: 10.1145/2964284.2973825 Conference
4 Paci, Francesco; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita; Benini, Luca "Context Change Detection for an Ultra-Low Power Low-Resolution Ego-Vision Imager" Computer Vision ECCV 2016 Workshops, vol. 9913, Amsterdam, The Netherlands, pp. 589 -602 , October 8-10, 2016, 2016
DOI: 10.1007/978-3-319-46604-0_42 Conference
5 Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita "Recognizing and Presenting the Storytelling Video Structure with Deep Multimodal Networks" IEEE TRANSACTIONS ON MULTIMEDIA, pp. 1 -14 , 2016 Journal
6 Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita "Scene-driven Retrieval in Edited Videos using Aesthetic and Semantic Deep Features" Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, New York, USA, pp. 23 -29 , 6-9 Giugno 2016, 2016
DOI: 10.1145/2911996.2912012 Conference
7 Baraldi Lorenzo; Grana Costantino; Cucchiara Rita "A Deep Siamese Network for Scene Detection in Broadcast Videos" Proceedings of the 23rd ACM international conference on Multimedia, Brisbane, Australia, pp. 1199 -1202 , Oct. 26-30, 2015
DOI: 10.1145/2733373.2806316 Conference

Research Activity Info