Deep Learning in videos
Despite the achievements of Deep Learning on images, text, and audio data, it is still not clear how these techniques can be successfully applied to video-related tasks, like video summarization, event detection and video concept detection. In this setting, the main goal of our research is to develop and test new Deep Learning algorithms for Temporal Video Segmentation and Concept Detection in videos.
Temporal video segmentation is a well-established problem in video analysis, and aims at organizing a video into groups of frame sequences according to some coherency criterion. In case of edited video, frames are grouped into shots (sequences of frames taken by the same camera) and shots can in turn be grouped into semantically coherent segments, which are also called stories, since they often are story telling. Since the resulting segments can be automatically tagged and annotated, this decomposition allows a fine-grained search and re-use of existing video archives, and is therefore of great interest both from the scientific point of view and from the industrial applications point of view.
We developed a new approach to Temporal Video Segmentation which was able to beat all existing approaches on benchmark datasets, and which relies on a multimodal Deep Neural Network that combines static appearance, features from the video transcript, and concept detection in a metric learning fashion. The core of the algorithm is a concept detection strategy which can analyse the video transcript and build classifiers for relevant concepts “on the fly”, exploiting the Imagenet database.
Our research in the next months will include a complete refinement, analysis and fine-tuning of this approach, as well as the development of new algorithms for concept detection and temporal clustering. We also plan to perform theoretical studies on Convolutional Neural Networks and Recurrent Neural Networks for structured prediction. Training and fine-tuning of these architectures will be performed on large-scale video and image datasets, thus requiring adequate computational capacities. A relevant scientific impact is expected, both in international computer vision and multimedia conferences and in top class journals.
Imagelab has received three important grants for this project:
The NVIDIA Hardware Grant, with the donation of one Tesla K40 GPU.
The Italian Supercomputing Resource Allocation (ISCRA) Grant from CINECA
The Facebook AI Research Partnership, with the donation of a GPU-based server
Baraldi, Lorenzo; Grana, Costantino; Messina, Alberto; Cucchiara, Rita
"A Browsing and Retrieval System for Broadcast Videos using Scene Detection and Automatic Annotation"
Proceedings of the 24th ACM international conference on Multimedia,
Amsterdam, The Netherlands,
15 - 19 October 2016,
DOI: 10.1145/2964284.2973825 Conference
|2||Paci, Francesco; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita; Benini, Luca "Context Change Detection for an Ultra-Low Power Low-Resolution Ego-Vision Imager" Proceedings of the 14th European Conference on Computer Vision Workshops, Amsterdam, The Netherlands, October 8-10, 2016, 2016 Conference|
|3||Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita "Recognizing and Presenting the Storytelling Video Structure with Deep Multimodal Networks" IEEE TRANSACTIONS ON MULTIMEDIA, pp. 1 -14 , 2016 Journal|
Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita
"Scene-driven Retrieval in Edited Videos using Aesthetic and Semantic Deep Features"
Proceedings of the 6th ACM on International Conference on Multimedia Retrieval,
New York, USA,
6-9 Giugno 2016,
DOI: 10.1145/2911996.2912012 Conference