Back to the research area

Visual Saliency Prediction

When human observers look at an image, attentive mechanisms drive their gazes towards salient regions. Emulating such ability has been studied for more than 80 years by neuroscientists and by computer vision researchers, while only recently, thanks to the large spread of deep learning, saliency prediction models have achieved a considerable improvement.

Motivated by the importance of automatically estimating the human focus of attention on images, we developed two different saliency prediction models which have overcome previous methods by a big margin winning the LSUN Saliency Challenge at CVPR17, Honolulu, Hawaii.

LSUN Challenge 2017

The latest saliency prediction model developed by Imagelab has ranked first in the LSUN 2017 Saliency Challenge, which took place at CVPR, Honolulu, Hawaii.

Our model integrates an LSTM-based attentive mechanism to iteratively attend and refine predictions at different locations. A variation of that model is currently under submission to a journal. You can download the preprint of the paper here.

Saliency Attentive Model (SAM)


Data-driven saliency has recently gained a lot of attention thanks to the use of Convolutional Neural Networks for predicting gaze fixations. In this paper we go beyond standard approaches to saliency prediction, in which gaze maps are computed with a feed-forward network, and we present a novel model which can predict accurate saliency maps by incorporating neural attentive mechanisms. The core of our solution is a Convolutional LSTM that focuses on the most salient regions of the input image to iteratively refine the predicted saliency map. Additionally, to tackle the center bias present in human eye fixations, our model can learn a set of prior maps generated with Gaussian functions. We show, through an extensive evaluation, that the proposed architecture overcomes the current state of the art on two public saliency prediction datasets. We further study the contribution of each key components to demonstrate their robustness on different scenarios.


Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model

M.Cornia, L.Baraldi, G.Serra, R.Cucchiara


Poster and Source Code:

Multi-Level Network (ML-Net)


Current state of the art models for saliency prediction employ Fully Convolutional networks that perform a non-linear combination of features extracted from the last convolutional layer to predict saliency maps. We propose an architecture which, instead, combines features extracted at different levels of a Convolutional Neural Network (CNN). Our model is composed of three main blocks: a feature extraction CNN, a feature encoding network, that weights low and high level feature maps, and a prior learning network. We compare our solution with state of the art saliency models on two public benchmarks datasets. Results show that our model outperforms under all evaluation metrics on the SALICON dataset, which is currently the largest public dataset for saliency prediction, and achieves competitive results on the MIT300 benchmark.


A Deep Multi-Level Network for Saliency Prediction

M.Cornia, L.Baraldi, G.Serra, R.Cucchiara

ICPR 2016

Poster and Source Code:


1 Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita "Visual Saliency for Image Captioning in New Multimedia Services" Proceedings of the 2017 IEEE International Conference on Multimedia and Expo Workshops, Hong Kong, July 10-14, 2017, 2017 Conference
2 Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita "A Deep Multi-Level Network for Saliency Prediction" Proceedings of the 23rd International Conference on Pattern Recognition, Cancun, Mexico, 4-8 Dec 2016, 2016 Conference
3 Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita "Multi-Level Net: a Visual Saliency Prediction Model" Computer Vision ECCV 2016 Workshops, vol. 9914, Amsterdam, The Netherlands, pp. 302 -315 , October 9th, 2016, 2016
DOI: 10.1007/978-3-319-48881-3_21 Conference