Imagelab datasets



Pandora has been specifically created for head center localizationhead pose and shoulder pose estimation and is inspired by the automotive context. A frontal fixed device acquires the upper body part of the subjects, simulating the point of view of camera placed inside the dashboard. Subjects also perform driving-like actions, such as grasping the steering wheel, looking to the rear-view or lateral mirrors, shifting gears and so on.
Pandora contains more than 250k full resolution RGB (1920x1080 pixels) and depth images (512x424) with the corresponding annotation; 110 annotated sequences using 10 male and 12 female actors.
Garments as well as various objects are worn or used by the subjects to create head and/or shoulder occlusions. For example, people wear prescription glasses, sun glasses, scarves, caps, and manipulate smartphones, tablets or plastic bottles.

Keywords: head pose, shoulder pose, automotive, cvpr


MotorMark is composed by more than 30k frames. A variety of subjects is guarantee (35 subjects in total).
We recreate an automotive context. The subject is standing in a real car dashboard and performs real inside-car actions, like rotating steering wheel, shifting gears.
Subjects are asked to follow a constrain path (4 led are placed in correspondence with the speedometer, the rev counter, infotainment system and the left wing mirror), to rotate their head in fixed position or to freely move their head. Besides, subjects can wear glasses, sun glasses and a scarf, to generate partial face and landmark occlusions.
The annotation of 68 landmark positions on both RGB and depth frames is available, following the ISO MPEG-4 standard. The ground truth has been manually generated.
The user was provided with an initial estimation done by means of the algorithm included in the dLib libraries (, which gives landmark positions on RGB images. The projection of the landmark coordinates on the depth images is carried out exploiting the internal calibration tool of the Microsoft Kinect SDK.  
RGB and depth images are acquired with a spatial resolution of 1280x720 HD and 515x424, respectively;

Keywords: facial landmark, depth images, automotive, landmarking


DR(eye)VE: a dataset for attention-based tasks with applications to autonomous and assisted driving.
Composed by 74 video sequences of 5 mins each, we have captured and annotated more than 500,000 frames.
The labeling contains drivers’ gaze fixations and their temporal integration providing task-specific saliency maps.
Geo-referenced locations, driving speed and course complete the set of released data.

Keywords: driver attention, driver gaze, ADAS, assisted driving, autonomous driving
Scene detection - sample thumbnail

BBC Planet Earth Dataset

The BBC Planet Earth dataset contains ground truth shots and scene annotation for each of the 11 eposides of the BBC Planet Earth educational TV Series. Each shot and scene has been manually annotated and verified by a set of human experts. Moreover, being scene detection a subjective task, we collect scene annotations from 5 different people.

Keywords: scene detection, shot detection

M-VAD Names

The dataset consists of 84.6 hours of video from 92 Hollywood movies. For each movie, we provide manually annotated visual appearances of characters, their face and body tracks, and associations with textual mentions in the captions. The dataset has been annotated in a semi-automatic way, by leveraging face recognition and clustering techniques, and then manually refined. It contains a total of 22,459 video clips, which are divided into appropriate train, test and validation splits. The number of unique characters found in the screenplays is 1,392, while the overall number of mentions is 29,862. We found some errors in the captions of the M-VAD dataset for many films, so the correct values of unique characters and mentions would be lower. Finally, the unique annotated characters are 908. They appear a total of 20,631 times in the video clips and in 53,665 extracted tracks.

Keywords: video captioning, naming, face identification
sva dataset 2

Surrounding Vehicles Awareness dataset

In order to collect data, we exploit Script Hook V library synthetic_scripthook, which allows using Grand Theft Auto V (GTAV) video game native functions.
We develop a framework in which the game camera automatically toggle between frontal and bird-eye view at each game time step: in this way we are able to gather information about the spatial occupancy of the vehicles in the scene from both views (bounding boxesdistancesyaw rotations).
We associate vehicles information across the two views by querying the game engine for entity IDs. 

Keywords: automotive, gta, surrounding vehicles


We collect a new dataset which has been explicitly designed and created for Human Computer Interaction.
10 gestures types: zoom in, zoom out, scroll up, scroll down, slide left, slide right, rotate, back, ok, exit; dataset contains also a no action, in which subjects stands in front of the camera with a neutral pose.
Gestures are performed by 10 subjects (not all subjects perform all actions), for a total of 168 instances. They are acquired by standing in front of a stationary Kinect 1 device and only the upper part of the body  (shoulders, elbows, wrists and hands) is involved; each gesture is performed with the same arm by all the subjects (despite they are left or right handed). 

Keywords: HCI, depth images, gestures
YACCLAD dataset preview


Following a common practice in the literature, we built a dataset that includes both synthetic and real images. The provided dataset is suitable for a wide range of applications, ranging from document processing to survaillance, and features a significant variability in terms of resolution, image density and number of components. All images are provided in 1 bit per pixel PNG format, with 0 (black) being background and 1 (white) being foreground.

Keywords: Connected Components Labeling, Binary Images.
Maramotti Dataset for Gesture Recognition

Maramotti Dataset for Gesture Recogntion

This dataset contains videos taken in the Maramotti modern art museum, in which paintings, sculptures and objets d’art are exposed. The camera is placed on the user’s head and captures a 800x450, 25 frames per second image sequence. The Maramotti dataset contains 700 video sequences, recorded by five different persons, each performing seven hand gestures in front of different artworks: like, dislike, point, ok, slide left to right, slide right to left and take a picture. Some of them (like the point, ok, like and dislike gestures) are statical, others (like the two slide gestures) are dynamical.

Keywords: gesture recognition, cultural heritage
Interactive Museum for Gesture Recogntion

Interactive Museum for Gesture Recognition

The Interactive Museum dataset consists of 700 video sequences, all shot with a wearable camera, taken in a interactive exhibition room, in which paintings and artworks are projected over a wall in a virtual museum fashion. The camera is placed on the user’s head and captures a 800x450, 25 frames per second 24-bit RGB image sequence. Five different users perform seven hand gestures: like, dislike, point, ok, slide left to right, slide right to left and take a picture.

Keywords: gesture recognition, cultural heritage


3DPeS (3D People Surveillance Dataset) is a surveillance dataset, designed mainly for people re-identification in multi camera systems with non-overlapped field of views, but also applicable to many other tasks, such as people detection, tracking, action analysis and trajectory analysis.

Available data: the camera setting and the 3D environment reconstruction, the hundreds of recorded videos, the camera calibration parameters, the identity of the hundreds of people, detected more than one time by different point of view.

It contains numerous video sequences taken from a real surveillance setup, composed by 8 different surveillance cameras, monitoring a section of the campus of the University of Modena and Reggio Emilia. Data were collected over the course of several days.

Keywords: re-identification, Visor, 3dpes, calibration

Visor - VIdeo Surveillance Online Repository

ViSOR contains a large set of multimedia data and the corresponding annotations. The repository has been conceived as a support tool for different research projects.
Together with the videos, ViSOR contains metadata annotation, both manually annotated ground-truth data and automatically obtained outputs of a particular system. In such a manner, the users of the repository are able to perform validation tasks of their own algorithms as well as comparative activities. ViSOR also contains two datasets for people Reidentification: 3DPES and SARC3D.

Keywords: Visor; surveillance; 3dPes, Sarc3D