Research on Videosurveillance & HBU
Learning to Detect and Track Visible and Occluded Body Joints in a Virtual World
Multi-People Tracking in an open-world setting requires a special effort in precise detection. Moreover, temporal continuity in the detection phase gains more importance when scene cluttering introduces the challenging problems of occluded targets. For the purpose, we propose a deep network architecture that jointly extracts people body parts and associates them across short temporal spans.
Our model explicitly deals with occluded body parts, by hallucinating plausible solutions of not visible joints. We propose a new end-to-end architecture composed by four branches (visible heatmaps, occluded heatmaps, part affinity fields and temporal affinity fields) fed by a time linker feature extractor. To overcome the lack of surveillance data with tracking, body part and occlusion annotations we created the vastest Computer Graphics dataset for people tracking in urban scenarios by exploiting a photorealistic videogame. It is up to now the vastest dataset (about 500.000 frames, more than 10 million body poses) of human body parts for people tracking in urban scenarios.
Our architecture trained on virtual data exhibits good generalization capabilities also on public real tracking benchmarks, when image resolution and sharpness are high enough, producing reliable tracklets useful for further batch data association or re-id modules.
GAN4Surveillance: Generative Adversarial Networks for Attribute Classification
Security is of fundamental importance in a world where terrorist attacks are steadily increasing. Governments and agencies face these realities every day, but not always the means at their disposal are sufficient to effectively prevent those attacks. The security area uses many science and engineering fields, and many are the areas of study available. This research activity is focused on the problem of attribute classification (such as age, sex, etc.) and items (backpacks, bags, etc.) of people through security cameras. Computer Vision based Deep Learning techniques and generative models are exploited to address this problem in an automatic fashion. We explore the generalization capability of adversarial networks to enhance people image resolution and to hallucinate occluded body parts.
Duke Imagelab Multi-Target, Multi-Camera Tracking Project
DukeMTMC aims to accelerate advances in multi-target multi-camera tracking. It provides a tracking system that works within and across cameras, a new large scale HD video data set recorded by 8 synchronized cameras with more than 7,000 single camera trajectories and over 2,000 unique identities, and a new performance evaluation method that measures how often a system is correct about who is where.
Despite prejudice cannot be directly observed, nonverbal behaviours provide profound hints on people inclinations. In this study, we use recent sensing technologies and machine
learning techniques to automatically infer the results of psychological questionnaires frequently used to assess implicit prejudice. In particular, we recorded 32 students discussing
with both white and black collaborators. Then, we identified a set of features allowing automatic extraction and measured their degree of correlation with psychological scores. Results
confirmed that automated analysis of nonverbal behaviour is actually possible thus paving the way for innovative clinical
tools and eventually more secure societies.
Action and Gesture Recognition for Human Computer Interaction
This research activity handles the problems of people action and gesture recognition. In particular, we are developing a complete framework for Human Computer Interaction (HCI), where custom gestures can be adopted by each user. Continuous gesture recognition, one-shot learning, transfer learning algorithms are taken into account.
HCI is the discipline that studies models and techniques for the interaction between people and computers. Its historical evolution starts in the ‘70s when Command Line Interfaces (CLI) were created. Although these devices are quick, they are difficult to use, because of their mnemonic component, i.e. users have to remember the right and precise command to interact with the computer. In the ‘80s Graphical User Interfaces (GUI) were developed: they are more user friendly than CLI and introduce new devices (like mouse) and metaphors (like window, drag-and-drop, and desktop). Natural User Interfaces (NUI) have been conceived in the ‘90s; they are intuitive and invisible because users do not need any material device to interact with the computer, they just perform natural actions with their body, their natural and innate language.
For these reasons, NUIs require systems that are able to automatically detect and recognize actions and gestures in a video stream. They have recently got prestige thanks to new low cost technologies that allow easily detecting and precisely tracking human body joints in a 3D space.
Group Detection and Crowd Analysis
Behavior analysis will play a central role in future video surveillance systems as research on this topic has been revealing promising in helping to discover public safety risks or predict crimes. Nevertheless, trying to understand complex interactions in the scene just by looking at each individual separately is unrealistic, due to the inherent social nature of human behavior. This is because those interactions do not occur at an individual level nor at a crowd level, but they typically involve small subsets of people, namely groups. We thus believe future challenges will reside in enhancing action analysis by considering social interactions among small gathering of people sharing a common goal, to this end group detection becomes a mandatory step for modern crowd surveillance systems.
Multiple People Tracking
Multiple Target Tracking is an important task within the field of computer vision. The proliferation of high-powered computers, the availability of high quality and inexpensive video cameras, and the increasing need for automated video analysis has generated a great deal of interest in tracking algorithms. The problem is often addressed in a paradigm named tracking-by-detection, where detections are given ahead of time and tracking purpose is to merge this detections into separate identities. The real challenge in Multi Target Tracking is how to deal with noisy detections (miss and false detections) and with long occlusions. In this work we leveraged on cognitive psychology studies to develop a human-inspired model.
People re-identification aims at finding multiple instance of the same person on images or videos based on appearance features. Imagelab attempted to solve the re-identification problem by means of 3D body models, that provide a spatial support for the appearance features.
ViSOR - Video Surveillance Online Repository
ViSOR contains a large set of multimedia data and the corresponding annotations. The repository has been conceived as a support tool for different research projects.
Together with the videos, ViSOR contains metadata annotation, both manually annotated ground-truth data and automatically obtained outputs of a particular system. In such a manner, the users of the repository are able to perform validation tasks of their own algorithms as well as comparative activities. ViSOR also contains two datasets for people Reidentification: 3DPES and SARC3D.
People trajectory analysis and anomaly detection
People trajectory analysis is a recurrent task in many pattern recognition applications, such as surveillance, behavior analysis, video annotation, and many others. We develop a new framework for analyzing trajectory shape, invariant to spatial shifts of the people motion in the scene.
People Tracking From Multiple Cameras
Outdoor surveillance is one of the most attractive application of video processing and analysis. Robust algorithms must be defined and tuned to cope with the non-idealities of outdoor scenes. For instance, in a public park, an automatic video surveillance system must discriminate between shadows, reflections, waving trees, people standing still or moving, and other objects. Visual knowledge coming from multiple cameras can disambiguate cluttered and occluded targets by providing a continuous consistent labeling of tracked objects among the different views.