VSSN 06

Computer Vision Laboratory
Institute for Advanced Computer Studies
&
Computer Science Department
University of Maryland
College Park MD, USA

ABSTRACT

People often appear in small groups in surveillance video; in order to analyze their movements or to recognize them based on their appearance it would be useful to be able to segment the groups into individuals. This is a problem that our research group has worked on for the past several years, and this paper reviews the sequence of approaches we have taken to address the problem.

The first approach is described in [1]. Here, we assume that we have a single surveillance video camera and that people enter the field of view of that camera one at a time. As they enter, we first build a color appearance model of the person. This model is obtained by segmenting the body into head, torso and legs and then modeling the color of each region using kernel density estimation. When people come together (e.g., one person walks in front of another), we construct a maximum likelihood segmentation of that group based on the known color appearance models and occlusion modeling. The algorithm finds the most like locations of each person in each view as well as a depth ordering of the people; it does this without conducting a combinatorial search over position, employing a mean shift like procedure to track people’s locations. However, this algorithm makes the restrictive assumption that people will enter the camera field one at a time; additionally, because of the way it constructed the depth ordering it would not scale well to larger numbers of people.

In [2] we presented a multi-camera algorithm that removes the restrictive assumption that people enter individually, and that tracks all of the people in the joint camera fields of view in a common ground plane coordinate system. This algorithm employed a novel wide baseline stereo reconstruction method that avoided point correspondences to perform region matching between views. It utilized an incremental method to find new people and build their appearance models, which were more detailed than the models used in [1]. The system cycled between segmentation and 3D localization – basically, if 3D positions of people were known, then this allows us to accurately segment each view into individual people; similarly, if each view could be so segmented, then the 3D positions of people can be determined. Iterating on these steps a few times per frame quickly converged to accurate locations and segmentation. But this algorithm needed strong calibration of the cameras for its stereo reconstruction. Most recently, in [3], we describe a multiview approach that requires only ground plane calibration for each camera. It uses a 3D localization method introduced by Hu et. al [4], embedded in a particle filtering tracking framework, to simultaneously detect, segment and track people moving on a ground plane.

We are currently pursuing an approach that requires only one camera and ground plane calibration. I will describe this work briefly at the end of the presentation.

REFERENCES

[1] Elgammal, A., and Davis, L.S., Probabilistic framework for segmenting people under occlusion, International Conference on Computer Vision, 2001, 145-152.
[2] Mittal, A., and Davis, L.S., M2 Tracker: A multi-view approach to segmenting and tracking people in a cluttered scene, International Journal of Computer Vision, 2005
[3] Kim, K, and Davis, L. S., Multi-camera tracking and segmentation of occluded people on ground plane using search-guided particle filtering, European Conference on Computer Vision, May 2006.
[4] Weiming Hu, Min Hu, Xue Zhou, Tieniu Tan and S.J. Maybank 2006 Principal axis based correspondence between multiple cameras for people tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, pp. 663-667

Keynote speech

Segmenting People in Small Groups

Prof. Larry Davis

ABSTRACT

REFERENCES