People often appear in small groups in surveillance video; in order to analyze their movements or to recognize them based on their
appearance it would be useful to be able to segment the groups into individuals. This is a problem that our research group has worked on for
the past several years, and this paper reviews the sequence of approaches we have taken to address the problem.
The first approach is described in . Here, we assume that we have a single surveillance video camera and that people enter the field of view of that
camera one at a time. As they enter, we first build a color appearance model of the person. This model is obtained by segmenting the body into head, torso and legs and
then modeling the color of each region using kernel density estimation. When people come together (e.g., one person walks in front of another), we construct a maximum
likelihood segmentation of that group based on the known color appearance models and occlusion modeling. The algorithm finds the most like locations of each person in
each view as well as a depth ordering of the people; it does this without conducting a combinatorial search over position, employing a mean shift like procedure to track
people’s locations. However, this algorithm makes the restrictive assumption that people will enter the camera field one at a time; additionally, because of the way it constructed the
depth ordering it would not scale well to larger numbers of people.
In  we presented a multi-camera algorithm that removes the restrictive assumption that people
enter individually, and that tracks all of the people in the joint camera fields of view in a common ground plane coordinate system. This algorithm employed a novel wide
baseline stereo reconstruction method that avoided point correspondences to perform region matching between views. It utilized an incremental method to find new people and
build their appearance models, which were more detailed than the models used in . The system cycled between segmentation and 3D localization – basically, if 3D positions
of people were known, then this allows us to accurately segment each view into individual people; similarly, if each view could be so segmented, then the 3D positions of
people can be determined. Iterating on these steps a few times per frame quickly converged to accurate locations and segmentation. But this algorithm needed strong
calibration of the cameras for its stereo reconstruction. Most recently, in , we describe a multiview approach that requires only ground plane calibration for each camera.
It uses a 3D localization method introduced by Hu et. al , embedded in a particle filtering tracking framework, to simultaneously detect, segment and track people moving
on a ground plane.
We are currently pursuing an approach that requires only one camera and ground plane calibration. I will describe this work briefly at the end of the presentation.
 Elgammal, A., and Davis, L.S., Probabilistic framework for segmenting people under occlusion, International Conference on Computer Vision
, 2001, 145-152.
 Mittal, A., and Davis, L.S., M2 Tracker: A multi-view approach to segmenting and tracking people in a cluttered scene, International Journal of Computer Vision
 Kim, K, and Davis, L. S., Multi-camera tracking and segmentation of occluded people on ground plane using search-guided particle filtering, European Conference on
Computer Vision, May 2006.
 Weiming Hu, Min Hu, Xue Zhou, Tieniu Tan and S.J. Maybank 2006 Principal axis based correspondence between multiple cameras for people tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence
, vol. 28, pp. 663-667