Automated pedestrian detection, counting and tracking has received significant attention in the computer vision community of late. As such, a variety of techniques have been investigated using both traditional 2D computer vision techniques and, more recently, 3D stereo information. However, to date, a quantitative assessment of the performance of stereo-based pedestrian detection has been problematic, mainly due to the lack of standard stereo-based test data and an agreed methodology for carrying out the evaluation. This has forced researchers into making subjective comparisons between competing approaches. In the paper [1], we propose a framework for the quantitative evaluation of a stereo-based pedestrian detection system. The synthetic and real-world test data sets from this paper are freely available to download via the links provided below. This allows researchers to benchmark systems, not only with respect to other stereo-based approaches, but also with more traditional 2D approaches.
The robust segmentation and tracking of pedestrians under unconstrained conditions introduces a multitude of complicating factors that make it one of the most challenging problems in computer vision. These challenges include occlusions, varying environmental conditions, lack of intensity variations between objects and the huge amount of possible variations in the orientation, pose, size and appearance of humans in a given scene. We believe that such issues must be reflected in data-sets in order to make significant progress. Therefore, many of the data sets provided on this webpage are designed to incorporate a similar high level of difficulty to that which arises in real-world pedestrianised scenarios. In addition to incorporating multiple levels of difficulty into an evaluation, we also believe that approaches should be evaluated at both the component and system level. For stereo-based approaches this means not only evaluating final system output, but also disparity estimation. Therefore datasets for evaluating both disparity estimation and final system-level output of a pedestrian detection technique are provided. See [1] for more details.
In order to quantitatively evaluate a disparity estimation approach within a stereo-based pedestrian detection algorithm, a data-set with groundtruth disparities is required. Unfortunately standard disparity data-sets, such as the Tsukuba, Venus, or Map data-sets may not be applicable for pedestrian detection algorithms. This can be due to the lack of a groundplane region within the scene which is a constraint required in some approaches, or the inability to obtain the background models required by a proposed algorithm. In addition, as pedestrian detection techniques are generally designed for real world pedestrianised scenes, it would be advantageous to determine the robustness of the disparity estimation technique within this context for a range of challenging scenarios such as varying lighting conditions, shadows, a lack of texture at depth discontinuities, homogeneous foreground and background regions, and most importantly pedestrians exhibiting a variety clothing, orientations, speed of movement, distances and scales.
We have developed a new synthetic data-set designed to incorporate a number of difficulties associated with typical pedestrianised scenarios. This data-set consists of 8 scenes, where each 3D scene was designed to incorporate a flat groundplane, one or more background objects, and a number of varying pedestrian models. Throughout the data-set, a variety of texture maps were chosen for the groundplane. These vary from textured or tiled surfaces to a single homogeneous colour. Finally, a variety of ambient and directional lighting sources were introduced, designed to mimic the lighting conditions of both indoor and outdoor scenarios. Depending on the lighting conditions, shadows (both cast- and self-shadows) range from subtle to strong.

Each 3D scene is rendered from each virtual camera viewpoint twice; the first rendering contains no foreground objects, and can be used to initialise background models -- see the images below. The groundtruth disparity maps for the two renderings of a synthetic scene can also been seen in the images below.
Using these synthetic scenes and their respective groundtruths, a quantitative evaluation of a proposed disparity estimation technique can be undertaken and can be benchmarked with respect to other disparity estimation techniques. We recommend using the Middlebury College open source stereo algorithm evaluation test bed for this benchmarking. The Middlebury framework provides a standalone C++ implementation of many stereo algorithms, from which a large variety of algorithms can applied to each of the 16 image-pairs. In addition, it provides a module for the quantitative evaluation of disparity results. See [1], [2] for more details. The results of our disparity estimation evaluation from these synthetic image-pairs can be viewed here.
Each of the synthetic scenes can be downloaded using the links provided in the table below. Within each tarball the following is provided for each scene; (1) all required stereo rig calibration data; (2) the right and left stereo images (in ppm format) for both renderings of the scene; (3) the right and left groundtruth disparity images (in ppm format) for both renderings of the scene; and (4) the bounding boxes of each pedestrian for use in "Methodology 1: 2D Pedestrian Detection Evaluation" (see the following section below).
| 3D Scene | # Pedestrians | |
|---|---|---|
| 1 | 12 | Download (5.5Mb) |
| 2 | 11 | Download (5.5Mb) |
| 3 | 8 | Download (3.1Mb) |
| 4 | 12 | Download (3.4Mb) |
| 5 | 11 | Download (4.6Mb) |
| 6 | 14 | Download (5.8Mb) |
| 7 | 16 | Download (5.3Mb) |
| 8 | 13 | Download (4.7Mb) |
| Total | 97 |
In order to quantitatively evaluate the final system-level output of a pedestrian detection technique, we propose the application of two differing methodologies. The first technique evaluates the proposed approach using 2D image plane comparison techniques -- this technique could also be used by monocular approaches to pedestrian detection. The second technique evaluates the proposed pedestrian detection technique using 3D groundtruth information. See [1] for more details.
Methodology 1: 2D Pedestrian Detection Evaluation
The first methodology proposed to evaluate a pedestrian detection technique is based on traditional 2D image plane comparison techniques. In this approach, stereo test sequences are manually groundtruthed by positioning a separate bounding box around each person in an image. In this process, a person is defined as someone who has a section of their body above the waist, no matter how small, visible in the image. If all that can be seen of a person in the image is an outstretched hand or part of a backpack then they are counted as being present. However, if just a leg or foot is present, then they are not counted. It should be noted that due to camera offsets in the stereo rig not all pedestrians visible in one camera are visible in the other (especially at the edges of the images). For this reason, only the right camera of the stereo rig is used, and all groundtruths are created with respect to this camera regardless of whether or not the pedestrian appears in the alternative image. The only other constraint for creating groundtruth regions was that people who are further than 8 metres from the camera are not considered valid pedestrians (however, this only effects the Corridor sequence).
Using this approach, a proposed pedestrian detection system may be evaluated via precision and recall metrics which can be determined using the groundtruth bounding boxes and the bounding boxes of pedestrians detected by a proposed system. See [1] for more details. Three real-world stereo test sequences (which are rectified and synchronised) from three scenario's were created and groundtruthed using this methodology. Each of these sequences plus their respective groundtruths can be downloaded using the links provided in the table below. In each sequence all the images are all rectified and in png format. In addition, each sequence tarball contains the timing information of the image sequence, a background image (if required) and a number of images which can be used to calibrate the groundplane via a calibration shape. An overview of each of the three evaluation test sequences is now given.
The Overhead scenario is set in an indoor environment with the camera positioned at around 3 metres above the ground and orientated back towards the groundplane. The camera has a limited field of view and due to its proximity with the groundplane it does not encounter significant occlusion problems. The lighting conditions in the scene are stable. The scene is brightly illuminated with a highly reflective ground surface.
The Corridor scenario is set in an indoor environment with the camera positioned just above 2 metres from the ground and orientated at 30 degrees towards the groundplane. The lighting conditions are relatively stable, however the ambient lighting of the scene does fluctuate more than the previous scenario due to a number of skylights overhead. The scene's illumination is more challenging than that of the Overhead sequence as it is brightly illuminated on one side, and dark on the other side, again due to the skylights. Finally, the scene contains a staircase on the right hand side, where people descend and ascend at will.
The DCU Corner scenario is set in an outdoor environment with a camera setup similar to that of the Corridor scenario. As this is in an outdoor environment the ambient lighting of the scene does fluctuate, however, due to fairly constant cloud cover, the magnitude of these changes are relatively minor. This scenario is intended to mimic a busy pedestrianised shopping area. Pedestrians in the scene walk in multiple directions with respect to the camera. Some change direction whilst walking, either by their own free will or by necessity in order to avoid persons travelling towards them. In addition, persons can be seen to enter and exit the scene at the same time in close proximity, whilst others may be somewhat occluded for some or all of their time in front of the camera. Regarding pedestrian appearance, some are drinking coffee whilst walking, others use mobile phones, and others carry shoulder bags.
| Sequence | # Pedestrians | Frames | Frame Rate (Hz) | Time (in mins.) | |
|---|---|---|---|---|---|
| DCU Corner | 5126 | 828 | ≈ 4.6 | 3.02 | Download (465Mb) |
| Overhead | 657 | 418 | ≈ 6.5 | 1.10 | Download (823Mb) |
| Corridor | 1027 | 697 | ≈ 5.3 | 2.26 | Download (345Mb) |
| 2D Total | 6810 | 1943 | ≈ 4.6--6.5 | 6.38 |
It should be noted that the above table provides the average frame rates. In all the test sequences, the latency between frames varied due to the software used to stream the images from the Digiclops camera to a computer hard disk. In our data-sets, minimum and maximum latency between frames were recorded at 0.08 and 1.5 seconds respectively.
We also propose to evaluate a system using a second methodology based on groundtruth data captured using a 3D Vicon infrared motion analysis system. The Vicon system is an automated motion capture system that tracks the 3D position of infra-red reflective markers in 3D space with a high degree of accuracy (up to 1mm in a 6m space). Using this groundtruth data we propose a technique for evaluating stereo-based pedestrian detection techniques with respect to both precision, recall and 3D positional and height statistics -- see [1] for more details. For our experiments 12 Vicon cameras were employed and five different test sequences (which are rectified and synchronised) from the Vicon scenario (see below) were recorded. The first sequence, Vicon 1, consisted of 1 person; the second sequence, Vicon 2, consisted of 2 people; the third sequence, Vicon 4, consisted of 4 people; and the final two sequences, Vicon 8A and Vicon 8B consisted of 8 people.
The Vicon scenario is set in an indoor environment with the camera positioned just above 2 metres from the ground. This setup is similar to that of the Corridor scene with regards camera placement and orientation. The lighting conditions are stable, however, as with the Overhead scene, the groundplane is highly specular and brightly illuminated. In this scenario, a number of people (between 1 and 8) are constrained to an ellipse of 3.15 by 5.5 metres in width and length respectively. Therefore, unlike all of the previous sequences the number of persons in the scene remains constant, however it should be noted that at times not all of them are visible due to occlusions.
| Sequence | # Pedestrians | Frames | Frame Rate (Hz) | Time (in mins.) | |
|---|---|---|---|---|---|
| Vicon 1 | 198 | 198 | ≈ 5.4 | 0.86 | Download (282Mb) |
| Vicon 2 | 526 | 263 | ≈ 5.5 | 0.94 | Download (307Mb) |
| Vicon 4 | 1296 | 324 | ≈ 5.3 | 1.18 | Download (363Mb) |
| Vicon 8A | 2104 | 263 | ≈ 5.4 | 0.96 | Download (306Mb) |
| Vicon 8B | 2120 | 265 | ≈ 5.4 | 1.00 | Download (315Mb) |
| 3D Total | 6244 | 1313 | ≈ 5.3--5.5 | 4.94 |
It should be noted that the above table provides the average frame rates. In all the test sequences, the latency between frames varied due to the software used to stream the images from the Digiclops camera to a computer hard disk. In our data-sets, minimum and maximum latency between frames were recorded at 0.08 and 1.5 seconds respectively.
The results of the disparity estimation evaluation from the 16 synthetic stereo image-pairs can be viewed here.
The results of the final pedestrian detection and tracking system from the 8 test sequences are available on-line here.
[1] P. Kelly, N.E. O'Connor and A.F. Smeaton. ``A Framework for Evaluating Stereo-Based Pedestrian Detection Techniques'', IEEE Transactions on Circuits and Systems for Video Technology, Volume 18, Issue 8, Pages 1163-1167, August, 2008.
[2] P. Kelly, "Pedestrian detection and tracking using stereo vision techniques", Ph.D. dissertation, Dublin City University (DCU), 2007. [Online] Download here.