Xiaoxu Ma W. Eric L. Grimson Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology, Cambridge, MA 02139, USA xiaoxuma@csail.mit.edu welg@csail.mit.edu Abstract
In this paper we propose an approach to vehicle classification under a mid-field surveillance framework. We develop a repeatable and discriminative feature based on edge points and modified SIFT descriptors, and introduce a rich representation for object classes. Experimental results show the proposed approach is promising for vehicle classification in surveillance videos despite great challenges such as limited image size and quality and large intra-class variations. Comparisons demonstrate the proposed approach outperforms other methods. However there are still great challenges to this problem. Vehicles are generally textureless. Limited object image size and quality are special difficulties. Varying lighting conditions in video surveillance further complicate the problem. The requirement to distinguish similar classes such as sedans vs. taxies makes the problem even harder. To tackle these challenges, this paper introduces an edge-based rich representation. The rich representation is able to give finer categorizations by modeling more details and improve robustness using over-complete information. The proposed approach augments edge points to repeatable and discriminative features, combines several existing techniques with modifications to fit them better to the considered problem, and gives models that perform sufficiently well to serve the purposes discussed above. Considering our applications, we focus on a fixed view angle. Our method achieves a 1.5% average error rate on cars vs. minivans classification. For even more similar object types like sedans vs. taxies, our method gives only a 4.24% error rate.
1. Introduction
Visual object recognition aims to classify observed objects into semantically meaningful categories. In this paper we focus on vehicle classification in a mid-field video surveillance framework with a single static uncalibrated camera. Several scenarios motivate our work. Activity monitoring around vital assets (embassy protection, port facility protection) often involves categorizing patterns of behavior, both to monitor normal flow of activity and to serve as a baseline for detecting possibly anomalous behavior. Such categorization is based in part on trajectories of moving objects, but also depends on the type of object. Hence it is of value to categorize objects by type, including subclasses of types. For example, trucks and vans may not be expected to visit certain parts of a site; a sedan approaching a person may indicate an arranged pick-up, yet a taxi instead may only correspond to a leaving person. In multi-camera settings, it is important to correlate activities through many different fields of view, which requires establishing correspondence between observations in non-overlapping views. Again, there is a need to classify objects into subclasses, to support this determination of correspondence. Compared with object recognition from still images, the fact that a surveillance framework deals with video sequences simplifies the recognition task in several ways. Moving objects can be separated from a static background reasonably well by background modeling and subtraction, so the problem of clutter can be minimized. Similarly, variation in scale is not a major challenge since objects can be extracted and normalized.
1.1. Related work
Researchers have investigated various 3D model based approaches for object recognition [8, 11, 16, 22]. These methods require geometric measurements such as edge/surface normal [8], saliency-based grouping of lines or curves [10, 11, 16, 22], or solving 3D to 2D projection [11, 16]. These requirements become less well-posed for vehicle recognition in a surveillance framework where images are of limited size and quality.