Computer Vision Methods for Robot Tasks: Motion Detection, Depth Estimation and Tracking

Wednesday, 15 June, 2011
A long-term goal of robotics research is that of building robots which behave and even look like human beings. So, with the aim of working with and for people, robots should be capable of autonomously interacting with human-populated, everyday environments. For that, they are provided with some sensors, which allow robots to perceive its environment, and some actuators, to carry out different actions as response to those received percepts.
Among perception senses, vision is undoubtedly the most important for the information it can provide. So, knowledge about physical objects and scenes can be obtained from a visual input. However, modelling the 3D world from 2D information is still a challenge. For that reason, from the wide variety of issues to be solved, we have focused on the perception of the environment to obtain information about objects and humans in the vicinity by investigating motion as a primary cue. Actually, motion plays a main role since it provides a stimulus for detecting moving objects in the observed scene. Moreover, motion allows to obtain other characteristics such as, for instance, object shape, speed or trajectory, which are meaningful for detection and recognition. Nevertheless, the motion observable in a visual input could be due to different factors: movement of the imaged objects (targets and/or vacillating background elements), movement of the observer, motion of the light sources or a combination of (some of) them. Therefore, image analysis for motion detection will be conditional upon the considered factors. In particular, in this Thesis, we have focused on motion detection, location and tracking from images captured from perspective and fisheye cameras. Note that cameras are still, that is, egomotion will not be considered, although all the other factors can occur at any time.
With that assumption, we propose a complete sensor-independent visual system which provides a robust target motion detection, location and tracking. So, firstly, the way sensors obtain images of the world, in terms of resolution distribution and pixel neighbourhood, is studied, so that a proper spatial analysis of motion can be carried out. Then, a novel background maintenance approach for robust target motion detection is implemented. On this matter, two different situations have been considered: (1) a fixed camera observes a constant background where interest objects are moving; and, (2) a still camera observes interest objects moving in a dynamic background. The reason for this distinction lies in developing, from the first analysis, a surveillance mechanism which removes the constraint of observing a scene free of foreground elements during several seconds when a reliable initial background model is obtained, since that situation cannot be guaranteed when a robotic system works in a dynamic, unknown environment. Furthermore, on the way to achieve an ideal background maintenance system, other canonical problems are addressed such that the proposed approach successfully deals with (gradual and global) changes in illumination, distinction between foreground and background elements in terms of motion and motionless, and non-uniform vacillating backgrounds, to name some.
Once moving objects are robustly detected in an image, an object representation is designed in order to, finally, track the identified targets over time. So, each detected target should be represented in a way it allows to properly establish the correspondence between its image in different time steps. Note that there are a million of appearances of the same object, and changes in image contrast, intensity or colour can lead to a mismatch. As a solution, we propose an invariant representation of an object. It identifies an object among a broad range of objects, even when targets leave and re-enter the scene, by being robust to partial (total) occlusions and capable of learning new targets from a frame. Moreover, the designed representation includes a feature array that helps to discard false matchs and to make correct decisions. In this last regard, depth is a key point in robotics applications such as, for instance, navigation, to avoid obstacles; manipulation, to grasp objects; or safe interaction, not to damage any other element, specially when it is a human being. Therefore, depth is studied from two different points of view by depending on the accuracy level required at any time. On the one hand, the active vision paradigm is used to build a relative representation of objects that are actively bound on time for the task at hand. On the other hand, a reasoning inference process for distance estimation from a visual input is presented. It provides context-dependent comparisons that helps the system make decisions about the objects it has to focus on.
The methods proposed in this Thesis provide important advances with respect to state-of-the-art computer vision approaches in terms of robot reliability. The motion detection algorithm allows a good environment adaptation of the system as it properly deals with most of the vision problems when dynamic, non-structured environments are considered. In addition, the proposed object representation satisfies the requirements of a recognition task, even when movements are accompanied by changes in shape and/or size. Moreover, the integration of depth estimation into the system improves the matching process as well as helps to arrange targets in order of proximity. In that way, the system might focus on those objects closer to it. All these contributions are validated with an extensive set of experiments and applications using different testbeds of real environments with real and/or virtual targets.