3

The Active Vision Idea

3.1 WHAT IS ‘ACTIVE’?

The word ‘active’ naturally appears as opposite to ‘passive’, and it carries with itself the idea of ‘action’. The point is: who is the actor? And what is acted on? The purpose of a vision system, be it natural or artificial, seems to be the perception of an external physical scene without implying any influence on it. In other words, we could think that letting our eyes open, or acquiring data from a camera, and processing the data so collected, is exactly what we need to visually ‘perceive’ the world. We will discuss later on these sufficient conditions. Now we want to clarify from the beginning that for active vision we indicate the approach followed by a system that manages its own resources in order to simplify a perceptual task. The resources on which the observer builds its perceptual abilities comprise visual sensors and processing devices, but also other types of sensors, actuators and even illumination devices, if needed. It is important to distinguish between active sensing and active vision. A sensor is considered active if it conveys some sort of energy into the external environment in order to measure, through the reflected energy, some physical property of the scene. Examples of these are laser scanners or ultrasonic range finders. An active observer may use either passive or active sensors, or both, but the concept of activity we are introducing relates to the observer, not to the sensor. The active observer influences acquisition parameters like camera position and orientation, lens focus, aperture and focal length, and sampling resolution. The active observer manages also its computational resources by devoting computing power and processes to critical sub-tasks or processes useful for reaching a goal. In other words, active vision has to deal with control of acquisition parameters, with selective attention to phenomena that take place in the spatio-temporal surroundings, and with planning of activities to perform in order to reach a goal. The main point here is that an intelligent observer should make use of all its resources in an economical and integrated way to perform a given task. By taking this point of view, as we do, we cannot consider vision any more as an isolated function.

3.2 HISTORICAL FRAME

The history of research in computer vision starts, together with the history of artificial intelligence and, ultimately, with the increasing availability of computing resources and languages, at the beginning of the 1960s. The active vision paradigm however has been made explicit and has been spreading in the scientific community only since the second half of the eighties. Why? There are several reasons, but we can group them into two classes: technological limitations and cultural influences.

Let us start with technological limitations. Useful computer vision has to face the difficulties posed by non-trivial, real-world images. One image provides a significant amount of data, that becomes enormous if we have to deal with an image sequence in a situation that involves motion, and we want to approach real-time performance. Moreover, anything can happen in an image; not only noise, but the extreme variability of the real world constitutes a very difficult problem at the computational theory and algorithmic levels (Marr, 1982). So, in search for a solution, a heavier and heavier burden is laid on the available computing resources. For these reasons the past predominant approach considered a static observer trying to recognize objects by analysing monocular images or, at most, stereo pairs. Motion was seen as a difficult problem that was wise avoiding whenever possible, like, for example, in object recognition and scene understanding tasks in a static environment. It was thought that observer motion would have further complicated these already difficult tasks. This intuition would have been proved wrong.

Beyond that, acting on camera extrinsic and intrinsic parameters posed technical and cost problems connected with the limited availability of lightweight TV cameras, efficient and easy to control motor systems, motorized lenses, fast acquisition devices and communication channels, and multi-processor systems.

The other class of reasons for not considering motion stemmed from the origin of computer vision, rooted in symbolic artificial intelligence. Inspired by the traditional AI perspective, vision was then mostly seen as an input functionality for the higher-reasoning processes. In this way the visual act resulted as decomposed in successive layers of computations, organized in an open loop hierarchy.

At the lowest level were processes devoted to noise reduction, image enhancement, edge-points extraction, stereo disparity calculation, etc. At an intermediate level were processes that extracted and organized feature sets from one or more intrinsic images. At the higher level these feature sets were compared with internal models in order to recognize objects and to reason about the physical reality that produced the image(s) and that had been ‘reconstructed’ by the previous processes.

Unfortunately, the task revealed itself as being much more difficult than expected. Even if, at the beginning of research in artificial vision, the undeclared acceptance of the naive idea that vision was so easy and effortless for us that it would have been almost the same for computers was conceivable, it shortly disappeared under the disillusions of the first attempts in dealing with real image analysis tasks.

One of the reasons that make vision unpenetrable is the fact that the important processes that give rise to perception in humans escape the domain of introspection. So, while symbolic AI has been inspired by the logical paradigm of thought, it has been very soon clear that the same might not satisfactorily hold for perception.

One could ask if it has been satisfactory or successful at all for AI itself; but vision goes beyond the difficulties shared with AI by adding the puzzling involvement of the so-called ‘primary processes’ that account for perceptual organization and interpolation. As gestaltist researchers have pointed out (Kanizsa, 1979), the primary processes themselves have an active character.

Among the merits of the active approach to computer vision is the recognition of the fact that visual capabilities should be considered and designed as strictly integrated within the global architecture of an intelligent autonomous system.

The very same control loop of perception and action, typical of the active vision approach, seems to point toward a more strict relationship between ‘vision’ and ‘thought’. The same relationship had been previously recognized by psychologists that refused the artificially sharp separation between the two concepts (Kanizsa, 1979). The active approach upholds vision to confront artificial intelligence on a parithetic level, by recognizing the importance of processes like categorizing, hypothesis formation and testing, and action planning. This approach to computer vision has emerged after the difficulties encountered by the reconstructionist school, flourished in the eighties.

3.3 THE RECONSTRUCTIONIST SCHOOL

The reconstructionist school, which has seen its highest expression in the influential work by Marr (1982), has neglected for a long time the importance of the observer as being immersed in the scene and interacting with it in ways much more complex than the simple acquisition of images. Nevertheless, an important part of Marr’s contribution, perhaps the most important one, is the rigorous computational approach to vision. In this perspective he stated a clear subdivision among three levels in the solution of a problem: the computational theory, the algorithmic level and the implementation.

In the work by Marr and the other researchers that followed his approach, attention is explicitly focused on what is present in a scene and where it is, and on the process for extracting this information from images. To know what is present in a scene means to recognize objects, and the most distinctive characteristic that allows to perceptually discriminate between different objects is shape.

To define what is shape is difficult, but it is easy to accept its relationship with the topological and geometrical properties of the object. In the same way, to know where an object is can mean to know its geometrical relationship with the three-dimensional world in which it is placed. Thus, as obvious consequence, we see the importance given to the problem of obtaining a good geometrical representation of the viewed scene. This approach has been called reconstructionist because it tries to reconstruct the three-dimensional structure of the scene from its two-dimensional projections on images.

It has given rise to many sophisticated algorithms sometimes collectively indicated as ‘shape from X’, where ‘X’ can be for example instantiated by ‘shading’, ‘contour’ or ‘texture’. These techniques recover the local three-dimensional shape of the surface of an object from two-dimensional image cues.

Problems arise from several factors, some of which are:

• The image formation process itself, that is the projection on a two-dimensional surface of a three-dimensional scene thus entailing loss of depth information.

• The uncertainty about the number and direction of the illuminants.

• The variability of reflectance characteristics of objects surfaces.

• The complex interaction among local reflectance and surface orientation and illumination and observation directions.

Due to these and other aspects many shape from X problems suffer from ill-posedness, nonlinearity, instability or a combination of them.

3.4 ACTIVE RECONSTRUCTION

In a series of seminal papers – the first presented in 1987 – Aloimonos and others (Aloimonos et al., 1987; Aloimonos et al., 1988; Aloimonos and Shulman, 1989) demonstrated how a controlled motion of the sensor can considerably reduce the complexity of the cited classical vision problems that fall into the ‘Shape-from-X’ group.

For all the three cases cited above, the authors showed that adequately controlled observer motion results in a mathematical simplification of the computations.

Let us take for example shape from shading, that is, a process to infer local orientation of object surfaces from apparent luminance values (Horn, 1977; Horn, 1986). It results that if we want to solve it with a single monocular view, we need to take additional assumptions about the smoothness and uniformity in reflectance characteristics of the surface, and the uniformity of lighting. These often unrealistic assumptions limit the applicability of techniques and algorithms based on them, but they must be adopted to regularize a generally underconstrained, ill-posed problem like passive shape from shading. Things become easier with two images in stereo configuration, leaving aside the correspondence problem, and even better with three images, but the problem is still nonlinear. Moreover stereo images pose two different kinds of problems in short and long baseline configurations: a short baseline makes the correspondence problem easier and allows for a linearization of the equations to solve by taking only up to the linear term of a Taylor series expansion, but the resulting accuracy is low; a long baseline reaches good accuracy but suffers from a difficult correspondence problem that can trap the search for solution in a local minimum.

Aloimonos and colleagues presented an active technique that takes the advantages of both stereo configurations. This technique decouples the process of finding the unknowns (surface orientation and reflectance parameters, plus lighting distribution) at every single point in the image. The adopted reference system is fixed. The camera moves by shifting the position of the optical centre in the XY plane, that is parallel to the image plane. For the mathematical details the reader can refer to one of the papers (Aloimonos et al., 1987; Aloimonos et al., 1988; Aloimonos and Shulman, 1989).

The effect of camera motion is that more data are collected in a controlled way such that the equations for the different surface points become coupled along time (subsequent camera positions) and are no more coupled for adjacent points on the surface.

This gets rid of the old unrealistic assumptions on maximal surface smoothness and uniform reflectance and lighting. In this way it is possible to separate shape information from reflectance and lighting variations, thus detecting and measuring all of them. Mathematical and computational advantages of an active approach to other shape-from-X modules have been similarly shown.

Given the under constrained nature of problems of shape reconstruction from intensity images, the strength of the active approach lies in substituting unrealistic constraints, artificially imposed on the nature of the outer scene, with actual and perfectly known constraints given by controlled observer motion.

The main task of model based object recognition, primarily targeted by shape from X techniques, has been often accompanied by automatic model extraction, in which the same techniques for surface reconstruction were used for building a CAD-like model of an unknown object. Chapter 4 of this book by E. Trucco gives a comprehensive treatment of both these tasks, along with an extensive bibliography.

We add here that the possibility of a fine control of camera parameters can conquer also problems given by difficult shapes like tori (Kutulakos and Dyer, 1994), and that the planning of multiple observations gives great advantages for efficiency of object recognition techniques (Gremban and Ikeuchi, 1994; Hutchinson and Kak, 1989). These advantages can be summarized into the acquired abilities of eliminating ambiguities and of incrementally gathering several cues that contribute to the solution of the problem at hand.

3.5 PURPOSIVE VISION

Object recognition constitutes only one of the major visual problems of interest, but other activities build on information obtained through vision. The most important one is navigation, that is, moving around in the world with the help of visual sensors. Moreover there are many problems or subproblems, that an intelligent, autonomous, perceiving agent must frequently solve, that need only some specific, task dependent information to be extracted from images. Some of these problems are: obstacle detection and avoidance, detection of independent motion, tracking of a moving object, interception, hand-eye coordination, etc. Trying to solve each of these problems separately can lead to much more economical solutions and to the realization of effective working modules to be activated on demand when need arises (Aloimonos, 1990; Ikeuchi and Hebert, 1990). The alternative approach now sketched has been called purposive vision, referring to the utilitarian exploitation of both system resources and surrounding world information, or animate vision, with approximately the same meaning (Ballard and Brown, 1992).

We have already seen that an active approach can help in the recovery of the three-dimensional structure of the observed scene. Actually, even if the geometrical structure could be extracted through an analysis of image sequences, under controlled observer motion conditions, the practical implementation of a working structure from motion module remains difficult. This module would grant us with solutions to both the object recognition and the navigation problems, but the use we make of our visual system may not always entail complete scene reconstruction. Many tasks that make use of visual input need only to extract some very specific information. This means that not every visual information nor every part or object present in a scene should be respectively extracted or analysed in detail, but only the single elements that are currently useful for the completion of the tasks at hand.

This approach has led some researchers to develop specific algorithms for solving specific sub-tasks like for example: detection of independent motion by a moving observer, tracking, estimation of relative depth (Ballard and Brown, 1992; Huang and Aloimonos, 1991; Fermuller and Aloimonos, 1992; Sharma and Aloimonos (1991)). Tracking, in particular, has attracted the attention of many researchers because of its importance (Fermuller and Aloimonos, 1992; Coombs and Brown, 1991; the first seven papers in Blake and Yuille, 1992). There are different reasons for such attention. By 3D-tracking of a moving object, the observer can keep it in sharp focus at the center of the image, in order to collect as much information as possible about it.

In the same time it is easier to isolate the tracked object from the background through motion blur of the second one. The tracking motion parameters of the observer give direct information on relative position and speed, for example for interception tasks. For the same reason, tracking of a fixed environment point in the observer image space gives information on observer trajectory and speed, extremely useful for navigation. There is another important aspect to consider: tracking or fixation establishes a relationship between the observer and the fixated object reference frames (Ballard and Brown, 1992). This helps in translating observer centered descriptions extracted from the images into object centered descriptions stored as models, thus making the path to model based object recognition easier.

Now, some questions arise. If there is convenience in selecting the information to be extracted from the scene, what kind of information is best suited to the task and to the resources at hand? How to combine sensing with action, vision with behaviour? How much explicit representation do we need? And which kind of representations?

3.6 INTELLIGENT PERCEPTION AND THE ACTIVE APPROACH

By taking a systemic view of perception and perceiving agents, we should not overlook any of the ‘subsystems’ that contribute to the perceptual task. Let us consider a visual system that surely works: our own. We constantly move our eyes, head and body to act in the world, and we have other sensors, in our ears and muscles, that provide information about our movements; so the observer motion must not be such an obstacle on the road to vision; it is probably more an advantage. For sure it is an effective way to collect an amazing amount of useful information, while dominating the spatio-temporal huge dimension of the input data, by sequentially directing the fovea only on small regions of interest in the scene. This paradigm allows to separate the what and where parts (Marr, 1982) of the classical image understanding problem.

Another aspect of the human visual system that has attracted the researchers is the foveated eye. By distributing the single sensitive elements with varying resolution it is possible to keep the data rate at levels that are orders of magnitude lower than those of uniformly sampling sensors (Schwartz, 1977; Sandini and Tagliasco, 1980), while keeping the same high level of detail in the fixated region of interest. Again, these new sensors would be of little use without active gaze control capabilities.

There is an aspect of image understanding that deserves careful consideration. As shown by (Yarbus, 1967) the visual exploration of a picture by a human observer follows different paths, according to the particular task he/she has been assigned with. Not only different global tasks may cause the exploration of different regions in the image, but some regions that are of particular significance for the given task are analysed more carefully; gaze is positioned on them with insistence, often coming back to these points of interest after a round trip to other regions. These experiments suggest two conclusions:

• What is important is to collect, in as short a time as possible, the information that allows to perform the task assigned to the observer, leaving aside the rest.

• To answer some questions about the scene one needs to analyse few restricted areas for a long time, meaning perhaps that the extraction of certain kinds of information is difficult compared to the extraction of others.

This view is not completely new, and the idea that a complete internal representation of the world might not be the best approach to the vision problem can be traced for example in the ecological approach proposed by Gibson (1950, 1966, 1979). According to him, the world around us acts as a huge external repository of the information necessary to act, and we directly extract from time to time the elements that we need.

3.7 ACTIVE COMPUTATIONAL VISION AND REPRESENTATIONS

The considerations just presented in the previous paragraph introduce us into the other problem that we wanted to address: representations. Do we need them? And, if yes, at what extent? And of which kind? The ecological approach seems to suggest a negative answer to the first of these questions; and the work by Brooks (1987) in the track of behaviorism follows this approach. Conversely, representations play a very important role in the Marr paradigm. This is mainly due to the rigorous computational approach followed by him. In this computational perspective the role of representations is correctly emphasized as that of formal schemes that make explicit some information about an entity. Moreover, formal schemes are the only objects that can be manipulated and transformed, one into the other, by computer programs. It should also be noticed that even the extreme position of Brooks against representation is substantially against the use of certain kinds of it, and even his robots make use of representations. The point is that there exist many forms of representation that are influenced by the available hardware, both for information gathering (sensors) and processing (neurons, transistors). Going back to the computational vision approach, the visual process, or at least its prominent part, is seen as transforming an image into a different kind of representation. As image, or input representation, is generally taken a regularly spaced grid of pixels, each of which carrying a numerical measure of the irradiance, taken at the corresponding point on the sensor. In this way what is made explicit by the representation is the measure of light, while other information, such as reflectance of the surface, texture, shape, relative position and motion of the object are left in the background. The effort is then to devise a process, or a collection of processes, that extracts interesting information by transforming the input representation into a new one that make that information explicit. Given the enormous difficulties often encountered on this path, we could ask ourselves if the addition of some different kind of input signal could help.

And how should the output representation be organized? What kind of information, embedded in the input, should be made explicit? The answer depends on the task for which visual processing is performed. The suitability of a representation for a given task depends on whether it makes explicit and easy to use the information relevant to the task itself. The traditional reconstructionist approach gave absolute preference to shape and spatial relations. If shape is the key point, then the final output representation has been selected as a three-dimensional model of the observed scene. Model based recognition uses internal geometrical representations, often of the same kind of those used in CAD systems, to be matched against the resulting representation produced by the visual processing.

It is clear that a certain amount of representation is needed, at least because today we are forced to use digital computers and software programs. But it is not necessarily the case that these representations must resemble the geometrical appearance of the observed scene. We agree that the key point is space representation, but it seems reasonable to suggest the role of eye movements and gaze control into an internal representation of space. Just think about what happens when somebody throws a ball to you. The key task is visually tracking the ball, foreseeing its trajectory in 3D space and organizing and activating muscle commands in order to intercept it. During that time interval CAD-like 3D scene reconstruction has no meaning; the spatial relationship between you and the ball is really the only important parameter. But the tight integration in the same control loop of ball detection and localization in the image and oculo motor commands, that allow for fixation and tracking of the ball, represents the most economical solution of the task (Coombs and Brown, 1991; Brown, 1990). Vergence commands give the distance of the ball at every instant, and they can also represent it. Gaze control commands give the direction in space from the observer to the ball, and, similarly, they can represent it, along with velocity information that is necessary for interception.

If internal motion commands can play a significant role in ‘representing’ the dynamic situation of tracking and interception (Berthoz, 1993) there is no apparent reason why they could not take an active role in space representation in general. The spontaneous exploration carried out by any normal individual introduced for the first time into an unknown environment – a room, for example – can be reasonably interpreted as the process of building a model of the space in which he/she is situated. But a satisfying answer to the question if this representation entails 3D-surface models, or internal exploration movements commands, or both, is far from being here.

3.8 CONCLUSIONS

Active vision is one of the hot subjects of research in the field of computer vision (Aloimonos, 1992; Swain, 1994; Fiala et al., 1994). This approach has already proved its usefulness in solving some hard problems in computer vision, but it has demonstrated its power also in complex real-time applications (Dickmanns and Christians, 1989; Dickmanns et al., 1990) and it is still in full development, so that more results are expected. It will not solve all the problems that still confront the researchers in this stimulating area, but it has already reached the result of revitalizing interest in the research community. Moreover, two important aspects of visual perception have been put in a different perspective: the position of vision, and perception in general, in the context of an intelligent autonomous agent, and the nature, role and meaning of representations. Vision should not be considered as an isolated function; rather it should be viewed as a powerful tool, so strictly integrated with the capabilities of an intelligent system that any artificial separation between vision and intelligence should be banned. These interactions will probably bring new insights and advantages in the study of both disciplines. In the same way active vision has the merit of having made explicit the relationship between perception and action, and the consequent importance of control (Clark and Ferrier, 1988; Rimey and Brown, 1994).

REFERENCES

Aloimonos, J., Purposive and qualitative active vision. Proc. 10th Int. Conf. on Pattern Recognition. Atlantic City, New Jersey, 1990.

Aloimonos, J., eds. CVGIP: image understanding, special issue on purposive, qualitative, active vision, 56. Academic Press, 1992:1.

Aloimonos, J., Shulman, D. Integration of visual modules: an extension of the Marr paradigm. Boston, Massachusetts: Academic Press; 1989.

Aloimonos, J., Weiss, I., Bandyopadhyay, A., Active vision. Proc. 1st IEEE Int. Conf. on Computer Vision. London, UK, 1987:35–54.

Aloimonos, J., Weiss, I., Bandyopadhyay, A. Active vision. Int. Journal of Computer Vision. 1988;1:333–356.

Ballard, D.H., Brown, C.M. Principles of animate vision. CVGIP Image Understanding. 1992;56(1):3–21.

Berthoz A., ed. Multisensory control of movement. New York: Oxford University Press, 1993.

Blake A., Yuille A., eds. Active vision. Cambridge, Massachusetts: MIT Press, 1992.

Brooks, R., Intelligence without representation. Proc. Workshop on the Foundations of AI, 1987.

Brown, C.M. Gaze controls with interactions and delays. IEEE Trans. on Systems, Man and Cybernetics. 20(3), 1990.

Clark, J.J., Ferrier, N.J. Modal control of an attentive vision system. In: Proc. 2nd Int. Conf. on Computer Vision. Tampa, Florida: IEEE Press; 1988:514–523.

Coombs, D.J., Brown, C.M. Cooperative gaze holding in binocular robot vision. IEEE Control Systems. 1991:24–33.

Dickmanns, E.D., Christians, T. Relative 3D-state estimation for autonomous visual guidance of road vehicles. Intelligent Autonomous Systems. 1989;2:683–693.

Dickmanns, E.D., Mysliwetz, B., Christians, T. An integrated spatio-temporal approach to automated visual guidance of autonomous vehicles. IEEE Trans. on Systems, Man and Cybernetics. 1990;20:1273–1284.

Fermuller, C., Aloimonos, J., Tracking facilitates 3-D motion estimation. CAR-TR-618. Univ. of Maryland, 1992..

Fiala, J.C., Lumia, R., Roberts, K.J., Wavering, A.J. TRICLOPS: A tool for studying active vision. Int. Journal of Computer Vision. 1994;12(2/3):231–250.

Gibson, J.J. The perception of the visual world. New York: Houghton Mifflin; 1950.

Gibson, J.J. The senses considered as perceptual systems. New York: Houghton Mifflin; 1966.

Gibson, J.J. The ecological approach to visual perception. New York: Houghton Mifflin; 1979.

Gremban, K.D., Ikeuchi, K. Planning multiple observations for object recognition. Int. Journal of Computer Vision. 1994;12(2/3):137–172.

Horn, B.K.P. Understanding image intensities. Artificial Intelligence. 1977;8:201–231.

Horn, B.K.P. Robot vision. New York: McGraw Hill; 1986.

Huang, L., Aloimonos, J., Relative depth from motion using normal flow: an active and purposive solution. CAR-TR-535. Univ. of Maryland, 1991..

Hutchinson, S., Kak, A. Planning sensing strategies in a robot work cell with multi-sensor capabilities. IEEE Trans. on Robotics and Automation. 1994;5:6.

Ikeuchi, K., Hebert, M., Task-oriented vision. Proc. DARPA Image Understanding Workshop. 1990:497–507.

Kanizsa, G. Organization in vision. New York: Praeger; 1979.

Kutulakos, K.N., Dyer, C.R. Recovering shape by purposive viewpoint adjustment. Int. Journal of Computer Vision. 1994;12(2/3):113–136.

Marr, D. Vision. San Francisco, California: Freeman; 1982.

Rimey, R.D., Brown, C.M. Control of selective perception using Bayes nets and decision theory. Int. Journal of Computer Vision. 1994;12(2/3):173–207.

Sandini, G., Tagliasco, V. An anthropomorphic retina-like structure for scene analysis. Comp. Vision, Graphics and Image Processing. 1980;14:365–372.

Schwartz, E.L. Spatial mapping in the primate sensory projection: analytical structure and relevance to perception. Biological Cybernetics. 1977;25:181–194.

Sharma, R., Aloimonos, J., Robust detection of independent motion: an active and purposive solution. CAR-TR-534. Univ. of Maryland, 1991..

Swain M., ed., eds. Special issue on active vision II. Int. Journal of Computer Vision, 12, 1994. 2/3

Yarbus, A.L. Eye movements and vision. New York: Plenum Press; 1967.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.109.151