Chapter 3. The Mechanics of Sight

The amazing human eye is a marvel of engineering. Seventy percent of all of the body’s sensory receptors are located in the eyes. Forty percent of the cerebral cortex is thought to be involved in some aspect of processing visual information. To fully understand the key technologies behind augmented and virtual reality, it is important to understand the primary sensory mechanism these systems address. In this chapter we will explore the mechanics of human sight, highlighting the physiological processes enabling us to visually perceive real and virtual worlds.

The Visual Pathway

Everything starts with light. It is the key underlying stimulus for human sight. Light is a form of electromagnetic radiation that is capable of exciting the retina and producing a visual sensation. Light moves through space as waves move over the surface of a pond. Science classifies electromagnetic radiation by a measure of its wavelength, which is the distance between two consecutive crests of a wave. The entirety of the electromagnetic spectrum includes radio waves, infrared, visible light, ultraviolet, x-rays, and gamma rays. As shown in Figure 3.1, the human eye is only sensitive to a narrow band within the electromagnetic spectrum falling between wavelengths roughly measuring 380 nanometers and 740 nanometers in length. A nanometer (nm) is one billionth of a meter.


Credit: Illustration by Peter Hermes Furian /

Figure 3.1 This diagram shows all the regions within the electromagnetic spectrum. The color callout shows the small portion to which the retinas of the eyes have natural sensitivity.

For clarity, there are no precisely defined boundaries between the bands of the electromagnetic spectrum; rather, they fade into each other like the bands in a rainbow. Gamma ray wavelengths blend into x-rays, which blend into ultra-violet, and so on. The longest wavelengths perceptible by humans correspond to light we see as red, and the shortest wavelengths correspond to light that we see as violet. However, the spectrum does not contain all the colors that a healthy human eye can distinguish. For instance, unsaturated colors such as pinks, purples, and magentas are a mix of multiple wavelengths.

In the real world, an object’s visible color is determined by the wavelengths of light it absorbs or reflects. Known as spectral reflectance, only the reflected wavelengths reach our eyes and are discerned as color. As a simple example, the leaves of many common plants reflect green wavelengths while absorbing red, orange, blue, and violet.

Entering the Eye

The human eye is a complex optical sensor but relatively easy to understand when thought of as functionally similar to a camera. Light enters through a series of optical elements, where it is refracted and focused. A diaphragm is adjusted to control the amount of light passing through an aperture, which ultimately falls onto an image plane. As shown in Figure 3.2, the human eyes performs the same basic functions, with the cornea and crystalline lens providing focus while the iris serves as the diaphragm, adjusting appropriately to allow just the right amount of light to pass through the aperture. Instead of coming to rest on film, the inverted light field falls onto the extremely sensitive retina.


Credit: Illustration by guniita ©

Figure 3.2 This illustration shows a vertical cross section through the human eye revealing major structures and chambers.

The Cornea

Light from all directions within the visual field initially enters the eye through the cornea, a transparent, dome-shaped structure, the surface which is composed of highly organized cells and proteins. Most of the refraction of light by the eye (~80%) takes place at the air-cornea interface because of its curvature and the large difference in the indexes of refraction. Set behind the cornea is another transparent structure called the crystalline lens, which is a fine focus mechanism because its shape can be changed, thus providing variability to the effective focal length of the optical system (Delamere, 2005). The space between the two optical elements, known as the anterior chamber, is filled by a clear, watery fluid called the aqueous humor, which is produced by the ciliary body. The aqueous humor provides nutrients (notably amino acids and glucose) for the central cornea and lens because they do not have their own blood supply. The front of the cornea receives the same nutrients via tears spread across the surface as a person blinks.


After light passes through the cornea and the aqueous-filled anterior chamber, a portion of that light then passes through a hole located in the center of the colored structure of the eye known as the iris. This hole, known as the pupil, allows the light to strike the retina. The pupil is black in appearance because most of the light entering through the hole is absorbed by the interior of the eye with little if any reflectance.

As described earlier, similar to the aperture of a camera diaphragm, the size of the pupil can be varied to account for changes in visual stimulus, resulting in dilation of the iris as shown in Figure 3.3. In low light situations, the pupil will expand to allow more light to enter. In brightly lit conditions, the pupil contracts in size. This involuntary reaction is called the pupillary reflex.


Credit: Illustration by hfsimaging ©

Figure 3.3 Similar to the aperture of a camera diaphragm, the size of the pupil constricts to account for changes in visual stimulus and focus.

Crystalline Lens

Light passing through the pupil immediately enters a new optical element known as the crystalline lens. This is an almost perfectly transparent, flexible structure and is composed of concentrically arranged shells of fiber cells. The most superficial fibers are metabolically active and, similar to the cornea, the crystalline lens receives all its nutrition from the fluids that surround it.


The crystalline lens is held in position by the ciliary muscles and delicate suspensory ligaments around its outer circumference. As shown in Figure 3.4, when the eye is at a relaxed state, such as when you are simply looking off in the distance, the crystalline lens assumes a flattened shape, thus providing the maximum focal length for distance viewing. To assume this shape, the ciliary muscle that encircles the crystalline lens, like all radial muscles, transitions from a constricted state to an enlarged, open state. In doing so, this exerts outward tension on the suspensory ligaments (zonules) connecting the muscle and lens, which in turn pulls the lens flat. When the eye focuses on near-field objects, the process is reversed. The ciliary muscle that encircles the crystalline lens constricts, thereby relieving tension on the suspensory ligaments and allowing the lens to naturally reassume a more rounded, biconvex (convex on both sides) shape, thus increasing its refractive power needed to clearly focus on the near-field object. This variable process by which the optical power of the eye is changed to allow an observer to rapidly switch focus between objects at different depths of field is referred to as accommodation.


Credit: Illustration by alila ©

Figure 3.4 This illustration shows the process of accommodation in which the optical power of the eye is changed to allow an observer to rapidly switch focus between objects.

It is widely believed that blurring on the retina is the stimulus for accommodation, although the process is also strongly linked to vergence, discussed later in this chapter (Leigh and Zee, 2015, 524).

As will be seen in Chapter 21, “Human Factors Considerations,” this extraordinary reflex action, although perfect for viewing our real-world surroundings, is highly problematic when using most current 3D display technologies. One of these challenges is the fact that current 3D displays present images on a 2D surface. Thus, focus cues such as vergence, as well as blur in the retinal image, specify the depth of the display surface rather than the depths in the depicted scene. Additionally, the uncoupling of vergence and accommodation required by 3D displays frequently reduces your ability to fuse the binocular stimulus and causes discomfort and fatigue for the viewer (Lambooij et al., 2009).

It is interesting to note that the crystalline lens is highly deformable up to about 40 years of age, at which point it progressively begins losing elasticity. As a result of increasing stiffness due to metabolic activity in the outer shell, by the mid-fifties, ciliary muscle contraction is no longer able to change the shape of the lens (Atchison, 1995; Duane, 1912).

Image Inversion

Thus far we have seen that the eye is equipped with a compound lens system. Light enters the eye by passing between mediums, passing from air into a denser medium (the cornea), which performs ~80% of the refraction for focusing, with the crystalline lens performing the remaining 20%. Although the cornea is a strong fixed lens, the crystalline lens is a variable, double convex lens. Following the refraction rules for converging lens, light rays will pass through the focal point on the opposite side. As shown in Figure 3.5 (although not at proper scale), the result is that the light field entering the eye is optically inverted before reaching the retina.


Credit: Illustration by peterhermesfurian ©

Figure 3.5 The double convex shape of the crystalline lens results in the inversion of the light field entering the eye.

Vitreous Body

After passing through the crystalline lens, light then enters the interior chamber of the eye, which is filled with a clear gel-like substance known as vitreous humor. As you would expect, this liquid has the perfect properties to enable the easy passage of light. Vitreous is 98% water, along with hyaluronic acid, which increases viscosity, a network of fine collagen fibrils that provides its jelly-like properties, as well as various salts and sugars (Suri and Banerjee, 2006). The substance is essentially stagnant, is not actively regenerated, and is not served by any blood vessels.

Image Formation and Detection

We have finally reached the point where light begins the process of being converted from waves deriving from a small detectable sliver of the electromagnetic spectrum into a form that allows us to actually “see.” The mechanisms enabling this conversion operate under conditions ranging from starlight to sunlight, recognize the positioning of objects in space, and enable us to discern shape, size, color, textures, and other dimensional aspects to interpret and derive meaning from our surroundings.

The Retina

Visual perception begins as the optical components of the eye focus light onto the retina (from the Latin, rete, meaning net), a multilayered sensory tissue that covers about 65 percent of its interior surface of the eye and serves a similar function as the film (or a CMOS/CCD image sensor) in a camera. The thickness of the retina ranges from 0.15 mm to 0.320 mm (Kolb et al., 1995). As shown in Figure 3.6, near the middle of the retina is a feature called the macula, and the center of the macula contains the fovea. The fovea is naturally centered on objects when we fixate on them and is the point on the retina with the greatest acuity. The entire intricate superstructure of the eye exists in the interests of the retina (Hubel, 1995).


Credit: Illustration by Zyxwv99 via Wikimedia under a CC BY 2.0 license

Figure 3.6 This image shows the major features of the retina as viewed through an ophthalmoscope.

An amazing aspect of the retina is the fact that it is near completely transparent (Huang et al., 1991; Slater and Usoh, 1993; D’Amico, 1994). As light falls upon the retina, it actually passes straight through until it comes into focus on the outermost, or deepest, layer known as the pigment epithelium, as shown in Figure 3.7. That image is then reflected back into the immediately adjacent layer where the photoreceptor neurons are located.


Credit: Illustration by OpenStax College via Wikimedia under a CC 3.0 license

Figure 3.7 This cross-sectional illustration shows the complex structure of the human retina.

Rods and Cones

The photoreceptors of the eye, referred to as rods and cones due to their shapes, actually face away from the incoming light. Rods are more numerous, responsible for vision at low light levels, and are highly sensitive motion detectors. Rods are found mostly in the peripheral regions of the retina and are responsible for our peripheral views. Cones are active at higher light levels, have a high spatial acuity, and are responsible for our color sensitivity.

Light reflected from the pigmented epithelium results in a chemical reaction with two photopigments: iodopsin in cones (activated in photopic or bright conditions) and rhodopsin in rods (activated in scotopic or dark conditions). This reaction, known as isomerization, results in changes in the electrical properties of the photoreceptors and the release of neurotransmitters (chemical transmitters/transmitter substances). These neurotransmitters stimulate neighboring neurons, thus enabling impulses to be passed from one cell to the next.

Based on their measured response curves shown in Figure 3.8, individual cones are sensitive to one of three light conditions; red (the most numerous) shows peak sensitivity at a wavelength of 564 nm, green at 533 nm, and blue at 437 nm. Rods show a peak sensitivity to wavelengths around 498 nm (green-blue) (FAA, 2016).


Credit: Illustration by Pancrat via Wikimedia under a CC BY 3.0 license

Figure 3.8 This graph shows the spectral sensitivity curves of short (437), medium (533), and long (564) wavelength cones compared to that of rods (498).

Impulses from the rods and cones stimulate bipolar cells, which in turn stimulate ganglion cells. These impulses continue into the axons of the ganglion cells, through the optic nerve and disk, and to the visual centers in the brain.

Rod and Cone Density

There are approximately 100–120 million rod and 7–8 million cone photoreceptors in each retina (Riggs, 1971). As shown in the distribution graph in Figure 3.9, most cones are concentrated in the fovea, while rods are absent there but dense elsewhere. Despite the fact that perception in typical daytime light levels is dominated by cone-mediated vision, the total number of rods in the human retina far exceeds the number of cones (Purves et al., 2001).


Credit: Illustration by Cmglee via Wikimedia under a CC 3.0 license

Figure 3.9 This graph shows why we see color (photopic vision) most clearly in our direct line of sight. The peak number of cones occurs in the fovea, where it reaches approximately 150,000 cones per square millimeter.

It is important to note that there are no photoreceptors in the optic disk, more accurately known as the optic nerve head. This lack of photoreceptors means there is no light detected in this area, resulting in a blind spot for each eye. The blind spot for the left eye is located to the left of the center of vision and vice versa for the right eye. With both eyes open, we do not perceive the blind spots because the field of view of each eye overlaps with the other, although they still can be experienced. Follow the instructions in the caption of Figure 3.10 to find your blind spots.


Credit: Illustration by S. Aukstakalnis

Figure 3.10 To find the blind spot for each eye, start by placing this book flat on a table. Cover your left eye and look at the dot on the left side of this image. Remain aware of the cross on the right without looking directly at it. Slowly move your face closer to the image. At some point, you will see the cross disappear. Reverse the process to find the blind spot for the right eye.

Spatial Vision and Depth Cues

Based on the visual processes described in the previous section, there are, quite literally, billions of pieces of information being sent to the cerebral cortex every second for analysis. This stream of information undergoes repeated refinement, with each level in the hierarchy representing an increase in organizational complexity. At each level, neurons are organized according to highly specific stimulus preferences, with the cortical destinations of impulses differentiating content and cues. Theoretically, the nature of the representations (patterns of nerve impulses) is thought to shift from analogue to symbolic (Mather, 2009).

In this section, we will explore many of the specific triggers, or cues, that are believed to enable the brain to perceive depth in the visual stimuli entering the eyes.

Extraretinal Cues

Extraretinal depth cues are those triggers or pieces of information that are not derived from light patterns entering the eye and bathing the retina, but from other physiological processes. In this section we explore the two most dominant of these cues.


As described in the previous section, when the human eye is at a relaxed state, such as when you are simply looking off in the distance, the crystalline lens of the eye is flattened, thereby providing the maximum focal length for distance viewing. As shown in Figure 3.11, when the eye focuses on near-field objects, the process is reversed. The ciliary muscle that encircles the crystalline lens constricts, thereby relieving tension on the suspensory ligaments and allowing the lens to snap into a more rounded, biconvex shape, thus increasing its refractive power needed to clearly focus on the near-field object.


Credit: Illustration by S. Aukstakalnis

Figure 3.11 Accommodation is the process by which an observer’s eye changes optical power to obtain a clear image or focus on an object on a different focal plane. In this illustration, constriction and relaxation of the radial ciliary muscle affects the focal length of the crystalline lens.

Accommodation is an involuntary physiological process by which the optical power of the eye lens changes to focus light entering the eye and falling on the retina. It is widely believed that blurring on the retina is the stimulus for accommodation, although the process is also strongly linked to vergence (Leigh and Zee, 2015, 524). It is also theorized that movement of the ciliary muscles themselves contributes to this cue (Helmholtz et al., 1944).


One of the most powerful depth cues and an oculomotor function that is at the foundation of binocular vision is that of vergence eye movements, which is the pointing of the fovea of both eyes at an object in the near field. As shown in Figure 3.12, this process entails the eyes simultaneously rotating about their vertical axis in opposite directions to the degree necessary so that when looking at a nearby object, the projected image of that object is aligned with the center of the retina of both eyes. When looking at an object in the near field, the eyes rotate toward each other, or converge. When looking at an object in a far field, the eyes rotate away from each other, or diverge.


Credit: Eye illustration by Ginko /

Figure 3.12 Vergence is the simultaneous movement of both eyes in opposite directions to obtain or maintain binocular vision.

When the eyes rotate in opposite directions, this is known as disconjugate movement. Literally all other eye movements are together, or conjugate.

Accommodation and vergence are normally tightly coupled physiological processes. As an example, when focusing your eyes on something in the distance and then shifting your attention to an object closer to you, that process starts with your eyes converging on that object in the near field. This results in the image of that object appearing larger on the retina and out of focus. That blurriness in turn triggers the accommodation reflex, which results in a change in the focal power of the crystalline lens bringing the image on the retina into sharp focus.

The vergence and accommodation processes are extremely important for virtual and augmented reality enthusiasts to understand. As pointed out in Chapter 21, users of flat panel-based stereoscopic head-mounted displays often complain of headaches and eye strain. That side effect is caused by the eyes having to remain focused on a flat plane (the display surface) that is within inches of the eye. Even if you are paying attention to and accommodating for objects at what appear to be differing focal planes within the virtual space, the depth of field is just simulated. The imagery presented to each eye is portrayed on a 2D display surface, and that is where the eye remains focused. Ultimately, this constant focusing in the near field on the surface of the display elements, which is made possible by the constricting of the ciliary muscle surrounding the edge of the lens, greatly contributes to such discomfort. Further, there is a mismatch, or decoupling, in the sensory cues provided to the brain by the vergence and accommodation processes.

An additional aspect to the vergence cue just described comes in the form of tension in the six extraocular muscles shown in Figure 3.13 that control eye movement (Jung et al., 2010).


Credit: Illustration by alila ©

Figure 3.13 This illustration shows the six muscles used in movement of the eyeball. The lateral rectus and media rectus (opposite side, not shown) are the primary muscles controlling vergence.

Binocular Cues

Binocular depth cues are those triggers or pieces of information that are detected as a result of viewing a scene with two eyes, each from a slightly different vantage point. These two scenes are integrated by our brain to construct a 3D interpretation of our real or virtual surroundings.


Binocular vision is sight with two eyes. The primary depth cue for binocular vision is known as stereopsis, which is a direct result of retinal or horizontal disparity. We have two eyes laterally separated by an average distance of about 2.5 inches (63 mm) (Dodgson, 2004), with each eye capturing the scene from a slightly different angle. As shown in Figure 3.14, stereopsis is the perception of depth gained by the slight offset from those two scenes that is constructed by the brain based on the differences between these two retinal images.


Credit: Illustration by S. Aukstakalnis

Figure 3.14 Stereopsis is the perception of depth and 3D structure obtained on the basis of visual information deriving from two eyes.

Within a binocular view, each point in one retina is said to have a corresponding point in the other retina (Howard and Rogers, 2012, 150). These retinal corresponding points correlate to an area before the observer called the “horopter,” such as is shown in Figure 3.15. The term horopter (meaning the horizon of vision) was introduced in 1613 by François d’Aguilon, a Belgian Jesuit mathematician, physicist, and architect. The term defines the locus of all object points that are imaged on corresponding retinal elements at a given fixation distance. Thus, a line can be drawn through an object of regard such that all the points on the line correspond to the same point on the retinas of both eyes. This allows the object to be seen as a single point. Theoretically, the horopter is the locus space in which each point produces images that fall on corresponding points for a given point of binocular fixation (Howard and Rogers, 2012, 150).


Credit: Illustration by Vlcekmi3 via Wikimedia under a CC 3.0 license

Figure 3.15 This illustration depicts the concept of a horopter, which is the locus of points in space having the same disparity as fixation. This can be defined theoretically as the points in space that project on anatomically identical or corresponding points, in the two retinas. Note how points R, P, and Q map to identical positions on both retinas.

According to this model, if corresponding points have a regular horizontal distance from the retina, the horopter would be a circle passing through the center of rotation of the two eyes and the fixation point. Thus, as the point of fixation gets closer, this circle would become smaller (Bhola, 2006).

This simple concept of binocular disparity leading to stereopsis can be demonstrated using the stereo pair shown in Figure 3.16. This print form of stereo images can be viewed using what is known as the cross-eyed viewing method. For those who have never tried, it may take a few minutes to master, but the effort is worth it. To get started, position this book approximately two feet in front of you and, while looking at the image pair straight on, slowly cross your eyes. You will then begin to perceive a third image in the center. Vary the degree to which your eyes are crossed until you can form a stable middle image and watch as the astronaut floats above the moon surface.


Credit: Image courtesy of NASA

Figure 3.16 The stereoscopic disparity of the two images in a 3D pair is a strong indicator of distance. To view this stereo pair, slowly cross your eyes and attempt to fuse together a third, combined image.

As a side note, once you have successfully fused the two images and can perceive depth, slowly tilt your head to the left and right. The progressive vertical separation of the images is the result of a loss of stereopsis as the images on the two retinas are displaced.

Finally, as is seen elsewhere in this chapter, neurons in the visual cortex of the brain have been identified that assist in the creation of stereopsis from binocular disparity.

Monocular Cues

Monocular depth cues are those triggers or pieces of information that are derived from light patterns on the retinas but are not dependent on both eyes. As will be seen in this section, monocular cues are divided between those requiring movement of light patterns across the retina (that is, viewer motion) and those that can be discerned from a fixed viewing position.

Motion Parallax

Motion parallax is a strong, relative motion cue within which objects that are closer to a moving observer appear themselves to move faster than objects that are farther away (Gibson et al., 1959; Ono et al., 1986). Figure 3.17 illustrates this phenomenon. From a physiological perspective, this perceptual phenomenon is the result of the speed at which an image moves across the retinas of the eyes. Objects closer to the observer will pass into, through, and out of your field of view considerably faster than objects off in the distance.


Credit: Illustration elements by sergeiminsk and Ginko ©

Figure 3.17 Motion parallax is the perception that nearby objects appear to move more rapidly in relation to your own motion than background features.

This visual cue provides important information about relative depth differences and can reliably provide 3D scene layout and help enable navigation in the environment (Helmholtz, 1925). This retinal image motion results in two types of motion boundaries: those that are parallel to the direction of observer movement and provides shear, and those that are at right angles to the direction of observer movement and provide dynamic occlusion, in which objects in the near field dynamically cover and uncover objects in the far field (Yoonessi and Baker, 2013).


Also known as interposition, occlusion cues are generated when one object blocks an observer’s view of another object. In such a situation, the blocking object is perceived as being closer to the observer. This is clearly shown in Figure 3.18, within which the progressive interposition of cars provides a strong indication of depth. Occlusion indicates relative (as opposed to absolute) distance.


Credit: Illustration by joyfull /

Figure 3.18 Occlusion, or interposition, is a simple but powerful depth cue within which one object partially blocks another.

Recent investigations have reinforced the potential importance of these cues in stereoscopic depth perception (Harris & Wilcox, 2009). Some research suggests that the primary function of monocular cues in stereoscopic depth perception is to define depth discontinuities and the boundaries of the occluding objects (Anderson, 1994; Gillam and Borsting, 1988; Nakayama and Shimojo, 1990).

Deletion and Accretion

Two components of the occlusion phenomenon are known as deletion (hiding) and accretion (revealing) and refer to the degree that an object or surface in the near field reveals or covers objects or surfaces in the far field as your viewpoint translates past their position. In both real and virtual environments, if an object or surface in the near field is significantly closer to the observer than that in the far field, the deletion or accretion of the distant object will occur at a faster rate as you move by, such as shown in Figure 3.19. Alternatively, if two objects are in the far field but in close proximity to each other, the rate at which deletion or accretion will occur is much slower.


Credit: Illustration by S. Aukstakalnis

Figure 3.19 The human visual system produces the perception of depth even when the only useful visual structural information comes from motion.

If it is not already apparent, the deletion and accretion phenomenon applies regardless of the direction of an observer’s movement.

This cue is extremely important to remember when designing virtual environment simulations. Under the right circumstances, these cues can be heavily leveraged to produce a variety of interesting effects.

Linear Perspective

Linear perspective is the monocular depth cue provided by the convergence of lines toward a single point in the distance (Khatoon, 2011, 98). As shown in Figure 3.20, looking down this image of a Hong Kong skyway, we know that the walls do not converge but remain parallel along the entire length.


Credit: Image by Warren R.M. Stuart via Flickr under a CC 2.0 license

Figure 3.20 Linear perspective is a depth cue within which parallel lines recede into the distance, giving the appearance of drawing closer together. The more the lines converge, the farther away they appear.

Kinetic Depth Effect (Structure from Motion)

Kinetic depth effect is perception of an object’s complex, 3D structure from that object’s motion. Although challenging to explain and demonstrate without moving media, consider a cube suspended between a light and a wall. If motionless, the cube could appear as any of the random silhouettes shown in Figure 3.21. Even the square shape in the upper left would be recognized as just that—a square. But rotated through the remaining views, most observers would quickly recognize the source of the silhouettes as a cube, even in the absence of other depth information or surface details.


Credit: Illustration by S. Aukstakalnis

Figure 3.21 The kinetic depth effect demonstrates the ability to perceive 3D structure from moving 2D views and silhouettes.

This phenomenon first appeared in scientific literature in the 1950s based on experiments performed by research scientists Hans Wallach and D. N. O’Connell (Wallach and O’Connell, 1953). Widely studied since, there are two key theories as to how the 3D forms are perceived. The first is the result of changes in the pattern of stimulation on the retina as the object moves, and the second is related to previous experience. In most situations, the kinetic depth effect is experienced along with other depth cues, such as that of motion parallax described earlier.

Familiar Size

As the name of this cue indicates, if we know how large an object is at a distant location, our brain can use that understanding to estimate absolute distances, such as shown in Figure 3.22. Some studies speculate that this cue can be recharacterized as the awareness of the relationship between the size of one’s body and the size of an object as knowing the size of an object must be anchored to some relative metric, and the body is really the only relevant thing we have to which sizes can be compared (Linkenauger et al., 2013).


Credit: Image by Anoldent via Flickr under a CC 2.0 license

Figure 3.22 The familiar size cue draws upon existing observer knowledge of an object in view to help estimate absolute distances.

Relative Size

As shown in Figure 3.23, if two objects are similar in size but offset in terms of their distances from the position of the observer, we perceive the one that casts a smaller image on the retina as being farther away, and the one with the larger image as being closer. This depth cue is heavily weighted based upon personal experience.


Credit: Illustration by S. Aukstakalnis

Figure 3.23 If two objects are of equal size, then if one is farther away, it will occupy a smaller area on the retina. A larger retinal image makes something appear closer.

Aerial Perspective

Aerial perspective (also known as atmospheric perspective) refers to the effect of light being scattered by particles in the atmosphere, such as water vapor and smoke between an observer and a distant object or scene. As shown in Figure 3.24, as this distance increases, the contrast between the object or scene feature and its background decreases. This is also the case with markings and details of the object. As can be seen in the photograph, the mountains in the distance become progressively less saturated and shift toward the background color. Leonardo da Vinci referred to this cue as “the perspective of disappearance.”


Credit: Image by WSilver via Flickr under a CC 2.0 license

Figure 3.24 This photo shows loss of color saturation, contrast, and detail as distance increases from the observer.

This atmospheric effect comes about as visible blue light has a short wavelength in the range of about 475 nm. Thus, it is scattered more efficiently by the molecules in the atmosphere, which is why the sky usually appears blue. At sunrise and sunset, orange (590 nm) and red (650 nm) colors would dominate the scene as the associated wavelengths are longer and less efficiently scattered by the atmosphere.

Texture Gradient

Texture gradient is a strong depth cue within which there is a gradual change in the appearance of textures and patterns of objects from coarse to fine (or less distinct) as distance from the observer increases. As shown in Figure 3.25, the individually defined cobblestones progressively become less and less distinguishable as distance increases from the observer until they appear to blend into a continuous surface.


Credit: Image by Jeremy Keith via Flickr under a CC 2.0 license

Figure 3.25 This photograph shows a wonderful example of texture gradient, within which there is a gradual change in the appearance of textures and patterns of objects from course to fine as distance from the observer increases.

Three key features can be identified in this cue (Mather, 2006):

Image Perspective gradient—Separation of texture elements perpendicular to the surface slant or angle of viewing appears to decrease with distance.

Image Compression gradient—The apparent height of texture elements decreases with increasing distance.

Image Density gradient—Density, or number of elements per unit area, increases with increasing distance.


Lighting, shading, and shadows are powerful cues in the perception of scene depth and object geometry, and their effects vary widely. The angle and sharpness of shadows influence perceived depth. Shadows and reflections cast by one object on another provide information about distances and positioning. Smaller, more clearly defined shadows typically indicate a close proximity of an object to the surface or object upon which the shadow is cast. Similarly, the perception of greater depth can be influenced by enlarging a shadow and the blurring of edges. The manner in which light interacts with an irregular surface reveals significant information about geometry and texture. A number of these effects are shown in Figure 3.26.


Credit: Illustration by Julian Herzog via Wikimedia under a CC 4.0 license

Figure 3.26 This photograph illustrates how shading and shadows can significantly impact the perception of depth in a closed space.

Optical Expansion

Extend your arm straight out, palm up, and slowly move your hand towards your face. As you hand moves closer, the image projected on your retina also grows in size isotropically and increasingly occludes the background. Known as optical expansion, this cue not only allows an observer to perceive an object as moving, but the distance of the object as well (Ittelson, 1951). Sensitivity to this dynamic stimulus develops at a young age and has been observed in infants who show a coordinated defensive response to an object approaching straight on (Bower et al., 1970). A still-frame example of this cue is shown in Figure 3.27. Not only does the object grow larger as distance decreases, but background cues increasing disappear.


Credit: Illustration by S. Aukstakalnis

Figure 3.27 Within the optical expansion cue, the visual image increases size on the retina as an object comes toward us, causing the background to be increasingly occluded.

Relative Height

In a normal viewing situation, objects on a common plane in the near field of your vision are projected onto a lower portion of the retinal field than objects that are farther away. This phenomenon can be seen in the simple example shown in the left side of Figure 3.28. The key is your height relative to the objects in your field of view. Conversely, if objects being viewed are on a common plane above your viewpoint, such as a line of ceiling lanterns, objects closest to you will appear in a higher portion of the retinal field than objects that are farther away. Artists have used this technique for centuries in order to depict depth in 2D drawings and paintings.


Credit: Image by Naomi / Mitch Altman via Flickr under a CC 2.0 licenses

Figure 3.28 Relative height is a concept where distant objects are seen or portrayed as being smaller and higher in relation to items that are closer.


As has been shown throughout this chapter, the human visual system is a remarkable sensory and interpretation mechanism capable of a high dynamic range of performance. Many of the capabilities explored have direct relevance to the overall subject matter of this book. For instance, the processes of vergence and accommodation have direct implications on the design of both fully immersive as well as augmented head-mounted displays. Understanding the various cues used by our visual system to perceive depth can contribute significantly to the actual design of virtual environments.

Throughout the remainder of this book are numerous instances where we refer back to the content of this chapter, solidifying the importance of understanding the mechanics of how our primary sensory mechanism functions. Beyond this chapter, enthusiasts and practitioners alike are strongly encouraged to build upon their knowledge in this area by digging into the papers and others resources provided in Appendix A, “Bibliography,” at the end of this book.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.