Chapter 2. Seeing the World Anew

We are at the beginning of a massive change in how we see and experience reality. Computer vision, machine learning, new types of cameras, sensors, and wearable devices are extending human perception in extraordinary ways. Augmented Reality (AR) is giving us new eyes.

AR’s evolution as a new communications medium is rooted in the history of the moving image and early cinema. In 1929, pioneering filmmaker Dziga Vertov wrote about the power of the camera to depict a new reality, “I am a mechanical eye. I, a machine, show you the world as only I can see it.” Vertov’s famous film Man with a Movie Camera used innovative camera angles and techniques to defy the limitations of human vision.

Vertov experimented with novel vantage points (such as filming from moving vehicles like a motorcycle, to placing a camera on the train tracks while a train passed overhead). He also explored a new sense of time and space by superimposing images and speeding up and slowing down film. Vertov used the emerging technology of the mechanical camera to extend the capabilities of the human eye and create new ways of seeing. He wrote, “My path leads to the creation of a fresh perspective of the world. I decipher in a new way a world unknown to you.”

Nearly a century later, Vertov’s path has led us to AR revealing a new reality and understanding of our world. The camera plays a central role in how AR technology traditionally works: a camera is paired with computer vision to scan and decipher our physical surroundings. AR previously relied heavily on fiducial markers (black and white geometric patterns) or images to augment two-dimensional (2-D) surfaces, such as a print magazine.

The real world, however, is not flat; we experience it in three-dimensional (3-D) space. Unlike 2-D fiducial markers or images, 3-D depth-sensing cameras are being used in AR to recognize, map, and understand our spatial surroundings. These 3-D depth-sensing cameras, such as the Microsoft Kinect camera and Intel’s RealSense camera, are replacing the use of fiducial markers and images to change the way computers see, translate, and augment 3-D environments.

Vertov’s work explored how the camera as a mechanical eye could defy the limits of human vision. He presented novel perspectives depicting what it would be like if a human could see like a camera. Depth-sensing cameras like Kinect and RealSense present the opposite: what if a camera and computer could see like a human? AR technology is beginning to mimic the design of human sensibilities allowing us to see in completely new ways.

You Are the Controller

When introduced in 2010, Kinect changed the way we experienced AR. Kinect’s tag line was “You are the controller.” By simply moving your body as you naturally would, you triggered and directed the AR experience.

Prior to Kinect, for AR to appear on your body, you would need to cover yourself in 2-D fiducial markers, have an image printed on your clothing, or get an AR tattoo. But with Kinect, the experience instantly became more immersive because there was no barrier between you and the augmentation; it was you. Standing in front of a screen powered by Kinect, you could see and interact with a transformed version of yourself, as though standing in front of a magical digital mirror. The augmentations followed your movement and responded to your gestures, creating an experience unique to you.

Artists immediately embraced Kinect as a creative tool to build new types of interactive experiences. Chris Milk’s “The Treachery of Sanctuary” (2012), is a beautiful example of Kinect used in an art installation. You are invited to stand in front of a series of three interactive panels that represent the creative process through birth, death, and regeneration. Your body is mirrored back to you as a dark shadow with different transformations occurring in each panel. In the first panel, your body disintegrates into rising birds. As you move into the second panel, these same birds swoop in to assail you. In the third and final panel, your body sprouts giant wings, and by flapping your arms, your form takes flight, rising off the ground and ascending into the sky.

Milk writes in an artist statement:1

What is interesting to me is the two-way conversation between the work and the viewer. The participant is an active character in the content and concept of the piece, and while the technology allows that interactivity, the emphasis is on the experience, on transcending past the enabling innovation to the spiritual immersion.

Part of Kinect’s magic is that the technology becomes invisible because it is easy to use: you stand in front of it and move around. The experience is reactive to you; your body movements trigger what happens. The technology enables the experience, but without you, there is no content. The technology recedes into the background and you become the focus, quite literally.

Observing Movement and Predicting Activities

Kinect uses a depth-sensing camera to see the world in three dimensions. It works by projecting a pattern of infrared light points onto a room and then measuring how long it takes the light from each of those points to reflect back to the camera’s sensor chip. Software processes the data to identify any human shapes that might be in view, like heads or limbs. Kinect uses a skeletal model that breaks down the human body into multiple segments and joints. Programmed with more than 200 poses, the software understands how a human body moves and is able to predict what movement your body is likely to make next.

Prediction is an important aspect of human perception that we use extensively in our daily activities to interact with our surroundings. Jeff Hawkins, the founder of Palm computing—the company that gave us the first handheld computer—and author of the book On Intelligence (Times Books, 2004), describes the human brain as a memory system that stores and plays back experiences to help us predict what will happen next.

Hawkins points out that the human brain is making constant predictions about what is going to happen in our environment. We experience the world through a sequence of patterns that we store and recall, which we then use to match up against reality to anticipate what will happen next.

Using Kinect, researchers at Cornell University’s Personal Robotics Lab programmed a robot to anticipate human actions, and assist in tasks like pouring a drink or opening a refrigerator door. The robot observes your body movements to detect what action is currently taking place. It accesses a video database of about 120 household activities (ranging from brushing your teeth, to eating, to putting food in the microwave) to predict what movement you will make next. The robot then plans ahead to assist you in that  task.

Building a 3-D Map with SLAM Technology

For a robot to move through an environment and perform such activities, it needs to be able to create a map of its surroundings and understand its location within it. Roboticists have developed Simultaneous Localization and Mapping (SLAM) technology to accomplish this task. Prior to SLAM the sensors required to build that map have traditionally been expensive and bulky. Kinect introduced an affordable and lightweight solution. Videos of Kinect-enabled robots appeared on YouTube within weeks of Kinect’s release. The robots ranged from a quadrotor flying autonomously around a room, to a robot capable of navigating rubble to find earthquake survivors.

Google’s self-driving car also uses SLAM technology with its own camera and sensors. The car processes both map and sensor data to determine its location and detects objects around it based on their size, shape, and movement. Software predicts what the objects might do next and the car performs a responsive action such as yielding to a pedestrian crossing the street.

SLAM is not limited to autonomous vehicles, robots, or drones; humans can use it to map their environment, too. MIT developed one of the first examples of a wearable SLAM device for human users. The system was initially designed for emergency personnel like first responders who enter unknown territory. With a Kinect camera worn on the chest, a digital 3-D map is built in real time as the user moves through the environment. Specific locations can be annotated using a handheld pushbutton. The map can be shared and immediately transferred wirelessly to an offsite commander.

SLAM also makes possible new forms of game play. Ball Invasion, created in 2011 by 13th Lab in Stockholm, Sweden, is an early example of integrating SLAM into an AR game. Holding your iPad in front of you, you see your physical surroundings filled with virtual targets to shoot and chase. What made Ball Invasion unique is that the virtual elements interact with your physical world: virtual bullets bounce off the wall in front of you, and virtual invading balls hide behind your furniture. As you play the game and move the iPad’s camera around, you are building a real time 3-D map of the environment to enable these interactions. In 2012, 13th Lab released PointCloud, an iOS Software Development Kit (SDK) with SLAM 3-D technology for app developers. 13th Lab was acquired by VR technology company Oculus in 2014.

Today, SLAM is one of the underlying technologies behind Google’s Tango AR computing platform. In 2015, tablet development kits for Tango became available to professional developers first, with Tango-enabled smartphones introduced later in 2016 (the Lenovo Phab 2 Pro) and 2017 (the Asus ZenFone AR). Tango makes possible experiences such as precise navigation without GPS, windows into virtual 3-D worlds, measuring spaces in real time, and games that know where they are in a room and what’s around them. Google describes2 the goal of Tango as giving “mobile devices a human-scale understanding of space and motion.”

Our smartphones are already an extension of ourselves, and with advancements like Tango, smartphones are beginning to see, learn, and understand the world like we do. This will give way to new types of interactions in which the virtual is seamlessly mapped to our physical reality and is contextually relevant, creating a deeper sense of immersion. The lines between the virtual and real will blur even more. Technology will not only understand our surroundings, but perhaps help navigate us through our daily lives with a new-found intelligence and awareness.

Helping the Blind to See

If we can bring vision to computers and tablets, why not use that same technology to help people see? Rajiv Mongia, director of the Intel RealSense Interaction Design Group, and his team have developed a portable prototype of a wearable device that uses RealSense 3-D camera technology to help people who are vision-impaired gain a better sense of their surroundings.

Demonstrated live on stage at the 2015 International Consumer Electronics Show (CES) in Las Vegas, the RealSense Spatial Awareness Wearable consists of a vest fitted with a computer that connects wirelessly to eight thumb-sized vibrating sensors worn across the chest, torso, and near the ankles of each leg. It works by seeing depth information to sense the environment around the wearer. Feedback is sent to the wearer through haptic technology that uses vibration motors for tactile feedback.

The vibration sensors are comparable to the vibration mode on your phone with the intensity of the vibration being proportional to how close an object is to you. If an object is very close, the vibration is stronger, and if it’s farther away, it is lower.

Darryl Adams, a technical project manager at Intel, has been testing the system. Adams was diagnosed with retinitis pigmentosa 30 years ago and says the technology allows him to make the most of the vision he does have by augmenting his peripheral vision with the sensation of touch.

For me, there is tremendous value in the ability to recognize when change occurs in my periphery. If I am standing still and I feel a vibration, I am instantly able to turn in the general direction to see what has changed. This would typically be somebody approaching me, so in this case I can greet them, or at least acknowledge they are there. Without the technology, I typically miss this type of change in my social space so it can often be a bit awkward.

The system was tested on three wearers, each with very different needs and levels of vision, from low vision to fully blind. Mongia and his team are working on making the system scalable with modular hardware components to allow users to select the combination of sensors and haptic output that best suits their specific situation.

Adams would like to see the software become context-aware so that the system can respond to the wearer’s needs in any given scenario. He thinks this technology could evolve to include features like facial recognition or eye tracking. This way, the wearer can be alerted when someone is looking at them rather than just knowing there is a person nearby.

Artificial Intelligence (AI) could further be integrated to provide wearable systems with a better understanding of the wearer’s context. Methods like machine learning can help give computers some of the abilities of a human brain, enabling computer programs to learn to perform new tasks when exposed to new data, without being explicitly programmed for those tasks.

Teaching a Computer to See with Machine Learning

OrCam, a device designed for the visually impaired, uses machine learning to help wearers interpret and better interact with their physical surroundings. The device can read text and recognize things like faces, products, and paper currency.

The OrCam device features a camera that clips on to a pair of glasses and continuously scans the wearer’s field of view. The camera is connected by a thin cable to a portable computer that fits in your pocket. Instead of vibration sensors (like in the RealSense Spatial Awareness Wearable), OrCam uses audio. A bone-conduction speaker transmits sound to the wearer as it reads aloud objects, words, or people’s names.

With OrCam, the wearer shows the device what he is interested in by pointing. “Point at a book, the device will read it,” says Yonatan Wexler,3 head of research and development at OrCam. “Move your finger along a phone bill, and the device will read the lines letting you figure out who it is from and the amount due.” To teach the system to read, it is repeatedly shown millions of examples, so the algorithms focus on relevant and reliable patterns.

Wexler says there is no need to point when identifying people and faces. “The device will tell you when your friend is approaching you. It takes about ten seconds to teach the device to recognize a person,” he says. “All it takes is having that person look at you and then stating their name.” OrCam takes a photo of the person and stores it within the system’s memory. The next time the camera views the person, the device will recognize that person, and even identify them by name.

OrCam uses machine learning to recognize faces. The research and development team had to provide OrCam with hundreds of thousands of images of all different kinds of faces in order to teach OrCam’s program how to recognize an individual face. When a user is wearing OrCam, the program sorts through all images, rejecting ones that are not a match, until only the one matching picture remains. In a matter of moments, this process of face recognition occurs each time someone wearing OrCam encounters someone they have taken a picture of with their device.

Training the Brain to See with Sound

OrCam is trained to see your world and provide an oral translation of your immediate surroundings. A different approach is taken by vision technologies like the vOICe and EyeMusic. Instead of using machine learning to tell the wearer what she is looking at, these technologies explore how the human brain can be trained to see with other senses, effectively learning how to see with sound.

Neuroscientist Amir Amedi asks, “What if we found a way to deliver the visual information in the brain of individuals who are visually impaired, bypassing the problems in their eyes?” Brain imaging studies conducted by Amedi and his team show that when people who have been blind since birth use systems like the vOICe and EyeMusic to “see,” they activate the same category-dependent processing areas of the brain as people who are sighted. However, instead of traveling through the visual cortex, the signal enters the brain through the auditory cortex and is diverted to the proper spot of the brain.

The vOICe system (OIC = “Oh, I See”) translates images from a camera into audio signals to assist people who are congenitally blind to see. Developed by Peter Meijer, the vOICe consists of a pair of sunglasses with a small camera that is connected to a computer and a pair of headphones. (The system also can be used on a smartphone by downloading the software and using the phone’s built-in camera.)

The vOICe software makes your surroundings into a “soundscape.” The camera continuously scans the environment from left to right, converting each pixel into a beep: the frequency represents its vertical position, and the volume of each beep represents the brightness of the pixel. Brighter objects are louder, and frequency indicates whether an object is high or low.

Amedi and his colleagues have trained people who were born blind to “see” using the vOICe and EyeMusic, a more recent app developed by Amedi that additionally assigns different pitches to colors. Different types of instruments are used to convey colors. For example, blue is signified by a trumpet, red by chords played on an organ, and yellow by a violin. White is represented by human voices, and black is silence.

Amedi says training one’s brain to learn to see this way takes about 70 hours. Users are taught how to identify broad categories of objects, including faces, bodies, and landscapes. Each is processed in the visual cortex of the brain. “Everyone thinks that the brain organizes according to the senses, but our research suggests this is not the case,” says Amedi.4 “The human brain is more flexible than we thought.”

Research and inventions like Amedi’s and Meijer’s show us that the traditional definition of what it means to see is changing. It will continue to change as both computers and the human brain are learning to see in new ways together.

Choose Your Own Reality

The ability to see and interpret our surroundings with the assistance of computer vision also makes it possible to filter our reality and selectively see, or unsee, the world around us. This includes the possibility of removing things from our reality that we do not wish to see.

Black Mirror, a popular television series satirizing modern technology, imagined the ability to block people in real life with the press of a button in the episode “White Christmas” (2014). Instead of seeing the person you have blocked, you now see a white space in the shape of a person, and hear muffled sounds, with the blocked person seeing the same for you. In 2010, Japanese developer Takayuki Fukatsu built a demonstration not too different from the technology presented in the episode of Black Mirror. Using Kinect and OpenFrameworks, Fukatsu’s Optical Camouflage shows a human figure blending in with his background to become invisible.

Dr. Steve Mann is a professor of electrical engineering and computer science at the University of Toronto and is referred to as “the father of wearable computing.” Mann defined the term “Mediated Reality” in the 1990s. He says,5 “Mediated Reality differs from virtual reality (or augmented reality) in the sense that it allows us to filter out things we do not wish to have thrust upon us against our will.” For Mann, wearable computing devices provide the user with “a self-created personal space.” Mann has used Mediated Reality to substitute personal notes and directions in place of advertisements.

New media artist Julian Oliver credits Mann’s work as inspiration for “The Artvertiser,” a Mediated Reality project initiated in 2008, which he developed in collaboration with Damian Stewart and Arturo Castro. The Artvertiser is a software platform that replaces billboard advertisements with art in real time. It works by teaching computers to recognize advertisements, which are then transformed into a virtual canvas upon which artists can exhibit images or video. The artwork is viewed through a handheld device that looks like a pair of binoculars.

Rather than referring to this as a form of AR technology, Oliver considers The Artvertiser to be an example of “Improved Reality.” He describes the project as a reclaiming of our public spaces from “read-only” to “read-write” platforms. The Artvertiser applies a subversive approach to reveal and temporarily intercept environments that are dominated by advertising.

The “Brand Killer” (2015) is a contemporary project that builds upon Mann’s and Oliver’s work. Brand Killer was created by a group of University of Pennsylvania students, Tom Catullo, Alex Crits-Christoph, Jonathan Dubin, and Reed Rosenbluth, to blur ads in real time for its wearer. The students question, “What if we lived in a world where consumers were blind to the excesses of corporate branding?” Brand Killer is a custom-built head-mounted display that uses openCV image processing to recognize and block brands and logos from the user’s point of view in real time. It’s “AdBlock for Real Life,” they state.

We already mediate our reality while we’re on the internet by blocking ads and even people with whom we no longer want to interact. Beyond advertising and other people, what else will we choose to remove or block from our vision with Mediated Reality?

As we design the future of AR, we will need to consider if digitally filtering, mediating, and substituting content to one’s choosing will enhance our reality or separate us from the world and one another. It is my hope that these new technologies will be used in ways to support human interaction, connection, and communication, and even build empathy.

Although we are often inclined to erase things from reality that we do not want to see, such as homelessness, poverty, and sickness, there are things that we, as society, must actively address. Mediated Reality has the potential to foster a culture of avoidance and even ignorance. We should not turn a blind eye to the realities of reality.

The positive side of Mediated Reality is that it could be used as a way to provide focus. This technology has the possibility to create a future with less distractions that leads to more human-to-human moments. We are already bombarded by technology and notifications; what if Mediated Reality provided an easy way to completely switch off distractions temporarily?

Another critical question is who will be authoring this new reality? Will it be individuals, corporations, or groups of people? Whose Mediated Reality will we be privy to and what visual filters or tools for interception will come to exist? To use Oliver’s terms, will we be part of a read-write environment, or read-only?

In the same way that the internet is read-write, I believe AR, with Mediated Reality a part of it, will be, too. Tim Berners-Lee, inventor of the World Wide Web, envisioned the internet as a place to share experiences in new and powerful ways. “The original thing I wanted to do was make it a collaborative medium, a place where we can all meet and read and write,” he says.6 The internet reframed the way we share and consume information and AR has the power to do this, as well.

With examples such as enabling the visually impaired to gain a form of sight, artists imagining new interactive experiences, and robots assisting humans in daily life, AR presents a new way of perceiving the world. AR has the ability to improve people’s lives and inspire life changing ways of engaging with our surroundings and one another.

If we replace the word “machine” with the word “human” in Vertov’s sentiment at the beginning of this chapter, “I, a machine, show you the world as only I can see it,” we get the richness of what the internet enables: a global collection of shared stories of human experiences and perspectives. To have a positive impact on society and contribute to humanity in a meaningful way, AR will need to find ways to mirror the original vision for the World Wide Web to largely be inclusive, not exclusive.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.13.201