© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
J.-M. ChungEmerging Metaverse XR and Video Multimedia Technologieshttps://doi.org/10.1007/978-1-4842-8928-0_3

3. XR HMDs and Detection Technology

Jong-Moon Chung1  
(1)
Seoul, Korea (Republic of)
 
This chapter focuses on XR head mounted displays (HMDs), the XR operation process, and XR feature extraction and detection technologies. Details of the XR technologies applied in state-of-the-art devices (including HMDs) are introduced in this chapter. Among XR operations, the XR feature extraction and detection process is the most computation burdening as well as energy- and time-consuming procedure of the XR process. The most important XR feature extraction and detection schemes are introduced in this book, which include scale-invariant feature transform (SIFT), speeded-up robust feature (SURF), features from accelerated segment test (FAST), binary robust independent elementary features (BRIEF), oriented FAST and rotated BRIEF (ORB), and binary robust invariant scalable keypoints (BRISK) (www.coursera.org/learn/ar-technologies-video-streaming). This chapter consists of the following sections:
  • XR HMDs

  • XR Operation Process

  • XR Feature Extraction and Description

XR HMDs

For an XR experience, the HMDs can provide the best immersive presence effects. Thus, XR HMDs play a significant role in metaverse and game services. Therefore, in this section, some of the representative XR HMDs are presented, which include the Meta Quest 2, Microsoft HoloLens 2, HTC VIVE Pro 2, HTC VIVE Focus 3, Valve Index, Pimax Vision 8K X, and the PlayStation VR 2.

Meta Quest 2

The Facebook Oculus Quest is the preceding version of the Meta Quest 2. In May of 2019, the Oculus Quest VR HMD was released by Facebook. The Quest HMD had 4 GB of random-access memory (RAM) and was equipped with a Qualcomm Snapdragon 835 system on chip (SoC) with a 4 Kryo 280 Gold central processing unit (CPU) and an Adreno 540 graphics processing unit (GPU). The Quest’s introductory price was $399 for the 64 GB storage memory device and $499 for the 128 GB storage memory device (www.oculus.com/experiences/quest/).

In October of 2020, Oculus Quest 2 was released, and due to Facebook’s parent company name change to Meta, the device was rebranded as Meta Quest 2 in November of 2021. Quest 2 is a VR HMD equipped with a Qualcomm Snapdragon XR2 SoC with 6 GB RAM and an Adreno 650 GPU. Quest 2’s initial price was $299 for the 64 GB storage memory model, which was changed to 128 GB later, and $399 for the 256 GB storage memory model.

The Meta Quest 2 software platform uses an Android operating system (OS). For first-time Quest 2 device setups, a smartphone running the Oculus app is needed. The Quest 2 software engine is Android 10 based which is processed with the help of the device’s SoC, which is a Qualcomm Snapdragon XR2 and the Adreno 650 GPU which has an approximate speed of 1.2 TFLOPS. The platform allows up to three additional accounts to log onto a single Quest 2 headset, while sharing purchased software from the online Oculus Store is permitted. The platform supports the Oculus Touch controllers and recently included experimental Air Link wireless streaming as well as experimental Passthrough APIs for AR applications (https://en.wikipedia.org/wiki/Oculus_Quest_2).

Compared to the Facebook Oculus Quest, the Meta Quest 2 is about 10% lighter and has a more ergonomic controller. The device’s power supply is based on a 3,640 mAh battery that supports approximately 2 hours for games and 3 hours for video watching. The full battery charging time through a USB-C power adapter takes about 2.5 hours. The display has a 1832×1920 pixel resolution per eye. The device supports 6DoF inside-out tracking using four infrared cameras along with an accelerometer and gyroscope inertial measurement units (IMUs). Metaverse, games, and multimedia applications need to run on the Quest OS which is based on an Android 10 source code. Hand tracking is conducted by a deep learning system based on four cameras for hand recognition and gesture recognition. More details on deep learning hand recognition and gesture recognition are provided in Chapter 4.

Two greyscale photographs of a Meta Quest 2 headset. It has a head-mounted device with sensors and controllers.

Figure 3-1

Meta Quest 2

Figure 3-1 presents the Meta Quest 2 headset (HMD) and controllers, where on the HMD, the motion (hand) sensors near the bottom of the device (which look like two black dots) can be seen. In Figure 3-2, the Meta Quest 2 inside view is presented (based on SIMNET graphics), where the views of the two eyes are presented in the top picture and the two eyes’ combined view is presented in the bottom picture.

Two photographs present the inside view of Meta Quest 2. It resembles an eye and flame-like graphics with color gradients.

Figure 3-2

Meta Quest 2 Inside View. Eyes Separate (Top) and Together (Bottom)

Microsoft HoloLens 2

Microsoft HoloLens 2 was released in 2019. It is equipped with a Qualcomm Snapdragon 850 SoC with 4 GB DRAM memory with 64 GB storage. The device is equipped with IEEE 802.11ac Wi-Fi and Bluetooth 5.0 and USB Type-C. Microsoft HoloLens 2 uses a holographic processing unit that has a holographic resolution of 2K (1440×936) with a 3:2 display ratio that supports a holographic density above 2.5 k radiants (light points per radian). Head tracking is conducted through four visible light cameras, and two infrared (IR) cameras are used for eye tracking. The system is based on the Windows Holographic Operating System and Windows 10 with Microsoft Azure cloud computing support when needed. The two front mounted cameras are used to scan the surrounding area which are supported by the accelerometer, gyroscope, and magnetometer based IMUs to provide accurate 6DoF World-scale positional tracking. The onboard camera can provide 8 MP still pictures or video with 1080p (1980×1080) at a frame rate of 30 frames/s. The battery enables 2 to 3 hours of active use with up to 2 weeks of standby time. The power supply is an 18 W (i.e., 9 V at 2 A) charger but can charge with at least a 15 W charger. Microsoft HoloLens 2 includes a two-handed fully articulated hand tracking system and supports real-time eye tracking. The voice command and control on-device system is based on Cortana natural language.

Two greyscale photographs of a Microsoft HoloLens 2 headset. It has a head-mounted device with sensors and controllers.

Figure 3-3

Microsoft HoloLens 2

Two greyscale photographs present a closer view of the Microsoft HoloLens 2 headset. It has a spectacle that displays the viewed objects in multi colors.

Figure 3-4

Microsoft HoloLens 2 Inside View

A dark screen displays the selection menu of Microsoft HoloLens 2 in bright colors.

Figure 3-5

Microsoft HoloLens 2 Selection Menu Using AR

Figure 3-3 shows the Microsoft HoloLens 2 device’s front view (left picture) and the top view (right picture), Figure 3-4 shows the Microsoft HoloLens 2 inside view that has augmented objects imposed on the background view, and Figure 3-5 shows a Microsoft HoloLens 2 inside view selection menu using AR (based on HanbitSoft graphics).

The Microsoft Windows MR platform is a part of the Windows 10 OS which provides MR and AR services for compatible HMDs which include the HoloLens, HoloLens2, Lenovo Explorer, Acer AH101, Dell Visor, HP WMR headset, Samsung Odyssey, Asus HC102, Samsung Odyssey+, HP Reverb, Acer OJO, and the HP Reverb G2 (https://docs.microsoft.com/en-us/windows/mixed-reality/mrtk-unity/architecture/overview?view=mrtkunity-2021-05).

HTC VIVE Pro 2

HTC VIVE Pro 2 is a tethered (non-standalone) MR HMD developed by HTC VIVE that was released in June of 2021. HTC VIVE Pro 2 is a MR device; thus, it is equipped with AR and VR technologies. The VIVE PRO 2 costs $799 for the headset only and $1,399 for the full kit, which includes the Base Station 2.0 and controller (www.vive.com/kr/product/vive-pro2-full-kit/overview/). The HMD display has a 2448×2448 per eye resolution that supports a FoV up to 120 degrees and weighs 855 g. Tracking is conducted with SteamVR 2.0 based on its G-sensors, gyroscope, proximity, and interpupillary distance (IPD) sensors. IPD sensors are used to detect the distance between the center of the pupils. The device uses a proprietary cable, Link Xox cable, and USB 3.0 cable and enables connectivity through USB Type-C interfaces and Bluetooth wireless networking. The HTC VIVE Pro 2 5K resolution MR HMD is based on a Windows 10 OS supported by an Intel Core i5-4590 or AMD Ryzen 1500 CPU with a NVIDIA GeForce RTX 2060 or AMD Radeon RX 5700 GPU with 8 GB RAM or higher (www.tomshardware.com/reviews/htc-vive-pro-2-review). The Base Station 2.0 system in the VIVE PRO 2 Full Kit supports room-scale immersive presence VR applications. The system is supported by the VIVE Pro series or Cosmos Elite headsets, and the controllers assist exact location tracking. The non-standalone needs to be tethered to a high-performance Windows OS computer. The tethered computer’s minimum requirements include a Windows 10 OS supported by an Intel Core i5-4590 or AMD Ryzen 1500 CPU with a NVIDIA GeForce GTX 1060 or AMD Radeon RX 480 GPU with 8 GB RAM.

HTC VIVE Focus 3

HTC VIVE Focus 3 is a standalone VR HMD developed by HTC VIVE that was released in August of 2021. The HMD is equipped with a Qualcomm Snapdragon XR2 (Snapdragon 865) SoC and runs on an Android OS (www.vive.com/kr/product/vive-focus3/overview/). The HMD display has a 2448×2448 per eye resolution that supports a 120 degree FoV and is equipped with 8 GB RAM and 128 GB storage memory. Tracking is conducted with the HTC VIVE inside-out tracking system based on its Hall sensors, capacitive sensors, G-sensors, and gyroscope sensors (www.tomshardware.com/reviews/vive-focus-3-vr-headset).

Valve Index

Valve Index is a VR HMD developed by Valve that was first introduced in 2019 (www.valvesoftware.com/en/index). The HMD requires a high-performance computer equipped with Windows OS or Linux OS that supports SteamVR to be tethered to the device (tethered cable with DisplayPort 1.2 and USB 3.0). The device conducts motion tracking with SteamVR 2.0 sensors. The HMD weighs 809 g and costs $499 for the headset only and $999 for the full kit, which includes the headset, controller, and Base Station 2.0 (www.valvesoftware.com/ko/index). The HMD display has a 1440×1600 per eye resolution that supports an adjustable FoV up to 130 degrees and is equipped with a 960×960 pixel camera (www.tomshardware.com/reviews/valve-index-vr-headset-controllers,6205.html).

Pimax Vision 8K X

The Pimax Vision 8K X is a high-resolution HMD developed by Pimax. The HMD was first introduced at CES 2020 and weighs 997.9 g and costs $1,387 (https://pimax.com/ko/product/vision-8k-x/). The HMD display has a 3840×2160 per eye resolution for the native display and a 2560×1440 per eye resolution for the upscaled display that supports a FoV up to 170 degrees. The device uses SteamVR tracking technology with a nine-axis accelerometer sensor (www.tomshardware.com/reviews/pimax-vision-8k-x-review-ultrawide-gaming-with-incredible-clarity). The Pimax Vision 8K X HMD needs to be tethered to a high-performance Windows OS computer in which the USB 3.0 connection will be used for data communication and the USB 2.0 connection can be used for power supply. The tethered computer needs to be equipped with Windows 10 Pro/Enterprise 64 bits (version 1809 or higher) OS supported by an Intel I5-9400 CPU (or higher) with a NVIDIA GeForce RTX 2080 (or at least RTX 2060 or higher) with at least 16 GB of RAM.

PlayStation VR 2

PlayStation VR 2 is an HMD developed by Sony Interactive Entertainment for VR games, especially for the PlayStation 5 VR Games (www.playstation.com/ko-kr/ps-vr2/). The device was first introduced at CES 2022. PlayStation VR 2 needs to be tethered (via USB Type-C wire connection) to a PlayStation 5 player. The HMD weighs less than 600 g. The PlayStation VR 2 has a 2000×2040 per eye panel resolution display with a FoV of approximately 110 degrees. Four cameras are used for tracking of the headset and controller, and IR cameras are used for eye tracking. The HMD can provide vibration haptic effects and is equipped with a built-in microphone and stereo headphones.

XR Operation Process

XR is a combination of MR, AR, VR, and haptic technologies. XR uses AR to enable virtual text or images to be superimposed on selected objects in the real-world view of the user, in which the AR generated superimposed virtual text or image is related to the selected object. AR technology is a special type of visual based context aware computing. For an AR user, the real-world and virtual information and objects need to coexist on the same view in harmony without disturbing the user’s view, comfort, or safety. However, AR technology is very difficult to implement as the computer-generated virtual text or image needs to be superimposed on the XR user’s real-world view within a few milliseconds. The details of AR processing are described in the following section. The AR process is very complicated, and its operation process is time-consuming. So, accurately completing the AR process within the time limit of a few milliseconds is very difficult. If the XR user changes head or eye directions, the focus of sight will change, and the entire process has to be continuously updated and refreshed within the time limit of a few milliseconds. This is why AR development is so much more difficult and slower compared to VR system development.

The AR operation process includes image acquisition, feature extraction, feature matching, geometric verification, associated information retrieval, and imposing the associated information/object on the XR device’s display. These processes can be conducted without artificial intelligence (AI) technology (i.e., “non-AI” methods) or using AI technology. There are a variety of non-AI based AR feature detection and description algorithms, in which SIFT, SURF, FAST, BRIEF, ORB, and BRISK schemes are described in this chapter. AI based AR technologies commonly use deep learning systems with convolutional neural network (CNN) image processing engines, which are described in Chapter 4.

AR user interface (UI) types include TV screens, computer monitors, helmets, facemask, glasses, goggles, HMDs, windows, and windshields.

Among handheld AR displays, smartphones were the easiest platform to begin commercial services. Powerful computing capability, good camera and display, and portability make smartphones a great platform for AR. Figure 3-6 shows a smartphone-based AR display example, and Figure 3-7 presents an AR research and development environment in my lab at Yonsei University (Seoul, South Korea) connected to a smartphone.

AR eyeglass products include the Google Glass, Vuzix M100, Optinvent, Meta Space Glasses, Telepathy, Recon Jet, and Glass Up. However, the large amount of image and data processing that AR devices have to conduct requires a significant amount of electronic hardware to execute long software program codes. As a result, AR eyeglasses are a significant challenge as there is very little space to implement the AR system within the small eyeglass frame. In the near future, XR smart eyeglasses will become significantly popular products, but much more technical advancements are required to reach that level. Hopefully, this book will help you and your company reach this goal faster.

AR HMD is a mobile AR device that can provide an immersive AR experience. The earliest examples include HMDs for aircraft maintenance and aviation assistance.

A photograph illustrates how the photo frame of the Mona Lisa is detected in the mobile interface with its artist and painted details.

Figure 3-6

Smartphone Based AR Process Example

Three photos of a computer with A R cloud server, smartphone, and power measurement connected to a Wi-Fi access point along with its front and rear views.

Figure 3-7

Example AR Development Setup in My Lab

An illustration defines how the protocol interrelates the content provider server and the user with software framework and application program or a browser.

Figure 3-8

AR Technological Components

AR Technology

As presented in Figure 3-8, AR services need to be supported by the user’s XR device and the content provider (CP) server. In standalone XR devices, the CP server is on the XR device. In tethered non-standalone XR devices, the CP server is commonly located in a local computer or edge cloud. The CP server may include 3D graphical assets, geographic information, text, image and movie libraries, AR content container, and point of interest information. The user’s XR device contains recognition functions, visualization functions, and interactions functions that are based on the software framework, application programs, and browser. The recognition functions support the computer vision, markers and features, sensor tracking, and the camera application programming interface (API). The visualization functions support the computer graphics API, video composition, and depth composition. The interactions functions support the user interface (UI), gestures, voice, and haptics.

An illustration defines how the A R system detects, tracks interact, renders the visual feedback to the user, and further gives the input.

Figure 3-9

AR Workflow

Figure 3-9 shows the AR workflow, which starts with detection of the environment as well as the objects and people within the image. Based on the application’s objective, the AR system will select objects and people within the image and continuously track them. Even though the objects and people may not move, because the XR device’s user view changes often (through head and eye movement), the objects and people within the user’s view will change, which is why tracking is always needed. Based on user interaction and interaction of the application, the augmented information/image needs to be generated and included into the display image of the AR user through the image rendering process. Visual feedback of the rendered image (which includes the augmented information/image) is displayed on the user’s screen (www.researchgate.net/publication/224266199_Online_user_survey_on_current_mobile_augmented_reality_applications).

AR Feature Detection and Description Technology

AR feature detection requires the characteristic primitives of an image to be accurately identified. The identified objects need to be highlighted using unique visual cues so the user can notice them. Various feature detection techniques exist, where the common objective is to conduct efficient and effective extraction of stable visual features. AR feature detection influencing factors include the environment, changes in viewpoint, image scale, resolution, lighting, etc. AR feature detector requirements include robustness against changing imaging conditions and satisfaction of the application’s Quality of Service (QoS) requirements (www.researchgate.net/publication/220634788_Mobile_Visual_Search_Architectures_Technologies_and_the_Emerging_MPEG_Standard).

During the Interest Point Detection (IPD) process, the detected interest points can be used to obtain feature descriptors, in which circular or square regions centered at the detected interest points are added where the size of the (circular or square) regions are determined by the scale. Figure 3-10 shows a feature detection and IPD example.

A greyscale photo of a Coca-Cola can and a color photo of a sunflower garden. Both pictures have several small circles detecting the objects.

Figure 3-10

AR Feature Detection and IPD Example

Feature Detection and Feature Description

Feature detection is a process where features in an image with unique interest points are detected. The feature descriptors characterize an image’s features using a sampling region. Figure 3-10 shows a feature detection example in which a circular image patch is applied around detected interest points.

There are AR algorithms that are capable of feature detection or feature description, and some can do both. For example, binary robust independent elementary features (BRIEF) is a feature descriptor scheme, and features from accelerated segment test (FAST) is a feature detector scheme. Schemes that do both feature detection and description include scale-invariant feature transform (SIFT), speeded-up robust features (SURF), oriented FAST and rotated BRIEF (ORB), and binary robust invariant scalable keypoints (BRISK).

There are two methods to conduct feature detection without using AI technology. Method-1 uses spectra descriptors that are generated by considering the local image region gradients. A “gradient” is a multivariable (vector) generalization of the derivative, where derivatives are applied on a single variable (scalar) in a function. Method-1 base algorithms include SIFT and SURF. Method-2 uses local binary features that are identified using simple point-pair pixel intensity comparisons. Method-2 based algorithms include BRIEF, ORB, and BRISK.

AR schemes that mix and match feature detection and feature description algorithms are commonly applied. Feature detection results in a specific set of pixels at specific scales identified as interest points. Different feature description schemes can be applied to different interest points to generate unique feature descriptions. Application and interest points characteristic based scheme selection is possible.

AR System Process

The AR process consists of five steps, which include image acquisition, feature extraction, feature matching, geometric verification, and associated information retrieval, as presented in Figure 3-11. Image acquisition is the process of retrieving an image from the AR camera. Feature extraction is based on an initial set of measured data, where the extraction process generates informative nonredundant values to facilitate the subsequent feature learning and generalization steps. Feature matching is the process of computing abstractions of image information and to make a local decision if there is an image feature (or not), which is conducted for all image points. Feature matching needs support from the database. Geometric verification is the identification process of finding geometrically related images in the image dataset (which are a subset of the overall AR image database). Associated information retrieval is the process of searching and retrieving metadata, text, and content-based indexing information of the identified object. The associated information is used for display on the XR device’s screen near (or overlapping) the corresponding object.

An illustration of how the picture of Seoul Tower in a mobile camera undergoes acquisition, extraction, and matching in an A R server and associated data is obtained.

Figure 3-11

AR Process

AR Cloud Cooperative Computation

As presented in the details above, the AR process is complicated and requires a large database and a lot of image and data processing. This is the same for both the AI based AR processes (using convolutional neural network (CNN) deep learning described in Chapter 4) and the non-AI based AR processes (e.g., SIFT, SURF, FAST, BRIEF, ORB, BRISK).

Some of the main issues with the AR process are listed in the following. First, all object information cannot be stored on the XR device, which is why an external AR database is commonly needed. Second, XR requires a large amount of computation to complete its task, and the amount of XR processing directly influences the battery power consumption of the XR device. If more processing is conducted on the XR device, then the operation time of the device will become shorter (www.itiis.org/digital-library/manuscript/1835). Third, more processing on the XR device will cause the device to warm up faster, and a larger battery will be needed making the XR device heavier. Fourth, the device will become much more expensive if more processing is conducted on the XR device, because a higher performing CPU and GPU and more memory will be needed. Therefore, to share the processing load and overcome database limitations of the XR device, XR cloud computation offloading can be used. Figure 3-12 presents an example of the AR process (the feature matching, geometric verification, and associated information retrieval part) conducted using a cloud offloading server.

An illustration of how the Seoul Tower picture in mobile is processed in A R smart device and A R server via cloud offloading and output is received via a wireless or wired network.

Figure 3-12

Example of an AR Process Using a Cloud Offloading Server

Cloud cooperative computation is also called “cloud offloading” because a part of the required computing process is offloaded to be computed by the main cloud server or the local edge cloud system (www.cs.purdue.edu/homes/bb/cs590/handouts/computer.pdf).

More details on cloud offloading, edge computing, and content delivery network (CDN) support for XR, metaverse, and multimedia streaming services are provided in Chapters 7 and 8.

Virtualization allows cloud server vendors to run arbitrary applications (from different customers) on virtual machines (VMs), in which VM processing is based on the infrastructure as a service (IaaS) cloud computing operations (where more details are provided in Chapter 8). Cloud server vendors provide computing cycles, in which XR devices can use these computing cycles to reduce their computation load. Cloud computing (offloading) can help save energy and enhance the response speed of the XR service. However, unconditional offloading may result in excessive delay, which is why adaptive control is needed. To conduct adaptive control, the cloud server load and network congestion status monitoring is needed. Adaptive cloud offloading parameters include network condition, cloud server status, energy status of the XR device, and target Quality of Experience (QoE) level (www.itiis.org/digital-library/manuscript/1093).

XR Detection Process

Feature extraction is the process of finding the interest points from the image or video. The system needs to detect the descriptors from interest points and compare the descriptors with data in the database. In the feature extraction process, the qualification of the descriptors is selected based on the invariability from noise, scale, rotation, and other accuracy degrading factors. The kinds of descriptors that are used include corner, blob, region, etc. Figure 3-13 shows an example of an AR description generation process.

A photograph of Coca-Cola can is presented in greyscale, interest point, and descriptors process.

Figure 3-13

Example of an AR Description Generation Process

Among these methods, blob detection is very popular. Blob detection is the process of detecting blobs in an image. A blob is a region of an image that has constant (or approximately constant) image properties. All the points in a blob are considered to be similar to each other. These image properties (i.e., brightness, color, etc.) are used in the comparison process to the surrounding regions. Figure 3-14 presents a blob detection example conducted on a field of sunflowers. In mathematical formula terms, blob detection is based on the image’s Laplacian of Gaussian (LoG) which uses the Hessian matrix H, which is obtained by the second-order derivative. The Hessian is the determinant of matrix H, and the Laplacian is the trace of matrix H.

A photograph of a sunflower garden. The blobs are highlighted in circles.

Figure 3-14

Blob Detection Example

Popular feature extraction techniques include the Haar feature that was developed by P. Viola et al. in 2001, scale-invariant feature transform (SIFT) was developed by D. G. Lowe in 2004, histogram of oriented gradient (HOG) was developed by N. Dalal et al. in 2005, speeded-up robust features (SURF) was developed by H. Bay et al. in 2006, and oriented FAST and rotated BRIEF (ORB) were developed by E. Rublee et al. in 2011.

In the AR system process, feature extraction needs to be conducted accurately but is among the most difficult and computationally burdening AR processes. AR feature extraction has a significant influence on the accuracy of the AR process, so more technical details are described in the following.

As presented in Figure 3-15, the AR feature extraction process is based on the following six steps: grayscale image generation (GIG), integral image generation (IIG), response map generation (RMG), interest point detection (IDP), orientation assignment (OA), and descriptor extraction (DE).

A process flow of feature extraction. It lists how the greyscale, integral, response map, interest point, orientation assignment, and descriptor images are generated.

Figure 3-15

AR Feature Extraction Process

A description of the AR feature extraction step procedures is provided in the following.
  1. 1.

    Grayscale Image Generation (GIG): Original image captured by the AR device is changed into a grayscale valued image in order to make it robust to color modifications.

     
  2. 2.

    Integral Image Generation (IIG): Process of building an integral image from the grayscale image. This procedure enables fast calculation of summations over image subregions.

     
  3. 3.

    Response Map Generation (RMG): In order to detect interest points (IPs) using the determinant of the image’s Hessian matrix H, the RMG process constructs the scale space of the image.

     
  4. 4.

    Interest Point Detection (IPD): Based on the generated scale response maps, the maxima and minima (i.e., extrema) are detected and used as the IPs.

     
  5. 5.

    Orientation Assignment (OA): Each detected IP is assigned a reproducible orientation to provide rotation invariance (i.e., invariance to image rotation).

     
  6. 6.

    Descriptor Extraction (DE): Process of uniquely identifying an IP, such that it is distinguished from other IPs.

     

XR Feature Extraction and Description

In this section, XR feature extraction and detection algorithms are described. Figures 3-16 through 3-24 present the process details of SIFT, SURF, FAST, BRIEF, ORB, and BRISK, and Tables 3-1 and 3-2 provide a comparison of AR feature detection and description methods (Y.-S. Park, Ph.D. Dissertation, Yonsei Univ., 2018). Table 3-1 presents an AR feature detection and description method comparison based on year, feature detector, spectra, and orientation. Table 3-2 summarizes the distance function, robustness, pros and cons of AR feature detection, and description methods.
Table 3-1

AR feature detection and description method comparison based on proposed year, feature detector, spectra, and orientation

Categories

SIFT

SURF

FAST

BRIEF

ORB

BRISK

Year

1999

2006

2006

2010

2011

2011

Feature detector

Difference of Gaussian

Fast Hessian

Binary comparison

N/A

FAST

FAST or AGAST

Spectra

Local gradient magnitude

Integral box filter

N/A

Local binary

Local binary

Local binary

Orientation

Yes

Yes

N/A

No

Yes

Yes

Feature shape

Square

HAAR rectangles

N/A

Square

Square

Square

Feature pattern

Square

Dense

N/A

Random point-pair pixel compares

Trained point-pair pixel compares

Trained point-pair pixel compares

Table 3-2

AR feature detection and description method comparison based on distance function, robustness, and pros and cons

Categories

SIFT

SURF

FAST

BRIEF

ORB

BRISK

Distance function

Euclidean

Euclidean

N/A

Hamming

Hamming

Hamming

Robustness

6 Brightness, rotation, contrast, affine transform, scale, noise

4 Scale, rotation, illumination, noise

N/A

2 Brightness, contrast

3 Brightness, contrast, rotation, limited, scale

4 Brightness, contrast, rotation, scale

Pros

Accurate

Accurate

Fast, real-time applicable

Fast, real-time applicable

Fast, real-time applicable

Fast, real-time applicable

Cons

Slow, compute-intensive, patented

Slow, patented

Large number of interest points

Scale and rotation-invariant

Less scale invariant

Less scale invariant

SIFT

Scale-invariant feature transform (SIFT) is one of the first feature detector schemes proposed. The SIFT software source code and user guidelines are provided in GitHub (https://github.com/OpenGenus/SIFT-Scale-Invariant-Feature-Transform#:~:text=Scale%20Invariant%2DFeature%20Transform%20(SIFT),-This%20repository%20contains&text=D.,these%20features%20for%20object%20recognition). SIFT uses image transformations in the feature detection matching process. SIFT is highly accurate; however, it does have a large computational complexity which limits its use in real-time applications. Figure 3-16 shows the SIFT feature extraction process.

A flow chart defines how the S I F T features are predicted from the input image via DoG generation, key point detection, orientation, descriptor, and down sampled to obtain all octaves,

Figure 3-16

SIFT Feature Extraction Process

SIFT processing steps begins with the Difference of Gaussian (DoG) generation, which builds a scale space using an approximation based on DoG techniques (https://en.wikipedia.org/wiki/Scale-invariant_feature_transform). The local extrema of the DoG images (at varying scales) are the selected interest points. DoG images are produced by image convolving or blurring with Gaussians at each octave of the scale space in the Gaussian pyramid. Gaussian image (based on set number of octaves) is down-sampled after each iteration of an octave. DoG is used because it is a Laplacian of Gaussian (LoG) approximation method that has low computational complexity compared to LoG and does not need partial derivative computations like LoG does. In addition, DoG obtains local extrema of images with difference of the Gaussians.

Interest point detection (IPD) classical methods use LoG because it is scale invariant when applied at multiple image scales, which makes it a popular approach to improve the performance. Gaussian scale-space pyramid and kernel techniques are frequently used in IPD. Figure 3-17 shows an example SIFT IPD process applying Gaussian scale-space pyramid and kernels. SIFT uses an approximation of the Laplacian of Gaussian (LoG) because the Laplacian is a differential operator of a function on the Euclidean space. The second-order Gaussian scale-space derivatives are very sensitive to noise and require significant computation.

A pyramid model illustrates how an image of a building undergoes scaling and changes its size.

Figure 3-17

SIFT IPD Process Applying Gaussian Scale-Space Pyramid and Kernels

Keypoint detection starts with the keypoint localization process in which each pixel in the DoG image is compared to its neighboring pixels. The comparison is processed on the current scale and two scales above and below. The candidate keypoints are pixels that are local maximums or local minimums (i.e., extrema). The final set of keypoints excludes low contrast points. Next, orientation assignment is conducted using the keypoint orientation determination process. The orientation of a keypoint is the local image gradient histogram in the neighborhood of the keypoint. The peaks in the histogram are selected as the dominant orientations.

The SIFT descriptor generation process computes the feature descriptor for each keypoint. The feature descriptor consists of 128 orientation histograms. Figure 3-18 shows a SIFT descriptor generation example.

A confusion matrix plots a sphere of image gradients that leads to a 4 by 4 grid of key point descriptors, and bars of 128 features for1 key point.

Figure 3-18

SIFT Descriptor Generation Example

SURF

Speeded-up robust feature (SURF) approximation techniques are used to get faster yet similarly accurate IPD results compared to SIFT. The SURF software source code and user guidelines are provided in GitHub (https://github.com/herbertbay/SURF). SURF uses the determinant of Hessian (DoH) matrix in the IPD process, where box filters are used in approximating the DoH (https://en.wikipedia.org/wiki/Speeded_up_robust_features). The Hessian matrix is a square matrix with second-order partial derivative elements that are used to characterize the level of surface curvature of the image, which is important in keypoint detection. Figure 3-19 presents the SURF feature extraction process.

A flow chart defines how the SURF features are predicted from the input image via internal and response map generation, interest point detection, orientation, and descriptor extraction.

Figure 3-19

SURF Feature Extraction Process

The SURF approximation process is applied to SIFT’s DoG with a box filtering process. The square box filters are used for approximation instead of Gaussian averaging of the image because it enables fast approximation and box area calculation. An example of box filters and an integral image used in SURF is shown in Figure 3-20 along with an integral image example.

An illustration of three heatmaps defines how the directions of x, y, and x y are extracted by box filter approximation and the integral image of sigma is plotted.

Figure 3-20

SURF Box Filter Approximation and Integral Image Example

Integral image generation is used to generate integral images that are used for fast convolution computation. Multiple parallel processing of box filtering on different scale images is used to approximate the LoG process. Then the IPD process uses a Hessian matrix-based Blob detection. The feature descriptor scheme processes the interest point’s neighboring pixels by dividing them into subregions. The SURF descriptor describes the pixel intensity distribution based on a scale independent neighborhood. Each subregion’s wavelet response is used (e.g., regular 4×4 subregions).

FAST

Features from accelerated segment test (FAST) have fast matching and low computation characteristics, which enable FAST to detect a greater number of interest points at the given scale compared to scale-space detector methods SIFT and SURF. The FAST software source code and user guidelines are provided in GitHub (https://github.com/soh534/fast). FAST uses a template-based corner detector scheme, in which a “corner” is a fixed scale interest point. Figure 3-21 presents the FAST feature detection process.

A flow chart defines how the FAST interest points are predicted from the input image via pixel selection, Breenham circle computation, and threshold comparison.

Figure 3-21

FAST Feature Detection Process

In the Bresenham circle computations, each corner has a point with two dominant and distinct gradient orientations. The center pixel is compared with surrounding pixels in a circular pattern based on the Bresenham circle. The circular template is used to detect corners based on intensity comparison. The size of the connected pixel region is commonly 9 (FAST9) or 10 (FAST10) out of a possible 16 pixels. Next, threshold comparison of each pixel is compared to a preset threshold value, which determines whether the pixel value is greater or lesser than the center pixel. A decision tree is used to quickly determine the comparison order of pixels. The FAST descriptor output has an adjoining bit vector form (https://en.wikipedia.org/wiki/Features_from_accelerated_segment_test).

BRIEF

Binary robust independent elementary features (BRIEF) is a popular binary feature descriptor technology used in robotic applications. The BRIEF software source code and user guidelines are provided in GitHub (https://github.com/lafin/brief). BRIEF uses a randomly generated distribution pattern, such as a patch form of 31×31 pixels, where an example distribution pattern could consist of 256 pixel point-pairs. The descriptors are generated from the binary comparison of the point-pairs. Figure 3-22 presents the BRIEF feature description process (www.cs.ubc.ca/~lowe/525/papers/calonder_eccv10.pdf).

A flow chart defines how the BRIEF descriptor is predicted from the input image via path generation, patch smoothening, pair selection, intensity comparison, and Bitstring generation.

Figure 3-22

BRIEF Feature Description Process

The BRIEF binary descriptor process applies sampling patterns to sample points in the region around the descriptors. In the orientation compensation process, the keypoint orientation is determined by the size of the orientation compensation that is needed to compensate for rotational changes. Then sampling pairs are formed. The final descriptor is made by pairs based on comparing sampling pairs. BRIEF compares random point-pairs within the local region and does not have an intricate sampling pattern or mechanism for orientation compensation.

In terms of noise sensitivity, the BRIEF descriptors use information based on single pixel locations, which makes it sensitive to noise. Gaussian filters are used to relieve the noise sensitivity.

ORB

Oriented FAST and rotated BRIEF is a hybrid scheme that combines features of FAST and BRIEF for feature extraction and description. The ORB software source code and user guidelines are provided in GitHub (https://github.com/Garima13a/ORB). ORB has fast computation speed; it is highly efficient in memory usage and has a high matching accuracy. This is why ORB is used instead of SIFT and SURF for feature extraction in many cases. Figure 3-23 presents the ORB feature extraction process (https://en.wikipedia.org/wiki/Oriented_FAST_and_rotated_BRIEF).

A flow chart defines how the O R B features are predicted from the input image via scale pyramid, FAST feature detection, Harris corner measure, non-max suppression, and oriented BRIEF.

Figure 3-23

ORB Feature Extraction Process

ORB uses a multi-scale feature-based image pyramid, where FAST and rBRIEF (which is a rotated version of BRIEF) are applied to each scale of the image pyramid. FAST feature detection uses a corner detector to detect keypoints. Harris corner measure is applied on the keypoints to select the top N points with the strongest FAST responses. The center of gravity (centroid) G of an image patch is computed with moments to improve rotation invariance. In the oriented BRIEF process, the orientation is computed based on the direction of the vector from the keypoint to G. BRIEF features use the orientation information to be rotation-invariant, where rBRIEF is used as the binary descriptor. The search and matching process is based on a correspondence search that uses multi-probe locally sensitive hashing (MP-LSH). When a match fail occurs, neighboring buckets are searched for matches. MP-LSH uses fewer hash tables to save memory consumption and generates more consistent hash bucket sizes compared to BRIEF.

BRISK

Binary robust invariant scalable keypoints (BRISK) is a low computational feature detector scheme that is known to provide a better performance than SURF with comparable accuracy. The BRISK software source code and user guidelines are provided in GitHub (https://github.com/kornerc/brisk). Figure 3-24 shows the BRISK feature extraction process.

BRISK conducts candidate IPD using a corner detect algorithm called adaptive and generic corner detection based on the accelerated segment test (AGAST), which is an extension to the FAST algorithm. AGAST uses a circular-symmetric pattern region shape with 60 point-pairs and uses point-pair line segments arranged in 4 concentric rings (www.researchgate.net/publication/221110715_BRISK_Binary_Robust_invariant_scalable_keypoints).

BRISK uses scale-space keypoint detection, which is a scale-space technology that creates octaves and intra-octaves, where each octave is half-sampled from the previous octave. The intra-octave is down-sampled to be placed in between octaves, and non-maximal suppression is conducted on each octave and intra-octave. Sub-pixel maximum is computed across the patch, and continuous maximum is computed across scales.

BRISK also applies sample point Gaussian smoothing to the patch area around each sampling point. As a result, red circles represent the size of the standard deviation of the Gaussian filter. In the BRISK pair generation process, point-pairs of pixels are separated into two groups. The long segment pairs are used in coarse resolutions, and the short segment pairs are used in fine resolutions. The short and long segment pair separation is used to represent the scale invariance. The gradient computation is computed on the long segment pairs first to determine the feature orientation. Next the gradient is computed on the short segment pairs to find the amount of rotation in reference to the orientation. The BRISK descriptor generation process makes a binary descriptor from the rotated short segment pairs.

A flow chart pointing toward a cluster of circles and a wheel-shaped structure defines how the B R I S K features are predicted from the input image via scale space keypoint detection, sample point Gaussian smoothing, pair generation, gradient computation, and descriptor generation.

Figure 3-24

BRISK Feature Extraction Process

Summary

This chapter provided details of XR devices and feature extraction and detection schemes, which include SIFT, SURF, FAST, BRIEF, ORB, and BRISK. Without feature extraction and description of sensor and image data, XR devices cannot provide any functions or features. Because these processes are very computationally burdening as well as energy- and time-consuming, process offloading methods are also described. In the following Chapter 4, details of artificial intelligence (AI) and deep learning technologies used in metaverses, XR, and multimedia systems are introduced.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.9.223