XR HMDs
XR Operation Process
XR Feature Extraction and Description
XR HMDs
For an XR experience, the HMDs can provide the best immersive presence effects. Thus, XR HMDs play a significant role in metaverse and game services. Therefore, in this section, some of the representative XR HMDs are presented, which include the Meta Quest 2, Microsoft HoloLens 2, HTC VIVE Pro 2, HTC VIVE Focus 3, Valve Index, Pimax Vision 8K X, and the PlayStation VR 2.
Meta Quest 2
The Facebook Oculus Quest is the preceding version of the Meta Quest 2. In May of 2019, the Oculus Quest VR HMD was released by Facebook. The Quest HMD had 4 GB of random-access memory (RAM) and was equipped with a Qualcomm Snapdragon 835 system on chip (SoC) with a 4 Kryo 280 Gold central processing unit (CPU) and an Adreno 540 graphics processing unit (GPU). The Quest’s introductory price was $399 for the 64 GB storage memory device and $499 for the 128 GB storage memory device (www.oculus.com/experiences/quest/).
In October of 2020, Oculus Quest 2 was released, and due to Facebook’s parent company name change to Meta, the device was rebranded as Meta Quest 2 in November of 2021. Quest 2 is a VR HMD equipped with a Qualcomm Snapdragon XR2 SoC with 6 GB RAM and an Adreno 650 GPU. Quest 2’s initial price was $299 for the 64 GB storage memory model, which was changed to 128 GB later, and $399 for the 256 GB storage memory model.
The Meta Quest 2 software platform uses an Android operating system (OS). For first-time Quest 2 device setups, a smartphone running the Oculus app is needed. The Quest 2 software engine is Android 10 based which is processed with the help of the device’s SoC, which is a Qualcomm Snapdragon XR2 and the Adreno 650 GPU which has an approximate speed of 1.2 TFLOPS. The platform allows up to three additional accounts to log onto a single Quest 2 headset, while sharing purchased software from the online Oculus Store is permitted. The platform supports the Oculus Touch controllers and recently included experimental Air Link wireless streaming as well as experimental Passthrough APIs for AR applications (https://en.wikipedia.org/wiki/Oculus_Quest_2).
Microsoft HoloLens 2
Figure 3-3 shows the Microsoft HoloLens 2 device’s front view (left picture) and the top view (right picture), Figure 3-4 shows the Microsoft HoloLens 2 inside view that has augmented objects imposed on the background view, and Figure 3-5 shows a Microsoft HoloLens 2 inside view selection menu using AR (based on HanbitSoft graphics).
The Microsoft Windows MR platform is a part of the Windows 10 OS which provides MR and AR services for compatible HMDs which include the HoloLens, HoloLens2, Lenovo Explorer, Acer AH101, Dell Visor, HP WMR headset, Samsung Odyssey, Asus HC102, Samsung Odyssey+, HP Reverb, Acer OJO, and the HP Reverb G2 (https://docs.microsoft.com/en-us/windows/mixed-reality/mrtk-unity/architecture/overview?view=mrtkunity-2021-05).
HTC VIVE Pro 2
HTC VIVE Pro 2 is a tethered (non-standalone) MR HMD developed by HTC VIVE that was released in June of 2021. HTC VIVE Pro 2 is a MR device; thus, it is equipped with AR and VR technologies. The VIVE PRO 2 costs $799 for the headset only and $1,399 for the full kit, which includes the Base Station 2.0 and controller (www.vive.com/kr/product/vive-pro2-full-kit/overview/). The HMD display has a 2448×2448 per eye resolution that supports a FoV up to 120 degrees and weighs 855 g. Tracking is conducted with SteamVR 2.0 based on its G-sensors, gyroscope, proximity, and interpupillary distance (IPD) sensors. IPD sensors are used to detect the distance between the center of the pupils. The device uses a proprietary cable, Link Xox cable, and USB 3.0 cable and enables connectivity through USB Type-C interfaces and Bluetooth wireless networking. The HTC VIVE Pro 2 5K resolution MR HMD is based on a Windows 10 OS supported by an Intel Core i5-4590 or AMD Ryzen 1500 CPU with a NVIDIA GeForce RTX 2060 or AMD Radeon RX 5700 GPU with 8 GB RAM or higher (www.tomshardware.com/reviews/htc-vive-pro-2-review). The Base Station 2.0 system in the VIVE PRO 2 Full Kit supports room-scale immersive presence VR applications. The system is supported by the VIVE Pro series or Cosmos Elite headsets, and the controllers assist exact location tracking. The non-standalone needs to be tethered to a high-performance Windows OS computer. The tethered computer’s minimum requirements include a Windows 10 OS supported by an Intel Core i5-4590 or AMD Ryzen 1500 CPU with a NVIDIA GeForce GTX 1060 or AMD Radeon RX 480 GPU with 8 GB RAM.
HTC VIVE Focus 3
HTC VIVE Focus 3 is a standalone VR HMD developed by HTC VIVE that was released in August of 2021. The HMD is equipped with a Qualcomm Snapdragon XR2 (Snapdragon 865) SoC and runs on an Android OS (www.vive.com/kr/product/vive-focus3/overview/). The HMD display has a 2448×2448 per eye resolution that supports a 120 degree FoV and is equipped with 8 GB RAM and 128 GB storage memory. Tracking is conducted with the HTC VIVE inside-out tracking system based on its Hall sensors, capacitive sensors, G-sensors, and gyroscope sensors (www.tomshardware.com/reviews/vive-focus-3-vr-headset).
Valve Index
Valve Index is a VR HMD developed by Valve that was first introduced in 2019 (www.valvesoftware.com/en/index). The HMD requires a high-performance computer equipped with Windows OS or Linux OS that supports SteamVR to be tethered to the device (tethered cable with DisplayPort 1.2 and USB 3.0). The device conducts motion tracking with SteamVR 2.0 sensors. The HMD weighs 809 g and costs $499 for the headset only and $999 for the full kit, which includes the headset, controller, and Base Station 2.0 (www.valvesoftware.com/ko/index). The HMD display has a 1440×1600 per eye resolution that supports an adjustable FoV up to 130 degrees and is equipped with a 960×960 pixel camera (www.tomshardware.com/reviews/valve-index-vr-headset-controllers,6205.html).
Pimax Vision 8K X
The Pimax Vision 8K X is a high-resolution HMD developed by Pimax. The HMD was first introduced at CES 2020 and weighs 997.9 g and costs $1,387 (https://pimax.com/ko/product/vision-8k-x/). The HMD display has a 3840×2160 per eye resolution for the native display and a 2560×1440 per eye resolution for the upscaled display that supports a FoV up to 170 degrees. The device uses SteamVR tracking technology with a nine-axis accelerometer sensor (www.tomshardware.com/reviews/pimax-vision-8k-x-review-ultrawide-gaming-with-incredible-clarity). The Pimax Vision 8K X HMD needs to be tethered to a high-performance Windows OS computer in which the USB 3.0 connection will be used for data communication and the USB 2.0 connection can be used for power supply. The tethered computer needs to be equipped with Windows 10 Pro/Enterprise 64 bits (version 1809 or higher) OS supported by an Intel I5-9400 CPU (or higher) with a NVIDIA GeForce RTX 2080 (or at least RTX 2060 or higher) with at least 16 GB of RAM.
PlayStation VR 2
PlayStation VR 2 is an HMD developed by Sony Interactive Entertainment for VR games, especially for the PlayStation 5 VR Games (www.playstation.com/ko-kr/ps-vr2/). The device was first introduced at CES 2022. PlayStation VR 2 needs to be tethered (via USB Type-C wire connection) to a PlayStation 5 player. The HMD weighs less than 600 g. The PlayStation VR 2 has a 2000×2040 per eye panel resolution display with a FoV of approximately 110 degrees. Four cameras are used for tracking of the headset and controller, and IR cameras are used for eye tracking. The HMD can provide vibration haptic effects and is equipped with a built-in microphone and stereo headphones.
XR Operation Process
XR is a combination of MR, AR, VR, and haptic technologies. XR uses AR to enable virtual text or images to be superimposed on selected objects in the real-world view of the user, in which the AR generated superimposed virtual text or image is related to the selected object. AR technology is a special type of visual based context aware computing. For an AR user, the real-world and virtual information and objects need to coexist on the same view in harmony without disturbing the user’s view, comfort, or safety. However, AR technology is very difficult to implement as the computer-generated virtual text or image needs to be superimposed on the XR user’s real-world view within a few milliseconds. The details of AR processing are described in the following section. The AR process is very complicated, and its operation process is time-consuming. So, accurately completing the AR process within the time limit of a few milliseconds is very difficult. If the XR user changes head or eye directions, the focus of sight will change, and the entire process has to be continuously updated and refreshed within the time limit of a few milliseconds. This is why AR development is so much more difficult and slower compared to VR system development.
The AR operation process includes image acquisition, feature extraction, feature matching, geometric verification, associated information retrieval, and imposing the associated information/object on the XR device’s display. These processes can be conducted without artificial intelligence (AI) technology (i.e., “non-AI” methods) or using AI technology. There are a variety of non-AI based AR feature detection and description algorithms, in which SIFT, SURF, FAST, BRIEF, ORB, and BRISK schemes are described in this chapter. AI based AR technologies commonly use deep learning systems with convolutional neural network (CNN) image processing engines, which are described in Chapter 4.
AR user interface (UI) types include TV screens, computer monitors, helmets, facemask, glasses, goggles, HMDs, windows, and windshields.
Among handheld AR displays, smartphones were the easiest platform to begin commercial services. Powerful computing capability, good camera and display, and portability make smartphones a great platform for AR. Figure 3-6 shows a smartphone-based AR display example, and Figure 3-7 presents an AR research and development environment in my lab at Yonsei University (Seoul, South Korea) connected to a smartphone.
AR eyeglass products include the Google Glass, Vuzix M100, Optinvent, Meta Space Glasses, Telepathy, Recon Jet, and Glass Up. However, the large amount of image and data processing that AR devices have to conduct requires a significant amount of electronic hardware to execute long software program codes. As a result, AR eyeglasses are a significant challenge as there is very little space to implement the AR system within the small eyeglass frame. In the near future, XR smart eyeglasses will become significantly popular products, but much more technical advancements are required to reach that level. Hopefully, this book will help you and your company reach this goal faster.
AR Technology
Figure 3-9 shows the AR workflow, which starts with detection of the environment as well as the objects and people within the image. Based on the application’s objective, the AR system will select objects and people within the image and continuously track them. Even though the objects and people may not move, because the XR device’s user view changes often (through head and eye movement), the objects and people within the user’s view will change, which is why tracking is always needed. Based on user interaction and interaction of the application, the augmented information/image needs to be generated and included into the display image of the AR user through the image rendering process. Visual feedback of the rendered image (which includes the augmented information/image) is displayed on the user’s screen (www.researchgate.net/publication/224266199_Online_user_survey_on_current_mobile_augmented_reality_applications).
AR Feature Detection and Description Technology
AR feature detection requires the characteristic primitives of an image to be accurately identified. The identified objects need to be highlighted using unique visual cues so the user can notice them. Various feature detection techniques exist, where the common objective is to conduct efficient and effective extraction of stable visual features. AR feature detection influencing factors include the environment, changes in viewpoint, image scale, resolution, lighting, etc. AR feature detector requirements include robustness against changing imaging conditions and satisfaction of the application’s Quality of Service (QoS) requirements (www.researchgate.net/publication/220634788_Mobile_Visual_Search_Architectures_Technologies_and_the_Emerging_MPEG_Standard).
Feature Detection and Feature Description
Feature detection is a process where features in an image with unique interest points are detected. The feature descriptors characterize an image’s features using a sampling region. Figure 3-10 shows a feature detection example in which a circular image patch is applied around detected interest points.
There are AR algorithms that are capable of feature detection or feature description, and some can do both. For example, binary robust independent elementary features (BRIEF) is a feature descriptor scheme, and features from accelerated segment test (FAST) is a feature detector scheme. Schemes that do both feature detection and description include scale-invariant feature transform (SIFT), speeded-up robust features (SURF), oriented FAST and rotated BRIEF (ORB), and binary robust invariant scalable keypoints (BRISK).
There are two methods to conduct feature detection without using AI technology. Method-1 uses spectra descriptors that are generated by considering the local image region gradients. A “gradient” is a multivariable (vector) generalization of the derivative, where derivatives are applied on a single variable (scalar) in a function. Method-1 base algorithms include SIFT and SURF. Method-2 uses local binary features that are identified using simple point-pair pixel intensity comparisons. Method-2 based algorithms include BRIEF, ORB, and BRISK.
AR schemes that mix and match feature detection and feature description algorithms are commonly applied. Feature detection results in a specific set of pixels at specific scales identified as interest points. Different feature description schemes can be applied to different interest points to generate unique feature descriptions. Application and interest points characteristic based scheme selection is possible.
AR System Process
AR Cloud Cooperative Computation
As presented in the details above, the AR process is complicated and requires a large database and a lot of image and data processing. This is the same for both the AI based AR processes (using convolutional neural network (CNN) deep learning described in Chapter 4) and the non-AI based AR processes (e.g., SIFT, SURF, FAST, BRIEF, ORB, BRISK).
Cloud cooperative computation is also called “cloud offloading” because a part of the required computing process is offloaded to be computed by the main cloud server or the local edge cloud system (www.cs.purdue.edu/homes/bb/cs590/handouts/computer.pdf).
More details on cloud offloading, edge computing, and content delivery network (CDN) support for XR, metaverse, and multimedia streaming services are provided in Chapters 7 and 8.
Virtualization allows cloud server vendors to run arbitrary applications (from different customers) on virtual machines (VMs), in which VM processing is based on the infrastructure as a service (IaaS) cloud computing operations (where more details are provided in Chapter 8). Cloud server vendors provide computing cycles, in which XR devices can use these computing cycles to reduce their computation load. Cloud computing (offloading) can help save energy and enhance the response speed of the XR service. However, unconditional offloading may result in excessive delay, which is why adaptive control is needed. To conduct adaptive control, the cloud server load and network congestion status monitoring is needed. Adaptive cloud offloading parameters include network condition, cloud server status, energy status of the XR device, and target Quality of Experience (QoE) level (www.itiis.org/digital-library/manuscript/1093).
XR Detection Process
Popular feature extraction techniques include the Haar feature that was developed by P. Viola et al. in 2001, scale-invariant feature transform (SIFT) was developed by D. G. Lowe in 2004, histogram of oriented gradient (HOG) was developed by N. Dalal et al. in 2005, speeded-up robust features (SURF) was developed by H. Bay et al. in 2006, and oriented FAST and rotated BRIEF (ORB) were developed by E. Rublee et al. in 2011.
In the AR system process, feature extraction needs to be conducted accurately but is among the most difficult and computationally burdening AR processes. AR feature extraction has a significant influence on the accuracy of the AR process, so more technical details are described in the following.
- 1.
Grayscale Image Generation (GIG): Original image captured by the AR device is changed into a grayscale valued image in order to make it robust to color modifications.
- 2.
Integral Image Generation (IIG): Process of building an integral image from the grayscale image. This procedure enables fast calculation of summations over image subregions.
- 3.
Response Map Generation (RMG): In order to detect interest points (IPs) using the determinant of the image’s Hessian matrix H, the RMG process constructs the scale space of the image.
- 4.
Interest Point Detection (IPD): Based on the generated scale response maps, the maxima and minima (i.e., extrema) are detected and used as the IPs.
- 5.
Orientation Assignment (OA): Each detected IP is assigned a reproducible orientation to provide rotation invariance (i.e., invariance to image rotation).
- 6.
Descriptor Extraction (DE): Process of uniquely identifying an IP, such that it is distinguished from other IPs.
XR Feature Extraction and Description
AR feature detection and description method comparison based on proposed year, feature detector, spectra, and orientation
Categories | SIFT | SURF | FAST | BRIEF | ORB | BRISK |
---|---|---|---|---|---|---|
Year | 1999 | 2006 | 2006 | 2010 | 2011 | 2011 |
Feature detector | Difference of Gaussian | Fast Hessian | Binary comparison | N/A | FAST | FAST or AGAST |
Spectra | Local gradient magnitude | Integral box filter | N/A | Local binary | Local binary | Local binary |
Orientation | Yes | Yes | N/A | No | Yes | Yes |
Feature shape | Square | HAAR rectangles | N/A | Square | Square | Square |
Feature pattern | Square | Dense | N/A | Random point-pair pixel compares | Trained point-pair pixel compares | Trained point-pair pixel compares |
AR feature detection and description method comparison based on distance function, robustness, and pros and cons
Categories | SIFT | SURF | FAST | BRIEF | ORB | BRISK |
---|---|---|---|---|---|---|
Distance function | Euclidean | Euclidean | N/A | Hamming | Hamming | Hamming |
Robustness | 6 Brightness, rotation, contrast, affine transform, scale, noise | 4 Scale, rotation, illumination, noise | N/A | 2 Brightness, contrast | 3 Brightness, contrast, rotation, limited, scale | 4 Brightness, contrast, rotation, scale |
Pros | Accurate | Accurate | Fast, real-time applicable | Fast, real-time applicable | Fast, real-time applicable | Fast, real-time applicable |
Cons | Slow, compute-intensive, patented | Slow, patented | Large number of interest points | Scale and rotation-invariant | Less scale invariant | Less scale invariant |
SIFT
SIFT processing steps begins with the Difference of Gaussian (DoG) generation, which builds a scale space using an approximation based on DoG techniques (https://en.wikipedia.org/wiki/Scale-invariant_feature_transform). The local extrema of the DoG images (at varying scales) are the selected interest points. DoG images are produced by image convolving or blurring with Gaussians at each octave of the scale space in the Gaussian pyramid. Gaussian image (based on set number of octaves) is down-sampled after each iteration of an octave. DoG is used because it is a Laplacian of Gaussian (LoG) approximation method that has low computational complexity compared to LoG and does not need partial derivative computations like LoG does. In addition, DoG obtains local extrema of images with difference of the Gaussians.
Keypoint detection starts with the keypoint localization process in which each pixel in the DoG image is compared to its neighboring pixels. The comparison is processed on the current scale and two scales above and below. The candidate keypoints are pixels that are local maximums or local minimums (i.e., extrema). The final set of keypoints excludes low contrast points. Next, orientation assignment is conducted using the keypoint orientation determination process. The orientation of a keypoint is the local image gradient histogram in the neighborhood of the keypoint. The peaks in the histogram are selected as the dominant orientations.
SURF
Integral image generation is used to generate integral images that are used for fast convolution computation. Multiple parallel processing of box filtering on different scale images is used to approximate the LoG process. Then the IPD process uses a Hessian matrix-based Blob detection. The feature descriptor scheme processes the interest point’s neighboring pixels by dividing them into subregions. The SURF descriptor describes the pixel intensity distribution based on a scale independent neighborhood. Each subregion’s wavelet response is used (e.g., regular 4×4 subregions).
FAST
In the Bresenham circle computations, each corner has a point with two dominant and distinct gradient orientations. The center pixel is compared with surrounding pixels in a circular pattern based on the Bresenham circle. The circular template is used to detect corners based on intensity comparison. The size of the connected pixel region is commonly 9 (FAST9) or 10 (FAST10) out of a possible 16 pixels. Next, threshold comparison of each pixel is compared to a preset threshold value, which determines whether the pixel value is greater or lesser than the center pixel. A decision tree is used to quickly determine the comparison order of pixels. The FAST descriptor output has an adjoining bit vector form (https://en.wikipedia.org/wiki/Features_from_accelerated_segment_test).
BRIEF
The BRIEF binary descriptor process applies sampling patterns to sample points in the region around the descriptors. In the orientation compensation process, the keypoint orientation is determined by the size of the orientation compensation that is needed to compensate for rotational changes. Then sampling pairs are formed. The final descriptor is made by pairs based on comparing sampling pairs. BRIEF compares random point-pairs within the local region and does not have an intricate sampling pattern or mechanism for orientation compensation.
In terms of noise sensitivity, the BRIEF descriptors use information based on single pixel locations, which makes it sensitive to noise. Gaussian filters are used to relieve the noise sensitivity.
ORB
ORB uses a multi-scale feature-based image pyramid, where FAST and rBRIEF (which is a rotated version of BRIEF) are applied to each scale of the image pyramid. FAST feature detection uses a corner detector to detect keypoints. Harris corner measure is applied on the keypoints to select the top N points with the strongest FAST responses. The center of gravity (centroid) G of an image patch is computed with moments to improve rotation invariance. In the oriented BRIEF process, the orientation is computed based on the direction of the vector from the keypoint to G. BRIEF features use the orientation information to be rotation-invariant, where rBRIEF is used as the binary descriptor. The search and matching process is based on a correspondence search that uses multi-probe locally sensitive hashing (MP-LSH). When a match fail occurs, neighboring buckets are searched for matches. MP-LSH uses fewer hash tables to save memory consumption and generates more consistent hash bucket sizes compared to BRIEF.
BRISK
Binary robust invariant scalable keypoints (BRISK) is a low computational feature detector scheme that is known to provide a better performance than SURF with comparable accuracy. The BRISK software source code and user guidelines are provided in GitHub (https://github.com/kornerc/brisk). Figure 3-24 shows the BRISK feature extraction process.
BRISK conducts candidate IPD using a corner detect algorithm called adaptive and generic corner detection based on the accelerated segment test (AGAST), which is an extension to the FAST algorithm. AGAST uses a circular-symmetric pattern region shape with 60 point-pairs and uses point-pair line segments arranged in 4 concentric rings (www.researchgate.net/publication/221110715_BRISK_Binary_Robust_invariant_scalable_keypoints).
BRISK uses scale-space keypoint detection, which is a scale-space technology that creates octaves and intra-octaves, where each octave is half-sampled from the previous octave. The intra-octave is down-sampled to be placed in between octaves, and non-maximal suppression is conducted on each octave and intra-octave. Sub-pixel maximum is computed across the patch, and continuous maximum is computed across scales.
Summary
This chapter provided details of XR devices and feature extraction and detection schemes, which include SIFT, SURF, FAST, BRIEF, ORB, and BRISK. Without feature extraction and description of sensor and image data, XR devices cannot provide any functions or features. Because these processes are very computationally burdening as well as energy- and time-consuming, process offloading methods are also described. In the following Chapter 4, details of artificial intelligence (AI) and deep learning technologies used in metaverses, XR, and multimedia systems are introduced.