© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
J.-M. ChungEmerging Metaverse XR and Video Multimedia Technologieshttps://doi.org/10.1007/978-1-4842-8928-0_5

5. XR and Multimedia Video Technologies

Jong-Moon Chung1  
(1)
Seoul, Korea (Republic of)
 
This chapter focuses on the core video encoding and decoding technologies used in XR services like Meta Quest 2 and Microsoft HoloLens 2 and the popular over-the-top (OTT) video services like Netflix, Disney+, and YouTube. The sections cover the history of Skype (as it is one of the earliest video conferencing and Voice over IP (VoIP) applications in the world) and YouTube and the evolution of their video and audio codec technologies. In addition, details of the H.264 Advanced Video Coding (AVC), H.265 High Efficiency Video Coding (HEVC), H.266 Versatile Video Coding (VVC) standards, and futuristic holography technology are introduced.
  • XR Device Video Codecs (Meta Quest 2, Pico G2, and DPVR P1)

  • Multimedia Codecs (Skype, YouTube, Netflix, and Disney+)

  • H.264 AVC Video Technology

  • H.265 HEVC Video Technology

  • H.266 VVC Video Technology

  • Holography Technology

XR Device Video Codecs

Among XR headsets and HMDs, the Meta Quest 2, Pico, and DPVR have the largest market share. These headsets all use H.264 and H.265 video codecs in their core display technologies. As described in this chapter, there are many video profiles in H.264 and H.265 where the XR device conducts auto detection and automatically selects the highest-quality multimedia format based on the XR device’s hardware and software capabilities as well as the network conditions. The XR device’s capabilities depend on the CPU, GPU, SoC, battery status, memory, display resolution, and platform software type and condition. Network status factors like delay and throughput are also considered.

The Meta Quest 2 maximum VR video resolutions all use the H.265 HEVC video codec. The Quest 2’s monoscopic highest (width×height) pixel resolution 8192×4096 at 60 fps (fps = frames/s = frames per second), stereoscopic highest pixel resolution 5760×5760 at 60 fps, and the 180 Side by Side highest pixel resolution 5760×5760 at 60 fps all use the H.265 HEVC video codec.

The Pico G2 4K maximum video resolutions use the H.264 AVC or the H.265 HEVC video codecs. The Pico G2’s 4K monoscopic highest pixel resolution 5760×2880 at 30 fps uses the H.264 codec, the stereoscopic highest pixel resolution 4096×4096 at 30 fps uses the H.264 or H.265 codec, and the 180 Side by Side highest pixel resolution 5760×2880 at 30 fps uses the H.264 codec (https://headjack.io/knowledge-base/best-video-resolution-for-pico-headsets/).

In addition, the DPVR P1 4K VR Glasses VR headsets support the 3GP, H.264, H.265, and MP4 video codecs (www.estoreschina.com/dpvr-p1-4k-vr-glasses-virtual-reality-headset-milk-white-p-11477.html).

Multimedia Codecs

Skype, YouTube, Netflix, and Disney+ systems conduct auto detection and automatically select the highest-quality multimedia format based on the receiving OTT device’s capabilities and network conditions. OTT device capabilities include video codec and audio codec, processing capability, memory, and especially the display resolution and platform software. Network status factors like delay and throughput are considered.

Considering that Skype was founded in 2003, YouTube was founded in 2005, Netflix started multimedia streaming in 2007, and Disney Streaming was announced in 2021, based on this chronological order of establishment, the multimedia codecs used by these companies are introduced in the following. By observing the video and audio codec changes made by these companies, we can easily see what were the best options for each time period.

Skype Multimedia Codecs

Skype supports video chat and voice call services. The name “Skype” was derived from “sky” and “peer-to-peer.” Skype communication uses a microphone, speaker, video webcam, and Internet services. Skype was founded by Niklas Zennström (Sweden) and Janus Friis (Denmark) based on the software created by Ahti Heinla, Priit Kasesalu, and Jaan Tallinn (Estonia) in 2003. On August 29, 2003, the first public beta version of Skype was released. In June 2005, Skype made an agreement with the Polish web portal Onet.pl which enabled integration into the Polish market. On September 12, 2005, eBay, Inc. agreed to acquire Skype Technologies SA for approximately $2.5 billion, and on September 1, 2009, based on Skype’s value of $2.75 billion, eBay announced it was selling 65% of Skype (to Silver Lake, Andreessen Horowitz, and the Canada Pension Plan Investment Board) for $1.9 billion. By 2010, Skype had 663 million registered users. On May 10, 2011, Microsoft acquired Skype for $8.5 billion and began to integrate Skype services into Microsoft products. On February 27, 2013, Microsoft introduced Office 2013 which included 60 Skype world minutes per month in the Office 365 consumer plans, which was for home and personal use as well as universities. On April 8–30, 2013, the Windows Live Messenger instant messaging service was phased out in favor of promoting Skype (but Messenger use continued in mainland China). On November 11, 2014, Microsoft announced that Lync would be replaced by Skype for Business in 2015, where later versions included combined features of Lync and the consumer Skype software. E–organizations were given the option to switch their users from the default Skype for Business interface to the Skype for Business (Lync) interface (https://en.wikipedia.org/wiki/Skype).

Skype is based on a hybrid peer-to-peer model and uses a client and server system architecture. Supporting applications include VoIP, video conferencing, instant messaging, and file transfer. Skype video conference calls can be made from various platforms using smartphones, tablets, and PCs. Desktop Client Operating Systems (OSs) including Windows and HoloLens as well as other desktop OSs which include OS X (10.6 or newer) and Linux (Ubuntu and others) support Skype.

Other mobile device OSs that support Skype include iOS (for Apple’s iPhone and iPad), Android (for various Android smart devices), BlackBerry 10, Fire OS, Nokia X, and many more (including support for selected Symbian and BlackBerry OS devices). Skype calls to other Skype users within the Skype services are free. Skype could also be used to make traditional phone calls to landline telephones and mobile phones using debit-based user account charges based on Skype Credit or by subscription. Similar technologies that support VoIP and video conferencing include Session Initiation Protocol (SIP) and H.323 based services which are used by companies like Linphone and Google Hangouts.

Skype video codecs before Skype 5.5 used VP7 as the video codec, but after 2005, Skype started to use the True Motion VP7 codec. In early 2011, Skype 5.5 moved to VP8. The video codecs True Motion VP7 and VP8 were developed by On2 Technologies for Google. Skype 5.7 uses VP8 for both group and one-on-one standard definition video chatting. But after Microsoft acquired Skype in 2011, the video codec was replaced to H.264.

The Skype video call quality is based on the mode being used. The standard mode has a video resolution of 320×240 pixels with 15 fps; high-quality mode has a resolution of 640×480 pixels with 30 fps. High definition (HD) mode has the highest resolution of 1280×720 pixels with 30 fps. Other H.264 codecs are also used in group and peer-to-peer (one-on-one) video chatting.

Skype initially used G.729 and SVOPC as its VoIP audio codec. Skype added SILK to Skype 4.0 for Windows and other Skype clients. SILK is a lightweight and embeddable audio codec created by Skype, where more details are provided in the following. Skype’s Opus is an open source codec that can integrate the SILK codec for voice transmission with Constrained Energy Lapped Transform (CELT) codecs for higher-quality audio transmissions (e.g., live music performances).

SILK is an audio compression format and audio codec that was developed by Skype Limited to replace the SVOPC codec. SILK was initially released in March 2009, and its latest software development kit (SDK) version 1.0.9 was released in 2012. SILK operation specs include the audio sampling frequencies of 8, 12, 16, and 24 kHz (i.e., 24,000 samples/s) resulting in a bitrate range of 6~40 kbps (bps = bits/s = bits per second). SILK has a low algorithmic delay of 25 ms, which is based on a 20 ms frame duration plus a 5 ms look-ahead duration. The reference programming of SILK is in C language, and the codec is based on linear predictive coding (LPC).

Skype also uses Opus as its audio codec. Opus was developed by Xiph and is standardized as RFC 6716 by the Internet Engineering Task Force (IETF). It was designed as a lossy audio coding format to efficiently encode speech and general audio into a single format. Opus was designed to combine SILK with CELT, in which switching between or combining SILK and CELT could be conducted as needed for maximum efficiency. Opus maintains low-latency sufficient for real-time interactive communications and was designed to have low complexity to run on low end ARM3 processors.

CELT is an open royalty-free lossy audio compression codec software algorithm format used by Skype. CELT has very low algorithmic delay, which is very beneficial in supporting low-latency audio communications. CELT was designed using lower latency Modified Discrete Cosine Transform (MDCT) technology.

YouTube Multimedia Codecs

YouTube is a global video-sharing website that uses H.264/MPEG-4 AVC, WebM, and Adobe Flash Video technology. Three former PayPal employees (Chad Hurley, Steve Chen, and Jawed Karim) created YouTube in February 2005. The cofounder Karim said that the inspiration for YouTube first came from the difficult experience in finding online videos of Justin Timberlake and Janet Jackson’s 2004 Super Bowl halftime performance exposing incident and later the 2004 Indian Ocean tsunami, where the need of a video-sharing site was realized. Cofounders Hurley and Chen said that the original idea for YouTube was an online dating service based on videos. On February 14, 2005, the domain name www.youtube.com was activated and on April 23, 2005, the first YouTube video titled “Me at the Zoo” was uploaded, which shows cofounder Karim at the San Diego Zoo (https://en.wikipedia.org/wiki/YouTube).

In September 2005, a Nike advertisement featuring Ronaldinho reached one million views for the first time. In November 2005 to April 2006, YouTube received a $11.5 million venture investment from Sequoia Capital, and on December 15, 2005, the www.youtube.com site was officially launched. In July 2006, YouTube announced that every day there were more than 65,000 new videos uploaded. In November 2006, YouTube was bought by Google, Inc. for $1.65 billion. In 2010, the company headquarters was moved to San Bruno, California, USA. In April 2011, YouTube software engineer James Zern revealed that 30% of videos accounted for 99% of YouTube’s views. In 2014, YouTube announced that every minute, 300 hours of new videos were uploaded. In 2015, YouTube was ranked as the third most visited website in the world (based on Alexa and SimilarWeb). In 2015, YouTube was attracting more than 15 billion visitors per month making it the world’s top TV and video website (based on SimilarWeb). In January 2016, the San Bruno headquarters (CA, USA) was expanded to an office park that could support up to 2,800 employees based on a facility investment of $215 million.

YouTube is a representative Internet video OTT service, which enabled user-generated content (UGC) video uploading. UGC is also called user created content (UCC), which is a video that is made and uploaded mostly by amateurs. This broke the traditional video service model of professionally generated content (e.g., TV programs or videos made in studios and broadcasting companies) and direct movie/video sales (e.g., video streaming companies like Netflix, Disney+, etc.). Now, video streaming companies also use OTT service models.

YouTube video technology used to require an Adobe Flash Player plug-in to be installed in the Internet browser of the user’s computer to view YouTube videos. In January 2010, YouTube’s experimental site used the built-in multimedia capabilities of HTML5 web browsers to enable YouTube videos to be viewed without requiring an Adobe Flash Player or any other plug-in to be installed. On January 27, 2015, HTML5 was announced as the default playback method. HTML5 video streaming uses MPEG Dynamic Adaptive Streaming over HTTP (DASH), which supports adaptive bit-rate HTTP based streaming by controlling the video and audio quality based on the status of the player device and network.

The preferred video format on YouTube is H.264 because older smart devices can easily decode H.264 videos more easily with less power consumption. Some older smart devices may not have a H.265 decoder installed, and even if it may be installed the power consumption and processing load may be too burdening for an older smart device. Therefore, if the wireless network (e.g., Wi-Fi, 5G, or LTE) is sufficiently good, H.264 could be the preferred video codec to use. This is why in some cases, when a user tries to upload a high-quality H.265 video to YouTube, the uploading YouTube interface may convert the H.265 video into a lower-quality H.264 format or multiple H.264 formats, which are used in MPEG-DASH. More details on MPEG-DASH are described in Chapter 6, and the technical details on YouTube recommended upload encoding settings can be found at https://support.google.com/youtube/answer/1722171?hl=en.

Netflix Multimedia Codecs

Netflix started multimedia streaming in 2007 using the Microsoft video codec VC-1 and audio codec Windows Media Audio (WMA). Currently, Netflix uses H.264 AVC and H.265 HEVC. H.264 AVC is used to support the CE3-DASH, iOS1, and iOS2 Netflix profiles. H.265 HEVC is used to support the CE4-DASH Netflix profile for UltraHD devices. CE3-DASH and CE4-DASH commonly use HE-AAC and Dolby Digital Plus for audio codecs, and iOS1 and iOS2 both use HE-AAC and Dolby Digital for audio codecs (https://en.wikipedia.org/wiki/Technical_details_of_Netflix).

Among existing Netflix encoding profiles, the following are deprecated (which means that although they are available, they are not recommended): CE1, CE2, and Silverlight use VC-1 for video and WMA for the audio codec; Link and Kirby-PIFF use H.263 for video and Ogg Vorbis for the audio codec; and Vega uses H.264 AVC for video and AC3 for the audio codec.

Table 5-1 summarizes the Netflix profile types (that are not deprecated) and corresponding video and audio codec types, multimedia container, digital rights management, and supported device types.
Table 5-1

Netflix profile types and video and audio codec types used

Netflix profile types

Video

codec

Audio codec

Multimedia container

Digital rights management

Supported devices

CE3-DASH

H.264 AVC

HE-AAC, Dolby Digital Plus, Ogg Vorbis

Unmuxed FMP4

PlayReady, Widevine

Android devices, Roku 2, Xbox, PS3, Wii, Wii U

iOS1

H.264 AVC

HE-AAC, Dolby Digital

Muxed M2TS

PlayReady, NFKE

iPhone, iPad

iOS2

H.264 AVC

HE-AAC, Dolby Digital

Unmuxed M2TS

PlayReady, NFKE

iPhone, iPad

CE4-DASH

H.265 HEVC, VP9

HE-AAC, Dolby Digital Plus

Unmuxed FMP4

PlayReady, Widevine

UltraHD devices

Netflix uses the multimedia container file formats of fragmented MP4 (FMP4) and the MPEG-2 transport stream (M2TS). Multimedia containers enable multiple multimedia data streams to be combined into a single file which commonly includes the metadata that helps to identify the combined multimedia streams. This is why multimedia containers are also called “metafiles” or informally “wrappers” (https://en.wikipedia.org/wiki/Container_format). More details on containers, Dockers, and Kubernetes will be provided in Chapter 8.

For digital rights management (DRM), CE3-DASH and CE4-DASH use PlayReady or Widevine, and iOS1 and iOS2 use PlayReady or NFKE. DRM enables control and policy enforcement of digital multimedia legal access and is based on licensing agreements and encryption technology. DRM technologies help monitor, report, and control use, modification, and distribution of copyrighted multimedia and software. There are various technological protection measures (TPM) that enable digital rights management for multimedia content, copyrighted works, and proprietary software and hardware (https://en.wikipedia.org/wiki/Digital_rights_management).

Disney+ Multimedia Technology

Disney Streaming started off as BAMTech in February 2015, which was founded through MLB Advanced Media. In August 2016, one-third of the company was acquired by the Walt Disney Company for $1 billion. In October 2018, BAMTech was internally renamed to Disney Streaming Services, and in August 2021, via Twitter, it was announced that the company was renamed to Disney Streaming (https://en.wikipedia.org/wiki/Disney_Streaming).

Disney+ makes automatic adjustments and delivers the best multimedia quality based on device capability and network conditions. Disney+ supports the video formats of Full HD, 4K Ultra HD, HDR10, Dolby Vision, and (limited) IMAX Enhanced (https://help.disneyplus.com/csp?id=csp_article_content&sys_kb_id=543d7f68dbbf78d01830269ed3961932).
  • Full HD video resolution is 1080p (higher than HD 720p).

  • 4K Ultra HD video resolution is 3840×2160 (higher than Full HD).

  • HDR10 provides a broader range of colors and brightness display and is the default High Dynamic Range (HDR) format.

  • Dolby Vision adds screen based optimized colors and brightness dynamic control to HDR.

  • IMAX Enhanced supports the 1.90:1 display expanded aspect ratio so the picture can cover the full height of the screen, but the other IMAX Enhanced features and functionalities are not supported yet.

H.264/MPEG-4 AVC Video Technology

H.264 is also called MPEG-4 Part 10 Advanced Video Coding (AVC), which was developed by the project partnership Joint Video Team (JVT) of ITU-T Video Coding Experts Group (VCEG) and ISO/IEC JTC1 Moving Picture Experts Group (MPEG)1. Worldwide, H.264 and H.265 are the two most popular standards for video recording, compression, and distribution, in which details on H.265 are provided in the following section. The ITU-T standards for H.264 AVC for generic audiovisual services can be found at www.itu.int/rec/T-REC-H.264.

H.264 was designed to have half or a lesser bitrate compared to MPEG-2, H.263, or MPEG-4 Part 2 without increasing the complexity too much. H.264 uses block-oriented motion compensation compression, similar to former standards like H.262 and H.263. The increased computation complexity of H.264 considers the improved processing capability of CPUs and GPUs on modern devices. H.264 is a family of standards that composes several different video encoding profiles, but because a H.264 decoder installed on a device may not be able to decode all profiles, the decodable profiles need to be informed through exchange of information with the content provider. H.264 compression results in lower bitrates compared to its former standardized video codecs. H.264 lossless encoding is possible but is rarely used. A list of the H.264 video encoding profiles is provided in the following section (https://doi.org/10.1007/978-3-030-62124-7_12).

Luma and Chroma

Since we are studying video technologies, it is necessary to understand the terms “luma” and “chroma.” In simplest terms, “luma” is used to represent the image’s brightness level, and “chroma” is used to represent the image’s color, which is combined through red (R), green (G), and blue (B) light sources. This is why luma and chroma are commonly paired together.

The general term “brightness” is commonly used as a perceived concept and is not objectively measurable. This is why “luminance” and “luma” are used in science and engineering. Luminance is defined in a way that is both quantifiable and accurately measurable. In technical terms, “luma” is different from “luminance.” Based on the International Commission on Illumination (CIE)2, application-wise, “luma” is used in video engineering (like the H.264, H.265, H.266 codecs and standards) where “relative luminance” is used in color science.

Luminance represents the level of brightness of a beam of light towards a direction. Luminance is represented by the luminance intensity of a beam of light based on the unit of candela per meter square (cd/m2), which is defined by the International System of Units. The unit candela (cd) is the Latin word for “candle,” which is the basis of brightness for this unit. One candela (i.e., 1 cd) is the luminous intensity of approximately one (ordinary) wax candle.

In video engineering, “relative luminance” is sometimes used to represent a monitor’s brightness. Relative luminance is derived from a weighted sum of linear RGB components. On the other hand, luma uses gamma-compressed RGB components.

The Gamma (γ) encoding applied in the luma computation helps to optimize the bits encoding efficiency of an image. Gamma encoding is based on the gamma correction nonlinear operation used to encode and decode luminance (Y) and luma (Y’) values. The basic equation has the form of Y’=AYγ (other equations exist), where A=1 is commonly used. If γ<1, the encoding process is called gamma compression. If γ>1, the encoding process is called gamma expansion (e.g., γ=2.2). For the special case of γ=1 and A=1, there is no change, resulting in Y=Y’. The same rule can be applied to RGB as well using R’=ARγ, G’=AGγ, and B’=ABγ (other equations exist).

Examples of how luma is computed using the weighted sum of gamma-compressed RGB video components are provided (please note that the actual process is more complicated and other equations exist). In the ITU-R Rec. BT.709, relative luminance (Y) is calculated based on a linear combination of R, G, B pure colorimetric values using Y=0.2126R+0.7152G+0.0722B. However, luma (Y’) is calculated in the Rec. 709 specs based on the gamma-compressed R, G, B components R’, G’, B’, respectively, using Y’=0.2126R’+0.7152G’+0.0722B’. For luma calculations, the commonly used digital standard CCIR 601 uses Y’=0.299R’+0.587G’+0.114B’, and the SMPTE 240M equation Y’=0.212R’+0.701G’+0.087B’ is used for transitional 1035i HDTV3.

Chroma is Greek for the word “color” and a shortened version of the word “chrominance.” The chroma components notation of Y′CbCr is commonly used to represent a family of color spaces, where the luma is Y′, the blue-difference is Cb, and red-difference is Cr (https://en.wikipedia.org/wiki/YCbCr).

For example, based on the ITU-R BT.601 conversion 8 bits per sample case, the digital Y′CbCr values are derived from analog R’G’B’ using the equations Y’=16+(65.481R’+128.553G’+24.966B’), Cb=128+(-37.797R’-74.203G’+112B’), and Cr=128+(112R’-93.786G’-18.214B’).

Human vision is more sensitive to notice changes in brightness than it acknowledges changes in color. This effect is used in “chroma subsampling” technology to reduce the encoded bitrate while maintaining the same perceived quality of an image to the human eyes (which means that there is no difference in the quality of the video to the viewer). Therefore, in chroma subsampling technology, more bits are assigned to the luma (Y’) representation, and less bits are assigned to the encoding of the color difference values Cb and Cr (https://en.wikipedia.org/wiki/Chroma_subsampling).

Chroma subsampling schemes are classified using a three number sequence S1:S2:S3 or a four number sequence S1:S2:S3:S4. Chroma samples are commonly defined in a matrix format that has two rows of pixels and S1 columns of pixels (which is commonly 4). S2 represents the number of chroma samples (Cr, Cb) in the first row of the S1 pixels. S3 represents the number of changes of chroma samples (Cr, Cb) between the first and second row of the S1 pixels. S4 represents the horizontal factor. Figure 5-1 presents simplified examples of the popular chroma subsampling formats, which include 4:4:4, 4:2:2, 4:2:0, and 4:1:1.

A graphic of a 2 cross 4 heat map of Y prime is added with C subscript r and C subscript b to give a 2 cross 4 heat map of Y prime C subscript r C subscript b.

Figure 5-1

Chroma Subsampling Examples

The H.264, H.265, and H.266 video codecs can use one or three color plane arrays of the pixel sample value bit depth. The luma color plane is the primary color plane, which is used to represent local brightness information. Chroma planes can be used to represent color hue and saturation. In the 4:4:4 format, all three color planes have the same resolution. In the 4:2:0 format, compared to the luma plane, the chroma planes have half the width and half the height. In the 4:2:2 format, compared to the luma plane, the chroma planes have the same height but half the width. The 4:4:4 format is used by computer desktop sharing and wireless display applications, and the 4:2:0 format is used by many consumer applications (https://ieeexplore.ieee.org/document/9503377).

H.264 Video Profiles

There are four H.264 video profile groups, which include the (1) non-scalable 2D video application profiles; (2) camcorders, editing, and professional application profiles; (3) scalable video coding (SVC) profiles; and (4) multiview video coding (MVC) profiles (www.itu.int/ITU-T/recommendations/rec.aspx?id=14659). As new updates to the standards are continuously published, the reader is recommended to recheck all standard updates before making a profile selection, as only the basic profiles are listed in the following.

Non-scalable 2D Video Application Profiles

  • Constrained Baseline Profile (CBP): For low-cost applications. Used in videoconferencing and mobile applications

  • Baseline Profile (BP): For low-cost applications that require additional data loss robustness. Used in videoconferencing and mobile applications

  • Extended Profile (XP): For streaming video. Has a relatively high compression capability with enhanced robustness to support data losses and server stream switching

  • Main Profile (MP): For standard-definition digital TV broadcasts using the MPEG-4 DVB Digital Video Broadcasting (DVB) standard. Used for HDTV, but rarely used after high profile (HP)

  • High Profile (HP): For DVB HDTV broadcast and disc storage applications. Used as the Blu-ray Disc storage format and DVB HDTV broadcast services

  • Progressive High Profile (PHiP): Similar to HP without field coding features

  • Constrained High Profile: Similar to PHiP without B (bi-predictive) slices

  • High 10 Profile (Hi10P): Builds on HP with added support for up to 10 bits per sample of decoded picture precision

  • High 4:2:2 Profile (Hi422P): For professional applications that use interlaced video. Builds on Hi10P with added support for the 4:2:2 chroma subsampling format while using up to 10 bits per sample of decoded picture precision

  • High 4:4:4 Predictive Profile (Hi444PP): Builds on top of Hi422P supporting up to 4:4:4 chroma sampling with up to 14 bits per sample. Uses efficient lossless region coding and individual pictures as three separate color planes coding

Camcorders, Editing, and Professional Application Profiles

Four additional Intra-frame-only profiles are used mostly for professional applications involving camera and editing systems:
  • High 10 Intra Profile

  • High 4:2:2 Intra Profile

  • High 4:4:4 Intra Profile

  • Context-adaptive variable-length coding (CAVLC) 4:4:4 Intra Profile

Scalable Video Coding (SVC)

  • Scalable Constrained Baseline Profile: Primarily for real-time communication applications

  • Scalable High Profile: Primarily for broadcast and streaming applications

  • Scalable Constrained High Profile: Primarily for real-time communication applications

  • Scalable High Intra Profile: Primarily for production applications. Constrained to all-intra use

Multiview Video Coding (MVC)

  • Stereo High Profile: Profile for two-view stereoscopic 3D video

  • Multiview High Profile: Profile for two or more views using both inter-picture (temporal) and MVC inter-view prediction

  • Multiview Depth High Profile: Profile for 3D video content improved compression through depth map and video texture information joint coding

  • Enhanced Multiview Depth High Profile: Profile for enhanced combined multiview coding with depth information

H.264 AVC Encoder and Decoder

In this section, more details on the H.264 AVC encoder and decoder are provided (https://doi.org/10.1007/978-3-030-62124-7_12).

H.264 divides an image into a sequential group of macroblocks based on raster scan order, which each group is called a slice. Figure 5-2 presents an example of an image divided into slices.

A graphic of a square is divided into four uneven segments labeled slice 0, slice 1, slice 2, and slice 3.

Figure 5-2

Example of an Image Divided into Slices

A simplified H.264 AVC encoder is presented in Figure 5-3. The H.264 encoder includes integer transform, variable block-size motion compensation, quarter-pixel accuracy in motion vectors, multiple reference picture motion compensation, intra-frame directional spatial prediction, scaling, quantization, in-loop deblocking filter, context-adaptive variable length coding (CAVLC), context-adaptive binary arithmetic coding (CABAC), entropy coding, etc.

A block diagram of H.264 A V C Encoder. It defines how the input image macroblock with an error is encoded via coder control and is processed under Intra or Interswitch.

Figure 5-3

H.264 AVC Encoder Structure

A block diagram of H.264 A V C Decoder. It defines how the encoded bit stream is decoded via entropy decoding, reconstruction, Intra or Interswitch, and memory devices.

Figure 5-4

H.264 AVC Decoder Structure

A simplified H.264 AVC decoder example is presented in Figure 5-4. The decoder operates in the reverse order of the encoder in reconstructing the original image from the compressed encoded bit stream. The decoder uses entropy decoding, inverse quantization, transform of residual pixels, motion compensation, inter-fame prediction, intra-fame prediction, reconstruction, in-loop deblocking filter, etc. in reconstructing the video images.

More details on the H.264 AVC encoder are provided in the following. File size compression is conducted using both intra-frame prediction and inter-frame prediction in H.264. Intra-frame prediction is used to compress the file size of an image using parts mostly within that individual frame. Inter-frame prediction is used to compress the file size of an image using parts mostly from other images (e.g., P and B frames), where more details on I, P, and B frames will be provided later in this chapter. In a video, motion changes take place over multiple images, so motion compensation must use inter-frame prediction algorithms. Since images of a video are interrelated, in-loop processing is used to provide the feedback loop operations needed to compress the video file (in the encoder) and decompress the video file (in the decoder) of the H.264 video codec system.

Integer Transform: After the macroblocks are obtained, integer transform is applied. H.264 uses a simple 4×4 discrete cosine transformation (DCT) with integer-precision. This is based on the fact that the residual pixels have a very low level of spatial correlation in the H.264 I and P frame prediction schemes. Earlier video codec standards used larger and more complex DCTs, which were a burden to the graphics processor and resulted in prediction shift errors due to the rounding errors that occurred in the floating-point calculations that were conducted in the DCT and inverse DCT (IDTC) transformation processes. An inverse integer transform is performed before the deblocking filter and will be used in the feedback loop of the H.264 encoding process.

DCT is a process that converts a time-domain signal into frequency-domain signals, which is a spectral signal decomposition technique. DCT changes a time-domain signal into a sum of multiple frequency domain signals (that are represented with cosine signals) that have different frequencies. Integer transformation is an integer DCT process, which is an integer approximation of the DCT. The integer approximation process enables selectable block sizes to be used, which results in exactly matching integer computations. The integer approximation significantly helps to reduce the computation complexity, which results in low complexity and no drifting. DCT alone (without integer approximation) is known to create undesired prediction shifts due to the rounding errors in the floating-point calculations conducted in the DCT and IDCT processes. Integer transform helps to improve the intra-coding spatial prediction accuracy, which is followed by transform coding.

Scaling and Quantization: After the integer transform, H.264 conducts a combined scaling and quantization process. The integer transform values go through a quantization process using a 4×4 quantization matrix that maps values to quantized levels. Then the scaling process is applied, which is a normalization process of values, where all values are mapped to values within 0 to 1. A scaling and inverse quantization process will be performed before the inverse integer transform is conducted in the feedback loop.

Entropy Coding: Entropy coding is a lossless data compression scheme that has the characteristics to approach the limit of data compression without losing any information or resolution of the image. H.264 uses three entropy coding schemes, which are the Exponential-Golomb (Exp-Golomb) code, context-adaptive variable length coding (CAVLC), and context-adaptive binary arithmetic coding (CABAC). H.264 entropy coding has two modes that are distinguished by the “entropy_coding_mode” parameter settings, which apply different encoding techniques. If the entropy_coding_mode = 0, the header data, motion vectors, and non-residual data are encoded with Exp-Golomb, and the quantized residual coefficients are encoded using CAVLC. Exp-Golomb is a simpler scheme than CAVLC. If the entropy_coding_mode = 1, then CABAC is applied, which is a combination of the binarization, context modeling, and binary arithmetic coding (BAC) processes. BAC is a scheme that results in a high data compression rate. Binarization coverts all nonbinary data into a string of binary bits (bins) for the BAC process. Context modeling has two modes, the regular coding mode and the bypass coding mode. Regular coding mode executes the context model selection and access process, where the statistics of the context are used to derive probability models, which are used to build context models in which the conditional probabilities (for the bins) are stored. Bypass coding mode does not apply context modeling and is used to expedite the coding speed.

CAVLC has a relatively complex process of quantizing the residual coefficients. CABAC is very effective in compressing images that include symbols with a probability greater than 0.5, in which algorithms like CAVLC are inefficient. For example, H.264 main and high profiles use CABAC for compression of selected data parts as well as the quantized residual coefficients. This is why CABAC is continuously applied in H.265 and H.266 entropy coding, but CAVLC is not.

Intra-frame Prediction: H.264 conducts a more sophisticated intra-frame directional spatial prediction process compared to earlier video codec standards (e.g., H.262 and H.263). Neighboring reconstructed pixels from intra-coded or inter-coded reconstructed pixels are used in the prediction process of intra-coded macroblocks. Each intra-coded macroblock can be formed using 16×16 or 4×4 intra prediction block size. Each intra-coded macroblock will be determined through a comparison of different prediction modes applied to the macroblock. The prediction mode that results in the smallest prediction error is applied to the macroblock. The prediction error is the difference (residual value of the subtraction process) between the predicted value and the actual value of the macroblock. The prediction errors go through a 4×4 integer transform coding process. In a macroblock, each 4×4 block can have a different prediction mode applied, which influences the level of the video encoding resolution in H.264.

Deblocking Filter: Due to the macroblock-based encoding mechanism applied, undesirable block structures can become visible. The deblocking filter has two general types (deblocking and smoothing) which are applied to the 4×4 block edges to eliminate the visible undesirable block structures. The deblocking filter’s length, strength, and type are adjusted considering the macroblock coding parameters and edge detection based spatial activity. The macroblock coding parameters include the reference frame difference, coded coefficients, as well as the intra-coding or inter-coding method applied.

Variable Block-Size Motion Compensation: In H.264, the default macroblock size is 16×16. Alternatively, dividing the 16×16 macroblock into four 8×8 partitions can be used. Each macroblock or partition can be divided into smaller sub-partitions for more accurate motion estimation. Based on the 16×16 structure, there are four M-type options (sub-partitions) that can be used, which include the 16×16, two 16×8 sub-partitions, two 8×16 sub-partitions, and four 8×8 sub-partitions. Based on the 8×8 partitions, there are also four 8×8-type options (sub-partitions) that can be used, which include the 8×8, two 8×4 sub-partitions, two 4×8 sub-partitions, and four 4×4 sub-partitions. These eight type options are presented in Figure 5-5, which are used as the basic blocks in the H.264 luma image inter-frame motion estimation process. In H.264, the motion compensation accuracy is at a quarter-pixel precision level for luma images. This process is conducted using a six-tap filtering method applied to first obtain the half-pixel position values, and these values were averaged to obtain the quarter-pixel position values.

Using these H.264 video compression techniques, I, P, and B frames are formed. I frames are intra (I) coded independent pictures that serve as keyframes for the following P and B inter-frame prediction processes. P frames are predictive (P) coded pictures obtained from inter-frame prediction of previous I and P frames. B frames are bipredictive (B) coded pictures that are obtained from inter-frame prediction of I, P, and other B frames.

An illustration of H.264 Macroblocks describes the 16 by 16, 16 by 8, 8 by16, and 8 by 8 M types, and 8 by 8, 8 by 4, 4 by 8, and 4 by 4 type segmentations.

Figure 5-5

H.264 Macroblock Segmentation-Type Options (M Types and 8×8 Types) for Motion Estimation

Group of Pictures (GOP): Based on H.264 and related MPEG standards, the first and last frame in a GOP are I frames, and in between these I frames, there are P frames and B frames. The reference frames of a GOP can be an I frame or P frame, and a GOP can have multiple reference frames. Macroblocks in P frames are based on forward prediction of changes in the image. Macroblocks in B frames are predicted using both forward prediction and backward prediction combinations. In Figure 5-6, I0 is the reference frame for P1, I0 and P1 are the reference frames for P2, and I0 and P2 are the reference frames for P3. In addition, I0, P2, and P3 are the reference frames for B1.

H.264 can have a GOP structure that has no B frames, or use multiple reference frames, or apply hierarchical prediction structures.

A H.264 GOP with no B frames would reduce the compression rate, which means that the video file size would be larger. But not using B frames would enable the H.264 video playing device to use less memory (because B frame processing requires more memory to process) and require a lighter image processing load. A H.264 GOP with no B frames is presented in Figure 5-7, where there are four reference frames, which are I0, P1, P2, and P3.

A graphic of 5 frames of I subscript 0, P subscript 1, P subscript 2, B subscript 1, and P subscript 3 describes how the H.264 G O P is formed.

Figure 5-6

H.264 GOP Formation Example Based on I, P, and B Frames

A graphic of 5 frames of I subscript 0, P subscript 1, P subscript 2, P subscript 3, and P subscript 4 describes how the H.264 G O P is formed.

Figure 5-7

H.264 GOP Formation Process Example with No B Frames

A graphic of a hierarchy of frames from layer 1 to layer 3 are combined to form a set of frames in combined G O P.

Figure 5-8

H.264 Hierarchical Prediction Based GOP Formation Example

Hierarchical Prediction Structure: A H.264 hierarchical prediction example is presented in Figure 5-8, where the GOP starts with I0 and ends with I12, where I0 is the reference frame for P1. Layer 1 is formed with three frames I0, P1, and I12. Using layer 1 frames, the layer 2 frames B2 and B5 are formed. B2 is predicted using I0, P1, and I21. B5 is predicted using P1 and I21. Using the layer 1 and layer 2 frames, the layer 3 frames B1, B3, B4, and B6 are formed. B1 is predicted using I0, P1, and B2. B3 is predicted using I0 and B2. B4 is predicted using I0 and B5. B6 is predicted using I12, B2, and B5. For technical accuracy, please note that the “+” sign used in Figure 5-8 indicates a combination of pixel information of image parts used in the inter-prediction process, and not that the actual image pixel values were just all added up. For example, “I12+B2+B5” means that B6 was predicted using the frames I12, B2, and B5. In addition, in order to control the video compression efficiency, larger quantization parameters can be applied to layers that are formed later down the hierarchy sequence.

A graphic of 10 frames in 4 rows illustrates how the process of H. 264 M V C is mapped from view 0 to view 3 through views 1 and 2.

Figure 5-9

H.264 MVC Process Example

H.264 supports multiview video coding (MVC), which enables users to select their preferred views (if this option is available). MVC is especially important for new XR systems and advanced multimedia applications. For example, Free Viewpoint Video (FVV) services use MVC. Figure 5-9 shows a H.264 MVC process example based on four views. MVC takes advantage of the hierarchical prediction structure and uses the interview prediction process based on the key pictures. Among the views, View 0 uses I frames as its reference frames, whereas View 1, View 2, and View 3 use P frames which are based on View 0’s I frames (and a sequence of P frames), which creates a vertical IPPP frames relation among the different views. Within each view, the same hierarchical prediction structure is applied in the example of Figure 5-9.

H.265/MPEG-5 HEVC Video Technology

In this section, details on the H.265 video codec are provided (https://doi.org/10.1007/978-3-030-62124-7_12). H.265 is also known as MPEG-H Part 2 High Efficiency Video Coding (HEVC), which is a successor of the H.264/MPEG-4 AVC video codec. The H.265 HEVC standards were jointly developed by the Joint Collaborative Team on Video Coding (JCT-VC), which consists of the ISO/IEC MPEG group and the ITU-T VCEG. In April 2013, H.265 standard version 1 was approved. HEVC was also classified in MPEG-H Part 2 by the ISO/IEC which was published in August 2020 in the standard document ISO/IEC 23008-2:2020 titled “Information Technology - High Efficiency Coding and Media Delivery in Heterogeneous Environments - Part 2: High Efficiency Video Coding” (www.iso.org/standard/75484.html).

The two main objectives of H.265 is to improve the compression ratio by 50% compared to H.264 (especially for higher video resolutions like the 7680×4320 for 8K UHDTV) and to reduce the coding and decoding complexity and time (especially by using advanced parallel processing technologies). Several publications claim that H.265 has accomplished a 50% or higher compression ratio compared to H.264 images with the same visual quality.

Extended Macroblock Size (EMS) structures are used in H.265, instead of the macroblocks used in H.264. These are formed using the Coding Tree Block (CTB) structure which has a 64×64 pixel maximum block size, which is much larger than the 16×16 pixel macroblock size used in H.264. These larger block structures are used to form Coding Tree Units (CTUs) to improve the picture sub-partition into a wider range of sized structure blocks.

CTB is the largest block of H.265, where the luma CTB size is N×N and the chroma CTB size is (N/2)×(N/2), where N=16, 32, or 64. CTUs consist of one luma CTB and two chroma CTBs. CTBs consist of coding blocks (CBs) which are square blocks in the quadtree. A luma CB is 8×8 or larger and a chroma CB is 4×4 or larger. A coding unit (CU) has one luma CB and two chroma CBs. To improve the prediction accuracy, a CB can be divided into multiple prediction blocks (PBs). Intra prediction uses the CBs and PBs as processing units. A luma or chroma CB can be divided into one, two, or four PBs that are rectangular but do not have to be a square. Prediction units (PUs) include the prediction syntax and the corresponding luma or chroma PUs. A CB can be divided into multiple transform blocks (TBs) of sizes from 4×4 up to 32×32 to minimize the residual errors of the transform coding process. TBs can span across multiple PB boundaries to increase the compression ratio. A transform unit (TU) contains the luma and chroma of the TBs.

The video image slices are also divided into tiles in H.265, so the video images can be processed parallel using multiple threads. As mentioned above, H.265 was designed to be more efficient in parallel processing systems. The CTU rows can be processed in parallel by assigning them as multiple threads using the wavefront parallel processing (WPP) technology included in H.265.

Figure 5-10 presents an example of the H.265 CTB partitioning, where (a) shows the CTB partitioning in which the CB boundaries are represented with solid lines and the TB boundaries are represented with dotted lines and (b) presents the quadtree of the same CTB.

An illustration of a square is divided into 4 segments. Each has several partitions of C T B along with their hierarchical tree diagram.

Figure 5-10

H.265 CTB Partitioning Example, Where (a) Shows the CTB Partitioning and (b) Presents the Quadtree of the Same CTB

Figure 5-11 presents the H.265 HEVC encoder structure and Figure 5-12 presents the H.265 HEVC decoder structure, where parts with major enhancements (compared to H.264) are marked in red. The encoder receives CTUs of the image as inputs to the system and converts the image frames into a compressed encoded bit stream (output). The decoder operates in the reverse order of the encoder to convert the H.265 compressed encoded bit stream back into video image frames.

A block diagram of H.265 H E V C Encoder. It defines how the input image C T Us with an error is encoded via coder control, fractional, and integer motion estimation.

Figure 5-11

H.265 HEVC Encoder Structure

A block diagram of H.265 H E V C Decoder. It defines how the encoded bit stream is decoded via entropy decoding, reconstruction, current picture buffer, and loop filter.

Figure 5-12

H.265 HEVC Decoder Structure

H.265 uses many advanced techniques that make its video compression rate extremely high, such that the encoded file size using H.265 will be much smaller than the file size using H.264 encoding. Table 5-2 presents the average bit rate compression rate reduction obtained by H.265 compared to H.264 for the 480p, 720p, 1080p, and 2160p video profiles (https://en.wikipedia.org/wiki/High_Efficiency_Video_Coding). The ITU-T standards for H.265 HEVC can be found at www.itu.int/rec/T-REC-H.265, and the conformance specifications H.265.1 can be found at www.itu.int/rec/T-REC-H.265.1.
Table 5-2

Average bitrate compression rate reduction obtained by H.265 compared to H.264

Video profiles

480p

720p

1080p

2160p

H.265

52%

56%

62%

64%

This benefit of resulting in a smaller file size is very beneficial for saving and sending video files over the Internet, as a smaller file can be delivered quicker while using less network bandwidth. In addition, H.265 profiles include improved video quality (e.g., reduced blocking artifacts and distortion, improved color and resolution) and support of videos for wider and diverse screen types. H.265 requires much more processing compared to H.624, which is why many multimedia services still use H.264. There are many advanced technologies applied to H.265, where a few features are described in the following.

Large Transform Size (LTS): H.265 uses discrete cosine transform (DCT) and discrete sine transform (DST) with block sizes from 4×4 up to 32×32. This is much larger than the DCT block sizes of 4×4 and 8×8 used in H.264.

Internal Bit Depth Increase (IBDI): For each pixel of an image, the number of bits used to define its color is called the bit depth, color depth, or colour depth. H.265 uses IBDI to increase the bit depth so the rounding errors that occur in internal calculations can be reduced. IBDI is beneficial to the color quality representation and compression rate as well as helps to determine the output bit depth.

Sample Adaptive Offset (SAO): In-loop filtering is used to eliminate the blocking artifacts of the recon picture. In H.264, only the deblocking filter (DF) was used in a passive way. In H.265, in addition to using a DF, SAO is also applied in low complexity (LC) conditions. For H.265 high efficiency (HE) conditions, in addition to using DF and SAO, an adaptive loop filter (ALF) is also applied. For neighboring pixels near block edges that have large levels of distortion, a compensation offset is adaptively applied using the SAO filter. ALFs have been used in many video codecs where the optimal coefficients are derived to minimize the difference between the original image and the decoded image from the compressed video file.

CABAC: CABAC is an entropy coding engine used in H.264 that had some throughput bottleneck issues that were improved in H.265. Entropy coding is used in H.264 and H.265 to compress syntax elements during the video encoding process final steps. The encoded and compressed syntax elements include motion vectors, prediction modes, and coefficients. Since entropy coding is used in the final steps of the encoding process, entropy decoding is used in the first step of the video decoding process. Many improvements in H.265 were made to CABAC, which include reduction in context coded bins, reduction in total bins, and grouping of bypass bins (www.researchgate.net/publication/261416853_A_comparison_of_CABAC_throughput_for_HEVCH265_vs_AVCH264).

H.265 profile version 1 was released in 2013, version 2 was released in 2014, version 3 was released in 2015, and version 4 was released in 2016. As new updates to the standards are continuously published, the reader is recommended to recheck all standard updates before making a profile selection, as only the basic profiles are listed in the following:
  • Version 1 (2013)
    • Main: For 8-bit colors per sample with 4:2:0 chroma sampling

    • Main 10: For 10-bit colors per sample

    • Main Still Picture: Allows for a single still picture to be encoded

  • Version 2 (2014)
    • Main 12: For 8 bits to 12 bits per sample with 4:0:0 and 4:2:0 chroma sampling

    • Main 4:4:4 Intra: Supports 4:0:0, 4:2:2, 4:4:4 chroma sampling. To be used by intra (I) frames only

    • High Throughput 4:4:4 16 Intra: For 8 bits to 16 bits per sample with 4:0:0, 4:2:0, 4:2:2, and 4:4:4 chroma sampling. To be used for high end professional content creation

    • Scalable Main: Appropriate for Internet streaming and broadcasting

    • Monochrome: For black and white video with 4:0:0 chroma sampling

    • Multiview (MV) Profile: For multiview video image synthesizing taken from multiple cameras based on various viewpoints

  • Version 3 (2015)
    • High Throughput 4:4:4: For 8 bits to 10 bits per sample. Provides a maximum bit rate six times higher than the Main 4:4:4 profile

    • 3D Main: For three-dimensional video

    • Screen-Extended Main: Suitable for graphic and animation video

    • Screen-Extended High Throughput 4:4:4: Combines the characteristics of screen-extended main profile and high throughput profile

    • Scalable Monochrome: Combines monochrome profile for black and white video and scalable profile for streaming

  • Version 4 (2016)
    • Screen Content Coding (SCC) Extensions: For screen content video which contains text and graphics, adaptive color transform, and adaptive motion vector transform

    • Scalable Range Extensions: Scalable Monochrome profiles, Scalable Main 4:4:4

H.266 VCC Video Technology

This section describes the H.266 video codec key parts (https://doi.org/10.1007/978-3-030-62124-7_12). H.266 Versatile Video Coding (VVC) is the successor of the H.265 HEVC. VVC was created by the Joint Video Exploration Team (JVET), composed of MPEG and VCEG. H.266 provides support for diverse video resolutions from SD, HD, 4K, 8K, 16K, and High Dynamic Range Rendering (HDRR) video. H.266 is extremely effective in video compression while preserving the video’s original quality. In comparison with H.265, the encoding complexity of H.266 is expected to be about ten times higher, and the decoding complexity is expected to be about two times higher. The ITU-T standards for H.266 VVC can be found at www.itu.int/rec/T-REC-H.266-202204-I/en.

The core structure of the H.266 VVC system uses the block-based hybrid video coding process, transport interface, parameter sets, and network abstraction layer (NAL) units based bitstream structure that are similar to H.264 AVC and H.265 HEVC. Among the many enhancements included in H.266 VVC (compared to H.264 AVC and H.265 HEVC), the essential ones are summarized in the following and marked red in Figure 5-13, which present the H.266 VVC encoder block diagram. Some of the H.266 VVC encoder differences compared to the H.265 HEVC encoder include improved luma mapping with chroma scaling, a combined inter-prediction and intra-picture prediction option, and advanced in-loop filters (https://ieeexplore.ieee.org/document/9503377). The corresponding new parts in the H.266 decoder are marked red in Figure 5-14.

The main H.266 codec functions include block partitioning, intra-picture prediction, inter-picture prediction, transforms and quantization, entropy coding, in-loop filters, and screen content coding tools. These seven enhancement parts of H.266 are summarized in the following, which include the three major encoder improvement (i.e., luma mapping with chroma scaling, combined inter-prediction and intra-picture prediction option, and advanced in-loop filters) technologies.

Block Partitioning: H.266 can conduct more flexible partitioning while supporting larger block sizes than in H.265 because the H.266 CTU quadtree partitioning has been extended. As a result, H.266 supports recursive non-square splits and separate partitioning for luma and chroma for video encoding enhancements. H.266 block partitioning technology includes use of Quadtree Plus Multi-Type Tree (QT+MTT), Chroma Separate Tree (CST), and Virtual Pipeline Data Units (VPDUs).

Intra-picture Prediction: H.266 uses advanced intra-picture prediction techniques based on DC mode and planar modes by using 93 angles in angular prediction (H.265 uses 33 angles). In addition, H.266 uses more advanced luma matrix-based prediction modes and chroma cross-component prediction modes. H.266 intra-picture prediction technology includes use of finer-granularity angular prediction, wide-angle intra prediction (WAIP), four-tap fractional sample interpolation filters, position-dependent prediction combination (PDPC), multiple reference lines (MRL), matrix-based intra-picture prediction (MIP), intra sub-partition (ISP) mode, cross-component linear model (CCLM), and extended most probable mode (MPM) signaling.

Inter-picture Prediction: H.266 inter-picture prediction is improved by the following four factors. First, H.266 includes many new coding tools to improve the representation efficiency, provide more accurate prediction and coding of motion compensation control information, and enhance the motion compensation process. H.266 inter-picture prediction is improved due to more advanced coding motion information by using history-based MV prediction (HMVP), symmetric MVD (SMVD), adaptive MV resolution (AMVR), pairwise average MV merge candidate, and merge with MVD (MMVD) technologies. Second, the CU level motion compensation is enhanced in H.266 by using more flexible weights on prediction signals and by including a combination of inter-picture prediction and intra-picture prediction with signaling of biprediction weights at the CU level. By adding the ability to predict non-rectangular partitions inside a CU by applying weighting matrices on prediction signals, the motion compensation performance is enhanced. CU level motion compensation technologies include geometric partitioning mode (GPM), combined intra-/inter-picture prediction (CIIP), and biprediction with CU-level weights (BCW). Third, H.266 uses refined subblock-based motion compensation which helps in motion representation with higher accuracy. The essential techniques include subblock-based temporal MV prediction (SBTMVP), prediction refinement with optical flow (PROF), decoder MV refinement (DMVR), bidirectional optical flow (BDOF), and affine motion which can help to reduce high-order deformation due to non-translational motion (e.g., zooming and rotation) as well as improve representation of translational motion between the reference picture and other pictures. Fourth, H.266 uses horizontal wraparound motion compensation to provide improved support for specific immersive video projection formats.

Integer Transform: H.266 applies integer transform to the prediction residual and then conducts quantization of the transform coefficients using the following functions: larger and non-square transforms, multiple transform selection (MTS), low frequency non-separable transform (LFNST), subblock transform (SBT) mode, extended quantization control, adaptive chroma QP offset, dependent quantization (DQ), and joint coding of chroma residuals (JCCR).

Entropy Coding: The entropy coding used in H.266 is CABAC with improved coefficient coding and high-accuracy multi-hypothesis probability estimation technology.

In-Loop Filters: The in-loop filters in H.266 that are applied to the reconstructed video signal include luma mapping with chroma scaling (LMCS), long deblocking filters, luma-adaptive deblocking, adaptive loop filter (ALF), and cross-component ALF (CC-ALF).

Screen Content Coding Tools: H.266 includes screen content coding tools to increase the coding efficiency for diverse camera-captured content, which include screen sharing apps and gaming apps. Applied technologies include palette mode, adaptive color transform (ACT), intra-picture block copy (IBC), block-based differential pulse-code modulation (BDPCM), and transform skip residual coding (TSRC).

A block diagram of H.266 V C C Encoder. It defines how the input video is encoded in Luma mapping, coder control, chroma scaling, and filter control analysis.

Figure 5-13

H.266 VVC Encoder Structure

A block diagram of H.266 V C C Decoder. It defines how the encoded bit stream is decoded via chroma residue scaling, inverse Luma mapping, loop, and adaptive filtering.

Figure 5-14

H.266 VVC Decoder Structure

Figure 5-14 presents a simplified H.266 VVC decoder structure. The decoder processes the H.266 encoded bit stream in the reverse order of the encoder. CTUs consist of one CTB if the video signal is monochrome and three CTBs if the video signal is based on three color components. H.266 decoding begins with entropy decoding using CABAC. Next the inverse quantization and inverse transformation are conducted outputting the decoded residue. After luma mapping with chroma scaling (LMCS), the residuals are combined with the prediction signals. The prediction signals are based on the intra-frame prediction, inter-frame prediction, and the combined inter and intra prediction (CIIP) signals. LMCS helps to increase the coding efficiency by remapping the luma and rescaling the chroma such that the signal range is more effectively used. In the decoder, LMCS based inverse luma mapping is conducted because the deblocking filter was designed to be processed on the subjective criteria of the original sample domain values. The blocking artifacts are minimized through the deblocking filter. The ringing artifacts are minimized, and corrections in the local average intensity variations are made by the sample adaptive offset process. Next, adaptive loop filtering (ALF) and cross-component ALF (CC-ALF) are conducted to correct the signal based on linear filtering and adaptive filtering. Compared to H.265, H.266 applies additional ALF, CC-ALF, and LMCS processes, which are the new in-loop filtering technologies (https://ieeexplore.ieee.org/document/9399506).

Holography Technology

Holographic displays control light diffraction to create a virtual 3D object using a special display system (e.g., special glass display, 3D monitor, particle chamber, project devices, etc.), where the generated 3D object is called a holograph. Due to this characteristic, holographic displays are different from 3D displays (https://en.wikipedia.org/wiki/Holographic_display). There are many methods to generate a holograph, and currently there are few commercial products that can be used for video services.

There are AR systems that can display holographic images, but technical development and standardization of holographic display technology is needed for it to be incorporated into successful consumer electronics (CE) products.

Hologram Generation Technologies

Holography generation methods include laser transmission, electroholographic displays, parallax techniques, and microelectromechanical system (MEMS) methods, which are described in the following.

Laser Transmission Holograms

Lasers are commonly used as the coherent light source of the hologram. For hologram generation, two laser beams are commonly used. A holograph is formed of numerous points located in a 3D domain that have different illumination colors and brightness levels. Each of these 3D points has two laser beams intersecting to generate the modeled interference pattern, which becomes the holographic 3D image. As presented in Figure 5-15, the laser beam is divided into an object beam and a reference beam (using mirrors and lenses), where the object beam expands through lens and is used to illuminate the modeled 3D object. By illuminating the hologram plate with the same reference beam, the modeled object’s wavefront is generated. Diffraction of the laser beams at the modeled object’s 3D formation points contained in the interference pattern is how the holograph is reconstructed (www.researchgate.net/figure/How-reflection-hologram-works-B-Transmission-holograms-Transmission-holograms-Fig-2_fig1_336071507).

A process defines how the user sees the hologram image from the Laser via shutter, beam splitter, mirror, diverging lens, reference, and object beam of the real object.

Figure 5-15

Laser Transmission Hologram Generation Example

Electroholography Holograms

Electroholographic displays use electromagnetic resonators to create the holographic image on a red-green-blue (RGB) color laser monitor. The digital holograph image file is sent to the electromagnetic resonator which generates a signal that enters the acoustic-optic modulator to be converted into a signal that can be displayed on the color laser monitor. Please note that the actual system operations of electroholographic displays are much more complex, and this is only an abstract description. Electroholographic technology is known to have higher picture accuracy and a wider range of color display.

Parallax Holograms

Many holographic displays work well when the users are within a certain viewing angle but may show a distorted form when viewed from other angles. Full parallax holography has the benefit of generating a holographic image that has the same perspective of a scene from any angle of view. Full parallax holography systems achieve this by supporting optical information display in both the X and Y directions. The resulting image will therefore provide the same perspective of a scene to all viewers regardless of the viewing angle. Full parallax systems require significant processing and are very complex and expensive, so reduced versions have been proposed, which are the horizontal parallax only (HPO) and vertical parallax only (VPO) displays. HPO and VPO are 2D optical information based holographic systems that perform well for a user within the proper angle of view but will show distortion beyond this range. Considering that human eyes are positioned side by side, the horizontal mode HPO displays are commonly used. HPO and VPO systems require much less processing and can be implemented with much less complexity and may result in less expensive CE products compared to full parallax systems.

MEMS Holograms

MEMS is a micrometer (10-6 m = μm = micron) level miniature integrated system consisting of electronics and mechanical components. It is worthy to note that the smallest object recognizable to the human naked eye is in the range of 55~75 microns and a human head’s single hair has an average diameter of 20~40 microns (https://en.wikipedia.org/wiki/Naked_eye). It is easy to see how MEMS devices with mechanically controllable reflectors/mirrors can be used to create a holographic image. For example, the piston holographic display uses MEMS based micropistons with reflectors/mirrors attached at pixel positions in the display, so a sharp 3D holographic image can be generated.

The first MEMS based micromagnetic piston display was invented in 2011 by IMEC in Belgium, and a successful commercial product release is expected in the future. The MEMS microscopic pistons located at each pixel position are controlled to change the light/laser reflection to form the hologram image’s shape, color, brightness, and texture. Possible limitations of micromagnetic piston displays include high complexity and cost as each pixel needs to be a controllable MEMS device. This may also create challenges in system repair when needed.

Laser Plasma Holograms

Laser plasma displays can generate a hologram image in thin air, without using a screen or refraction media display, but the resolution and picture quality of the hologram are relatively low. Laser plasma technology was invented in 2005 at the University of Texas using multiple focused lasers that are powerful enough to create plasmatic excitation of oxygen and nitrogen molecules in the air at the modeled locations. Laser plasma-based holograms are very bright and visible, but details of the generated object may be difficult to express due to the low resolution.

Holographic Television

Although the technical implementation details are somewhat different, a holographic television can display a combination of a full parallax hologram and a laser transmission hologram in a unique advanced way. The first holographic television display was invented in 2013 at MIT by Michael Bove. In this system, a Microsoft Kinect camera was used to detect the 3D environment and objects. Using multiple laser diodes, a 3D 360 degree viewable hologram was generated. There are expectations for holographic television display CE products to be released in the future.

Touchable Holograms

Touchable holograms add on user touch sensors and response technology such that acts of touching will result in a response in the displayed hologram. In addition, some systems use ultrasonic air blasts to give the user’s hand a haptic response feeling of its interaction with the hologram. Touchable holograms first originated in Japan and were further improved by Intel in the United States. Using a touchable hologram, keyboards and buttons can be implemented as virtual holograms without any physical interface hardware. Especially in times like the COVID-19 pandemic, hologram keyboards can help prevent disease spreading as there is no physical interaction between users. Expectations of future touchable holograms are significant as many examples have been demonstrated in sci-fi movies including Star Wars and Star Trek.

Summary

This chapter provided details of the video technologies used in metaverse XR devices (e.g., Meta Quest 2 and Microsoft HoloLens 2) and popular OTT multimedia services (e.g., Netflix, Disney+, and YouTube), which include the H.264 Advanced Video Coding (AVC), H.265 High Efficiency Video Coding (HEVC), and H.266 Versatile Video Coding (VVC) standards. In addition, holography technologies are also described. H.264, H.265, and H.266 are used in recording, compression, and distribution of a diverse range of video profiles. The high compression rate of H.264, H.265, and H.266 makes video files very small in size for efficient storage and network transfer. In the following Chapter 6, an introduction to the Moving Picture Experts Group – Dynamic Adaptive Streaming over HTTP (MPEG-DASH) technology as well as its Multimedia Presentation Description (MPD) and Group of Pictures (GOP) is provided. In addition, a YouTube MPEG-DASH operation example and the MPEG-DASH features are described.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.25.112