CHAPTER 14
H.264

H.264 has entered into the public consciousness only in the last few years, but the effort actually started back in 1997, as H.263+ was finishing up. At that point, it was becoming clear to some that adding new layers on top of H.263 was running out of steam, and yielding bigger increases in complexity than efficiency. What was then called “H.26L” was formally launched in the ITU’s Video Quality Expert’s Group (VCEG) in 1999, chaired by my now-colleague Gary Sullivan.

By eschewing the “let’s enhance H.263 again!” trend and building a new codec from the ground up, all kinds of proposed tools to improve compression efficiency were considered, even if they broke the traditional models of codec design. Their goal was to at least double the efficiency of other standard codecs.

As it became clear that MPEG-4 Part 2 wasn’t scaling as well as hoped, MPEG put out a call for proposals for new codecs. H.26L was proposed, and the ITU and MPEG agreed to collaborate on its development, forming the Joint Video Team (JVT).

The H.264 spec was delivered in 2003, with the Fidelity Range Extensions (FRExt), including the broadly used High Profile, coming in 2004.

The effort paid off, and H.264 is clearly the most efficient codec on the market, particularly as the bits per pixel drops down to very low levels. However, there’s a price to pay: Decoding the more complex profiles of H.264 requires more MIPS in software or more silicon in hardware to decode than past codecs. However, even the Baseline profile offered notable efficiency improvements over older complexity-constrained codecs, and has quickly taken over the market for lower-power media players.

The Many Names of H.264

The codec now called H.264 has been known by many other names. There are people who have political feelings about one name or another. The names you might hear or have heard it called include the following:

•  H.264. This is the ITU designation, following H.261, H.262 (aka MPEG-2), and H.263 (aka MPEG-4 part 2 short header). It has clearly emerged as the most commonly used name for the codec, probably because that’s what Apple called it in QuickTime.

•  MPEG-4 part 10. This is the MPEG designation, from the section of the MPEG-4 specification where it is defined.

•  Advanced Video Coding (AVC). This has been proposed as a politically neutral alternative to H.264 and MPEG-4 part 10. However, it has disambiguation issues with other AVC acronyms.

•  H.26L: This was the working name for H.264.

•  JVT: The codec was designed by the Joint Video Team, composed of codec experts from MPEG and ITU.

Why H.264?

Compression Efficiency

H.264’s biggest draw has been its industry-leading compression efficiency. Its novel design enabled the bitrate requirements for common delivery scenarios to drop substantially compared to past codecs as H.264 encoders matured.

While we’ll certainly have even better codecs in the next few years, right now H.264

High Profile is the undisputed king of low-bitrate quality, with plenty of room for further optimization in current profiles.

Ubiquity

Nothing begets success like success. By combining a big technical advantage over MPEG-2 and MPEG-4 part 2, H.264 has quickly been added to new products that used the older technologies. It’s an ongoing process, but H.264 is certainly becoming a given in any consumer electronics device that does any form of video playback.

And since it’s an international standard, everyone has equal access to the specification and licensing, and can compete on implementation while participating in a big interoperable ecosystem.

Why Not H.264?

Decoder Performance

The biggest downside to H.264, particularly when using the more advanced settings with more advanced profiles, is decoder performance. It’s no big deal when GPU acceleration or an ASIC decoder is available, but in pure software, H.264 can require 2–4x the MIPS/pixel as simpler codecs.

Of course, there are optional features like CABAC (defined later in this chapter) that can be turned off to improve decoder performance, albeit with a reduction in compression efficiency.

Older Windows Out of the Box

Windows only started shipping with H.264 in-box with Windows 7, and even then not in the very low end Starter and Home Basic editions. For content targeting Windows machines, WMV is the only format that absolutely will play on every one, particularly on slow-to-update corporate desktops. Flash or Silverlight are likely to be installed as well, but they can’t play a double-clicked media file.

Profile Support

While H.264 High Profile is a clear efficiency winner, that’s not the profile supported everywhere. In particular many devices support both H.264 Baseline and VC-1 Main, and VC-1 can offer superior detail retention at moderate bitrates in that case.

Licensing Costs

Lastly, there has been some concern about MPEG-LA licensing fees. I am not a lawyer, but my reading of the published fees is that anyone with a business big enough to be impacted would have a business big enough to pay the fees. And these are nothing new; MPEG-2 and MPEG-4 part 2 had licensing fees, and VC-1 has similar ones today. However, this is the first intersection of the web world with standards-based codecs, so it’s new to a lot of folks. Concerns about H.264 patent licensing have been the major source of interest in the Theora and Dirac codecs.

What follows is my personal summary of a PDF from MPEG-LA’s web site; if you have legal questions about this, seek real legal advice.

The agreements renew every five years, with prices rising no more than 10 percent each.

Branded encoders and decoders (Built-in decoders are covered by Apple and Microsoft)

•  Up to 100 K/year: No fee

•  $0.20 per unit more than 100 K/year up to 5 M

•  $0.10 per unit over 5 M

•  Maximum fee per company per year: $5 M

Video content and service providers

End user pays directly on title-by title basis

•  No royalty for titles of 12 minutes or less

•  The lesser of $0.02 or 2 percent of price paid to licensee per title over 12 minutes

End user pays on subscription basis (yearly fee)

•  Fewer than 100 K subscribers per year: No fee

•  100 K to 250 K subscribers per year: $25 K

•  250 K to 500 K subscribers per year: $50 K

•  500 K to 1 M subscribers per year: $75 K

•  More than 1 M subscribers per year: $100 K

Remuneration from other sources, like free TV not paid by end user (yearly fee); may apply to internet broadcasting in 2010+

•  Choice of $2500 K per broadcast encoder, or

•  Broadcast market under 100 K: No fee

•  Broadcast market between 100 K and 500 K: $2500

•  Broadcast market between 500 K and 1 M: $5000

•  Broadcast market more than 1 M: $10 K

Annual cap for service provider enterprises

•  $4.25 M in 2009

•  $5 M in 2010

What’s Unique About H.264?

If VC-1 can be thought of as the culmination of the H.261/MPEG-1/MPEG-2/H.263 lineage, keeping what worked (IBP structure, 8 × 8 as the “default” block size, field/frame adaptive interlace coding) and addressing what didn’t (using an integer transform, adding in-loop deblocking, offering more flexibility and simplicity in block types per frame type), H.264 changed many of the fundamentals of how codecs work, from DCT and blocks all the way up to I-, B-, and P-frames.

4 × 4 blocks

Perhaps the most startling thing about H.264 was replacing the long-standard 8 × 8 block size for the much smaller 4 × 4 (although 8 × 8 was added later as an option in High Profile). With smaller block sizes, ringing and blocking artifacts from high QP blocks are smaller and hence less noticible. Plus, edges go through fewer blocks, and so fewer pixels need higher bitrates to encode well.

4 × 4 blocks were designed along with the in-loop deblocking filter, so that the loss of efficiency from having only 16 samples instead of 64 samples per block had less of an impact, reducing the visual and predictive impact of higher quantization.

The return of 8 × 8 blocks

While 4 × 4 worked well for the lower resolutions that H.264 was originally tested at, they didn’t scale as well to larger frame sizes, particularly with film content where there can be large areas that are relatively flat except for grain texture. In the initial DVD Forum testing to select a HD codec for HD DVD, several MPEG-2 and early H.264 encoders were tested along with VC-1, and to broad surprise, VC-1 won the quality shootout, followed by MPEG-2, and with the H.264 encoders in last place. That result likely had more to do with the relative refinement of the implementations of each codec. Before long, the High Profiles were launched, including the optional 8 × 8 block type, and H.264 was a more competitive HD codec.

Strong In-Loop Deblocking

Probably the single most valuable feature added in H.264 is its strong, adaptive in-loop deblocking filter. The goal of in-loop deblocking is obvious; soften the edges between highly quantized blocks in order to remove the blocking artifacts. And not just in postprocessing, but on reference frames, keeping block artifacts from propagating forward.

But H.264 was the first to allow the filter to be so strong (in terms of how much it softened) and broad (in terms of how many pixels it touches) as to really be able to eliminate blocking artifacts outright. As you crank up the compression with H.264, the image gets softer and softer, but doesn’t get the classic DCT sharp-edged blocks. Some H.264 encoders rely on this too much, in my opinion; it allows some sloppiness in encodes, throwing away more detail than necessary. Even if viewers don’t realize it’s due to compression, not a soft source, it’s still throwing away important visual information. But done correctly, this makes H.264 less vulnerable to QP spikes and distractingly poor quality, particularly with 1-pass CBR encoding.

The downside of in-loop deblocking is that it can also soften the image somewhat at even moderate QPs. Fortunately, it can be tuned down or even turned off in many encoders. Typically Blu-ray H.264 encodes won’t use in-loop deblocking for most scenes, reserving it for those for those high-motion/complexity scenes that would otherwise have artifacts (see Figures 14.1 and 14.2).

Figure 14.1 A H.264 encode with in – loop deblocking off.

image

Figure 14.2 Same source with in – loop deblocking on. Deblocking not only reduces blocking, but improves image data by improving reference frames.

image

Variable Block-Size Motion Compensation

H.264 supports a wide variety of different block sizes for motion compensation compared to the simple 16 ×16 and 16 × 8 options in MPEG-1/2. Each macroblock can use 16 ×16, 16 × 8, 8 × 16, 8 × 8, 8 4, 4 × 8, or 4 × 4 partitions.

Quarter-Pixel Motion Precision

H.264 has a more efficient implementation of quarter-pixel motion estimation than part 2’s, delivering a good efficiency improvement.

Multiple Reference Frames

In the classic codecs, there would be at most two reference frames at a time: A B-frame would reference the previous and next I-or P-frame. H.264 radically expanded the concept of a reference frame by allowing any frame to have up to sixteen reference frames it could access. And that goes for P-frames, not just B-frames. Essentially, the decoder can be required to cache up to 16 past frames, any or all of which can referenced in a motion vector to construct a future frame.

H.264 doesn’t require that decoded and displayed order match either. It’s perfectly legal for a P-frame to reference a later P-frame in display order, although it’d have to come first in encode order.

And frames don’t even need to be all P or B. Each frame can be made up of I-, P-, and B-slices (see Figure 14.3). What we think of as an I-frame in most codecs is called an IDR in H.264—Instantaneous Decoder Refresh. All frames that reference each other must all reference the same single IDR frame. Non-IDR I-frames are also allowed in H.264 (like a Open GOP to an IDR’s Closed GOP).

Figure 14.3 One possible structure of slices in a four–slice encode.

image

While awesome-sounding, and awesomely complex-sounding, it’s not as dramatically hard to implement or as dramatically useful as it appears on first blush. The main value of multiple reference frames is when a part of the image is occluded (covered) in the previous reference frame, but visible in a frame before that. Imagine the classic shot through a spinning fan. At any given point, the fan’s blades are covering a third of the screen. So frame 100 would include a part of the image obscured in frame 99 and frame 98, but visible in frame 97. If the encoder was using three reference frames, frame 100 could pull a motion vector from frame 97 to predict frame 100, instead of having to use intrablocks to encode that as a new part of the image.

The number of allowed reference frames is constrained by a maximum number of macroblocks for the given profile@level. Thus smaller frame sizes can support more reference frames than ones near the max capacity of the level. That said, it’s rare for the full 16 to be materially better than 3–4 for most film/video content. Cel animation or motion graphics can take better advantage of lots of reference frames.

All of this information can seem conceptually confusing, but the details aren’t really exposed to a compressionist. Normally, the choices are about how many reference frames to use. The rest follows from that.

While software decoders typically don’t enforce limits on maximum reference frames, hardware decoders very often will. This includes GPU decoder technologies like DXVA in Windows. So stick to the constraints of your target profile@level.

Pyramid B-Frames

Pyramid B-frames are another concept that can blow the mind of a classical compressionist.

It’s long been a matter of definition that a B-frame is never a reference frame, and is discarded as soon as displayed. Thus a B-frame only needs as many bits as it takes to look decent, without any concern for it being a reference frames.

But with pyramid B-frames, a B-frame can be a reference frame for another layer of B-frames. And that layer itself could be used as a reference for yet another layer of B-frames.

Typically the first tier of B-frames is written as “B” while a second tier that could reference those is written as b. So instead of the classic IBBPBBPBBP pattern, you could get IbBbPbBbPbBbP, with the “B” frames based on I-and P-frames, and the “b” frames referencing the I-, P-, and B-frames (see Figure 14.4).

Figure 14.4 Pyramid B–frames strucuture.

image

These complex structures, when coupled with an encoder that can dynamically adjust its pattern to the content, can further improve efficiency of encoding with motion graphics, cel animation, rapid or complex editing patterns, and lots of flash frames. It doesn’t make that much of a difference with regular content, but is still worthwhile.

Weighted Prediction

Weighted prediction allows for whole-frame adjustments in prediction. This is mainly useful for efficient encoding of fades, both to/from a flat color and cross-fades. Coupling weighted prediction with multiple reference frames and B-frames can finally make that kind of challenging content encode well.

Logarithmic Quantization Scale

The scaling factor in H.264’s quantizer isn’t linear as in other codecs, but logarithmic. So going up one QP has less visual impact on the lower end of the scale. So, while QP 10 is typically pretty ugly in MPEG-2 or VC-1, QP 10 in H.264 is around the threshold of visually lossless.

Flexible Interlaced Coding

I have mixed feelings about H.264’s interlaced support. I’m impressed it was done so well. But I worry that gives the industry one more excuse to keep the interlaced millstone around our necks another decade.

There are two basic levels of interlaced in H.264.

Macroblock adaptive field-frame

Macroblock Adaptive Field-Frame (MBAFF) is conceptually similar to classic MPEG-2 and VC-1 interlaced coding, where individual macroblocks can be either interlaced or progressive, with some clever tweaks around the edges for in-loop deblocking and the like.

Picture adaptive field-frame

Picture Adaptive Field-Frame (PAFF) allows individual frames to be MBAFF (good for interlaced frames with mixed areas of motion and static elements), or fields separately encoded (for when everything is moving).

So far, interlaced H.264 has been mainly the domain of broadcast, and hasn’t been supported in QuickTime, Silverlight, Flash, or many devices. (Well, Flash can display interlaced H.264 bitstreams, but it leaves them as interlaced, so you should deinterlace on encode anyway.)

CABAC Entropy Coding

Entropy coding has been pretty similar for most codecs, with some flavor of Huffman coding with variable length tables. And H.264 has a variant of that with its CAVLC mode—Context Adaptive Variable Length Coding. It’s a fine, fast implementation of entropy coding.

What’s new in H.264 is CABAC—Context Adaptive Binary Arithmetic Coding. As you may remember from the section on data compression, Huffman compression can’t be more efficient than one-bit-per-symbol, but arithmetic compression can use fractional bits and thus can get below one. Thus, CABAC can be quite a bit more efficient than CAVLC; up to a 10–20 percent improvement in efficiency with high compression, decreasing with less compression. With visually lossless coding it is more like 3 percent more efficient. CABAC is probably second only to in-loop deblocking in importance to H.264’s efficiency advantage. But it comes with some limitations as well:

•  CABAC is a lot slower, and not parallelizable within a slice. CABAC can increase decoder requirements up to 40%. That speed hit is proportional to bitrate, so it’s less of a problem at the lower bitrates where CABAC is most useful. Keeping peak bitrate as low as is feasable provides adequate quality and can help significantly. Each slice in the video is independently decodable, so using more slices can increase decoder performance on multicore machines, and is required in some cases, like Blu-ray at Level 4.1.

•  CABAC is also slower to encode, and requires a feedback mechanism that can add some latency to the encoding process.

•  CABAC’s speed and complexity excludes it from the Baseline profile, and hence it’s not available for portable media device content.

Differential Quantization

H.264 can specify the quantization parameter per macroblock, enabling lots of perceptual optimization. Improved use of DQuant has been a big area of H.264 advancement in recent encoders.

Quantization Weighting Matricies

This is a High Profile–only feature brought over from MPEG-2. Custom quantization matricies allow tuning of what frequencies are retained and discarded.

The popular x264 recommends a flat matrix and then aggressively uses Differential Quantization.

Modes Beyond 8-bit 4:2:0

Relatively unique for an interframe codec, H.264 has profiles that support greater-than-8-bit-per-channel luma, and both 4:2:2 and 4:4:4 sampling. H.264 also supports using a full 0-255 luma range instead of the classic 16-235, making round-trip conversion to/from RGB much easier.

None of those modes are widely supported in decoders yet, though. H.264 is still mainly used at good old 16-235 8-bit 4:2:0.

H.264 Profiles

Baseline

The Baseline profile is a highly simplified mode that discards many of the computationally expensive features of H.264 for easy implementation on low-power devices. Generally, if it runs on a battery or fits in your hand (like an iPod, Zune, or phone), it likely will only have Baseline support. Baseline leaves out several prominent features:

•  B-frames

•  CABAC

•  Interlaced coding

•  8 × 8 blocks

So, Baseline requires quite a few more bits to deliver the same quality as the more advanced profiles, particularly High. But it works on cheap, low-power decoder ASICs.

While Baseline is certainly easier on software decoders than a full-bore High encode, that absolutely doesn’t mean that Baseline should be used for software decodes even when decode complexity is a constraint. In particular, B-frames and 8 × 8 blocks offer big improvements in efficiency, and minor increases in decoder complexity. In software, they’ll outperform Baseline when comparing the same quality.

Constrained baseline

There’s an effort ongoing to add a new “Constrained Baseline” profile to document what’s emerged as the de facto Baseline standard. It turns out that there are some required features in Baseline not used by Baseline encoders, and hence not tested and not working in existing “Baseline” decoders. Constrained Baseline would just codify what is currently implemented as Baseline.

Extended

The Extended profile targets streaming applications, with error resiliency and other features useful in that market. However, it hasn’t been implemented widely in encoders or decoders. Its most prominent appearance has been as a mysteriously grayed-out checkbox in QuickTime since H.264’s addition in 2005.

Main

Main Profile was intended to be the primary H.264 codec for general use, and includes all the useful features for 8-bit 4:2:0 except for 8 × 8 block sizes. However, adding 8 × 8 turns out to help compression a lot without adding significant complexity, so any new decoder that does Main also does the superior High.

The primary reason to use Main today is if you’re targeting legacy players that do it but not High. Silverlight, Flash, and WMP all included High in their first H.264 implementations.

QuickTime was Baseline/Main only in QuickTime 7.0 and 7.1, with High decode added in 7.2.

Many Main decoders can’t decode interlaced content, although that’s a required feature of the Main Profile spec.

High

High Profile adds three primary features to Main:

•  8 × 8 block support!

•  Monochrome video support, so black-and-white content can be encoded as just Y′ without CbCr.

•  Adpative Quantization Tables.

8 × 8 blocks are quite useful in terms of improving efficiency, particularly with film content and higher resolutions.

Monochrome is used much less in practice (and it’s not like coding big matrices where Cb and Cr = 128 took a lot of bits), and decoder support is hit-and-miss.

Adaptive Quantization Tables have been used to good success in some recent encoders, and are becoming another important High advantage.

High 10

After the initial launch of H.264, attention quickly turned to “Fidelity Range Extensions”—FRExt—to get beyond 8-bit luma coding. However, it was the addition of 8 × 8 blocks to the baseline High 8-bit codec that was broadly adopted. It’s great, but so far not used for content distribution. Having 10-bit would require less dithering on authoring. However, as the end-to-end ecosystem is all based around 8-bit sampling, it would be a slow transition as best.

The only widely supported player that decodes High 10 is Flash. Adobe doesn’t support it in their compression tools, however, and Flash itself has an 8-bit rendering model that doesn’t take advantage of the additional precision.

High 4:2:2

High 4:2:2 simply adds 4:2:2 chroma sampling with 10-bit luma. It would be most useful for interlaced content.

High 4:4:4 Predictive

High 4:4:4 Predictive (normally just called high 4:4:4) goes further to 14-bit 4:4:4. Again, points for awesome, but it is much more likely to be used as an intermediate codec than for content delivery for many years to come.

Intra Profiles

H.264 supports intra-only implementations for production use (like an updated MPEG-2 High Profile I-frame only). These are subsets of the following:

•  High 10 Intra

•  High 4:2:2 Intra

•  High 4:4:4 Intra

•  CAVLC 4:4:4 Intra

Really, intra codecs should use CAVLC anyway. CABAC is worth around only 3 percent extra efficiency at these high bitrates, while these same high bitrates make the decode complexity of CABAC much more painful.

Scalable Video Coding profiles

I can’t decide if Scalable Video Coding (SVC) is going to be the Next Big Thing in IP-based video delivery, or just something neat used in video conferencing.

The idea of scalable coding has been around for quite a while. The concept is to use prediction not just for future frames, but to enhance a base stream. B-frames are actually scalable video already, and could be thought of as a temporal enhancement layer. Since no frames are based on B-frames (putting pyramid B aside for a moment), a file could have all its B-frames stripped out and still be playable, or have new B-frames inserted to increase frame rate without changing the existing frames.

Temporal scalability

For example, take a typical 30-fps encode at IBPBPBP. A lower-bitrate, lower-decode complexity version could be made by taking out B-frames and leaving a 15 fps IPPP. Or extra B-frames (pyramid b, typically) could be added making it IbBbPbBbPbBbP and 60 fps (see Table 14.1).

Table 14.1 Temporal Scalabilty à la Pyramid B-Frames. The reference frames are at 15 fps, the first set of B-frames is at 30p, and then the second layer of B-pyramids.

Sec.1/602/603/604/605/606/607/608/609/6010/6011/6012/6013/60
15pI   P   P   P
30pI B P B P B P
60pIbBbPbBbPbBbP

It’s conceptually simple. However, only the original 30 fps encode would have the best coding efficiency. If you knew you wanted 15 fps or 60 fps all along, you could have used the optimal bitrate and frame structure to encode that band. But where this really shines is when you don’t know how much bitrate is going to be available; instead the video can be encoded once (as IbBbP in the example) and depending on bitrate and/or processing power, just the IP, IP + B, or IP + B + b frames could be delivered on the fly. Each layer of extra data that can be added is called an enhancement layer. For our example here, the I-and P-frames are the base layer, the B-frames are the first enhancement layer, and the b-frames are the second enhancement layer.

And thus the basic concept of scalable video: trading some compression efficiency for increased flexibility.

Spatial scalability

Another place where scalable coding has long been used is “progressive” JPEG and GIF files, where a lower resolution of the image is transmitted and displayed first, with a higher resolution version then predicted from that. This is of course the basic mode of operation of wavelet codecs, and wavelet codecs can be thought of as being natively scalable, with each band serving as an enhancement layer.

H.264 SVC goes well beyond that simplified example in spatial scalability, but the concept is the same—predicting a higher-resolution frame from a lower-resolution frame. The key innovation in H.264 prediction is adaptive mixing of both spatial prediction (of the same frame in a lower band) and a temporal prediction (using another frame of the same resolution).

Quality scalability

The last form of scalability in SVC is quality scalability, which adds extra data to improve the quality of a frame. Essentially, the enhancement layers add additional coefficients back to the matrix. Imagine them as the difference between the coefficients of the base stream’s matrix QP and a less coarse QP.

SVC profiles

There are three profiles for SVC today, mapping to the existing profiles in use:

•  Scalable Baseline

•  Scalable High

•  Scalable High Intra

Note the lack of Main; there’s no point in using it instead of High, since SVC requires a new decoder. High is getting by far the most attention, as it as the best compression efficiency. High Intra could be of use for workflow project to enable remote proxy editing of HD sources.

Why SVC?

Looking at SVC, we can see that it has the potential to offer better scalability than classic stream switching or adaptive streaming. This is because stream switching needs to make sure that there’s some video to play, and so it can’t be very aggressive at grabbing higher bitrates with low latencies. With stream switching, if a 600 Kbps stream only gets 500 Kbps of bandwidth, there’s a risk of running out of buffered frames to decode and the video pausing. But with SVC, each enhancement layer can be pulled down one-by-one—a 100 Kbps base layer can be delivered, and then additional 100 Kbps layers could be downloaded. If a player tried to pull down the 600 Kbps layer but didn’t get it all, the layers at 500 Kbps and below are still there, and there’s no risk of a loss of video playback.

Buffering could be prioritized per stream, with the base stream always buffering ahead several minutes, with relatively less buffer the higher the enhancement layer. And with this buffering comes greater responsiveness. With stream switching, it takes a while to request a new stream; if background CPU activity causes dropped frames, it may be a few seconds before a new stream can be selected. But SVC can turn layers on and off on the fly without having to request anything new from the server.

Lastly, SVC can save a lot of server storage space. 4 Mbps across 10 layers is a lot smaller than a 4 Mbps stream + 3 Mbps stream + 2 Mbps + 1.5 Mbps…

Why not SVC?

There are a few drawbacks for SVC, of course. First, decoders aren’t widely deployed, nor are encoders widely available. It’s essentially a new codec that hasn’t solved the chicken-and-egg problem of being used enough to drive decoders, or being in enough decoders to drive use. This is particularly an issue with hardware devices.

SVC also has more decoder overhead; 720p24 4 Mbps with eight bands requires more power to play back than 720p24 4 Mbps encoded as a single layer.

There’s also an efficiency hit, particulary when spatial scalability is used. It might take 10–15 percent higher bitrate by the highest layer of a low bitrate to HD layer set than to encode a single-layer HD stream. So, to get the quality of a 3000 Kbps single stream, SVC with eight layers may need 3300–3450 Kbps.

When SVC?

SVC is more interesting the less predictable the delivery environment is. It makes little sense for file-and physical-based playback, since the available bandwidth is already known. And for IPTV-like applications with fixed provisioned bandwidth, it’s more efficient to just encode at the desired bitrate.

But when bandwidth is highly variable, like with consumer Internet use, SVC could shine. And the lower the broadcast delay required, the harder it is to use stream switching.

Thus, the first place we’re seeing SVC catch on is in videoconferencing, where the broadcast delay is a fraction of a second.

Where H.264 Is Used

QuickTime

QuickTime was the first major platform to support H.264, introduced with QuickTime 7 in 2005. Initially only Baseline and Main were supported for decode. QuickTime 7.2 added High Profile decoding in 2007; it’s safe to assume nearly all Mac and iTunes users would have upgraded by now. QuickTime doesn’t support interlaced decode at all.

QuickTime’s decoder is decent, albeit software-only and not very fast.

Flash

Flash added H.264 decode in version 9.115. They have one of the deeper implementations (based on the Main Concept decoder), and can decode three profiles:

•  Baseline

•  Main

•  High (including 10 and 422)

Table 14.2 H.264 Levels.

Level numberMax macro-blocks/secondMax macro-blocks/frameMax bitrate Baseline/MainMax bitrate HighMax 4:3Max 16:9Max 4:3 24pMax 16:9 24pMax 4:3 30pMax 16:9 30p
114859964 Kbps80 Kbps176 × 144208 × 128144 × 112160 × 96128 × 96144 × 96
1b148599128 Kbps160 Kbps176 × 144208 × 128144 × 112160 × 96128 × 96144 × 96
1.13000396192 Kbps240 Kbps368 × 272432 × 240208 × 160240 × 128192 × 128208 × 128
1.26000396384 Kbps480 Kbps368 × 272432 × 240288 × 224336 × 192256 × 208304 × 176
1.311880396768 Kbps960 Kbps368 × 272432 × 240416 × 304480 × 272368 × 272432 × 240
2118803962 Mbps2.5 Mbps368 × 272432 × 240416 × 304480 × 272368 × 272432 × 240
2.1198007924 Mbps5 Mbps512 × 400608 × 336528 × 400608 × 352480 × 352544 × 304
2.22025016204 Mbps5 Mbps736 × 560864 × 480528 × 416624 × 352480 × 368560 × 304
340500162010 Mbps12.5 Mbps736 × 560864 × 480752 × 576880 × 496672 × 512784 × 448
3.1108000360014 Mbps14 Mbps1104 × 8321280 × 7201232 × 9281424 × 8161104 × 8321280 × 720
3.2216000512020 Mbps25 Mbps1328 × 9921520 × 8641744 × 13282016 × 11361568 × 11681808 × 1024
4245760819220 Mbps25 Mbps1664 × 12641936 × 10881872 × 14082160 × 12161664 × 12641936 × 1088
4.1245760819250 Mbps62.5 Mbps1664 × 12641936 × 10881872 × 14082160 × 12161664 × 12641936 × 1088
4.2522240870450 Mbps62.5 Mbps1728 × 12961984 × 11202720 × 20483152 × 17602432 × 18402816 × 1584
558982422080135 Mbps168.75 Mbps2736 × 20643168 × 17922896 × 21763344 × 18882592 × 19362992 × 1680
5.198304036864240 Mbps300 Mbps3536 × 26724096 × 23043728 × 28164096 × 23043344 × 25123856 × 2176

Flash can decode interlaced H.264, unlike QuickTime and Silverlight. However, it doesn’t have any deinterlacing support, so both fields will be painfully visible on playback.

Silverlight

Silverlight 3 introduced MPEG-4 and H.264 support, and it supports the same H.264 profiles as QuickTime:

•  Baseline

•  Main

•  High

Smooth streaming with H.264

With Silverlight 3, H.264 is now supported in Smooth Streaming. While tool support for that combination was just emerging as this book was being written, support was forthcoming in Expression Encoder 3, plus products from Inlet, Telestream, Envivo, and others.

The Smooth Streaming Encoder SDK used for VC-1 Smooth Streaming encoding provides resolution switching to maintain minimum quality. There’s less need for resolution switching with H.264 due to its strong in-loop deblocking.

One nice feature of Smooth Streaming is that different versions of the content can be encoded at the same bitrate. So codecs can be mixed and matched; for example, a single player could link to different 300 Kbps streams containing:

•  VC-1 Advanced Profile stream for desktop playback on computers

•  H.264 Baseline for mobile devices

•  H.264 High Profile for consumer electronics devices

Windows 7

While there have been third-party DirectShow H.264 decoders for years, and hardware acceleration support for H.264 since Vista, Windows 7 is the first version with out of the box support for the MPEG-4 formats, including H.264.

Windows 7 has a deep implementation, including interlaced support and hardware acceleration via the Windows 7 Media Foundation API, including HD H.264 playback on the next generation of netbooks.

Portable Media Players

H.264 Baseline has become the dominant codec for portable media players and mobile phones. We’ll give specific examples in Chapter 25.

Consoles

Both the Xbox 360 and PlayStation 3 can play back H.264 files and streams from media services. There is not an equivalent for the Wii (which has much more limited hardware capabilities).

Settings for H.264 Encoding

Profile

Profile is your first question, and largely dependent on the playback platform. This will almost always be Baseline or High. Even if you’re concerned about decode complexity with a software player, use High if you can, while turning off the expensive-to-decode features. Features like 8 × 8 blocks improve compression without hurting decode complexity.

Level

Level defines the constraints of the encode to handle particular decoders. The level is stored in the bitstream, and decoders that can’t play back that level may not even try. It’s important to always have the level set and specified, even if implicitly (some encoders figure out the lowest Level compatible with your encoding settings). It’s bad practice to do 640× 360 encoding with Level left at 5.1; use the lowest Level your content needs in order to deliver broader compatibility. QVGA 30p encodes never need more than 2.1, SD encodes never need more than 3.1, and 1080 p24 encodes never need more than 4.1.

Bitrate

Compared to older codecs, a well-tuned H.264 encode can deliver great quality at lower bitrates. However, that comes at the cost of extra decoder complexity; encoding H.264 at MPEG-2 bitrates is a waste of bits and a waste of MIPS. Make sure that your average (and, if VBR, peak) bitrates aren’t higher than needed for high quality. There’s no need to spend bits beyond the point where the video stops looking better.

Conversely, H.264 isn’t magic, and if you’re used to just watching out for blocking artifacts when compressing, you’ll need to take a more detailed look with H.264 since in-loop deblocking yields softness instead of blockiness when overcompressed. Use enough bits to get an image that looks good, not just one that’s not obviously bad.

Entropy Coding

The two entropy coding modes in H.264 Main and High are CAVLC and CABAC. CAVLC is faster and less efficient, CABAC is slower to encode and decode, but more efficient.

Most of the time, you’ll want to use CABAC; CAVLC only makes sense when you’re worried about decode complexity.

Since CABAC’s decode hit is proportional to bitrate, watch out for performance with VBR and high peak bitrates. Make sure the hardest part of the video plays well enough on the target platform. A peak of 1.5x the average bitrate is a good starting point; don’t use a higher peak than one that significantly contributes to quality.

Slices

One way to allow entropy decoding to parallelize is to encode with slices, breaking the frame into (typically) horizontal stripes. Each slice is treated somewhat like an independent frame, with separate entropy coding.

This can improve playback on some multicore software decoders, and is required by the Bluray spec for encodes using Level 4.1. There’s a slight loss in efficiency with slices (perhaps 1 percent per slice as a ballpark). The hit is bigger the smaller the frame; a good rule of thumb is for each slice to be at least 64 pixels tall.

It’s generally a lot more bitrate efficient to encode CABAC with slices than CAVLC without slices.

The Main Concept and Microsoft encoders, among others, also use slices for threaded encoding; the default is one slice per physical core in the encoding machine. However the trend in H.264 is definitely single-slice encoding for maximum efficiency.

Number of B-frames

Two B-frames is a good default for most non-pyramid H.264 encodes. Most H.264 encoders treat B-frame number as a maximum value when B-frame adaption is turned on, with Microsoft’s an exception.

Pyramid B-frames

Pyramid B-frames can be a welcome, if not huge, boost to encoding efficiency. You’ll need to be using at least 2 B-frames to use pyramid, obviously. Pyramid should be used with an adaptive B-frame placement mode for best results.

Number of Reference Frames

As mentioned above, large numbers of reference frames rarely pay off for film/video content, although they can be quite useful with cel animation and motion graphics. Four is generally sufficient. However, using the full 16 often has little impact on decode performance for playback, but can slow down random access and delay the start of playback.

Strength of In-Loop Deblocking

The default in-loop deblocking filter in H.264 is arguably tuned for video and lower bitrates more than for film grain and higher bitrates, and may soften detail more than is required. For tools that offer finer-grained control over it, turning it down can increase detail. Beyond on/off, there are two parameters that control behavior of in-loop deblocking:

•  Alpha determines the strength of deblocking; higher values produce more blurring of edges.

•  Beta modifies that strength for blocks that are more uniform.

It’d be rare to increase the strength of the filters. The range for each is –6 to +6, with 0 being the default. If adjusted, the values almost always are negative, typically –1 or –2 is sufficient.

Some Blu-ray encoders default to having in-loop deblocking off entirely in order to preserve maximum detail. It seems unlikely to me that the job couldn’t be accomplished by just using lower settings.

H.264 Encoders

Main Concept

There’s a huge number of H.264 encoders out there, but just a few that are used across a variety of products you’ll see again and again. Main Concept is the most broadly licensed H.264 encoder SDK, and provides support in video products from companies including the following:

•  Adobe

•  Sorenson

•  Rhozet

•  Inlet

The Main Concept SDK has seen progressive improvement in speed, quality, and configurability over the years. At any given time, most products will be using the same implementation; differences are mainly in what controls are exposed and what defaults are used.

In general, Main Concept provides the same settings as other encoders (Figure 14.5). One difference is that it doesn’t have an explicit Complexity control, but rather a series of options with a Fast/Complex switch. Generally the Fast modes are faster than they are bad, but the Complex modes will eke out a little extra quality if needed. Unfortunately Main Concept’s documentation has infamously lacked useful settings guidelines to users or licensees, so we’re often in the dark as to the impact of any given option. However, a few comments:

•  Differential Quantization can be a big help in preserving background detail. I find using Complexity: -50 is a good default and really improves many encodes.

•  Search shape of 8 × 8 is more precise than 16 × 16. This presumably specifies partition size.

•  2-pass VBR is way better than 1-pass VBR, particularly for long content.

•  Hadamard Transform is also a slower/higher quality setting.

Figure 14.5 The Main Concept encoding controls as presented in their own Reference product. I just wish they had more documentation on the impact of these settings.

image

All that said, the general defaults are pretty good in most products these days. Perhaps the biggest criticism of Main Concept is that it’s less tuned for detail preservation than other implementations, particularly x264.

x264

x264 (Figure 14.6) is an open source H.264 implementation following the successful model of LAME for MP3 and Xvid for MPEG-4 Part 2. It’s absolutely competitive with commercial implementations, has an active and innovative developer community, and is used in a wide variety of non-commercial products. Some of its cool features:

Figure 14.6 MeGUI settings. A good front-end to the great x264, but I can’t say it does a very good job of guiding users to setting that are tweakable and the ones that should really be left alone. But I can always find out exactly what each option does.

image

Great performance

Using defaults, it can easily encode 720p in real-time on a decent machine, and is still quite fast even with every reasonable quality-over-speed option selected.

Great quality

x264 has done great work with classic rate distortion optimizations, and is able to perform rate distortion down to the macroblock level. But perhaps even more impressive is its perceptual optimizations. Its Psy-RDO mode does a pretty incredible job of delivering more psychovisually consistent encodes, reducing blurring, retaining detail, and maintaining perceptual sharpness and texture. It’s my go-to encoder these days when I need to get a good looking image at crazy-low bitrates.

MB-tree

The new Macroblock rate control tree mode optimizes quality based on how future frames reference different macroblocks of the current frame. That lets it shift bits to parts of the image that have a long-term impact on quality away from transient details that may last for just a frame. This yields a massive efficiency improvement in CGI and cel animation, and a useful improvement for film/video content.

Single-slice multithreading

It can use 16 cores with a single-slice encode for maximum efficiency. This is a significant advantage over the Main Concept and Microsoft implementations.

CRF mode

This is a perceptually tuned variant of constant quantizer encoding, allowing a target quality and maximum bitrate to be set, and the encode will be as small as it can be while hitting the visual quality target and maintaining VBV. It’s a great feature for file-based delivery, where we care more about “good enough and small enough” than any particular bitrate.

Rapid development

There can be new builds with new features or tweaks quite often. On the other hand, these rapid changes can be difficult to stay on top of.

Telestream

Telestream’s H.264 codec (Figure 14.7) is derived from PopWire’s Compression Master, and so has many advanced features of use in the mobile and broadcast industries, a la Main Concept. It includes some unique features:

•  Can set buffer as both VBV bytes and bitrate peak.

•  Lookahead-based 2-pass mode makes for faster encoding but somewhat weaker rate control for long content.

Figure 14.7 (a) Episode has all the right basic controls, including a choice of specifying buffer as VBV or bitrate. Oddly, its multiple reference frames maxes out at 10. (b) Episode uses its usual Lookahead–based 2–pass mode for H.264 as well. It can also do High 4:2:2, unlike many other encoders. (c) Episode also has some more unusual settings that allow streams to meet some unusual requirements, like easy switching between bitrates. It uses multithreaded slicing.

image

It had some quality limitations in older versions, but as of Episode 5.1 is competitive.

QuickTime

Apple’s encoder is relatively weak, particularly beyond Baseline. It is Baseline/Main only and its implementation lacks three key features:

•  CABAC

•  Multiple reference frames

•  8 × 8 blocks

In fact, Apple’s Main is pretty much Baseline + B-frames.

There’s very little control in QuickTime’s export dialog (Figure 14.8), but here are a few tips:

•  Set Size and Frame Rate to current if you’re not doing any preprocessing.

•  If you’ve got some time on your hands, use the “Best quality (Multi-pass)” Encoding mode. It’ll take as many passes as the codec thinks are necessary per section of the video. I’ve counted up to seven passes in some encodes.

•  Don’t drive yourself crazy trying to activate the “Extended Profile” checkbox. This has been grayed in every version of QuickTime with H.264. Extended is a streaming-tuned mode that hasn’t been used significantly in real-world applications.

Figure 14.8 QuickTime’s export settings. It’s the same underlying codec, but you get 14.6a and 14.6b when exporting to a MPEG – 4 file (I hate modal dialogs), but a nice unified one in 14.6c when exporting to .mov. Note the CD – ROM/Streaming/Download control is only available in multi–pass mode. We lack good documentation on the difference between CD – ROM and Download; I presume Streaming is CBR.

image

In general, QuickTime’s exporter is a fine consumer-grade solution, but I can’t imagine using it for quality-critical content.

Microsoft

Microsoft first shipped an H.264 encoder with Expression Encoder 2 SP1, which was a quite limited implementation just for devices. Windows 7 adds much improved H.264 support, although the implementation in Windows itself is focused on device support.

Windows 7 also supports hardware encoding, which can make transcoding for devices as fast as a file copy.

A fuller software implementation with a bevy of controls is included in Expression Encoder 3, for both .mp4 and Smooth Streaming .ismv files.

EEv3’s H.264 (Figure 14.9) is quite a bit better than QuickTime’s, although it’s also missing a few quite useful features:

•  8 × 8 blocks (Baseline/Main Only)

•  Pyramid B-frames

•  Adaptive frame type decisions

Figure 14.9 Expression Encoder 3’s H.264 controls. I like having a single–pane nonmodal control like this.

image

However, unlike QuickTime, it does support multiple reference frames, peak buffer size, and CABAC. It also has higher complexity modes that can deliver quite good quality within those limitations. I look forward to a future version with High Profile, of course.

Tutorial: Broadly Compatible Podcast File

Scenario

We’re a local community organization that wants to put out a regular video magazine piece about industry events. Our audience is a broad swath of folks, many of whom are not very technical. So we want to make something that’ll work well on a variety of systems and devices with minimal additional software installs required.

We want to make a simple H.264 file that’ll play back on a wide variety of devices.

The content is a standard def 4:3 DV file. It’s a mix of content, some handheld, some tripod, some shot well, some less so.

The Three Questions

What Is My Content?

A reasonably well-produced 4:3 480i DV clip. We’ll yell at them next time about shooting progressive, but what we’ve got is what we’ve got.

Who Is My Audience?

The people in the community, everyone from tech-savvy teenagers to retired senior citizens. We want something that’ll be very easy for them to use without installing anything. We want a single file that can work on whatever they’ve got, and will offer multiple ways to consume it from our web site.

•  Via a podcast subscription to iTunes, Zune client, or any other compatible podcasting application.

•  Embedded in the web page via a Flash player.

•  As a download for desktop, or portable media player playback.

Users will have a wide range of computer configurations and performance. While most will have broadband, we’ll still have some potential modem users out there who mainly use their computers for email; the file shouldn’t be too big a download for them.

What Are My Communications Goals?

We want to share all the cool things we’re doing with our community. We want the content to look good, but that’s just a part of the whole experience: it must be easy to find, play back well, and leave them excited about downloading the next episode. We want viewers to feel more connected to their community and more excited about engaging in it.

Tech Specs

We’ll try both QuickTime and Sorenson Squeeze for this encode. Squeeze is one of the most approachable compression tools, and a good first choice for someone more interested in doing some compression without becoming a compressionist. Squeeze includes good deinterlacing and a good Main Concept implementation.

We’ve got a lot of constraints to hit simultaneously. In general, we can aim for the iPod 5G specs; other popular media players do at least that or better. Thus, we can use the following parameters:

•  H.264 Baseline Profile Level 3.0

•  Up to 640 × 480, 30 fps, 1.5 Mbps

•  Up to160 KHz 16-bit stereo 48 KHz audio

Reasonably straightforward. The one thing we’ll want to keep our eyes on is that bitrate; that’ll have to be our peak.

Encoding Settings

QuickTime iPod encoding is dead simple (Figure 14.10). You just export the file as “iPod” and you’re done. No options, no tuning, and it’ll look great on the iPod’s small screen. However, it’ll also use more bits than needed for its quality, and can have some quality issues when scaled up to full screen (although Apple’s Baseline encoding is much more competitive than their Main).

Figure 14.10 The optionless iPod export in QuickTime.

image

Alas, sometimes with great simplicity comes great lameness. QuickTime exports the 480i DV assuming it’s progressive square-pixel, and so we get horrible fields and the image scaled down to 640 426:720 480 proportionally scaled to 640 width. This will not stand!

Squeeze comes with an “iPod_Lg” preset that sounds exactly like what we’re looking for. Applying it to the source file, the automatic preset modes of Auto Crop and Auto Deinterlace are applied. Those filters are much improved from previous versions of Squeeze in quality.

Look carefully at the settings, though; “iPod_Lg” is Sorenson MPEG4 Pro—actually their part 2 implementation. It gives us the choice of Sorenson, Main Concept, or Apple’s H.264 encoders. Main Concept is the only one which allows a 2-pass VBR with both average and peak bitrates, and it’s also the highest-quality one; we’ll pick that. The default settings aren’t bad, but we’ll make a few changes, also shown in Figure 14.11:

•  Switch Method from 1-Pass VBR to 2-pass VBR; important to get the bits where we need them for optimal file size.

•  Lower data rate to 1000 Kbps to reduce our download size for modem users.

•  Reduce sample rate to 44.1. That’s slightly more compatible on very old computers, and will sound just as good.

•  Set Max Data Rate to 150 percent, giving us our 1500 Kbps peak for compatibility.

•  Encoding Effort (Complexity) is already on Best, so we’ll leave it there.

•  Slices defaults to 0. That’s probably fine; can’t use CABAC in baseline.

Figure 14.11 Our final Squeeze settings. It also gets points for a single – pane interface, although there are plenty of options I wish were exposed.

image

And the result are pretty decent. If we weren’t worried about download time, we would have been a little better off using the full 1500 Kbps. But the hard parts of the video were already going to be at the 1500 Kbps peak, so it wouldn’t have helped all that much. At 1000 Kbps, the Squeeze output (Figure 14.12) looks a lot better than the QuickTime (Figure 14.13) at 1500 Kbps.

Figure 14.12 The Squeeze encode is well preprocessed, which makes for much better quality even at a third lower bitrate.

image

Figure 14.13 QuickTime’s output with this source is a good refresher of what not to do from the Preprocessing chapter. It has interlaced coded as progressive, anamorphic coded as square–pixel, and a reduced luma range, due to a wrong 16–235 to 0–255 conversion.

image

H.265 and Next – Generation Video Codec

Of course, H.264 isn’t the end of codec development by any means. Its replacement is already in early stages. Recognizing that many of H.264’s most computationally expensive features weren’t the source of the biggest coding efficiency gains, a new codec could provide a mix of better compression and easier implementation. The VQEG has ambitious targets:

•  Allow a 25 percent reduction in bitrate for a given quality level with only 50 percent of H.264 decode complexity

•  A 50 percent reduction in bitrate without decode complexity constraints

It’s too early to say if either or both will be met, but getting close to either should be as disruptive to H.264 as H.264 was to MPEG-2. These improvements may be addressed via an extension to H.264 or an all-new H.265. Both are very much research projects at this point. A H.264 extension could potentially emerge in 2010, but a full H.265 is unlikely before 2012 at the earliest.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.111.195