CHAPTER 11
MPEG-4

After the clear resounding success of MPEG-2, the Moving Picture Experts Group (MPEG) started work on MPEG-4 in 1993. Version 1 of the specification was approved in 1998, version 2 followed in 2000, and the all-important H.264 (MPEG-4 part 10) was unveiled in 2003.

In general terms, MPEG-4 is a standard way to define, encode, and play back time-based media. In practice, MPEG-4 can be used in radically different ways in radically different situations. MPEG-4 was designed to be used for all sorts of things, including delivering 2D still images, animated 3D facial models with lip sync, two-way video conferencing, and streaming video on the web. Many of those features have never been used in practice, and the particular features supported in any given industry or player vary widely.

Like MPEG-1 before it, MPEG-4 fell far short of the initial expectations many had for it. In particular, ambiguity around licensing terms slowed adoption around the turn of the millennium, the original MPEG-4 Part 2 video codec had practical limitations that kept it from becoming a viable replacement for MPEG-2, and by the time it was widely available it was outperformed by the available proprietary codecs. And some of the most-hyped features—in particular, the Binary Format for Scene (BIFS) rich presentation layer—never made it past cool demos into real deployments.

But as we saw with MPEG-1 and MPEG-2, different markets found different early uses for the MPEG-4 technologies:

•  “DivX” files commonly used for pirated content were the biggest area of use for MPEG-4 Part 2, and led to lots of development for (largely noncommercial) encoders and decoders.

•  The post-MP3 audio format for Apple’s iTunes and iPods, .m4a, is AAC-LC in an MPEG-4 file.

•  Early, pre-H.264 podcasting.

•  QuickTime 6-era video files.

•  3GP/3GPP video on cell phones.

But the real explosion in MPEG-4 has happened since H.264 came out:

•  The first broadly supported media format since MPEG-1! All major players and devices—QuickTime 7+ , Windows Media Player 12, iPod, Zune, Xbox 360, PlayStation 3, Flash, Silverlight, etc.—support H.264.

•  The broadcast industry is finally shifting from MPEG-2 to H.264.

•  HE AAC streams are becoming the new digital radio and audio webcasting standard.

The momentum of MPEG-4 and its component technologies is inarguable.

MPEG-4 Architecture

MPEG-4’s official designation as a standard is ISO/IEC-14496. MPEG-4 picks up where MPEG-1 and MPEG-2 left off, defining methods for encoding, storing, transporting, and decoding multimedia objects on a variety of playback devices. Whereas MPEG-1 and -2 were all about compressing and decompressing video and audio, MPEG-4’s original aims were “low bitrate audio/video coding” and defining “audio-visual or media objects” that can have very complex behaviors.

There is no such thing as a generic MPEG-4 player or encoder. Because the specification is so huge and flexible, subsets of it have been defined for various industries. A file intended for delivery to a digital film projector isn’t meant to work on a cell phone! Thus MPEG-4 has a more complex set of Profiles and Levels than MPEG-2. And it also includes “Parts,” independent sections of the spec, to enable new aspects of the format. H.264 (MPEG-4 part 10) is the biggest of these.

MPEG-4 File Format

The base MPEG-4 file format was closely based on the QuickTime file format, and will feel very familiar to old QuickTime hands. They’re so close that the same parser is often used to read both .mov and .mp4 files, as in Flash, Silverlight, and Windows Media Player 12.

The MPEG-4 file format is extensible. There are many mappings for codes not formally part of MPEG-4 to the file format, including MP3 and VC-1.

Boxes

A box is the basic unit of a MPEG-4 file, inherited from QuickTime (which called them atoms). A box can contain other boxes, which themselves contain other boxes. Each has a 4CC. The most important is the movie header or “moov” (see Figure 11.4).

Tracks

As in QuickTime, media in an MPEG-4 file is in tracks, normally a sequence of video or audio. A single .m4a file will have a single track of AAC-LC audio. A simple podcasting file would have a track of AAC-LC audio and a track of H.264 Baseline video.

But more complex combinations are possible. A file could have multiple audio tracks for different langauges, with the player picking which one to play.

Hint tracks

MPEG-4 files meant for real-time streaming with standard Real Time Streaming Protocol (RTSP) streaming require hint tracks. This is an extra track (one per track to be streamed) containing an index showing how to packetize the media for real-time delivery. Hint tracks are easy to make during encoding or afterwards. They shouldn’t be included in files meant for local playback only, as the extra data increases file size. Figures 11.1, 11.2, and 11.3 demonstrate the characteristics of hint tracks in various implementations.

Figure 11.1 Tracks in a hinted file. They increase the file size a little more than 3 percent in this case.

image

Figure 11.2 QuickTIme has pretty basic hinting options. The defaults are fine unless you’ targeting a specific network that is best with a different packet size. The “optimize for server” mode hasn’t been useful for ages; it just makes the file much bigger without actually helping anything.

image

Figure 11.3 Episode Pro started as an encoder for mobile devices, and continues to have a very deep feature set for that market. It has much more configurability for hinting than most applications. This allows encoding to be tuned for particular devices or networks.

image

Figure 11.4 The structure of a fragmented MPEG-4 file (fMP4). Each fragment is self-contained, with its own video, audio, and index needed for playback.

image

Flash Media Server (FMS) doesn’t require hint tracks for its proprietary RTMP (Real Time Messaging Protocol) protocol, but there’s no downside to including them either, as they aren’t transmitted by FMS even if they’re there.

Text tracks

Text tracks are how MPEG-4 stores subtitles, synced URLs, and other time-based data. This has significant use only recently, with different player implementations expecting somewhat different versions of subtitle presentations.

Fast-Start

Traditional MPEG-4 content, and all .mp4/.m4a files, have been nonfragmented, following the classic QuickTime model. In a nonfragmented file, the video and audio tracks are continuous, and to jump to a particular frame, the player needs to know where that frame and its associated audio are in memory. That information is stored in an index called the header, which is a “moov” box.

The header needs to know the location of all the samples in the media tracks, so it isn’t created until after encoding is done. The simplest way to do this is to append the “header” at the end, with an offset value at the front of the file indicating where it is. That works great when random access is possible. But it’s not a good experience when the content is being accessed serially, like classic progressive download without byterange access; all the media has to be downloaded before anything can be played.

Hence the “Fast Start” file, which moves the movie header to the beginning of the file.

This requires that the file be resaved after the initial write, but that’s plenty fast on modern machines. Most MPEG-4 tools automatically make the files Fast Start.

Fragmented MPEG-4 files

The other way to handle random access into a MPEG-4 file is to use the fragmented subtype. In fragmented MPEG-4 (fMP4), instead of having a big movie header for all the samples, there are periodic partial headers that cover one portion of the content. This type of header is a “moof” box, and along with the associated video and audio content is called a fragment.

Historically, fMP4 was mainly used for digital video recorders (losing power during recording leaves the file playable except for the final partial fragment). However, it’s now used as the file format for Smooth Streaming, and in other major implementations not yet announced.

The tragedy of BIFS

The big thing that MPEG-4 promised that went beyond the QuickTime file format and thenexisting technologies was BIFS—the Binary Format for Scenes. It was a derivative of VRML (Virtual Reality Markup Langauge, another mid-1990s rich media disappointment), and provided control over the video presentation, including synchronized text, multiple video windows, integrated user interfaces with audio feedback, and so on.

It could have changed everything.

Except it didn’t. It wasn’t ever supported by QuickTime/RealPlayer/Windows Media Player. And while there were some interesting authoring tools, from iVast and Envivio, it never got enough traction in players to drive content, or enough traction in content to drive adoption by players. MPEG-4 in the end was mainly important as a way to get much better standards-based video codecs. But the video format itself didn’t really enable anything MPEG-1 couldn’t in interactivity.

In the end, the promise of BIFS was finally realized by Flash, and later, by Silverlight. But in those cases it was about embedding media in a rich player instead of having the media itself host the interactivity. HTML5’s <video> tag has been proposed as another way to address this market, but that so far lacks any codec or format common denominator.

MPEG-4 Streaming

MPEG-4 uses RTSP for streaming. As a protocol, RTSP is broadly used, with RealPlayer and Windows Media using implementations as well. RTSP uses UDP and RTSP, as we’ll discuss in more detail in Chapter 23.

RTSP is a pretty canonical streaming protocol with all the features one would expect, like:

1.  Multicast support

2.  Random access in on-demand content

3.  Live support

4.  Session Descrption Protocol (SDP) as metafile format

That said, pure MPEG-4 RTSP hasn’t been that widely used for PC delivery; RealMedia, Windows Media, and Flash have long dominated that market. Apple keynotes and presentations have probably been the most prominent. RTSP has been much more used in device streaming.

MPEG-4 Players

While there is a reference player available, MPEG-4 player implementation is left open to interpretation by the many different vendors. Different players can support different profiles. They can also support different levels of decompression quality within a single profile, which can result in lower-quality playback on lower-speed devices. For example, postprocessing (deblocking and deringing) may not be applied—these tools are not required in the standard.

This implementation dependence is the source of much confusion among end users and developers alike. There’s no way to tell in advance whether any given .mp4 file is playable in a given player, and clear documentation can be hard to come by, particularly for devices.

MPEG-4 Profiles and Levels

When someone asks if a piece of technology is “MPEG-4-compatible,” the real question is, “Which MPEG-4 profiles and levels does it support?” Like in MPEG-2, a profile is a subset of the MPEG-4 standard that defines a set of features needed to support a particular delivery medium, like particular codec features. And level specifies particular constraints within a profile, such as maximum resolution or bitrate. Combined, these are described as profile@level, e.g., Advanced Simple Visual@Level 3. Compliance with profiles is an all-or-nothing proposition for a decoder—if you have a device that can handle the Advanced Simple Visual Profile Level 3, it has to be able to play any file that meets that specification. However, many players don’t conform to the official descriptions. Apple made up their own “H.264 Low Profile” for the initial video iPods.

MPEG-4 Video Codecs

So far, there are two primary video codecs as part of MPEG-4, with other codecs that can be used in MPEG-4 but aren’t part of the formal standard.

MPEG-4 Part 2

The original MPEG-4 codec was MPEG-4 part 2. It’s also known as Divx and Xvid by its most common implementations, and as ASP after its most popular profile (Advanced Simple Profile). It’s a big topic, and hence the subject of the next chapter.

H.264

MPEG-4 part 2 was a relative disappointment in the market, but H.264 has become the real deal, and is rapidly becoming the leading codec in the industry. It’s the topic of Chapter 14.

VC-1

VC-1 is the SMPTE standardized version of the Windows Media Video 9 codec. A mapping for VC-1 into MPEG-4 has existed for a while, but it has only come into common use with Smooth Streaming. The default .ismv file format for Smooth Streaming is a fragmented MPEG-4 file containing either VC-1 or H.264 as the video codec. Smooth Streaming itself is covered in Chapter 27, and VC-1 in Chapter 17.

MPEG-4 Audio Codecs

MPEG-4 has a rich set of audio features. Like most platforms, it provides separate codecs for low bitrate speech and general-purpose audio.

Advanced Audio Coding (AAC)

Following in the footsteps of MP3 (a.k.a. MPEG-1, Layer III audio), born of MPEG-1, AAC, born of MPEG-2, has became a dominant audio codec in the MP4 era.

It’s covered in Chapter 13.

Code-Excited Linear Prediction (CELP)

Code-Excited Linear Prediction (CELP) is a low-bitrate speech technology (related to the old ACELP.net codec in Windows Media) used in the 3GPP2 MPEG-4 implementation for mobile devices. It supports bitrates from 4 to 24 Kbps at 8 kHz or 16 kHz mono by spec, but most implementations just offer 6.8 and 14 Kbps versions. If available, you’d use 12 Kbps HE AAC over 14 Kbps CELP. QuickTime includes encode and decode.

Adaptive Multi-Rate (AMR)

Adaptive Multi-Rate (AMR) is another low-bitrate speech codec also used in 3GPP and 3GPP2 devices. It offers from 1.8 to 12.20 Kbps, depending on implementation. QuickTime and Episode can both encode, and QuickTime can play back.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.189.250