CHAPTER 3
Fundamentals of Compression

This chapter is about the fundamental techniques used to get from uncompressed digital video and audio to compressed video and audio. There’s a lot of math implied in this chapter, but I’ll keep the equations to a minimum and describe what they mean for the more narratively inclined. While it’s not essential to master every detail of how codecs work, a working knowledge of what’s happening under the hood helps you understand the results you’re getting, spot problem areas, and plan strategies for overcoming challenges.

Even if you skip this chapter the first time around, you might want to try again when while you’re waiting for that big encode to finish.

Compression Basics: An Introduction to Information Theory

The fundamentals of information theory were thought up almost entirely by a single guy—Bell Labs Researcher Claude E. Shannon. His “A Mathematical Theory of Communication” remains the cornerstone of the entire field. Published in the Bell System Journal in 1948, it addressed an important question: “How much information can we actually send over these telephone and telegraph lines?” and then defined a mathematical definition of information and its limits. Shannon might not be a household name like Edison or Einstein, but the field he founded is as central to our modern world as indoor lighting and nuclear physics.

Tempted as I am to deliver a book-length discourse on information theory, I’ll just summarize a few critical insights underlying video and audio compression.

Any Number Can Be Turned Into Bits

This is an obvious point, but bears repeating. The bit, a single 0 or 1, is the basis of all math on computers, and of information theory itself. Any number can be turned into a series of bits, be it an integer like “3,” a fraction like 7/9ths, or an irrational constant like pi (3.1415926535…). This means we can use bits as the basis for measuring everything that can be sampled and quantized.

The More Redundancy in the Content, the More It Can Be Compressed

English in plain text is rather predictable. For example, I’m much more likely to use the word “codec” at the end of this sentence than XYZZY. The more you understand the distribution of elements within the information you need to compress, the more efficient the compression can be. For example, the distribution of letters in English isn’t random: the letter “e” appears a lot more often than “q,” so it is more efficient to design a system where “e” can be coded in fewer bits than “q.” A table that translates from the original characters to shorter symbols is called a codebook. Shannon’s grad student David A. Huffman devised a method for generating an optimal codebook for a particular set of content (for a term paper, no less). Huffman coding uses shorter codes for more common symbols and longer codes for less common ones. Different codebooks work better or worse for different kinds of content. A codebook for English won’t work well for Polish, which has a very different distribution of letters. For an English codebook, the Polish distribution of letters would require more bits to encode, possibly even more bits than the uncompressed original file. The more you know about what’s going to be compressed, the better a codebook can be generated.

Taken to the next level, the order of letters is also not random. For example, “ee” isn’t as common a sequence in English as “th,” even though “t” and “h” individually appear less often than “e.” So a coding scheme can be expanded to take account of the probability of a particular letter given previous letters, assign symbols to a series of letters instead of a single one, or having a different codebook based on past letters.

Codebooks can be agreed upon in advance, so the decoder knows how to best decode the content in question. Multiple codebooks can be used, with a code at the start of the file or even changing throughout the file to indicate which codebook is to be used. Or a codebook can be dynamically created while the file is being compressed.

The converse of “more redundant content compresses better” is that less redundant content doesn’t compress as well. Truly random data won’t compress at all; there isn’t any greater probability for any given number, so there isn’t a useful Huffman code. Thus the codebook plus the compressed data can wind up larger than the original data. Modern compressors leave that kind of uncompressed, so the output file won’t be larger than the input file.

In fact, we use randomness as a measure of compressibility. Compression is sometimes called “entropy coding,” since what you’re really saving is the entropy (randomness) in the data, while the stuff that could be predicted from that entropy is what gets compressed away to be reconstructed on decode.

The More Efficient the Coding, the More Random the Output

Using a codebook makes the file smaller by reducing redundancy. Because there is less redundancy, there is by definition less of a pattern to the data itself, and hence the data itself looks random. You can look at the first few dozen characters of a text file, and immediately see what language it’s in. Look at the first few dozen characters of a compressed file, and you’ll have no idea what it is.

Data Compression

Data compression is compression that works on arbitrary content, like computer files, without having to know much in advance about their contents.

There have been many different compression algorithms used over the past few decades. Ones that are currently available use different techniques, but they share similar properties.

The most-used data compression technique is Deflate, which originated in PKWare’s .zip format and is also used in .gz files, .msi installers, http header compression, and many, many other places. Deflate was even used in writing this book—Microsoft Word’s .docx format (along with all Microsoft Office “.???x” formats) is really a directory of files that are then Deflated into a single file.

For example, the longest chapter in my current draft (“Production, Post, and Acquisition”) is 78,811 bytes. Using Deflate, it goes down to 28,869 bytes. And if I use an advanced texttuned compressor like PPMd, (included in the popular 7-Zip tool), it can get down to 22,883 bytes. But that’s getting pretty close to the theoretical lower limit for how much this kind of content can be compressed. That’s called the Shannon limit, and data compression is all about getting as close to that as possible.

Well-Compressed Data Doesn’t Compress Well

As mentioned earlier, an efficiently encoded file looks pretty much random. This means there isn’t any room for further compression, assuming the initial compression was done well. All modern codecs already apply data compression as part of the overall compression process, so compressed formats like MP3, or H.264 typically won’t compress further. The big exception is with authoring formats, especially uncompressed formats. You might get up to a 2:1 compression with an uncompressed video file, or a intraframe codec where several frames in a row are identical.

A time-honored Internet scam/hoax is the compression algorithm that can reduce the size of random data, and thus even its own output. Information theory demonstrates that such systems are all bunk.

General-Purpose Compression Isn’t Ideal

Generating a codebook for each compressed file is time-consuming, expands the size of the file, and increases time to compress. Ideally, a compression technology will be able to be tuned to the structure of the data it gets. This is why lossless still image compression will typically make the file somewhat smaller than doing data compression on the same uncompressed source file, and will do it faster as well. We see the same thing as with the text compression example.

Small Increases in Compression Require Large Increases in Compression Time

There is a fundamental limit to how small a given file can be compressed, called the Shannon limit. For random data, the limit is the same as the size of the source file. For highly redundant data, the limit can be tiny. A file that consists of the pattern “01010101” repeated a few million times can be compressed down to a tiny percentage of the original data. However, real-world applications don’t get all the way to the Shannon limit, since it requires an enormous amount of computer horsepower, especially as the files get larger. Most compression applications have a controlling tradeoff between encoding speed and compression efficiency. In essence, these controls expand the amount of the file that is being examined at any given moment, and the size of the codebook that is searched for matches. However, doubling compression time doesn’t cut file size in half! Doubling compression time might only get you a few percentages closer to the Shannon limit for the file. Getting a file 10 percent smaller might take more than 10 times the processing time, or be flat-out impossible.

Lossy and Lossless Compression

Lossless compression codecs preserve all of the information contained within the original file. Lossy codecs, on the other hand, discard some data contained in the original file during compression. Some codecs, like PNG, are always lossless. Others like VC-1 are always lossy. Others still may or may not be lossy depending on how you set their quality and data rate options. Lossless algorithms, by definition, might not be able to compress the file any smaller than it started. Lossy codecs generally let you specify a target data rate, and discard enough information to hit that data rate target. This really only makes sense with media—we wouldn’t want poems coming out with different words after compression!

Spatial Compression Basics

I’m going to spend a lot of time on spatial compression, because it forms the basis of all video compression.

Spatial compression starts with uncompressed source in the native color space, typically 4:2:0 at 8-bits per channel. Each channel is typically compressed by itself, so you can think of it as three independent 8-bit bitmaps being compressed in parallel. For simplicity, I’ll mainly be talking about single-channel compression and hence grayscale (Y’-only) images.

Spatial Compression Methods

There are a ton of spatial compression algorithms and refinements to those algorithms. In the following, I briefly touch on the most common and best-known. Most (but not all) modern video codecs use Discrete Cosine Transformation (DCT) for images, so we’ll focus on its description, and talk about wavelets some as well. First, though, a quick look at the lossless compression techniques used in codecs.

Run-Length Encoding

The simplest technique used to compress image data is Run-Length Encoding (RLE). In essence, the data is stored as a series of pairs of numbers, the first giving the color of the sequence and the second indicating how many pixels long that line should be. This works very well for content like screenshots of Windows 2000 era operating systems, with long lines of the same value (although with modern antialiasing, textures, and backgrounds). It can also work well for text, since a sharp edge isn’t any harder to encode than a soft one, although a soft, antialiased edge can require several samples. However, RLE is terrible at handling natural images such as photographs (or a modern OS like Vista/Windows 7’s Aero Glass or Mac OS X). Just a slight amount of random noise can cause the image to be as large as an uncompressed file, because if each pixel is even slightly different from its neighbor, there aren’t any horizontal lines of the same value. With a seemingly simple “01010101” repeating pattern, RLE wouldn’t be able to compress the file at all, since there is never more than one identical value in a row.

Advanced Lossless Compression with LZ77 and LZW

As mentioned earlier, the more the structure of data is predictable, the more efficient the compression can be.

The first highly compressed file formats were based on either the LZ77 (Lempel Ziv 1977) or the LZW (Lempel-Ziv-Welch) algorithms (Abraham Lempel and Jacob Ziv have had a huge impact on digital media, and Terry Welch is no slouch either). I’ll spare the details, but they’re both essentially combinations of advanced RLE with Huffman coding. Both build a codebook as they go, and then look for repeated strings of the same symbols and assign those codes as well.

LZW was adopted by Adobe for TIFF and CompuServe for GIF (predating JPEG). Both were pretty straightforward implementations; they just started in the upper-left corner and encoded pixel-by-pixel. But after some years Unisys, which had wound up with the LZW patents, started to demand patent licensing fees for anyone using GIF files, which was pretty much the entire web at the time.

The PNG image format was born out of this kerfuffle, as an alternate image format based on the older LZ77, for which all patents had since expired. PNG offers dramatically better compression than either LZW-based codecs, not because the fundamental compression algorithm is that much more efficient, but because the format supports lossless filtering to better arrange image data for more efficient compression. Specifically, for each line, each pixel can be described as the difference from the pixel above, the pixel to the left, or the average of those two pixels. Unlike RLE, which relies on identical pixels, PNG is then able to use a few bits to encode slight differences. And the resulting pattern itself can have a lot of redundancy (a black line under a white line will all have the same difference, and so the final lossless encode will be very efficient).

This core concept of optimizing the structure of the data to make the final lossless encode more efficient is a critical one used in every delivery codec.

Arithmetic Coding

We have one last lossless technique to introduce: arithmetic coding. With a traditional Huffman codebook, the best you can hope for is for one symbol to be encoded in one bit. But each symbol requires a whole number of bits. Arithmetic coding allows for the average number of bits per symbol to be a fraction of a bit, reflecting their statistical probability.

The math is awesome but obtuse; the core fact here is arithmetic encoding allows for more efficient lossless encoding, getting much closer to the Shannon limit.

The most common use of arithmetic coding for lossless spatial image encoding is the Lagarith codec. Compared to the older Huffyuv lossless codec, Lagarith can do 30 percent better with typical video content and better yet with less noisy content by:

•  Encoding each channel independently (the Y’ values have a lot more in common with their neighbors than the Cb/Cr values of the same pixel)

•  Applying arithmetic encoding to the result

With arithmetic coding of video, we can sometimes get down to 4:1 lossless compression! That’s getting pretty close to the Shannon limit for typical film/video content with some noise. And that noise is the biggest part of that limit. With 8-bit video, if the last bit is essentially random, which means the best possible compression would be 12.5 percent (1 out of 8) even if the image is otherwise a completely flat color.

Still, 4:1 compression isn’t much. At DVD data rates (up to 9 Mbps peak for video), that would only give us enough bits for a 416 × 240 image. But DVD regularly uses half that as an average bitrate, and with 350 % as many pixels. And modern codecs can do 3–4x better than MPEG-2.

Lossless image compression doesn’t get us near where we need to be for video.

Discrete Cosine Transformation (DCT)

And so we leave the world of mathematically lossless compression of pixels and enter the world of frequency transformation and lossy compression, where we get the huge compression ratios required for modern media delivery.

The most common basic technique used for lossy image compression is DCT. While there have been plenty of other techniques proposed, nothing has come close to matching DCT and its derivatives for video encoding, and it’s extremely competitive for still images as well.

Before we dive into the details, the core concept behind DCT comes straight from Chapter 1—it’s the edges we see, not the flat area in between. Since it’s the changes between pixels that matter, DCT rearranges the image so that bits get spent on those changes instead of the pixels themselves, and so that those changes are what’s best preserved as lossy compression is applied.

But how?

Let’s start with a sketch of how classic JPEG compression works.

Fourier

The basic mathematical idea behind DCT was developed by Joseph Fourier a couple of centuries ago (he also discovered how greenhouse gases influence global temperature). Fourier proved that any series of numbers can be produced by a sufficiently complex equation. So for any sequence of pixel values, there’s an equation that can describe that sequence. This was a revolutionary concept leading to many further discoveries, including all digital video and audio processing.

The DCT

DCT is an implementation of Fourier’s discovery, and is a simple way to take a series of numbers and calculate coefficients for a series of cosines that produce the input series. A cosine itself is a simple oscillating wave: a frequency like the classic illustration of a radio wave. The coefficients of the DCT measure how tall that wave is (amplitude) and how quickly it goes from one peak to the next (frequency).

In classic JPEG and other codecs, the basic unit of compression is the 8 × 8 block. Figure 3.1 shows how a few different 8 × 8 blocks images turn into frequencies.

Figure 3.1 A variety of source blocks and their resulting quant matricies and images using light compression.

image

Mathematically, you get one coefficient on output for each sample pixel, so putting 64 pixels of source video into the DCT will output 64 coefficients. The coefficients that result won’t fall into the 0–255 8-bit range the source was coded in; larger numbers can exist, requiring higher precision. So, just converting from pixel values to cosine coefficients didn’t yield any compression at all; in fact, we’ve lost ground so far.

But we have set the stage for the magic to happen.

As we look at the resulting DCT matrices, there are a few things to note:

•  The coefficient in the upper-left corner square is the DC coefficient: that’s the base value for the whole matrix. A flat 8 × 8 block would have the luma of the block as the coefficient there, and all other values 0. The other blocks are AC* coefficients, and record frequencies across the block.

•  The further we get away from the DC* coefficient, the higher frequency is being coded. The upper-right coefficient is the highest horizontal frequency, and the lower-left coefficient is the highest vertical frequency.

•  The coefficients not on the top or right are frequencies that are at a diagonal to the image.

•  The coefficient value in each position is the amplitude of that frequency. A 0 means that there’s none of that frequency.

Since the human eye looks for changes and notices edges, the most important parts of the image have the highest amplitude (biggest changes). But the bigger those changes, the less the precise value of the change is critical. High frequencies (how quickly the changes happen) are very visible as well, but precise sharpness of the edge isn’t that big a factor; the difference between lines 1.1 pixels and 1.2 pixels wide is too subtle for us to notice in a moving image.

For a number of our sample matrices, we see bigger numbers in the upper left, going down as we head to the lower right. This reflects that most images don’t have worst case detail. And since that’s a predictable pattern, we’re now able to apply some data compression on this structure.

First, we need to go from the 2D array to a single series of values. Starting from upper right, samples are pulled out of the matrix in a zigzag pattern (Figure 3.2).

Figure 3.2 The zigzag scan pattern with a DCT on top. Note how the scan pattern results in efficient RLE, with most of the last half of samples 0.

image

In JPEG, simple RLE compression is then applied (not very effective in most of these cases, given there’s almost always a little detail in the real-world images, but it’ll become more useful in the next step). And then Huffman coding is applied. The distribution of frequency coefficients is more predictable than for the original pixels, which should yield an improvement in compression.

But we also went from our original 8-bit values to higher precision, so we might not have gained anything in net. There are two obvious ways to further improve compression in this scheme:

•  Find a way to get back to 8-bit or even lower precision

•  Get more of the coefficients to have the same value, in order for the RLE to be more effective

But we must also make sure that we’re preserving the visually important detail as much as we can.

And this is where we move into lossy compression, and the point of this exercise. We can now start reducing the bitrate required.

DCT quantization

Now we want to start eliminating the data we don’t need visually, so that we’re only spending bits on “differences that make a difference.”

The first thing we’re going to do is reduce the precision of our matrix so we need fewer bits to store each value. Also, there are frequencies with such low amplitude that they aren’t visible; we can eliminate them as well. But the relative value of different frequencies can vary quite a bit:

•  The DC coefficient is really important.

•  The higher the frequency, the less important small differences in amplitude are.

•  We’re visually drawn to right angles, so frequencies that are fully horizontal or vertical matter more than diagonals.

Our tool here is the quantization matrix.

The classic JPEG matrix is shown in Figure 3.3.

Figure 3.3 The standard JPEG quant matrix.

image

The least-lossy compression in JPEG is simply to divide the coefficients in the source matrix by the value in the same position in the quant matrix, rounding to the nearest integer. A few things to note here:

•  The DC coefficient is 16. This compensates for the increase in precision past 8-bit yielded by the DCT. Using a coefficient of 16 will take 12-bit precision back down to 8-bit, as 24= 16, and there are 4 fewer bits in 8 than 12. So this wouldn’t have a net impact on image fidelity.

•  The other quant matrix values increase with increasing frequency, and on the diagonal.

•  The 99 in the bottom right is lower than the surrounding quant values. This allows a sharp diagonal line to retain a little more accuracy.

•  Within those general rules, the numbers can seem pretty random. These tables are the result of a whole lot of simulation, and wind up being what works best on the content they were tested with more than any abstract theory. Don’t sweat the why of any particular value too much.

Figure 3.1 shows what happens to the image with lightweight JPEG compression, and what the resulting coefficients turn out as.

Note how many of the low amplitude samples wind up as 0, which are obviously very easy to compress. The range of values that turn into a 0 in the output is called the “dead zone,” which is half the quant matrix coefficient for that point.

And we get more samples in a row, so RLE encoding is more efficient. And with the smaller range of more consistent numbers, the Huffman coding is more efficient as well.

And the images still look pretty much the same. Even though we’ve removed a whole lot of the mathematical detail, most of that wasn’t visual detail.

Figure 3.4 Comparing a simple gradient with and without noise over a range of quantization levels.

image

Scale factor and the quantization parameter

So we had some real gains with our initial quantization step. But what if the file still wasn’t small enough (and it probably isn’t)? We can simply scale up the values in the quant matrix. This is controlled by a “scale factor” which multiplies all the values in the quant matrix by a particular value.

The control for the scale factor is called the Quantization Parameter, or QP, which I’ll be talking about a few more hundred times in this book. Any time you’ve got a “quality” control in a codec, be it JPEG or VC-1, what it really controls is QP, and hence the scale factor used in the quant matrix. The classic “0–100” quality slider used in JPEG encoders the world over is setting the quantization parameter, with “100” the least quantized and “0” the most.

As the scale factor matrix get bigger, several things happen:

•  The dead zone gets bigger, so there are more zeros.

•  The remaining values wind up more “coarsely” quantized, with a wider range of different input amplitudes winding up with the same value.

•  Smaller numbers mean more will be identical, for better RLE, and the number of different possible values goes down, making for more efficient Huffman encoding.

•  Quality will start visibly degrading past a certain point, as the differences between the input and output frequencies, and the pixels that come out, grow bigger.

So, how does different content respond to different scale factors in compression efficiency and output quality?

Ringing and blocking

A gradient that varied in luma exactly like a cosine would be very easy to encode, since a single coefficient would do the job. But real-world complex images are more difficult to compress, because they require more coefficients and have a broader distribution of detail.

When quantization is pushed too high, there are two main types of artifacts—visible distortions—that can affect the image.

Blocking is just that–the 8 × 8 block structure (sometimes called a “basis pattern”) becomes visible, because the decoded values on the edges of adjoining blocks don’t match. This is because the quantization in the blocks changed or removed the frequencies that would have left the pixels at the edge of the block the same as the adjacent block. At high quantizations, you might wind up with adjoining blocks just the shade of the DC coefficient. Unfortunately, that kind of sharp straight edge is exactly the kind of high-frequency, high-amplitude image that we’d expect to be very visible with human vision. And it is. Minimizing blocking is one of the core goals of the compessionist.

Ringing (Figure 3.5) is an artifact at edges. A sharp edge is actually the hardest thing to encode with a DCT, since it’s a high-frequency, high-amplitude change. And a high-amplitude coefficient at a high frequency to capture that edge will oscillate over the whole block—remember, these are cosines. At low quantization, the codec uses coefficients at other frequencies to balance out those high frequencies. But as quantization goes up, the balancing frequencies become less accurate, and can even fall into the dead zone, Thus you’ll see weird distortions around sharp edges, like text. This is also called “mosquito noise.”

Figure 3.5 A sharp diagonal line is extremely hard to compress.

image

Chroma Coding and Macroblocks

So far we’ve really just been talking about the luma channel, but most content has color as well. Our visual acuity for chroma is less than for luma, and even subsampled chroma normally has much lower frequencies in general. So a different, flatter quant matrix is normally used for chroma; the most widely used is shown in Figure 3.6.

Figure 3.6 The standard JPEG chroma quant matrix.

image

Note that the chroma matrix is heavily weighted towards lower frequencies. Chroma and luma will share the same scale factor, and with this matrix, relatively more of chroma will fall into the dead zone than luma. Both chroma channels together add up to only half as many samples as luma with 4:2:0 subsampling, but chroma will wind up with an even smaller portion of bits in the encoded file due to more aggressive quantization and lower complexity.

Because color channels are subsampled, in 4:2:0 their 8 × 8 blocks actually cover a 16 × 16 area of the original image. This is called a “macroblock” (Figure 3.7), which is a combination of four 8 × 8 luma blocks and the corresponding 8 × 8 chroma blocks (one each for Cb and Cr).

Figure 3.7 The layout of blocks and samples within a macroblock. The dotted lines indicate the position of the four luma blocks in the macroblock. The subsampled chroma blocks cover the entire 16 × 16 area.

image

This macroblock structure is why codecs compress in blocks of 16 × 16 pixels. Formats that mallow arbitrary resolutions encode full macroblocks, and just pad the pixels at the edge with blank data. So, a 240 × 180 file is actually coded as 240 × 192, with the bottom 12 pixels filled with bogus data and not drawn on playback. Ideally, keep resolution divisible by 16 to squeeze out the maximum bang for the bit.

Finishing the Frame

And that’s how spatial compression is applied to a single block and then assembled into a macroblock. To make a full image, we just make an array of macroblocks.

And that, dear reader, is your introduction to JPEG and still image coding.

There are a few takeaways you might mull over:

•  DCT is all about putting the bits where they make a difference: frequencies.

•  Different frequencies matter more than others, hence the quant matrix.

•  More compression yields fewer and less accurate frequencies.

•  The classic visual distortions of DCT compression are blocking and ringing.

•  Easier-to-encode content can use a lower QP to hit the same file size compared to more difficult content, and hence will have higher quality.

For another taste of this in action, see Color Figure C.12, which compares the two images of different encoded to with a low QP (high quality), a high QP (low quality), and QPs that yields the same file size.

Temporal Compression

So, we’ve got a frame. But we want a lot of frames. How?

The simplest approach is just to make a file where each frame is an independently coded still image. And that’s pretty much Motion JPEG right there.

But we still need more compression, and there’s an obvious next area of redundancy to target: most of the time, most of a frame is like the frame before it.

Prediction

Imagine a little two-frame video. The first frame used classic JPEG-style compression. We call that an intra-coded frame, or I-frame. An I-frame is self-contained and isn’t based on any other frames.

We need to start with an I-frame, but they’re not very efficient, so next we want to make a new frame based on (predicted from) the I-frame. And that predicted frame is called a P-frame.

To take the absurdly simplest case of prediction, a macroblock in the P-frame that’s exactly the same as the same macroblock on the I-frame can simply be coded as a “skip”—just copy the same pixels over.

In a more common case, the previous macroblock can be very similar, but have a slightly different noise pattern or something (Figure 3.8). In that case, the P-frame can start with the difference between the I-frame’s macroblock and its version. If they’re close, it’ll be a pretty good match, and the difference would have lower frequencies and hence will encode better.

Figure 3.8 In low–motion sequences, most parts of the image are pretty similar. Even with high motion, much of the frame matches the previous one.

image

It’s also possible that there’s just no good match at all. In that case, the P-frame can use an intra-coded macroblock, just like an I-frame.

Motion Estimation

Simple prediction is a start, but it’s too basic to be used in mainstream codecs. In real-world video, there’s often a ton of motion. If the camera panned left 10 pixels between the I- and P-frame, none of the macroblocks would be good matches as is, even though most of the P-frame’s image is seen in the I-frame.

Enter the motion vector. A motion vector specifies a block of pixels in the reference frame that’s copied into the P-frame; see Color Figure C.13. So in the case where the whole frame moved 10 pixels, each macroblock can grab the pixels 10 to the left in the reference frame, and have a very good match.

And since each P-frame macroblock can have its own motion vector, there’s no need for the whole frame to move at once. Multiple objects can be moving in multiple directions, with each part of the frame having an appropriate motion vector.

Of course, this all presumes that the codec knows where the matches are. And that’s harder than it sounds; human vision is really adept at picking out motion, but a computer has to do it one little bit of math at a time. Trying to find good motion vectors is where most of the CPU cycles of compression go, and the accuracy of that effort is the big difference between fast and slow-but-good encoding.

There’s one big downside of P-frames: to jump to a particular frame you’ll need to decode all the P-frames between it and the previous I-frame. Imagine an encode with an I-frame every 100 frames; Frame 1 and 101 are I-frames, while frames 2-99 are P-frames. If you want to play just frame 78, it’s predicted from frame 77, so you’ll need to decode that. And frame 77 is predicted from 76, and 76 is predicted from 75, all the way back to 1. So doing random access to a particular frame can be a lot slower, particularly when the I-frames are far apart.

Bidirectional Prediction

So, I- and P-frames are all that are needed for interframe compression, but it was discovered in the early days that having a frame based on the previous and next frame added further efficiency.

And thus the B-frame. A B-frame can be based on the previous and the next frame. And it can use motion vectors from either. That yields better efficiency, which is a big reason B-frames are used.

Note that since a B-frame is based on the next reference frame, the display order is different than the encoded order. So if the video is displayed as:

image

•  it would be actually encoded, transmitted, and decoded as:

image

So the P-frame following the B-frame is encoded and then decoded before the B-frame, even though that B-frame is temporally first in order.

Another property of B-frames is that they’re not actually needed for playback. That means they can be dropped from transmission under bandwidth stress, dropped from decode under CPU stress, or skipped when doing random access. Also, since B-frames aren’t reference frames, they can be given just enough bits to not have distracting artifacts without worrying about being a good basis for future frames. This saves bits that can be used to improve the reference frames, thus providing B-frames more accurate references and thus higher quality (Figure 3.9).

Figure 3.9 An example of GOP structure. I – frames are self contained, P – frames reference the previous P – or I – frame, and a B – frame can reference the previous and next I – or P – frame, but nothing references them.

image

And that’s it for our high-level overview of interframe compression. Again, a few take-homes:

•  The more motion in the video is on a 2D plane, the more efficient motion vectors are.

•  Noise will make motion estimation less efficient, since the matches will be less accurate.

•  Random access is an interplay between frequency of I-frames and B-frames; the more B-frames you have, the longer you can go without I-frames.

•  B-frames are also very helpful for scalability.

Rate Control

There’s one last part to our fundamental video coding exercise: rate control. So far, we’ve talked about how quantization controls frame size, but only indirectly. Very complex frames will use a lot more bits than very simple frames, even with the same QP.

The simplest style of rate control would be to have every frame be the same size, varying the QP. But this could yield really horrible quality: the most interesting frames would look the worst, while we might use more bits than needed on the easy ones. What we want is to distribute the bits so that quality is consistent, and as consistently high as possible.

So, with rate control, you’ll see the I-frames being biggest, P-frames smaller, and B-frames smaller yet. And within the allowed buffer of the encoding mode (more about those in Chapter 7), harder frames will get more bits than easier frames.

There’s a whole academic discipline around how to do this, called Rate Distortion Theory. Yeah, Claude Shannon thought of this too. Fortunately, you don’t really need to tune this yourself; PhDs bake it into the encoders. But good rate control is one of the things that makes a good codec; two implementations can seem generally the same, but the one that spends the bits where they can best be spent will make much better-looking video.

An example of more advanced rate control is when a codec discovers that spending more bits on a particular I-frame lets it be a better match for following frames, reducing the bitrate required for those. Simple constant quality would have wound up being less efficient, with the I-frame lower quality and later frames spending more bits to get more accurate matches.

Beyond MPEG-1

The previous discussion was a very high-level description of a very basic encoder like that used in MPEG-1. The ones you’ll use in the real world will have many more features and unique twists. Each description of the major codecs will include an overview of its critical differences and enhancements from the description in this chapter.

Perceptual Optimizations

But there’s one last critical concept to introduce, even though it’s not really applicable to the simple model we’ve described so far: perceptual optimizations.

A perceptual optimization is any adjustment to improve the perceived quality of the image instead of its mathematical accuracy. A simple example used in some advanced codecs is Differential Quantization (DQuant). Some codecs allow different blocks to use different quantization parameters. From a pure math perspective, this is the most efficient, as it minimizes the difference between source and output. But visually, some differences matter a lot more than others! So, as shown in Color Figure C.14, DQuant would pick out blocks that include flatter gradients or sharper edges, and give them more bits and less quantization. The flip side of that is more textured blocks will get fewer bits and more quantization. But as we saw in our DCT examples, quantization impacts the appearance of those sharp edges a lot more than it does noisy texture detail.

Alternate Transforms

I’m pretty well sold on DCT-like transforms as fundamentally a lot better than any of the alternative technologies once interframe coding comes into play.

The other transforms, particular wavelets, are being put to good use in still image formats.

Wavelet Compression

Wavelet compression uses a very different transform from DCT that’s popular for still image compression (notably JPEG2000), and is used in some experimental video codecs.

The fundamental idea behind wavelets is similar to DCT—turn a sequence of samples into an equation—but that’s where the similarities stop. Wavelet images are built out of a series of images, starting small and doubling in height and width until the final image is built at the file’s resolution. Each successively higher resolution is predicted from the previous, storing the additional detail of that band. The space savings from this technique can be tremendous—each band is predicted from the previous sub-band, so bits are only spent where there’s detail not in the sub-band. Thus for flat areas or gradients, like clouds or ripples in water, the later bands need to add very little data. Conversely, more detailed images, such as faces in a crowd or film grain, might require a lot of detail in each new band.

One useful aspect of these bands is only as many bands as desired need to be decoded; the process can stop at any band. So a low-resolution proxy view is pretty much a free feature.

Good wavelet implementations don’t have nearly the problems encoding text as do classic JPEG-like DCT codecs (although more modern DCT-derived codecs like H.264 do much better in that regard). They also don’t block in the same way, as there’s no 8 × 8 basis pattern. Thus wavelets tend to have less objectionable artifacts at lower rates, though the images can still get quite soft.

Alas, wavelets aren’t nearly so promising for use in interframe codecs, because there’s not really an efficient way to use motion vectors to predict a future wavelet. There have been many attempts at making wavelet interframe video codecs over the years, and several are ongoing like the BBC’s Dirac, but none have ever been competitive.

So far, the primary uses of wavelets have been in JPEG2000 and Cineform, with interest in the “VC-2” I-frame-only subset of Dirac. None use interframe encoding.

Fractal Compression

Fractals were the rotary engine of the 1990s—the technology that promised to make everything wonderful, but never actually turned into mainstream delivery technologies. The idea behind fractals is well-known to anyone who breathlessly read James Gleick’s Chaos: Making a New Science (Penguin, 1988). In its classic definition, a fractal image is able to reuse parts of itself, transformed in many ways, in other parts of the image. This concept is based on self-similar objects, so in fractal compression, things like ferns or trees can compress very well. Fractal compression isn’t so useful with people—our fingers don’t have smaller fingers hanging off their ends. Fractal compression can be thought of as extending wavelet style sub-band prediction to rotation, scaling, and other kinds of transformations. This mathematical approach is also called an Iterated Function System, after which Iterated Systems, the company that tried to bring the technology to market, is named.

Figure 3.10 A sample wavelet encode, showing how progressive bands improve quality, and (in the upper – right corner) what detail gets added in each band.

image

The problem is that actually compressing a fractal image is hugely difficult. A fractal codec needs to see whether one part of an image matches another after being rotated, scaled, flipped, and otherwise looked at from every which way. All of which requires an insane amount of CPU power, and led to the concept of the Graduate Student Algorithm, wherein a grad student is locked in a room with a workstation for a few days to find the best compression for the image. Needless to say, the Graduate Student Algorithm wasn’t very practical in the real world.

Iterated Systems did ship a fractal video codec called ClearVideo; it was never particularly successful, and was eclipsed by Sorenson Video. Iterated’s still image fractal technology, FIF, has gone through a couple of corporate owners, and is now mainly pitched as an algorithm that offers better scaling.

The real advantage of these fractal compressors was that they offered resolution independence (the image could be decompressed at a higher resolution with fewer artifacts than other algorithms), and that a lower-quality image could be generated from the beginning of the file, à la wavelets.

Is a Picture Worth a Thousand Words?

We’ve all heard the old saying that it is. And now, with the power of compression, we check to see how good a picture we can make with the information in a thousand words. First, I’m going to take the first thousand words of this chapter, and save it as a text file. To make it a fair comparison, I’m going to take the redundancy out of the file via Deflate at default compression. This gives a file size of 2793 bytes. Not a lot to play with, but we can definitely make some kind of image with it.

Then I’ll use the same number of bytes to create a grayscale JPEG instead. (I’m using grayscale because “a picture is worth a thousand words” was generally associated with newspapers back in the day, when newspapers were black-and-white.)

So, is Figure 3.11 worth a thousand words?

Figure 3.11 The picture worth a thousand words…worth of bytes.

image

Audio Compression

Audio compression uses a lot of the same principles as video compression, but with twists, mainly due to the differences in how we hear instead of see.

Sub-Band Compression

In the same way that DCT video compression turns samples into frequencies and removes frequencies that are redundant or hard to see, sub-band codecs convert samples to frequencies and remove frequencies that are redundant or hard to hear.

First the audio is broken up into discrete “frames” between 20 to 500 fps. (These don’t have any relationship to video frames the audio might be attached to.) Then, each frame is turned into a series of from anywhere between two to more than 500 frequency bands. Different bands have varying complexity, and lower complexity bands take up less space; thus, a lossless compression stage is able to take good advantage of redundancy and predictability.

Psychoacoustic models can then remove any masked sounds. As discussed in the previous chapter, you can’t hear a quiet sound next to a loud sound of the same frequency. Nor will you hear a quiet sound of the same frequency as a loud sound that precedes it closely in time. Using these psychoacoustic techniques, the codec is able to remove further unnecessary information.

Another important technique is used with stereo audio. Left and right channels almost always contain redundant information, so it would be inefficient to encode them separately. For example, the left channel and right channels are mixed together and encoded, then the differences between left and right channels are encoded separately. This substantially reduces the data rate requirements for stereo audio.

There are a lot of variations, tweaks, and flavors of these techniques. Most general-purpose audio codecs are based on them.

Of course, beyond a certain point, audio compression degradation will become audible. At that point, the codec tries to provide the best possible quality given the available bits.

Percussive sounds, which have relatively random waveforms, and hence sub-bands with rapidly changing values, are the hardest for most codecs to compress. Cymbals and hi-hats are typically the most difficult to encode, being random and high-pitched. Sometimes very clean, well-produced audio is difficult to compress as well, since even slight distortions are audible.

Audio Rate Control

Unlike video codecs, most audio codecs are used as constant bitrate (CBR)—each frame takes up the same amount of bandwidth. Some codecs like MP3 allow a bit-buffer, where extra bits from previous frames can be used in later ones. There are also many codecs with quality-based variable bitrate (VBR) modes, where a fixed quantization is used, and bitrate will vary as needed.

There are plenty of VBR 1-pass audio codecs. LAME for MP3 and Ogg Vorbis are probably the best-known of these. The VBR of MP3 audio means the data rate of each frame can vary within the allowed range of 8 Kbps and 320 Kbps. So, when the audio is silent or extremely simple in nature, the frame will be 8 Kbps, and with the data rate rising as the content becomes more complex. Apple’s AAC-LC includes a pretty good 1-pass VBR mode as well. So far, Microsoft’s WMA codecs have the only broadly used full rate-controlled 2-pass VBR modes that deliver predictable file sizes.

* DC comes from “Direct Current” since the concept was originally developed in talking about electrical circuits. DC power, like from a battery, is continuous, like the flat color of the block. Similarly, AC comes from “Alternating Current.” AC power alternates its voltage in a cosine pattern, like the cosines that the AC coefficients define.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.171.52