CHAPTER 2
Uncompressed Video and Audio:
Sampling and Quantization

This chapter is about how we turn light and sound into numbers. Put that way, it sounds simple, but there are a fair amount of technical details and even math-with-exponents to follow. Although it may seem that all the action happens in the codecs, a whole lot of the knowledge and assumptions about human seeing and hearing we talked about in the last chapter get baked into uncompressed video and audio well before it hits a codec.

Sampling and Quantization

In the real world, light and sound exist as continuous analog values. Those values in visual terms make up an effectively infinite number of colors and details; in audio terms, they represent an effectively infinite range of amplitude (loudness) and frequency. But digital doesn’t do infinite. When analog signals are digitized, the infinite continuous scales of analog signals must be reduced to a finite range of discrete bits and bytes. This is accomplished via processes known as sampling and quantization. Sampling defines the discreet points or regions that are going to be measured. Quantization defines the actual value recorded.

Sampling Space

Nature doesn’t have a resolution. Take a standard analog photograph (one printed from a negative on photographic paper, not one that was mass-produced on newsprint or magazine paper stock via a printing press). When you scan that photo with a scanner, you need to specify a resolution in dots per inch (dpi). But given a good enough scanning device, there is no real maximum to how much detail you could go into. If you had an electron microscope, you could scan at 1,000,000 dpi or more, increasing resolution far past where individual particles of ink are visible. Beyond a certain point, you won’t get any more visual detail from the image, just ink particle patterns.

Sampling is the process of breaking up an image into discrete pieces, or samples, each of which represents a single point in 2D space. Imagine spreading a sheet of transparent graph paper over a picture. A sample is one square on the graph paper. The smaller the squares, the more samples. The number of squares is the resolution of the image. Each square is a picture element or pixel (or in MPEG-speak, a pel).

Most web codecs use square pixels, in which the height and width of each sample are equal. However, capture formats such as DV and ITU-R BT.601 often use nonsquare pixels, which are rectangular. Lots more on that later.

Color Figure C.9 demonstrates the effect of progressively coarser sampling on final image quality.

Sampling Time

The previous section describes how sampling works for a single frame of video. But video doesn’t have just a single frame, so you need to sample temporally as well. Temporal sampling is normally described in terms of fps or Hertz (cycles per second, abbreviated Hz). It’s relatively simple, but explains why video is so much harder to compress than stills; we’re doing a bunch of stills many times a second. We sample time more accurately as we capture more frames per second. We generally need at least 15 fps for motion to seem like motion and not a slide show, at least 24 fps for video and audio to appear in sync, and at least 50 fps for fast motion like sports to be clear.

Sampling Sound

Digitized audio, like video, is sampled. Fortunately, audio data rates are lower, and so it’s a lot easier to store audio in a high-quality digital format. Thus, essentially all professional audio is uncompressed.

Video has two spatial dimensions (height and width), but audio has only one: loudness, which is essentially the air pressure at any given moment. So audio is stored as a series of measurements of loudness. In the analog world, like when sound is transmitted via speaker cables, those changes are changes in the voltage on the wire. The frequency at which loudness changes are sampled is called the sampling rate. For CD-quality audio, that rate is 44.1 kHz (44,100 times per second). Other common sampling rates include 22.05 kHz, 32 kHz, 48 kHz, 96 kHz, and 192 kHz (96 kHz and 192 kHz are almost exclusively used for authoring, not for delivery). Consumer audio is almost never delivered at more than 48 kHz.

Nyquist Frequency

A concept crucial to all sampling systems is the Nyquist theorem. Harry Nyquist was an engineer at Bell Labs (which will come up more than once in the history of digital media). Back in the 1920s, he proved that you need a sampling rate at least twice as high as the frequency of any signal to reproduce that signal accurately. The frequency and signal can be spatial, temporal, or audible.

The classic example of the Nyquist theorem is train wheels. I’m sure you’ve seen old movies where wheels on a train look like they’re going backwards. As the train speeds up, the wheels go faster and faster, but after a certain point, they appear to be going backwards. If you step through footage of a moving train frame by frame, you’ll discover the critical frequency at which the wheel appears to start going backwards is at 12 Hz (12 full rotations)—half the speed of the film’s 24 fps.

Figure 2.1 shows what’s happening. Assume a 24 fps film recording of that train wheel. I’ve put a big white dot on the wheel so you can more easily see where in the rotation it is. In the first figure, the wheel is rotating at 6 Hz.At 24 fps, this means we have four frames of each rotation. Let’s increase the wheel speed to 11 Hz.We have a little more than two frames per rotation, but motion is still clear. Let’s go up to 13 Hz.With less than two frames between rotations, the wheel actually looks like it’s going backwards! Once you get past 24 Hz, the wheels look like they’re going forward again, but very slowly.

Figure 2.1 The infamous backwards — train — wheel illusion illustrated.

image

Alas, there is no way to correct for a sampling rate that’s too low to accommodate the Nyquist frequency after the fact. If you must have train wheels that appear to be going forward, you need slow wheels or a camera with a fast frame rate!

Another way to think about this is with a sine wave. If you sample a sine wave, as long as the source frequency is half or less of the sampling frequency, you can easily draw a new sine wave between the samples. However, if the sample rate is less than twice the frequency of the sine wave, what looks correct turns out to be completely erroneous.

You’ll encounter the same issue when working with images—you can’t show spatial information that changes faster than half the special resolution. So, if you have a 640-pixel-wide screen, you can have at most 320 variations in the image without getting errors. Figure 2.2 shows what happens to increasing numbers of vertical lines. Going from 320 × 240 to 256 × 192 with 3-pixel-wide detail is okay; 256/320*3 = 2.4; still above the Nyquist limit of 2. But a 2-pixel-wide detail is 256/320*2 = 1.6—below the Nyquist frequency, so we get an image that’s quite different from the source.

Figure 2.2 Nyquist illustrated. The far left — hand detail is a little over the Nyquist limit after scaling; the far right — hand detail is below it after scaling, resulting in a pattern unrelated to the source.

image

This problem comes up in scaling operations in which you decrease the resolution of the image, because reducing the resolution means frequencies that worked in the source won’t work in the output. Good scaling algorithms automatically filter out frequencies that are too high for the destination output. Algorithms that don’t filter out frequencies that are too high can cause serious quality problems. Such quality problems can sometimes be seen when thumbnail-sized pictures seem unnaturally sharp and blocky. There’s a lot more on this topic in the preprocessing chapter.

Quantization

You’ve learned that sampling converts analog sounds and images into discrete points of measurement. The process of assigning discrete numeric values to the theoretically infinite possible values of each sample is called quantization. We speak of quantization that has a broader range of values available as being finer, and of that with a smaller range of values available as being coarser. This isn’t a simple matter of finer being better than coarser; the finer the quantization the more bits required to process and store it. So a lot of thought and testing has gone into the ideal quantization for different kinds of samples to accurately capture the range we care about without wasting bits on differences too small to make an actual visible or audible difference.

In quantizing, we’re making everything into numbers. And because a computer of some flavor is doing all the math (even if it’s just a tiny chip in a camera) that math is fundamentally binary math, and made of bits. A bit can have one of two values—one or zero. Each additional bit doubles the sampling resolution. For example, a two-bit sample can represent four possible values; three bits gives eight possible values; eight bits provides 256 possible values. The more bits, the better the sampling resolution.

Bits themselves are combined into bytes, which are 8 bits each. Processing on computers is done in groups of bytes, themselves going up by powers of two. So, typically processing is done in groups of 1 byte (8-bit), 2 bytes (16-bit), 4 bytes (32-bit), 8 bytes (64-bit), et cetera. Something that requires 9-bit is going to really be processed as 16-bit, and so can require twice as much CPU power and memory to process. So, you’ll see lots of 8-bit, 16-bit, and 32-bit showing up in our discussion of quantization. We’ll also see intermediate numbers show up, like 10-bit, but those are more about saving storage space over the full byte size; it’s still as slow as 16-bit in terms of doing the calculations.

Quantizing video

Basic 8-bit quantization, and why it works

Continuing our graph paper metaphor, consider a single square (representing a single sample). If it’s a solid color before sampling, it’s easy to represent that solid color as a single number. But what if that square isn’t a single solid color? What if it’s partly red and partly blue? Because a sample is a single point in space, by definition, it can have only one color. The general solution is to average all the colors in the square and use that average as the sample’s color. In this case, the sample will be purple, even if no purple appeared in the original image. See Color Figure C.10 for an example of this in practice.

The amount of quantization detail is determined by bit depth. Most video codecs use 8 bits per channel. Just as with 8-bit audio systems, 8-bit video systems provide a total of 256 levels of brightness from black to white, expressed in values between 0 and 255. Unlike 8-bit audio (a lousy, buzzy-sounding mess with no dynamic range, and which younger readers may have been fortunate enough never to hear), 8-bit video can be plenty for an extremely high-quality visual presentation. Note, however, that Y′CbCr (also called, frequently but inaccurately, YUV) color space video systems (more on what that means in a moment) typically use only use the range from 16 to 235 for luminance of typical content, while RGB color space computer systems operate across the full 0 to 255 range. So, RGB and Y′CbCr are both quantized to an 8-bit number, but in practice the range of RGB has 256 values and the range of Y′CbCr is 219.

In the previous chapter, we discussed the enormous range of brightness the human eye is capable of perceiving—the ratio between the dimmest and brightest light we can see is about one trillion to one. Fortunately, we don’t need to code for that entire range. To begin with, computer monitors and televisions aren’t capable of producing brightness levels high enough to damage vision or deliver accurate detail for dim images. In practice, even a good monitor may only have a 4000:1 ratio, and even that may be in limited circumstances. Also, when light is bright enough for the cones in our eyes to produce color vision, we can’t see any detail in areas dark enough to require rods for viewing. In a dark room, such as a movie theater, we can get up to about a 100:1 ratio between the brightest white and the darkest gray that is distinguishable from black. In a more typical environment containing more ambient light, this ratio will be less. We’re able to discriminate intensity differences down to about 1 percent for most content (with a few exceptions). Thus, at least 100 discrete steps are required between black and white; our 256 or 219 don’t sound so bad.

Note when I say we can distinguish differences of about 1 percent, I’m referring to differences between two light intensity levels. So, if we say if the background has 1,000 lux of light, we can tell the difference between that 1,000 and an object at 990 (1 percent less) or 1,010 (1 percent more). But if the background is 100 we can distinguish between 99 and 101. It’s the ratio between the light levels that’s important, not the actual amount of light itself. So, the brighter the light, the bigger a jump in absolute light levels we need to perceive a difference.

We take advantage of this by making a quantization scale that uses fixed ratios instead of fixed light levels, which we call “perceptually uniform.” In a perceptually uniform scale, we code the differences in terms of how it would be perceived, not how it’s produced. With a perceptually uniform luma scale, the percentage (and thus the perceived) jump in luminance looks the same from 20 to 21 as from 200 to 201, and the difference from 20 to 30 the same as from 200 to 210. This means that there is an exponential increase in the actual light emitted by each jump. But it makes the math a lot simpler and enables us to get good results in our 219 values between 16 and 235, which is good; this means we can do our video in 8 bits, or one byte, which is the optimal size to be able to process and store the bits.

The exponential increase used to make the scale perceptually uniform is called gamma, after the Greek letter γ. The nominal value of video codecs these days is a uniform 2.2. This was a hard-won battle; in the first edition of this book I had to talk about codecs that used gammas from 1.8 to 2.5, and video often had to be encoded differently for playback on Mac and Windows computers! It was a pleasure to delete most of that material for this edition.

One cool (and sadly historical) detail is that good old CRT displays had essentially the same 2.2 relationship between input power and displayed brightness as does the human eye. This meant that the actual electrical signal was itself perceptually uniform! And thus it was really easy to convert from light to signal and back to light without having to worry.

In the LCD/Plasma/DLP era, the underlying displays have very different behavior, but they all emulate the old CRT levels for backwards compatibility. Unfortunately, they can’t all do it as well, particularly in low black levels, particularly with LCD. We’ll talk about the implications of that when we get to preprocessing.

Gradients and Beyond 8-bit

With sufficiently fine quantization, a gradient that starts as total black and moves to total white should appear completely smooth. However, as we’ve learned, digital systems do not enjoy infinite resolution. It’s possible to perceive banding in 8-bit per channel systems, especially when looking at very gradual gradients. For example, going from Y’ 20 to Y’ 25 across an entire screen, you can typically see the seams where the values jump from one value to another. But you aren’t actually seeing the difference between the two colors; cover the seam with a little strip of paper, and the two colors look the same. What you’re seeing is the seam—revealed by our visual system’s excellent ability to detect edges. We solve that problem in the 8-bit space by using dithering (more on that in the discussion on preprocessing). But dithering is a lossy process, and only should be applied after all content creation is done. It’s also why I didn’t include a visual image of the previous example; the halftone printing process is an intense variety of dithering, and so would obscure the difference.

This is why most high-end professional video formats and systems process video luma in 10 bits. Although those two extra bits increase the luma data rate by only 25 percent, their addition quadruples the number of steps from 256 to 1,024 (and makes processing more computationally intensive).

We also see some high-end 12-bit systems used professionally. And in Hollywood-grade film production, perceptually uniform gamma can be left behind for logarithmic scales that better mimic the film process. And we have reasonably priced tools now that use 32-bit floating-point math to avoid any dithering. We’ll cover those subjects in the chapters on video formats and preprocessing.

But that’s all upstream of compression; there are no delivery codecs using more than 8 bits in common use today. There are specs for 10- and 12-bit video in the H.264 spec, which we’ll discuss more in the H.264 chapter. However, that’s not likely to matter until we see consumer displays and connections that support more than 8-bit become more common. The decades of development around digital video delivery we have today are all based around 8-bit, so it would be a slow and gradual process for more than 8-bit delivery to come to the fore.

Color Spaces

So far, we’ve been describing sampling and quantization of luminance, but haven’t mentioned color—which, as you’re about to discover, is a whole new jumbo-sized can of worms.

It’s not a given that chroma and luma get quantized and sampled the same way. In fact, in most video formats they’re handled quite differently. These different methods are called subsampling systems or color spaces.

As discussed in the previous chapter, all visible colors are made up of varying amounts of red, green, and blue. Thus, red, green, and blue can be sampled separately and combined to produce any visible color. Three different values are required for each sample to store color mathematically. And as we also discussed, human visual processing is very different for the black-and-white (luma) part of what we see than for the color (chroma) part of what we see. Specifically, our luma vision is much more detailed and able to pick up fast motion, while our chroma vision isn’t great at motion or detail, but can detect very slight gradations of hue and saturation.

Back in the 1930s, the CIE determined that beyond red, green, and blue, other combinations of three colors can be used to store images well. We call these triads of values “channels,” which can be considered independent layers, or planes, of the image. But these channels don’t have to be “colors” at all; a channel of luma and two of “color difference” work as well. Because luma is mainly green, the color difference channels measure how much red and blue there is relative to gray. This allows us to sample and quantize differently based on the different needs of the luma and chroma vision pathways, and allows many awesome and wonderful things to happen.

So, let’s examine all the major color spaces in normal use one by one and see how they apply to real-world digital image processing.

RGB

RGB (Red, Green, Blue) is the native color space of the eye, each being specific to one of the three types of cones. RGB is an additive color space, meaning that white is obtained by mixing the maximum value of each color together, gray is obtained by having equal values of each color, and black by having none of any color.

All video display devices are fundamentally RGB devices. The CCDs in cameras are RGB, and displays use red, green, and blue phosphors in CRTs, planes in LCDs, or light beams in projection displays. Computer graphics are almost always generated in RGB. All digital video starts and ends as RGB, even if it’s almost never stored or transmitted as RGB.

This is because RGB is not ideal for image processing. Luminance is the critical aspect of video, and RGB doesn’t encode luminance directly. Rather, luminance is an emergent property of RGB values, meaning that luminance arises from the interplay of the three channels. This complicates calculations enormously because color and brightness can’t be adjusted independently. Increasing contrast requires changing the value of each channel. A simple hue shift requires decoding and re-encoding all three channels via substantial mathematics. Nor is RGB conducive to efficient compression because luma and chroma are mixed in together. Theoretically, you might allocate fewer samples to blue or maybe red, but in practice this doesn’t produce enough efficiency to be worth the quality hit.

So, RGB is mainly seen either in very high-bitrate production formats, particularly for doing motion graphics with tools like Adobe After Effects, or in still images for similar uses.

RGBA

RGBA is simply RGB with an extra channel for alpha (transparency) data, which is used in authoring, but content isn’t shot or compressed with an alpha channel. Codecs with “alpha channel support” really just encode the video data and the alpha channel as different sets of data internally.

Y′CbCr

Digital video is nearly always stored in what’s called Y′CbCr or a variant thereof. Y′ stands for luminance, Cb for blue minus luminance, and Cr for red minus luminance. The ′ after the Y is a prime, and indicates that this is nonlinear luma; remember, this is gamma corrected to be perceptually uniform, and we don’t want it to be linear to actual light levels.

Confusing? Yes, it is. Most common first question: “where did green go?” Green is implicit in the luminance channel!

Let’s expand the luminance equation from the previous chapter into the full formula for Y′CbCr.

Y′ = 0.299Red + 0.587Green + 0.114Blue
Cb = 0.147Red − 0.289Green + 0.436Blue
Cr = 0.615Red − 0.515Green − 0.100Blue

So, Y′ mainly consists of green, some red, and a little blue. Because Cb and Cr both subtract luminance, both include luminance in their equations; they’re not pure B or R either. But this matches the sensitivities of the human eye, and lets us apply different compression techniques based on that. Because we see so much more luma than chroma detail, separating out chroma from luma allows us to compress two of our three channels much more. This wouldn’t work in RGB, which is why we bother with Y′CbCr in the first place.

The simplest way to reduce the data rate of a Y′CbCr signal is to reduce the resolution of the Cb and Cr channels. Going back to the graph paper analogy, you might sample, say, Cb and Cr once every two squares, while sampling Y′ for every square. Other approaches are often used—one Cb and Cr sample for every four Y′s, one Cb and Cr sample for every two Y′ and so on.

Formulas for Y′CbCr sampling are notated in the format x:y:z. The first number establishes the relative number of luma samples. The second number indicates how many chroma samples there are relative to the number of luma samples on every other line starting from the first, and the third represents the number of chroma samples relative to luma samples on the alternating lines. So, 4:2:0 means that for every four luma samples, there are two chroma samples on every other line from the top, and none on the remaining lines. This means that chroma is stored in 2 × 2 blocks, so for a 640 × 480 image, there will be a 640 × 480 channel for Y′, and 320 × 240 channels for each of Cb and Cr.

The quantization for Y′ runs as normal—0 to 255, perceptually uniform. Color is different. It’s still 8-bit, but uses the range of –127 to 128. If both Cb and Cr are 0, the image is monochrome. The higher or lower the values are from 0, the more saturated the color. The actual hue of the color itself is determined by the relative amounts of Cb versus Cr. In the same way that the range of Y in video is 16 to 235, the range of Cb and Cr in video is limited to –112 to 112.

Terminology

In deference to Charles M. Poynton, who has written much more and much better about color than I have, this book uses the term Y′CbCr for what other sources call YUV. YUV really only refers to the native mode of NTSC television, with its specific gamma and other attributes, and for which broadcast has now ended. Y′U′V′ is also used sometimes, but isn’t much more relevant to digital video. When used properly, U and V refer to the subcarrier modulation axes in NTSC color coding, but YUV has become shorthand for any luma-plus-subsampled chroma color spaces. Most of the documentation you’ll encounter uses YUV in this way. It may be easier to type, but I took Dr. Poynton’s excellent color science class at the Hollywood Postproduction Alliance Tech Retreat, and would rather avoid his wrath and disappointment and instead honor his plea for accurate terminology.

4:4:4 Sampling

This is the most basic form of Y′CbCr, and the least seen. It has one Cb and one Cr for each Y value. Though not generally used as a storage format, it’s sometimes used for internal processing. In comparison, RGB is always 4:4:4. (See Figure 2.3 for illustrations of 4:4:4, 4:2:2, 4:2:0. 4:1:1, and YUV-9 sampling.)

Figure 2.3 Where the chroma subsamples live in four common sampling systems, plus the highly misguided YUV – 9. The gray boxes indicate chroma samples, and the black lines the luma samples.

image

4:2:2 Sampling

This is the subsampling scheme most commonly used in professional video authoring: one Cb and one Cr for every two Y′ horizontally. 4:2:2 is used in tape formats such as D1, Digital Betacam, D5, and HDCAM-SR. It’s also the native format of the standard codecs of most editing systems. It looks good, particularly for interlaced video, but is only a third smaller than 4:4:4.

So, if 4:2:2 results in a compression ratio of 1.5:1, why is 4:2:2 called “uncompressed”? It’s because we don’t take sampling and quantization into account when talking about compressed versus uncompressed, just as audio sampled at 44.1 kHz isn’t said to be “compressed” compared to audio sampled at 48 kHz.So we have uncompressed 4:2:2 codecs and compressed 4:2:2 codecs.

4:2:0 Sampling

Here, we have one pixel per Y′ sample and four pixels (a 2 × 2 block) per Cb and Cr sample. This yields a 2:1 reduction in data before further compression. 4:2:0 is the ideal color space for compressing progressive scan video, and is used in all modern delivery codecs, whether for broadcast, DVD, Blu-ray, or web. It’s often called “YV12” as well, referring to the specific mode that the samples are arranged.

4:2:0 is also used in some consumer/prosumer acquisition formats, including PAL DV, HDV, and AVCHD.

4:1:1 Sampling

This is the color space of NTSC DV25 formats, including DV, DVC, DVCPRO, and DVCAM, as well as the (thankfully rare) “DVCPRO 4:1:1” in PAL. 4:1:1 uses Cb and Cr samples four pixels wide and one pixel tall. So it has the same number of chroma samples as 4:2:0, but in a different (nonsquare) shape.

So why use 4:1:1 instead of 4:2:0? Because of the way interlaced video is put together. Each video frame is organized into two fields that can contain different temporal information—more about this later. By having chroma sampling be one line tall, it’s easier to keep colors straight, between fields. However, having only one chroma sample every 4 pixels horizontally can cause some blocking or loss of detail with the edges of saturated colors. 4:1:1 is okay to shoot with, but you should never actually encode to 4:1:1; it’s for acquisition only.

YUV-9 Sampling

YUV-9 was a common colorspace in the Paleolithic days of compression, with one chroma sample per 4 × 4 block of pixels. A break from the 4:x:x nomenclature, YUV-9’s name comes from it having an average of 9 bits per pixel. (Thus 4:2:0 would be YUV-12, hence YV12.) 4 × 4 chroma subsampling wasn’t enough for much beyond talking head content, and sharp edges in highly saturated imagery in YUV-9 looks extremely blocky.

This was the Achilles heel of too many otherwise great products, including the Indeo series of codecs and Sorenson Video 1 and 2. Fortunately, all modern codecs have adopted 4:2:0.

Lab

Photoshop’s delightful Lab color mode is a 4:4:4 Y′CbCr variant. Its name refers to Luminance, chroma channel A, and chroma channel B. And luma/chroma separation is as good for still processing as it is for video, and I’ve been able to wow more than one oldschool Photoshop whiz by jumping into Lab mode and fixing problems originating in the chroma channels of a JPEG file.

CMYK Color Space

CMYK isn’t used in digital video, but it’s the other color space you’re likely to have heard of. CMYK stands for cyan, magenta, yellow, and black (yep, that final letter is for blacK).

CMYK is used in printing and specifies the placement and layering of inks on a page, not the firing of phosphors. Consequently, CMYK has some unique characteristics. The biggest one is adding a fourth channel, black. This is needed because with real-world pigments, mixing C + M + Y yields only a muddy brown, so a separate black channel is necessary.

In contrast to additive color spaces like RGB, CMYK is subtractive; white is obtained by removing all colors (which leaves white paper).

There are some expanded versions of CMYK that add additional inks to further improve the quality of reproduction, like CcMmYK, CMYKOG, and the awesomely ominous sounding (if rarely seen) CcMmYyKkBORG.

Color spaces for printing is a deep and fascinating topic that merits its own book. Fortunately, this is not that book. CMYK is normally just converted to RGB and then on to Y′CbCr in the compression workflow; all the tricky stuff happens in the conversion to RGB. And as CMYK is only ever used with single images, this is really part of editing and preprocessing, well before compression. In general it’s a lot easier to convert from CMYK than to it; getting good printable images out of a video stream can be tricky.

Quantization Levels and Bit Depth

Now that you’ve got a good handle on the many varieties of sampling, let’s get back to quantization. There are a number of sampling options out in the world, although only 8-bit per channel applies to the codecs we’re delivering in today.

8-bit Per Channel

This is the default depth for use in compression, for both RGB and YUV color space, with our canonical 8-bits per channel. This works out to be 24 bits per pixel with 4:4:4, 16-bits per pixel with 4:2:2, and 12 bits per pixel with 4:2:0. 8-bit per channel quantization is capable of excellent reproduction of video without having to carry around excess bits. And because our entire digital video ecosystem has been built around 8-bit compression, transmission, and reproduction, more bits would be hard to integrate.

Almost all of the examples in this book wind up in 8-bit 4:2:0. I document the other quantization methods, since they’ll come up in source formats or legacy encodes.

1-bit (Black and White)

In a black-and-white file, every pixel is either black or white. Any shading must be done through patterns of pixels. 1-bit quantization isn’t used in video at all. For the right kind of content, such as line art, 1-bit can compress quite small, of course. I mainly see 1-bit quantization in older downloadable GIF coloring pages from kids’ web sites.

Indexed Color

Indexed color is an odd beast. Instead of recording the values for each channel individually, an index of values specify R, G, and B values for each “color”, and the index code is stored for each pixel. The number of items in the index is normally 8-bit, so up to 256 unique colors are possible. The selection of discrete colors is called a palette, Color Look Up Table (or CLUT, which eventually stops sounding naughty after frequent use, or eighth grade), or index.

8-bit is never used in video production, and it is becoming increasingly rare for it to be used on computers either. Back in the Heroic Age of Multimedia (the mid-1990s), much of your media mojo came from the ability to do clever and even cruel things with palettes. At industry parties, swaggering pixel monkeys would make extravagant claims for their dithering talents. Now we live in less exciting times, with less need for heroes, and can do most authoring with discreet channels. But 8-bit color still comes up from time to time. Indexed color’s most common use remains in animated and still GIF files. And I even use it for PNG screenshots of grayscale user interfaces (like the Expression products) on my blog, as it’s more efficient than using full 24- or 32-bit PNG. And it somehow crept into Blu-ray! BD discs authored with the simpler HDMV mode (but not the more advanced BD-J) are limited to 8-bit indexed PNG for all graphics. Go figure.

Depending on the content, 8-bit can look better or worse than 16-bit. With 8-bit, you can be very precise in your color selection, assuming there isn’t a lot of variation in the source. 16-bit can give you a lot more banding (only 32 levels between black and white on Mac), but can use all the colors in different combinations.

Making a good 8-bit image is tricky. First, the ideal 8-bit palette must be generated (typically with a tool such as Equilibrium’s DeBabelizer). Then the colors in the image must be reduced to only those appearing in that 8-bit palette. This is traditionally done with dithering, where the closest colors are mixed together to make up the missing color. This can yield surprisingly good results, but makes files that are very hard to compress. Alternatively, the color can be flattened—each color is just set to the nearest color in the 8-bit palette without any dithering. This looks worse for photographs, but better for synthetic images such as screen shots. Not dithering also improves compression.

Many formats can support indexed colors at lower bit depths, like 4-color (2-bit) or 16-color (4-bit). These operate in the same way except with fewer colors (see Color Figure C.11).

8-bit Grayscale

A cool variant of 8-bit is 8-bit grayscale. This doesn’t use an index; it’s just an 8-bit Y′ channel without chroma channels. I used 8-bit grayscale compression for black-and-white video often—Cinepak had a great-looking 8-bit grayscale mode that could run 640 × 480 on an Intel 486 processor. You can also force an 8-bit index codec to do 8-bit grayscale by giving it an index that’s just 0 to 255 with R =; G =; B. The same effect can be done in Y′CbCr codecs by just putting zeros in the U and V planes. H.264 has an explicit Y′-only mode as well, although it isn’t broadly supported yet.

16-bit Color (High Color/Thousands of Colors/555/565)

These are quantization modes that use fewer than 8 bits per channel. I can’t think of any good reason to make them anymore, but you may come across legacy content using them. There are actually two different flavors of 16-bit color. Windows uses 6 bits in G and 5 bits each in R and B. “Thousands mode” on the Mac is 5 bits each for R, G, and B, with 1 bit reserved for a rarely used alpha channel.

16-bit obviously had trouble reproducing subtle gradients. Even with a 6-bit green (which is most of luminance) channel, only 64 gradations exist between black and white, which can cause visible banding.

No modern video codecs are natively 16-bit, although they can display to the screen in 16-bit. The ancient video codecs of Apple Video (code named Road Pizza–its 4CC is “rpza”) and Microsoft Video 1 were 16-bit. But hey, they didn’t have data rate control either, and ran on CPUs with a tenth the power of the cheapest cell phone today. You could probably decode them with an abacus in a pinch.

The Delights of Y′CbCr Processing

Warning: High Nerditry ahead.

One of the behind-the-scenes revolutions in editing tools in the last decade is moving from using RGB native processing to Y′CbCr-native processing for video filters.

But though the mainstream editing tools like Adobe Premiere and Apple Final Cut Pro are now Y′CbCr native, compositing tools (I’m looking at you, After Effects!) retain an RGB-only color model. One can avoid rounding errors from excess RGB < > Y′CbCr conversion by using 16-bit or even 32-bit float per channel, but that’s paying a big cost in speed when all you want to do is apply a simple Levels filter to adjust luma.

There are three reasons why Y′CbCr processing rules over RGB.

Higher Quality

The gamut of RGB is different than Y′CbCr, which means that there are colors in Y′CbCr that you don’t get back when converting into RGB and back to Y′CbCr (and vice versa; Y′CbCr can’t do a 100-percent saturated blue like RGB can). Every time color space conversion is done, some rounding errors are added, which can accumulate and cause banding. This can be avoided by using 32-bit floating-point per channel in RGB space, but processing so many more bits has an impact on performance.

Higher Performance

With common filters, Y′CbCr’s advantages get even stronger. Filters that directly adjust luminance levels—like contrast, gamma, and brightness—only need to touch the Y′ channel in Y′CbCr, compared to all three in RGB, so in 8-bit per channel that’s 8 bits versus 24 bits in RGB. And if we’re comparing 10-bit versus 32-bit float, that’s 10 bits versus 96 bits. And chroma-only adjustments like hue or saturation only need to touch the subsampled Cb and Cr channels. So in 8-bit, that’d be 4 bits per channel for Y′CbCr compared to 24 bits for RGB, and in 10 bit Y′CbCr versus 32-float, that’s 8 bits against 96 bits.

What’s worse, for a number of operations, RGB must be converted into a luma/chroma-separated format like Y′CbCr anyway, then converted back to RGB, causing that much more processing.

Channel-specific Filters

One of my favorite ways to clean up bad video only works in Y′CbCr mode. Most ugly video, especially coming from old analog sources like VHS and Umatic, is much uglier in the chroma channels than in luma. You can often get improved results by doing an aggressive noise reduction on just the chroma channels. Lots of noise reduction adds a lot of blur to the image, which looks terrible in RGB mode, because it’s equally applied to all colors. But when the chroma channels get blurry, the image can remain looking nice and sharp, while composite artifacts seem to just vanish. Being able to clean up JPEG sources like this is one big reason I love Photoshop’s Lab mode, and I wish I had Lab in After Effects and other compositing/motion graphics tools.

10-bit 4:2:2

Professional capture codecs and transports like SDI and HD SDI use Y′CbCr 4:2:2 with 10 bits in the luma channel (the chroma channels remain 8-bit). This yields 1,024 levels of gradation compared to 8-bit’s 256. While there’s nothing about 10-bit which requires 4:2:2, all 10-bit luma implementations are 4:2:2.

I’d love to see 10 bits per channel become a minimum for processing within video software. It breaks my heart to know those bits are in the file but are often discarded. Even if you’re delivering 24-bit color, the extra two bits per channel can make a real difference in reducing banding after multiple passes of effects processing. While the professional nonlinear editing systems (NLEs) now can do a good job with 10-bit, few if any of our mainstream compression tools do anything with that extra image data.

12-bit 4:4:4

Sony’s HDCAM-SR goes up to RGB 4:4:4 with 12 bits per channel at its maximum bitrate mode. While certainly a lot of bits to move around and process, that offers a whole lot of detail and dynamic range for film sources, allowing quite a lot of color correction without introducing banding. 12-bit 4:4:4 is mainly used in the high-end color grading/film timing workflow. We video guys normally get just a 10-bit 4:2:2 output after all the fine color work is done.

16-bit Depth

16-bit per channel was introduced in After Effects 5, and was the first step towards reducing the RGB/Y′CbCr rounding issues. However, it didn’t do anything to deal with the gamut differences.

32-bit Floating-point

So far, all the quantization methods we’ve talked about have been integer; the value is always a whole number. Because of that limitation we use perceptually uniform values so we don’t wind up with much bigger steps between values near black, and hence more banding. But floating-point numbers are actually made of three components. The math-phobic may tune out for this next part, if they’ve made it this far, but I promise this is actually relevant.

A Brief Introduction to Floating-Point

The goal of floating-point numbers is to provide a constant level of precision over a very broad range of possible values.

First off, let’s define a few terms of art.

•  Sign: whether the number is positive or negative (1-bit)

•  Exponent, which determines how big the number is (8 bits in 32-bit float)

•  Mantissa, which determines the precision of the number (23 bits in 32-bit float)

By example, we can take Avogadro’s number, which is how many atoms of carbon there are in 12 grams. As an integer, it is awkward:

•  602,200,000,000,000,000,000,000

That’s a whole lot of zeros at the end that don’t really add any information, but which would take a 79-bit binary number! Which why it’s normally written in scientific notation as

•  6.022 × 1023

Here, 6.022 is the mantissa and 23 is the exponent. This much more compact, and provides a clear indication of its precision (how many digits in the mantissa) and its magnitude (the value of the exponent) as different values.

With floating-point numbers, we get 23 bits worth of precision, but we can be that precise over a much broader number range. And we can even go negative which isn’t useful for delivery, of course, but can be very useful when stacking multiple filters together—a high contrast filter can send some pixels to below 0 luma, and then a high brightness filter can bring them back up again without losing detail.

Floating-point numbers are generally not perceptually uniform either. To make the calculations easier, they actually use good old Y, not Y′, and calculate based on actual light intensity values.

Modern CPUs are very fast at doing 32-bit float operations, and by not having to deal with being perceptually uniform, 32-bit float may not be any slower than 16-bit per channel. And since 32-bit float can always return the same Y′CbCr values (where those values weren’t changed by any filters). For that reason, I normally just use 32-bit float over 16-bit per channel.

Quantizing Audio

So, given all that, how do we quantize audio? As with sampling, things in the audio world are simpler.

Each bit of resolution equals six decibels (dB) of dynamic range (the difference between the loudest and softest sounds that can be reproduced). The greater the dynamic range, the better the reproduction of very quiet sounds, and the better the signal-to-noise ratio.

8-bit Audio

Early digital audio sampling devices operated at 8-bit resolution, and it sounded lousy. And to make it even passably lousy instead of utterly incomprehensible required specialized software or hardware.

16-bit Audio

Audio CD players operate at 16-bit resolution, providing 65,536 possible values. And using our 6 dB-per-bit rule, we get 96 dB of dynamic range. This is substantially higher than even the best consumer audio equipment can deliver, or that would be comfortable to listen to; if it was loud enough to hear the quietest parts, it’d be painfully loud at the loudest parts.

There are many near-religious debates on the topic of how much quantization is too much for audio. My take is that 16-bit audio is like 8-bit video; I’d like to start and work with more than that, but properly processed, it can produce very high-quality final encodes.

And as with making good 8-bit video, dithering is the key to 16-bit audio to soften the edges between fast transitions. I’m of the opinion that the CD’s 44.1 KHz 16-bit stereo is quite sufficient for high-quality reproduction on good systems for people with good hearing, assuming the CD is well mastered. The main advantage to “HD audio” is in adding additional channels, not any increase in sample rate or bit depth.

20-bit Audio

20-bit offers only 4 more bits than 16-bit, but that’s another 24 dB, which is more than a hundred times difference in dynamic range. At 20-bit, dithering becomes a nonissue, and even the finest audio reproduction system with an industry recognized “golden ear” can’t tell the difference between 20-bit and higher in an A/B comparison. If you’re looking for overkill, 20-bit is sufficiently murderous for any final encode.

24-bit Audio

Codecs like DD+ and WMA 10 Pro offer 24-bit modes. These make for trouble-free conversion from high-bit sources without having to sweat dithering. They’re in practice the same quality as 20-bit. It’s 24-bit, since that’s three bytes (3 × 8), the next computer-friendly jump from 16-bit (2 × 8).

24-bit is also often used for recording audio, where all that headroom can be handy. Still, even 16-bit offers far more dynamic range than even the best microphone in the world can deliver. 24-bit and higher really pays off when mixing dozens of audio channels together in multiple generations.

32-bit and 64-bit Float

Floating-point is also available for audio production, although nothing is ever delivered in float. It offers the same advantages of high precision across a huge range like 32-bit float in video does.

Quantization Errors

Quantization (quant) errors arise from having to choose the nearest equivalent to a given number from among a limited set of choices. Digitally encoding colors inevitably introduces this kind of error, since there’s never an exact match between an analog color and its nearest digital equivalent. The more gradations available in each channel—that is, the higher the color depth—the closer the best choice is likely to be to the original color.

For 8-bit channels that don’t need much processing, quantization error doesn’t create problems. But quant errors add up during image processing. Every time you change individual channel values, they get rounded to the nearest value, adding a little bit of error. With each layer of processing, the image tends to drift farther away from the best digital equivalent of the original color.

Imagine a Y value of 101. If you divide brightness in half, you wind up with 51 (50.5 rounded up—no decimals allowed with integers). Later, if you double brightness, the new value is 102. Look at Figure 2.4, which shows the before (left) and after (right) doing some typical gamma and contrast filtering. Note how some colors simply don’t exist in the output, due to rounding errors; instead doubling the amount of an adjacent value.

Figure 2.4 Even a slight filter can cause spikes and valleys in the histogram, with some values missing from the output and others overrepresented.

image

Quant errors also arise from color space conversion. For example, most capture formats represent color as Y′CbCr 4:1:1 or 4:2:2, but After Effects filters operate in RGB.

Transferring from tape to tape via After Effects generally involves converting from Y′CbCr to RGB and back to Y′CbCr—at the very least. What’s more, filters and effects often convert color spaces internally. For instance, using Hue, Saturation, and Lightness (HSL) filters involves a trip through the HSL color space. Going between Y′CbCr and RGB also normally involves remapping from the 16–235 to the 0–255 range and back again each time, introducing further rounding errors.

Another trouble with color space conversions is that similar values don’t always exist in all color spaces. For example, in Y′CbCr, it’s perfectly possible to have a highly saturated blue-tinted black (where Y′ = 16 and Cb = 112). However, upon conversion to RGB, the information in the Cb and Cr channels disappears—black in RGB means R, G, or B are all 0; no color information at all. If the RGB version is converted back to Y′CbCr, the values in the Cb and Cr channels are no longer fixed. As Y gets near black or white, the Cb and Cr values approach zero.

Conversely, RGB has colors that can’t be reproduced in Y′CbCr: for example, pure R, G, or B (like R = 255, G = 0, B = 0). Since each of Y′, Cb, and Cr is made of a mix of R, G, and B, it simply can’t have that degree of separation at the extremes. It can be “quite blue” but there’s a range of “super blues” that simply can’t be quantized in video.

And that’s how we get video and audio into numbers we can process on computers. And while we may think of uncompressed as being “exactly the same as the source”, a whole lot of tuning for the human visual and auditory systems is implicit in the quantization and sampling modes chosen.

Now, on to compression itself.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.57.3