MPEG video compression
In this chapter the principles of video compression are explored, leading to descriptions of MPEG-1, MPEG-2 MPEG-4. MPEG-4 Part 10, also known as H.264 or AVC, is also covered, and for simplicity it will be referred to throughout as AVC.
MPEG-1 supports only progressively scanned images, whereas MPEG-2 and MPEG-4 support both progressive and interlaced scan. MPEG uses the term ‘picture’ to mean a full-screen image of any kind at one point on the time axis. This could be a field or a frame in interlaced systems but only a frame in non-interlaced systems. The terms field and frame will be used only when the distinction is important. MPEG-4 introduces object coding that can handle entities that may not fill the screen. In MPEG-4 the picture becomes a plane in which one or more video objects can be displayed, hence the term video object plane (VOP).
5.1 The eye
All imaging signals ultimately excite some response in the eye and the viewer can only describe the result subjectively. Familiarity with the functioning and limitations of the eye is essential to an understanding of image compression. The simple representation of Figure 5.1 shows that the eyeball is nearly spherical and is swivelled by muscles. The space between the cornea and the lens is filled with transparent fluid known as aqueous humour. The remainder of the eyeball is filled with a transparent jelly known as vitreous humour. Light enters the cornea, and the amount of light admitted is controlled by the pupil in the iris. Light entering is involuntarily focused on the retina by the lens in a process called visual accommodation. The lens is the only part of the eye which is not nourished by the bloodstream and its centre is technically dead. In a young person the lens is flexible and muscles distort it to perform the focusing action. In old age the lens loses some flexibility and causes presbyopia or limited accommodation. In some people the length of the eyeball is incorrect resulting in myopia (short-sightedness) or hypermetropia (long-sightedness). The cornea should have the same curvature in all meridia, and if this is not the case, astigmatism results.
The retina is responsible for light sensing and contains a number of layers. The surface of the retina is covered with arteries, veins and nerve fibres and light has to penetrate these in order to reach the sensitive layer. This contains two types of discrete receptors known as rods and cones from their shape. The distribution and characteristics of these two receptors are quite different. Rods dominate the periphery of the retina whereas cones dominate a central area known as the fovea outside which their density drops off. Vision using the rods is monochromatic and has poor resolution but remains effective at very low light levels, whereas the cones provide high resolution and colour vision but require more light. Figure 5.2 shows how the sensitivity of the retina slowly increases in response to entering darkness. The first part of the curve is the adaptation of the cone or photopic vision. This is followed by the greater adaptation of the rods in scotopic vision. At such low light levels the fovea is essentially blind and small objects which can be seen in the peripheral rod vision disappear when stared at.
The cones in the fovea are densely packed and directly connected to the nervous system allowing the highest resolution. Resolution then falls off away from the fovea. As a result the eye must move to scan large areas of detail. The image perceived is not just a function of the retinal response, but is also affected by processing of the nerve signals. The overall acuity of the eye can be displayed as a graph of the response plotted against the degree of detail being viewed. Detail is generally measured in lines per millimetre or cycles per picture height, but this takes no account of the distance from the eye. A better unit for eye resolution is one based upon the subtended angle of detail as this will be independent of distance. Units of cycles per degree are then appropriate. Figure 5.3 shows the response of the eye to static detail. Note that the response to very low frequencies is also attenuated. An extension of this characteristic allows the vision system to ignore the fixed pattern of shadow on the retina due to the nerves and arteries.
The retina does not respond instantly to light, but requires between 0.15 and 0.3 second before the brain perceives an image. The resolution of the eye is primarily a spatio-temporal compromise. The eye is a spatial sampling device; the spacing of the rods and cones on the retina represents a spatial sampling frequency.
The measured acuity of the eye exceeds the value calculated from the sample site spacing because a form of oversampling is used. The eye is in a continuous state of subconscious vibration called saccadic motion. This causes the sampling sites to exist in more than one location, effectively increasing the spatial sampling rate provided there is a temporal filter which is able to integrate the information from the various different positions of the retina. This temporal filtering is responsible for ‘persistence of vision’. Flashing lights are perceived to flicker only until the critical flicker frequency (CFF) is reached; the light appears continuous for higher frequencies. The CFF is not constant but changes with brightness (see Figure 5.4). Note that the field rate of European television at 50 fields per second is marginal with bright images.
Figure 5.5 shows the two-dimensional or spatio-temporal response of the eye. If the eye were static, a detailed object moving past it would give rise to temporal frequencies, as Figure 5.6(a) shows. The temporal frequency is given by the detail in the object, in lines per millimetre, multiplied by the speed. Clearly a highly detailed object can reach high temporal frequencies even at slow speeds, yet Figure 5.5 shows that the eye cannot respond to high temporal frequencies.
However, the human viewer has an interactive visual system which causes the eyes to track the movement of any object of interest. Figure 5.6(b) shows that when eye tracking is considered, a moving object is rendered stationary with respect to the retina so that temporal frequencies fall to zero and much the same acuity to detail is available despite motion. This is known as dynamic resolution and it describes how humans judge the detail in real moving pictures.
5.2 Dynamic resolution
As the eye uses involuntary tracking at all times, the criterion for measuring the definition of moving-image portrayal systems has to be dynamic resolution, defined as the apparent resolution perceived by the viewer in an object moving within the limits of accurate eye tracking. The traditional metric of static resolution in film and television has to be abandoned as unrepresentative of the subjective results.
Figure 5.7(a) shows that when the moving eye tracks an object on the screen, the viewer is watching with respect to the optic flow axis, not the time axis, and these are not parallel when there is motion. The optic flow axis is defined as an imaginary axis in the spatio-temporal volume which joins the same points on objects in successive frames. Clearly when many objects move independently there will be one optic flow axis for each.
The optic flow axis is identified by motion-compensated standards convertors to eliminate judder and also by MPEG compressors because the greatest similarity from one picture to the next is along that axis. The success of these devices is testimony to the importance of the theory.
According to sampling theory, a sampling system cannot properly convey frequencies beyond half the sampling rate. If the sampling rate is considered to be the picture rate, then no temporal frequency of more than 25 or 30 Hz can be handled (12 Hz for film). With a stationary camera and scene, temporal frequencies can only result from the brightness of lighting changing, but this will not approach the limit. However, when there is relative movement between camera and scene, detailed areas develop high temporal frequencies, just as was shown in Figure 5.6(a) for the eye. This is because relative motion results in a given point on the camera sensor effectively scanning across the scene. The temporal frequencies generated are beyond the limit set by sampling theory, and aliasing should take place.
However, when the resultant pictures are viewed by a human eye, this aliasing is not perceived because, once more, the eye tracks the motion of the scene.1
Figure 5.8 shows what happens when the eye follows correctly. Although the camera sensor and display are both moving through the field of view, the original scene and the retina are now stationary with respect to one another. As a result the temporal frequency at the eye due to the object being followed is brought to zero and no aliasing is perceived by the viewer due to sampling in the image-portrayal system.
Whilst this result is highly desirable, it does not circumvent sampling theory because the effect only works if several assumptions are made, including the requirement for the motion to be smooth.
Figure 5.7(b) shows that when the eye is tracking, successive pictures appear in different places with respect to the retina. In other words if an object is moving down the screen and followed by the eye, the raster is actually moving up with respect to the retina. Although the tracked object is stationary with respect to the retina and temporal frequencies are zero, the object is moving with respect to the sensor and the display and in those units high temporal frequencies will exist. If the motion of the object on the sensor is not correctly portrayed, dynamic resolution will suffer. Dynamic resolution analysis confirms that both interlaced television and conventionally projected cinema film are both seriously sub-optimal. In contrast, progressively scanned television systems have no such defects.
In real-life eye tracking, the motion of the background will be smooth, but in an image-portrayal system based on periodic presentation of frames, the background will be presented to the retina in a different position in each frame. The retina separately perceives each impression of the background leading to an effect called background strobing.
The criterion for the selection of a display frame rate in an imaging system is sufficient reduction of background strobing. It is a complete myth that the display rate simply needs to exceed the critical flicker frequency.
Manufacturers of graphics displays which use frame rates well in excess of those used in film and television are doing so for a valid reason: it gives better results!
The traditional reason against high picture rates is that they require excessive bandwidth. In PCM form this may be true, but there are major exceptions. First, in the MPEG domain, it is the information that has to be sent, not the individual picture. Raising the picture rate does not raise the MPEG bit rate in proportion. This is because pictures which are closer together in time have more redundancy and smaller motion distances. Second, the display rate and the transmission rate need not be the same in an advanced system. Motion-compensated up-conversion at the display can correctly render the background at intermediate positions to those transmitted.
5.3 Contrast
The contrast sensitivity of the eye is defined as the smallest brightness difference which is visible. In fact the contrast sensitivity is not constant, but increases proportionally to brightness. Thus whatever the brightness of an object, if that brightness changes by about 1 per cent it will be equally detectable.
The true brightness of a television picture can be affected by electrical noise on the video signal. As contrast sensitivity is proportional to brightness, noise is more visible in dark picture areas than in bright areas. In practice the gamma characteristic of the CRT is put to good use in making video noise less visible. Instead of having linear video signals which are subjected to an inverse gamma function immediately prior to driving the CRT, the inverse gamma correction is performed at the camera. In this way the video signal is non-linear for most of its journey.
Figure 5.9 shows a reverse gamma function. As a true power function requires infinite gain near black, a linear segment is substituted. It will be seen that contrast variations near black result in larger signal amplitude than variations near white. The result is that noise picked up by the video signal has less effect on dark areas than on bright areas. After the gamma of the CRT has acted, noise near black is compressed with respect to noise near white. Thus a video transmission system using gamma correction at source has a better perceived noise level than if the gamma correction is performed near the display.
In practice the system is not rendered perfectly linear by gamma correction and a slight overall exponential effect is usually retained in order further to reduce the effect of noise in darker parts of the picture. A gamma correction factor of 0.45 may be used to achieve this effect. If another type of display is to be used with signals designed for CRTs, the gamma characteristic of that display will probably be different and some gamma conversion will be required.
Sensitivity to noise is also a function of spatial frequency. Figure 5.10 shows that the sensitivity of the eye to noise falls with frequency from a maximum at zero. Thus it is vital that the average brightness should be correctly conveyed by a compression system, whereas higher spatial frequencies can be subject to more requantizing noise. Transforming the image into the frequency domain allows this characteristic to be explored.
5.4 Colour vision
Colour vision is due to the cones on the retina which occur in three different types, responding to different colours. Figure 5.11 shows that human vision is restricted to a range of light wavelengths from 400 nanometres to 700 nanometres. Shorter wavelengths are called ultraviolet and longer wavelengths are called infra-red. Note that the response is not uniform, but peaks in the area of green. The response to blue is very poor and makes a nonsense of the use of blue lights on emergency vehicles which owe much to tradition and little to psycho-optics.
Figure 5.11 shows an approximate response for each of the three types of cone. If light of a single wavelength is observed, the relative responses of the three sensors allow us to discern what we call the colour of the light. Note that at both ends of the visible spectrum there are areas in which only one receptor responds; all colours in those areas look the same. There is a great deal of variation in receptor response from one individual to the next and the curves used in television are the average of a great many tests. In a surprising number of people the single receptor zones are extended and discrimination between, for example, red and orange is difficult.
The triple receptor characteristic of the eye is extremely fortunate as it means that we can generate a range of colours by adding together light sources having just three different wavelengths in various proportions. This process is known as additive colour matching which should be clearly distinguished from the subtractive colour matching that occurs with paints and inks. Subtractive matching begins with white light and selectively removes parts of the spectrum by filtering. Additive matching uses coloured light sources which are combined.
5.5 Colour difference signals
An effective colour television system can be made in which only three pure or single wavelength colours or primaries can be generated. The primaries need to be similar in wavelength to the peaks of the three receptor responses, but need not be identical. Figure 5.12 shows a rudimentary colour television system. Note that the colour camera is in fact three cameras in one, where each is fitted with a different coloured filter. Three signals, R, G and B, must be transmitted to the display which produces three images that must be superimposed to obtain a colour picture.
A monochrome camera produces a single luminance signal Y whereas a colour camera produces three signals, or components, R, G and B which are essentially monochrome video signals representing an image in each primary colour. RGB and Y signals are incompatible, yet when colour television was introduced it was a practical necessity that it should be possible to display colour signals on a monochrome display and vice versa. Creating or transcoding a luminance signal from R, G and B is relatively easy. Figure 5.11 showed the spectral response of the eye which has a peak in the green region. Green objects will produce a larger stimulus than red objects of the same brightness, with blue objects producing the least stimulus. A luminance signal can be obtained by adding R, G and B together, not in equal amounts, but in a sum which is weighted by the relative response of the eye. Thus:
Y = 0.3R + 0.59G + 0.11B
If Y is derived in this way, a monochrome display will show nearly the same result as if a monochrome camera had been used in the first place. The results are not identical because of the non-linearities introduced by gamma correction.
As colour pictures require three signals, it should be possible to send Y and two other signals which a colour display could arithmetically convert back to R, G and B. There are two important factors which restrict the form which the other two signals may take. One is to achieve reverse compatibility. If the source is a monochrome camera, it can only produce Y and the other two signals will be completely absent. A colour display should be able to operate on the Y signal only and show a monochrome picture. The other is the requirement to conserve bandwidth for economic reasons.
These requirements are met by sending two colour difference signals along with Y. There are three possible colour difference signals, R–Y, B–Y and G–Y. As the green signal makes the greatest contribution to Y, then the amplitude of G–Y would be the smallest and would be most susceptible to noise. Thus R–Y and B–Y are used in practice as Figure 5.13 shows.
R and B are readily obtained by adding Y to the two colour difference signals. G is obtained by rearranging the expression for Y above such that:
If a colour CRT is being driven, it is possible to apply inverted luminance to the cathodes and the R–Y and B–Y signals directly to two of the grids so that the tube performs some of the matrixing. It is then only necessary to obtain G–Y for the third grid, using the expression:
G–Y = 0.51(R–Y)– 0.186(B–Y)
If a monochrome source having only a Y output is supplied to a colour display, R–Y and B–Y will be zero. It is reasonably obvious that if there are no colour difference signals the colour signals cannot be different from one another and R = G = B. As a result the colour display can produce only a neutral picture.
The use of colour difference signals is essential for compatibility in both directions between colour and monochrome, but it has a further advantage that follows from the way in which the eye works. In order to produce the highest resolution in the fovea, the eye will use signals from all types of cone, regardless of colour. In order to determine colour the stimuli from three cones must be compared. There is evidence that the nervous system uses some form of colour difference processing to make this possible. As a result the full acuity of the human eye is only available in monochrome. Differences in colour cannot be resolved so well. A further factor is that the lens in the human eye is not achromatic and this means that the ends of the spectrum are not well focused. This is particularly noticeable on blue.
If the eye cannot resolve colour very well there is no point is expending valuable bandwidth sending high-resolution colour signals. Colour difference working allows the luminance to be sent separately at a bit rate which determines the subjective sharpness of the picture. The colour difference signals can be sent with considerably reduced bit rate, as little as one quarter that of luminance, and the human eye is unable to tell.
The overwhelming advantages obtained by using downsampled colour difference signals mean that in broadcast and production facilities their use has become almost universal. The technique is equally attractive for compression applications and is retained in MPEG. The outputs from the RGB sensors in most cameras are converted directly to Y, R–Y and B–Y in the camera control unit and output in that form. Whilst signals such as Y, R, G and B are unipolar or positive only, it should be stressed that colour difference signals are bipolar and may meaningfully take on levels below zero volts.
The downsampled colour formats used in MPEG were described in section 2.9.
5.6 Progressive or interlaced scan?
Analog video samples in the time domain and vertically down the screen so a two-dimensional vertical/temporal sampling spectrum will result. In a progressively scanned system there is a rectangular matrix of sampling sites vertically and temporally. The rectangular sampling structure of progressive scan is separable which means that, for example, a vertical manipulation can be performed on frame data without affecting the time axis. The sampling spectrum will be obtained according to section 2.6 and consists of the baseband spectrum repeated as sidebands above and below harmonics of the two-dimensional sampling frequencies. The corresponding spectrum is shown in Figure 5.14. The baseband spectrum is in the centre of the diagram, and the repeating sampling sideband spectrum extends vertically and horizontally. The vertical aspects of the star-shaped spectrum result from vertical spatial frequencies in the image. The horizontal aspect is due to image movement. Note that the star shape is rather hypothetical; the actual shape depends heavily on the source material. On a still picture the horizontal dimensions collapse to a line structure. In order to return a progressive scan video signal to a continuous moving picture, a twodimensional low-pass filter having a rectangular response is required. This is quite feasible as persistence of vision acts as a temporal filter and by sitting far enough away from the screen the finite acuity of the eye acts as a spatial reconstruction filter.
Interlace is actually a primitive form of compression in which the system bandwidth is typically halved by sending only half the frame lines in the first field, with the remaining lines being sent in the second field. The use of interlace has a profound effect on the vertical/temporal spectrum. Figure 5.15 shows that the lowest sampling frequency on the time axis is the frame rate, and the lowest sampling frequency on the vertical axis is the number of lines in a field. The arrangement is called a quincunx pattern because of the similarity to the five of dice. The triangular passband has exactly half the area of the rectangular passband of Figure 5.14 illustrating that half the information rate is available.
As a consequence of the triangular passband of an interlaced signal, if the best vertical frequency response is to be obtained, no motion is allowed. It should be clear from Figure 5.16 that a high vertical spatial frequency resulting from a sharp horizontal edge in the picture is only repeated at frame rate, resulting in an artifact known as interlace twitter. Conversely, to obtain the best temporal response, the vertical resolution must be impaired. Thus interlaced systems have poor resolution on moving images, in other words their dynamic resolution is poor.
In order to return to a continuous signal, a quincuncial spectrum requires a triangular spatio-temporal low-pass filter. In practice no such filter can be realized. Consequently the sampling sidebands are not filtered out and are visible, particularly the frame rate component. This artifact is visible on an interlaced display even at distances so great that resolution cannot be assessed.
Figure 5.17(a) shows a dynamic resolution analysis of interlaced scanning. When there is no motion, the optic flow axis and the time axis are parallel and the apparent vertical sampling rate is the number of lines in a frame. However, when there is vertical motion, (b), the optic flow axis turns. In the case shown, the sampling structure due to interlace results in the vertical sampling rate falling to one half of its stationary value.
Consequently interlace does exactly what would be expected from a half-bandwidth filter. It halves the vertical resolution when any motion with a vertical component occurs. In a practical television system, there is no anti-aliasing filter in the vertical axis and so when the vertical sampling rate of an interlaced system is halved by motion, high spatial frequencies will alias or heterodyne causing annoying artifacts in the picture. This is easily demonstrated.
Figure 5.17(c) shows how a vertical spatial frequency well within the static resolution of the system aliases when motion occurs. In a progressive scan system this effect is absent and the dynamic resolution due to scanning can be the same as the static case.
This analysis also illustrates why interlaced television systems must have horizontal raster lines. This is because in real life, horizontal motion is more common than vertical. It is easy to calculate the vertical image motion velocity needed to obtain the half-bandwidth speed of interlace, because it amounts to one raster line per field. In 525/60 (NTSC) there are about 500 active lines, so motion as slow as one picture height in 8 seconds will halve the dynamic resolution. In 625/50 (PAL) there are about 600 lines, so the half-bandwidth speed falls to one picture height in 12 seconds. This is why NTSC, with fewer lines and lower bandwidth, doesn’t look as soft as it should compared to PAL, because it actually has better dynamic resolution.
The situation deteriorates rapidly if an attempt is made to use interlaced scanning in systems with a lot of lines. In 1250/50, the resolution is halved at a vertical speed of just one picture height in 24 seconds. In other words on real moving video a 1250/50 interlaced system has the same dynamic resolution as a 625/50 progressive system. By the same argument a 1080 I system has the same performance as a 480 P system.
Interlaced signals are not separable and so processes that are straightforward in progressively scanned systems become more complex in interlaced systems. Compression systems should not be cascaded indiscriminately, especially if they are different. As digital compression techniques based on transforms are now available, it makes no sense to use an interlaced, i.e. compressed, video signal as an input.
Interlaced signals are harder for MPEG to compress.2 The confusion of temporal and spatial information makes accurate motion estimation more difficult and this reflects in a higher bit rate being required for a given quality. In short, how can a motion estimator accurately measure motion from one field to another when differences between the fields can equally be due to motion, vertical detail or vertical aliasing?
Computer-generated images and film are not interlaced, but consist of discrete frames spaced on a time axis. As digital technology is bringing computers and television closer the use of interlaced transmission is an embarrassing source of incompatibility. The future will bring image delivery systems based on computer technology and oversampling cameras and displays which can operate at resolutions much closer to the theoretical limits.
Interlace was the best that could be managed with thermionic valve technology sixty years ago, and we should respect the achievement of its developers at a time when things were so much harder. However, we must also recognize that the context in which interlace made sense has disappeared.
5.7 Spatial and temporal redundancy in MPEG
Chapter 1 introduced these concepts in a general sense and now they will be treated with specific reference to MPEG. Figure 5.18(a) shows that spatial redundancy is redundancy within a single picture or object, for example repeated pixel values in a large area of blue sky. Temporal redundancy (b) exists between successive pictures or objects.
In MPEG, where temporal compression is used, the current picture/object is not sent in its entirety; instead the difference between the current picture/object and the previous one is sent. The decoder already has the previous picture/object, and so it can add the difference, or residual image, to make the current picture/object. A residual image is created by subtracting every pixel in one picture/object from the corresponding pixel in another. This is trivially easy when pictures are restricted to progressive scan, as in MPEG-1, but MPEG-2 had to develop greater complexity (continued in MPEG-4) so that this can also be done with interlaced pictures. The handling of interlace in MPEG will be detailed later.
A residual is an image of a kind, although not a viewable one, and so should contain some kind of spatial redundancy. Figure 5.18(c) shows that MPEG takes advantage of both forms of redundancy. Residual images are spatially compressed prior to transmission. At the decoder the spatial compression is decoded to re-create the residual, then this is added to the previous picture/object to complete the decoding process.
Whenever objects move they will be in a different place in successive pictures. This will result in large amounts of difference data. MPEG overcomes the problem using motion compensation. The encoder contains a motion estimator which measures the direction and distance of any motion between pictures and outputs this as vectors which are sent to the decoder. When the decoder receives the vectors it uses them to shift data in a previous picture to more closely resemble the current picture. Effectively the vectors are describing the optic flow axis of some moving screen area, along which axis the image is highly redundant. Vectors are bipolar codes which determine the amount of horizontal and vertical shift required. The shifts may be in whole pixels or some binary fraction of a pixel. Whole pixel shifts are simply achieved by changing memory addresses, whereas sub-pixel shifts require interpolation. Chapter 3 showed how interpolation of this kind can be performed.
In real images, moving objects do not necessarily maintain their appearance as they move. For example, objects may move into shade or light. Consequently motion compensation can never be ideal and it is still necessary to send a residual to make up for any shortcomings in the motion compensation.
Figure 5.19 shows how this works in MPEG. In addition to the motion-encoding system, the coder also contains a motion decoder. When the encoder outputs motion vectors, it also uses them locally in the same way that a real decoder will, and is able to produce a predicted picture based solely on the previous picture shifted by motion vectors. This is then subtracted from the actual current picture to produce a prediction error or residual which is an image of a kind that can be spatially compressed. The decoder takes the previous picture, shifts it with the vectors to re-create the predicted picture and then decodes and adds the prediction error to produce the actual picture. Picture data sent as vectors plus prediction error are said to be P coded.
The concept of sending a prediction error is a useful approach because it allows both the motion estimation and compensation to be imperfect. An ideal motion compensation system will send just the right amount of vector data. With insufficient vector data, the prediction error will be large, but transmission of excess vector data will also cause the bit rate to rise. There will be an optimum balance which minimizes the sum of the prediction error data and the vector data.
In MPEG-1 and MPEG-2 the balance is obtained by dividing the screen into areas called macroblocks which are 16 luminance pixels square. Each macroblock is associated with a vector. The vector has horizontal and vertical components so that, used in combination, pixel data may be shifted in any direction. The location of the boundaries of a macroblock are fixed with respect to the display screen and so clearly the vector cannot move the macroblock. Instead the vector tells the decoder where to look in another picture to find pixel data which will be fetched to the macroblock. Figure 5.20(a) shows this concept. MPEG-2 vectors have half-pixel resolution.
MPEG-1 and MPEG-2 also use a simple compression scheme on the vectors. Figure 5.21 shows that in the case of a large moving object, all the vectors within the object will be identical or nearly so. A horizontal run of macroblocks can be associated to form a structure called a slice (see section 5.14). The first vector of a slice is sent as an absolute value, but subsequent vectors will be sent as differences. No advantage is taken of vertical vector redundancy.
Real moving objects will not coincide with macroblocks and so the motion compensation will not be ideal but the prediction error makes up for any shortcomings. Figure 5.20(b) shows the case where the boundary of a moving object bisects a macroblock. If the system measures the moving part of the macroblock and sends a vector, the decoder will shift the entire block making the stationary part wrong. If no vector is sent, the moving part will be wrong. Both approaches are legal in MPEG-1 and MPEG-2 because the prediction error compensates for the incorrect values. An intelligent coder might try both approaches to see which required the least prediction error data.
The prediction error concept also allows the use of simple but inaccurate motion estimators in low-cost systems. The greater prediction error data are handled using a higher bit rate. On the other hand if a precision motion estimator is available, a very high compression factor may be achieved because the prediction error data are minimized. MPEG does not specify how motion is to be measured; it simply defines how a decoder will interpret the vectors. Encoder designers are free to use any motion-estimation system provided that the right vector protocol is created. Chapter 3 contrasted a number of motion estimation techniques.
Figure 5.22(a) shows that a macroblock contains both luminance and colour difference data at different resolutions. Most of the MPEG-2 Profiles use a 4:2:0 structure which means that the colour is downsampled by a factor of two in both axes. Thus in a 16 × 16 pixel block, there are only 8 × 8 colour difference sampling sites. MPEG-2 is based upon the 8 × 8 DCT (see section 3.7) and so the 16 × 16 block is the screen area which contains an 8 × 8 colour difference sampling block. Thus in 4:2:0 in each macroblock there are four luminance DCT blocks, one R–Y. DCT block and one B–Y DCT block, all steered by the same vector. In the 4:2:2 Profile of MPEG-2, shown in Figure 5.22(b), the chroma is not downsampled vertically, and so there is twice as much chroma data in each macroblock which is otherwise substantially the same.
In MPEG-4 a number of refinements are made to motion compensation in addition to the MPEG-2 tools which are still available. One of these tools is a more complex form of vector compression which will be described in section 5.21. This cuts down the bit rate needed for a given set of vector data and has a number of benefits. If the existing macroblock size is retained, a higher compression factor results from the reduced vector data. However, the opportunity also arises to increase the vector density. Optionally, MPEG-4 allows one vector to be associated with an 8 × 8 DCT block rather than with a 16 × 16 macroblock. This allows much more precise prediction of complex motion, especially near the edge of objects, and will result in a reduced residual bit rate. On a macroblock- by-macroblock basis an MPEG-4 coder can decide whether to code four vectors per macroblock or only one. When the compressed vector data and residual are combined, one of these approaches will result in the least data. In AVC the vector density may be even higher still.
5.8 I and P coding
Predictive (P) coding cannot be used indefinitely, as it is prone to error propagation. A further problem is that it becomes impossible to decode the transmission if reception begins part-way through. In real video signals, cuts or edits can be present across which there is little redundancy and which make motion estimators throw up their hands.
In the absence of redundancy over a cut, there is nothing to be done but to send the new picture information in absolute form. This is called I coding where I is an abbreviation of intra-coding. As I coding needs no previous picture for decoding, then decoding and/or error recovery can begin at I coded information.
MPEG is effectively a toolkit. The bigger the MPEG-number, the more tools are available. However, there is no compulsion to use all the tools available. Thus an encoder may choose whether to use I or P coding, either once and for all or dynamically on a macroblock-by-macroblock basis. For practical reasons, an entire frame should be encoded as I macroblocks periodically. This creates a place where the bitstream might be edited or where decoding could begin.
Figure 5.23 shows a typical application of the Simple Profile of MPEG-2. Periodically an I picture is created. Between I pictures are P pictures which are based on the picture before. These P pictures predominantly contain macroblocks having vectors and prediction errors. However, it is perfectly legal for P pictures to contain I macroblocks. This might be useful where, for example, a camera pan introduces new material at the edge of the screen which cannot be created from an earlier picture.
Note that although what is sent is called a P picture, it is not a picture at all. It is a set of instructions to convert the previous picture into the current picture. If the previous picture is lost, decoding is impossible. An I picture together with all of the pictures before the next I picture form a group of pictures (GOP). There is no requirement for GOPs to be of constant size and indeed the size may vary dynamically.
5.9 Bidirectional coding
Motion-compensated predictive coding is a useful compression technique, but it does have the drawback that it can only take data from a previous picture.
Where moving objects reveal a background this is completely unknown in previous pictures and forward prediction fails. However, more of the background is visible in later pictures. Figure 5.24 shows the concept. In the centre of the diagram, a moving object has revealed some background. The previous picture can contribute nothing, whereas the next picture contains all that is required.
Bidirectional coding is shown in Figure 5.25. A bidirectional or B macroblock can be created using a combination of motion compensation and the addition of a prediction error. This can be done by forward prediction from a previous picture or backward prediction from a subsequent picture. It is also possible to use an average of both forward and backward prediction. On noisy material this may result in some reduction in bit rate. The technique is also a useful way of portraying a dissolve.
The averaging process in MPEG-1 and -2 is a simple linear interpolation which works well when only one B picture exists between the reference pictures before and after. A larger number of B pictures would require weighted interpolation but only AVC supports this.
Typically two B pictures are inserted between P pictures or between I and P pictures. As can be seen, B pictures are never predicted from one another, only from I or P pictures. A typical GOP for broadcasting purposes might have the structure IBBPBBPBBPBB. Note that the last B pictures in the GOP require the I picture in the next GOP for decoding and so the GOPs are not truly independent. Independence can be obtained by creating a closed GOP which may contain B pictures but which ends with a P picture. It is also legal to have a B picture in which every macroblock is forward predicted, needing no future picture for decoding.
In AVC, bidirectional coding may take picture data from more than one picture either side of the target picture.
Bidirectional coding is very powerful. Figure 5.26 is a constant quality curve showing how the bit rate changes with the type of coding. On the left, only I or spatial coding is used, whereas on the right a IBBP structure is used. This means that there are two bidirectionally coded pictures in between a spatially coded picture (I) and a forward predicted picture (P). Note how for the same quality the system which only uses spatial coding needs two and a half times the bit rate that the bidirectionally coded system needs.
Clearly information in the future has yet to be transmitted and so is not normally available to the decoder. MPEG gets around the problem by sending pictures in the wrong order. Picture reordering requires delay in the encoder and a delay in the decoder to put the order right again. Thus the overall codec delay must rise when bidirectional coding is used. This is quite consistent with Figure 1.6 which showed that as the compression factor rises the latency must also rise.
Figure 5.27 shows that although the original picture sequence is IBBPBBPBBIBB…, this is transmitted as IPBBPBBIBB… so that the future picture is already in the decoder before bidirectional decoding begins. Note that the I picture of the next GOP is actually sent before the last B pictures of the current GOP.
Figure 5.27 also shows that the amount of data required by each picture is dramatically different. I pictures have only spatial redundancy and so need a lot of data to describe them. P pictures need fewer data because they are created by shifting the I picture with vectors and then adding a prediction error picture. B pictures need the least data of all because they can be created from I or P.
With pictures requiring a variable length of time to transmit, arriving in the wrong order, the decoder needs some help. This takes the form of picture-type flags and time stamps which will be described in section 6.2.
5.10 Coding applications
Figure 5.28 shows a variety of possible GOP (group of pictures in MPEG-1 and MPEG-2) or GOV (group of video object planes in MPEG-4) structures. The simplest is the III…sequence in which every picture (or object in MPEG-4) is intra-coded. These can be fully decoded without reference to any other picture or object and so editing is straightforward. However, this approach requires about two-and-one-half times the bit rate of a full bidirectional system.
Bidirectional coding is most useful for final delivery of post-produced material either by broadcast or on prerecorded media as there is then no editing requirement. As a compromise the IBIB … structure can be used which has some of the bit rate advantage of bidirectional coding but without too much latency. It is possible to edit an IBIB stream by performing some processing. If it is required to remove the video following a B picture, that B picture could not be decoded because it needs I pictures either side of it for bidirectional decoding. The solution is to decode the B picture first, and then re-encode it with forward prediction only from the previous I picture. The subsequent I picture can then be replaced by an edit process. Some quality loss is inevitable in this process but this is acceptable in applications such as ENG and industrial video.
5.11 Intra-coding
Intra-coding or spatial compression in MPEG is used in I pictures on actual picture data and in P and B pictures on prediction error data. MPEG-1 and MPEG-2 use the discrete cosine transform described in section 3.13. In still pictures, MPEG-4 may also use the wavelet transform described in section 3.14.
Entering the spatial frequency domain has two main advantages. It allows dominant spatial frequencies which occur in real images to be efficiently coded, and it allows noise shaping to be used. As the eye is not uniformly sensitive to noise at all spatial frequencies, dividing the information up into frequency bands allows a different noise level to be produced in each.
The DCT works on blocks and in MPEG these are 8 × 8 pixels. Section 5.7 showed how the macroblocks of the motion-compensation structure are designed so they can be broken down into 8 × 8 DCT blocks. In a 4:2:0 macroblock there will be six DCT blocks whereas in a 4:2:2 macroblock there will be eight.
Figure 5.29 shows the table of basis functions or wave table for an 8 × 8 DCT. Adding these two-dimensional waveforms together in different proportions will give any original 8 × 8 pixel block. The coefficients of the DCT simply control the proportion of each wave which is added in the inverse transform. The top-left wave has no modulation at all because it conveys the DC component of the block. This coefficient will be a unipolar (positive only) value in the case of luminance and will typically be the largest value in the block as the spectrum of typical video signals is dominated by the DC component.
Increasing the DC coefficient adds a constant amount to every pixel. Moving to the right the coefficients represent increasing horizontal spatial frequencies and moving downwards the coefficients represent increasing vertical spatial frequencies. The bottom-right coefficient represents the highest diagonal frequencies in the block. All these coefficients are bipolar, where the polarity indicates whether the original spatial waveform at that frequency was inverted.
Figure 5.30 shows a one-dimensional example of an inverse transform. The DC coefficient produces a constant level throughout the pixel block. The remaining waves in the table are AC coefficients. A zero coefficient would result in no modulation, leaving the DC level unchanged. The wave next to the DC component represents the lowest frequency in the transform which is half a cycle per block. A positive coefficient would make the left side of the block brighter and the right side darker whereas a negative coefficient would do the opposite. The magnitude of the coefficient determines the amplitude of the wave which is added. Figure 5.30 also shows that the next wave has a frequency of one cycle per block, i.e. the block is made brighter at both sides and darker in the middle. Consequently an inverse DCT is no more than a process of mixing various pixel patterns from the wave table where the relative amplitudes and polarity of these patterns are controlled by the coefficients. The original transform is simply a mechanism which finds the coefficient amplitudes from the original pixel block.
It should be noted that the DCT is not a true spectral analysis like the DFT. If a cosine waveform enters a DCT the behaviour is DFT-like, but the DCT handles a sine waveform by generating a larger number of coefficients. This is shown in Figure 5.31.
The DCT itself achieves no compression at all. Sixty-four pixels are converted to sixty-four coefficients. However, in typical pictures, not all coefficients will have significant values; there will often be a few dominant coefficients. The coefficients representing the higher twodimensional spatial frequencies will often be zero or of small value in large areas, due to blurring or simply plain undetailed areas before the camera.
Statistically, the further from the top-left corner of the wave table the coefficient is, the smaller will be its magnitude. Coding gain (the technical term for reduction in the number of bits needed) is achieved by transmitting the low-valued coefficients with shorter wordlengths. The zero-valued coefficients need not be transmitted at all. Thus it is not the DCT which compresses the data, it is the subsequent processing. The DCT simply expresses the data in a form that makes the subsequent processing easier.
Higher compression factors require the coefficient wordlength to be further reduced using requantizing. Coefficients are divided by some factor which increases the size of the quantizing step. The smaller number of steps which results permits coding with fewer bits, but of course with an increased quantizing error. The coefficients will be multiplied by a reciprocal factor in the decoder to return to the correct magnitude.
Inverse transforming a requantized coefficient means that the frequency it represents is reproduced in the output with the wrong amplitude. The difference between original and reconstructed amplitude is regarded as a noise added to the wanted data. Figure 5.10 showed that the visibility of such noise is far from uniform. The maximum sensitivity is found at DC and falls thereafter. As a result the top-left coefficient is often treated as a special case and left unchanged. It may warrant more error protection than other coefficients.
Transform coding takes advantage of the falling sensitivity to noise. Prior to requantizing, each coefficient is divided by a different weighting constant as a function of its frequency. Figure 5.32 shows a typical weighting process. Naturally the decoder must have a corresponding inverse weighting. This weighting process has the effect of reducing the magnitude of high-frequency coefficients disproportionately. Clearly different weighting will be needed for colour difference data as colour is perceived differently.
P and B pictures are decoded by adding a prediction error image to a reference image. That reference image will contain weighted noise. One purpose of the prediction error is to cancel that noise to prevent tolerance build-up. If the prediction error were also to contain weighted noise this result would not be obtained. Consequently prediction error coefficients are flat weighted.
When forward prediction fails, such as in the case of new material introduced in a P picture by a pan, P coding would set the vectors to zero and encode the new data entirely as an unweighted prediction error. In this case it is better to encode that material as an I macroblock because then weighting can be used and this will require fewer bits.
Requantizing increases the step size of the coefficients, but the inverse weighting in the decoder results in step sizes which increase with frequency. The larger step size increases the quantizing noise at high frequencies where it is less visible. Effectively the noise floor is shaped to match the sensitivity of the eye. The quantizing table in use at the encoder can be transmitted to the decoder periodically in the bitstream.
5.12 Intra-coding in MPEG-1 and MPEG-2
Study of the signal statistics gained from extensive analysis of real material is used to measure the probability of a given coefficient having a given value. This probability turns out to be highly non-uniform suggesting the possibility of a variable-length encoding for the coefficient values. On average, the higher the spatial frequency, the lower the value of a coefficient will be. This means that the value of a coefficient tends to fall as a function of its radius from the DC coefficient. DC coefficients also have certain characteristics. As they represent the average brightness of a block, in many picture areas, adjacent blocks will have similar DC coefficient values. This can be exploited using differential coding which will be dealt with in section 5.14. However, errors in the magnitude of DC coefficients cause wide area flicker in the picture and so the coding must be accurate.
Typical material often has many coefficients which are zero valued, especially after requantizing. The distribution of these also follows a pattern. The non-zero values tend to be found in the top-left-hand corner of the DCT block, but as the radius increases, not only do the coefficient values fall, but it becomes increasingly likely that these small coefficients will be interspersed with zero-valued coefficients. As the radius increases further it is probable that a region where all coefficients are zero will be entered.
MPEG uses all these attributes of DCT coefficients when encoding a coefficient block. By sending the coefficients in an optimum order, by describing their values with Huffman coding and by using run-length encoding for the zero-valued coefficients it is possible to achieve a significant reduction in coefficient data which remains entirely lossless. Despite the complexity of this process, it does contribute to improved picture quality because for a given bit rate lossless coding of the coefficients must be better than requantizing, which is lossy. Of course, for lower bit rates both will be required.
It is an advantage to scan in a sequence where the largest coefficient values are scanned first. Then the next coefficient is more likely to be zero than the previous one. With progressively scanned material, a regular zigzag scan begins in the top-left corner and ends in the bottom-right corner as shown in Figure 5.33. Zig-zag scanning means that significant values are more likely to be transmitted first, followed by the zero values. Instead of coding these zeros, a unique ‘end of block’ (EOB) symbol is transmitted instead.
As the zig-zag scan approaches the last finite coefficient it is increasingly likely that some zero value coefficients will be scanned. Instead of transmitting the coefficients as zeros, the zero-run-length, i.e. the number of zero-valued coefficients in the scan sequence, is encoded into the next non-zero coefficient which is itself variable-length coded. This combination of run-length and variable-length coding is known as RLC/VLC in MPEG.
The DC coefficient is handled separately because it is differentially coded and this discussion relates to the AC coefficients. Three items need to be handled for each coefficient: the zero-run-length prior to this coefficient, the wordlength and the coefficient value itself. The wordlength needs to be known by the decoder so that it can correctly parse the bitstream. The wordlength of the coefficient is expressed directly as an integer called the size.
Figure 5.34(a) shows that a two-dimensional run/size table is created. One dimension expresses the zero-run-length; the other the size. A run length of zero is obtained when adjacent coefficients are non-zero, but a code of 0/0 has no meaningful run/size interpretation and so this bit pattern is used for the end-of-block (EOB) symbol.
In the case where the zero-run-length exceeds 14, a code of 15/0 is used signifying that there are fifteen zero-valued coefficients. This is then followed by another run/size parameter whose run-length value is added to the previous fifteen.
The run/size parameters contain redundancy because some combinations are more common than others. Figure 5.34(b) shows that each run/size value is converted to a variable-length Huffman codeword for transmission. As was shown in section 1.5, the Huffman codes are designed so that short codes are never a prefix of long codes so that the decoder can deduce the parsing by testing an increasing number of bits until a match with the look-up table is found. Having parsed and decoded the Huffman run/size code, the decoder then knows what the coefficient wordlength will be and can correctly parse that.
The variable-length coefficient code has to describe a bipolar coefficient, i.e. one which can be positive or negative. Figure 5.34(c) shows that for a particular size, the coding scale has a certain gap in it. For example, all values from −7 to +7 can be sent by a size 3 code, so a size 4 code only has to send the values of −15 to −8 and +8 to +15. The coefficient code is sent as a pure binary number whose value ranges from all zeros to all ones where the maximum value is a function of the size. The number range is divided into two, the lower half of the codes specifying negative values and the upper half specifying positive.
In the case of positive numbers, the transmitted binary value is the actual coefficient value, whereas in the case of negative numbers a constant must be subtracted which is a function of the size. In the case of a size 4 code, the constant is 1510. Thus a size 4 parameter of 01112 (710) would be interpreted as 7 −15 = −8. A size of 5 has a constant of 31 so a transmitted coded of 010102 (102) would be interpreted as 10 − 31 = −21.
This technique saves a bit because, for example, 63 values from −31 to +31 are coded with only five bits having only 32 combinations. This is possible because that extra bit is effectively encoded into the run/size parameter.
Figure 5.35 shows the entire MPEG-1 and MPEG-2 spatial coding subsystem. Macroblocks are subdivided into DCT blocks and the DCT is calculated. The resulting coefficients are multiplied by the weighting matrix and then requantized. The coefficients are then reordered by the zig-zag scan so that full advantage can be taken of run-length and variable-length coding. The last non-zero coefficient in the scan is followed by the EOB symbol.
In predictive coding, sometimes the motion-compensated prediction is nearly exact and so the prediction error will be almost zero. This can also happen on still parts of the scene. MPEG takes advantage of this by sending a code to tell the decoder there is no prediction error data for the macroblock concerned.
The success of temporal coding depends on the accuracy of the vectors. Trying to reduce the bit rate by reducing the accuracy of the vectors is false economy as this simply increases the prediction error. Consequently for a given GOP structure it is only in the spatial coding that the overall bit rate is determined. The RLC/VLC coding is lossless and so its contribution to the compression cannot be varied. If the bit rate is too high, the only option is to increase the size of the coefficient-requantizing steps. This has the effect of shortening the wordlength of large coefficients, and rounding small coefficients to zero, so that the bit rate goes down. Clearly if taken too far the picture quality will also suffer because at some point the noise floor will become visible as some form of artifact.
5.13 A bidirectional coder
MPEG does not specify how an encoder is to be built or what coding decisions it should make. Instead it specifies the protocol of the bitstream at the output. As a result the coder shown in Figure 5.36 is only an example. Conceptually MPEG-2 Main Profile coding with a progressive scan input is almost identical to that of MPEG-1 as the differences are to be found primarily in the picture size and the resulting bit rate. The coder shown here will create MPEG-1 and MPEG-2 bidirectionally coded bitstreams from progressively scanned inputs. It also forms the basis of interlaced MPEG-2 coding and for MPEG-4 texture coding, both of which will be considered later.
Figure 5.36(a) shows the component parts of the coder. At the input is a chain of picture stores which can be bypassed for reordering purposes. This allows a picture to be encoded ahead of its normal timing when bidirectional coding is employed.
At the centre is a dual-motion estimator which can simultaneously measure motion between the input picture, an earlier picture and a later picture. These reference pictures are held in frame stores. The vectors from the motion estimator are used locally to shift a picture in a frame store to form a predicted picture. This is subtracted from the input picture to produce a prediction error picture which is then spatially coded. The bidirectional encoding process will now be described. A GOP begins with an I picture which is intra-coded. In Figure 5.36(b) the I picture emerges from the reordering delay. No prediction is possible on an I picture so the motion estimator is inactive. There is no predicted picture and so the prediction error subtractor is set simply to pass the input. The only processing which is active is the forward spatial coder which describes the picture with DCT coefficients. The output of the forward spatial coder is locally decoded and stored in the past picture frame store.
The reason for the spatial encode/decode is that the past picture frame store now contains exactly what the decoder frame store will contain, including the effects of any requantizing errors. When the same picture is used as a reference at both ends of a differential coding system, the errors will cancel out.
Having encoded the I picture, attention turns to the P picture. The input sequence is IBBP, but the transmitted sequence must be IPBB. Figure 5.36(c) shows that the reordering delay is bypassed to select the P picture. This passes to the motion estimator which compares it with the I picture and outputs a vector for each macroblock. The forward predictor uses these vectors to shift the I picture so that it more closely resembles the P picture. The predicted picture is then subtracted from the actual picture to produce a forward prediction error. This is then spatially coded. Thus the P picture is transmitted as a set of vectors and a prediction error image.
The P picture is locally decoded in the right-hand decoder. This takes the forward-predicted picture and adds the decoded prediction error to obtain exactly what the decoder will obtain.
Figure 5.36(d) shows that the encoder now contains an I picture in the left store and a P picture in the right store. The reordering delay is reselected so that the first B picture can be input. This passes to the motion estimator where it is compared with both the I and P pictures to produce forward and backward vectors. The forward vectors go to the forward predictor to make a B prediction from the I picture. The backward vectors go to the backward predictor to make a B prediction from the P picture. These predictions are simultaneously subtracted from the actual B picture to produce a forward prediction error and a backward prediction error. These are then spatially encoded. The encoder can then decide which direction of coding resulted in the best prediction, i.e. the smallest prediction error.
In the encoder shown, motion estimation sometimes takes place between an input picture and a decoded picture. With increased complexity, the original input pictures may be stored for motion-estimation purposes. As these contain no compression artifacts the motion vectors may be more accurate.
Not shown in the interests of clarity is a third signal path which creates a predicted B picture from the average of forward and backward predictions. This is subtracted from the input picture to produce a third prediction error. In some circumstances this prediction error may use fewer data than either forward of backward prediction alone.
As B pictures are never used to create other pictures, the decoder does not locally decode the B picture. After decoding and displaying the B picture the decoder will discard it. At the encoder the I and P pictures remain in their frame stores and the second B picture is input from the reordering delay.
Following the encoding of the second B picture, the encoder must reorder again to encode the second P picture in the GOP. This will be locally decoded and will replace the I picture in the left store. The stores and predictors switch designation because the left store is now a future P picture and the right store is now a past P picture. B pictures between them are encoded as before.
5.14 Slices
In MPEG-1 and MPEG-2 I pictures and in MPEG-4 I-VOPs, the DC coefficient describes the average brightness of an entire DCT block. In real video the DC component of adjacent blocks will be similar much of the time. A saving in bit rate can be obtained by differentially coding the DC coefficient.
In P and B pictures this is not done because these are prediction errors, not actual images, and the statistics are different. However, P and B pictures send vectors and instead the redundancy in these is explored. In a large moving object, many macroblocks will be moving at the same velocity and their vectors will be the same. Thus differential vector coding will be advantageous.
As has been seen above, differential coding cannot be used indiscriminately as it is prone to error propagation. Periodically absolute DC coefficients and vectors must be sent and the slice is the logical structure which supports this mechanism. In I coding, the first DC coefficient in a slice is sent in absolute form, whereas the subsequent coefficients are sent differentially. In P or B coding, the first vector in a slice is sent in absolute form, but the subsequent vectors are differential.
Slices are horizontal picture strips which are one macroblock (16 pixels) high and which proceed from left to right across the screen and may run on to the next horizontal row. The encoder is free to decide how big slices should be and where they begin and end.
In the case of a central dark building silhouetted against the bright sky, there would be two large changes in the DC coefficients, one at each edge of the building. It may be advantageous to the encoder to break the width of the picture into three slices, one each for the left and right areas of sky and one for the building. In the case of a large moving object, different slices may be used for the object and the background.
Each slice contains its own synchronizing pattern, so following a transmission error, correct decoding can resume at the next slice. Slice size can also be matched to the characteristics of the transmission channel. For example, in an error-free transmission system the use of a large number of slices in a packet simply wastes data capacity on surplus synchronizing patterns. However, in a non-ideal system it might be advantageous to have frequent resynchronizing. In DVB, for example, additional constraints are placed on the slice size.
5.15 Handling interlaced pictures
MPEG-1 does not support interlace, whereas MPEG-2 and MPEG-4 do. Spatial coding, predictive coding and motion compensation can still be performed using interlaced source material at the cost of considerable complexity. Despite that complexity, MPEG cannot be expected to perform as well with interlaced material.
Figure 5.37 shows that in an incoming interlaced frame there are two fields each of which contains half of the lines in the frame. In MPEG-2 and MPEG-4 these are known as the top field and the bottom field. In video from a camera, these fields represent the state of the image at two different times. Where there is little image motion, this is unimportant and the fields can be combined, obtaining more effective compression. However, in the presence of motion the fields become increasingly decorrelated because of the displacement of moving objects from one field to the next.
This characteristic determines that MPEG must be able to handle fields independently or together. This dual approach permeates all aspects of MPEG and affects the definition of pictures, macroblocks, DCT blocks and zig-zag scanning.
Figure 5.37 also shows how MPEG-2 designates interlaced fields. In picture types I, P and B, the two fields can be superimposed to make a frame-picture or the two fields can be coded independently as two field-pictures. As a third possibility, in I pictures only, the bottom field-picture can be predictively coded from the top field-picture to make an IP frame picture.
A frame-picture is one in which the macroblocks contain lines from both field types over a picture area 16 scan lines high. Each luminance macroblock contains the usual four DCT blocks but there are two ways in which these can be assembled. Figure 5.38(a) shows how a frame is divided into frame DCT blocks. This is identical to the progressive scan approach in that each DCT block contains eight contiguous picture lines. In 4:2:0, the colour difference signals have been downsampled by a factor of two and shifted as was shown in Figure 2.18. Figure 5.38(a) also shows how one 4:2:0 DCT block contains the chroma data from 16 lines in two fields.
Even small amounts of motion in any direction can destroy the correlation between odd and even lines and a frame DCT will result in an excessive number of coefficients. Figure 5.38(b) shows that instead the luminance component of a frame can also be divided into field DCT blocks. In this case one DCT block contains odd lines and the other contains even lines. In this mode the chroma still produces one DCT block from both fields as in Figure 5.38(a).
When an input frame is designated as two field-pictures, the macroblocks come from a screen area which is 32 lines high. Figure 5.38(c) shows that the DCT blocks contain the same data as if the input frame had been designated a frame-picture but with field DCT. Consequently it is only frame-pictures which have the option of field or frame DCT. These may be selected by the encoder on a macroblock-by-macroblock basis and, of course, the resultant bitstream must specify what has been done.
In a frame which contains a small moving area, it may be advantageous to encode as a frame-picture with frame DCT except in the moving area where field DCT is used. This approach may result in fewer bits than coding as two field-pictures.
In a field-picture and in a frame-picture using field DCT, a DCT block contains lines from one field type only and this must have come from a screen area 16 scan lines high, whereas in progressive scan and frame DCT the area is only eight scan lines high. A given vertical spatial frequency in the image is sampled at points twice as far apart which is interpreted by the field DCT as a doubled spatial frequency, whereas there is no change in the horizontal spectrum.
Following the DCT calculation, the coefficient distribution will be different in field-pictures and field DCT frame-pictures. In these cases, the probability of coefficients is not a constant function of radius from the DC coefficient as it is in progressive scan, but is elliptical where the ellipse is twice as high as it is wide.
Using the standard 45° zig-zag scan with this different coefficient distribution would not have the required effect of putting all the significant coefficients at the beginning of the scan. To achieve this requires a different zig-zag scan, which is shown in Figure 5.39. This scan, sometimes known as the Yeltsin walk, attempts to match the elliptical probability of interlaced coefficients with a scan slanted at 67.5° to the vertical.
Motion estimation is more difficult in an interlaced system. Vertical detail can result in differences between fields and this reduces the quality of the match. Fields are vertically subsampled without filtering and so contain alias products. This aliasing will mean that the vertical waveform representing a moving object will not be the same in successive pictures and this will also reduce the quality of the match.
Even when the correct vector has been found, the match may be poor so the estimator fails to recognize it. If it is recognized, a poor match means that the quality of the prediction in P and B pictures will be poor and so a large prediction error or residual has to be transmitted. In an attempt to reduce the residual, MPEG-2 allows field-pictures to use motion-compensated prediction from either the adjacent field or from the same field type in another frame. In this case the encoder will use the better match. This technique can also be used in areas of frame-pictures which use field DCT.
The motion compensation of MPEG-2 has half-pixel resolution and this is inherently compatible with an interlace because an interpolator must be present to handle the half-pixel shifts. Figure 5.40(a) shows that in an interlaced system, each field contains half of the frame lines and so interpolating half-way between lines of one field type will actually create values lying on the sampling structure of the other field type. Thus it is equally possible for a predictive system to decode a given field type based on pixel data from the other field type or of the same type.
If when using predictive coding from the other field type the vertical motion vector contains a half-pixel component, then no interpolation is needed because the act of transferring pixels from one field to another results in such a shift.
Figure 5.40(b) shows that a macroblock in a given P field-picture can be encoded using a vector which shifts data from the previous field or from the field before that, irrespective of which frames these fields occupy. As noted above, field-picture macroblocks come from an area of screen 32 lines high and this means that the vector density is halved, resulting in larger prediction errors at the boundaries of moving objects.
As an option, field-pictures can restore the vector density by using 16 × 8 motion compensation where separate vectors are used for the top and bottom halves of the macroblock. Frame-pictures can also use 16 × 8 motion compensation in conjunction with field DCT. Whilst the 2 × 2 DCT block luminance structure of a macroblock can easily be divided vertically in two, in 4:2:0 the same screen area is represented by only one chroma macroblock of each component type. As it cannot be divided in half, this chroma is deemed to belong to the luminance DCT blocks of the upper field. In 4:2:2 no such difficulty arises.
MPEG supports interlace simply because interlaced video exists in legacy systems and there is a requirement to compress it. However, where the opportunity arises to define a new system, interlace should be avoided. Legacy interlaced source material should be handled using a motion-compensated de-interlacer prior to compression in the progressive domain.
5.16 MPEG-1 and MPEG-2 coders
Figure 5.41 shows a complete coder. The bidirectional coder outputs coefficients and vectors, and the quantizing table in use. The vectors of P and B pictures and the DC coefficients of I pictures are differentially encoded in slices and the remaining coefficients are RLC/VLC coded. The multiplexer assembles all these data into a single bitstream called an elementary stream. The output of the encoder is a buffer which absorbs the variations in bit rate between different picture types. The buffer output has a constant bit rate determined by the demand clock. This comes from the transmission channel or storage device. If the bit rate is low, the buffer will tend to fill up, whereas if it is high the buffer will tend to empty. The buffer content is used to control the severity of the requantizing in the spatial coders. The more the buffer fills, the bigger the requantizing steps get.
The buffer in the decoder has a finite capacity and the encoder must model the decoder’s buffer occupancy so that it neither overflows nor underflows. An overflow might occur if an I picture is transmitted when the buffer content is already high. The buffer occupancy of the decoder depends somewhat on the memory access strategy of the decoder. Instead of defining a specific buffer size, MPEG-2 defines the size of a particular mathematical model of a hypothetical buffer. The decoder designer can use any strategy which implements the model, and the encoder can use any strategy which doesn’t overflow or underflow the model. The elementary stream has a parameter called the video buffer verifier (VBV) which defines the minimum buffering assumptions of the encoder.
As was seen in Chapter 1, buffering is one way of ensuring constant quality when picture entropy varies. An intelligent coder may run down the buffer contents in anticipation of a difficult picture sequence so that a large amount of data can be sent.
MPEG-2 does not define what a decoder should do if a buffer underflow or overflow occurs, but since both irrecoverably lose data it is obvious that there will be more or less of an interruption to the decoding. Even a small loss of data may cause loss of synchronization and in the case of long GOP the lost data may make the rest of the GOP undecodable. A decoder may choose to repeat the last properly decoded picture until it can begin to operate correctly again.
Buffer problems occur if the VBV model is violated. If this happens then more than one underflow or overflow can result from a single violation. Switching an MPEG bitstream can cause a violation because the two encoders concerned may have radically different buffer occupancy at the switch.
5.17 The elementary stream
Figure 5.42 shows the structure of the elementary stream from an MPEG-2 encoder. The structure begins with a set of coefficients representing a DCT block. Six or eight DCT blocks form the luminance and chroma content of one macroblock. In P and B pictures a macroblock will be associated with a vector for motion compensation. Macroblocks are associated into slices in which DC coefficients of I pictures and vectors in P and B pictures are differentially coded. An arbitrary number of slices forms a picture and this needs I/P/B flags describing the type of picture it is. The picture may also have a global vector which efficiently deals with pans.
Several pictures form a group of pictures (GOP). The GOP begins with an I picture and may or may not include P and B pictures in a structure which may vary dynamically.
Several GOPs form a sequence which begins with a sequence header containing important data to help the decoder. It is possible to repeat the header within a sequence, and this helps lock-up in random access applications. The sequence header describes the MPEG-2 profile and level, whether the video is progressive or interlaced, whether the chroma is 4:2:0 or 4:2:2, the size of the picture and the aspect ratio of the pixels. The quantizing matrix used in the spatial coder can also be sent. The sequence begins with a standardized bit pattern which is detected by a decoder to synchronize the deserialization.
5.18 An MPEG-2 decoder
The decoder is only defined by implication from the definitions of syntax and any decoder which can correctly interpret all combinations of syntax at a particular profile will be deemed compliant however it works. Compliant MPEG-2 decoders must be able to decode an MPEG-1 elementary stream. This is not particularly difficult as MPEG-2 uses all the tools of MPEG-1 and simply adds more of its own. Consequently an MPEG-1 bitstream can be thought of as a subset of an MPEG-2 bitstream.
The first problem a decoder has is that the input is an endless bitstream which contains a huge range of parameters many of which have variable length. Unique synchronizing patterns must be placed periodically throughout the bitstream so that the decoder can identify known starting points. The pictures which can be sent under MPEG are so flexible that the decoder must first find a sequence header so that it can establish the size of the picture, the frame rate, the colour coding used, etc.
The decoder must also be supplied with a 27 MHz system clock. In a DVD player, this would come from a crystal, but in a transmission system this would be provided by a numerically locked loop running from a clock reference parameter in the bitstream (see Chapter 6). Until this loop has achieved lock the decoder cannot function properly.
Figure 5.43 shows a bidirectional decoder. The decoder can only begin decoding with an I picture and as this only uses intra-coding there will be no vectors. An I picture is transmitted as a series of slices which begin with subsidiary synchronizing patterns. The first macroblock in the slice contains an absolute DC coefficient, but the remaining macroblocks code the DC coefficient differentially so the decoder must subtract the differential values from the previous value to obtain the absolute value.
The AC coefficients are sent as Huffman coded run/size parameters followed by coefficient value codes. The variable-length Huffman codes are decoded by using a look-up table and extending the number of bits considered until a match is obtained. This allows the zero-run-length and the coefficient size to be established. The right number of bits is taken from the bitstream corresponding to the coefficient code and this is decoded to the actual coefficient using the size parameter.
If the correct number of bits has been taken from the stream, the next bit must be the beginning of the next run/size code and so on until the EOB symbol is reached. The decoder uses the coefficient values and the zero-run-lengths to populate a DCT coefficient block following the appropriate zig-zag scanning sequence. Following EOB, the bitstream then continues with the next DCT block. Clearly this Huffman decoding will work perfectly or not at all. A single bit slippage in synchronism or a single corrupted data bit can cause a spectacular failure.
Once a complete DCT coefficient block has been received, the coefficients need to be inverse quantized and inverse weighted. Then an inverse DCT can be performed and this will result in an 8 × 8 pixel block. A series of DCT blocks will allow the luminance and colour information for an entire macroblock to be decoded and this can be placed in a frame store. Decoding continues in this way until the end of the slice when an absolute DC coefficient will once again be sent. Once all the slices have been decoded, an entire picture will be resident in the frame store.
The amount of data needed to decode the picture is variable and the decoder just keeps going until the last macroblock is found. It will obtain data from the input buffer. In a constant bit rate transmission system, the decoder will remove more data to decode an I picture than has been received in one picture period, leaving the buffer emptier than it began. Subsequent P and B pictures need much fewer data and allow the buffer to fill again.
The picture will be output when the time stamp (see Chapter 6) sent with the picture matches the state of the decoder’s time count.
Following the I picture may be another I picture or a P picture. Assuming a P picture, this will be predictively coded from the I picture. The P picture will be divided into slices as before. The first vector in a slice is absolute, but subsequent vectors are sent differentially. However, the DC coefficients are not differential.
Each macroblock may contain a forward vector. The decoder uses this to shift pixels from the I picture into the correct position for the predicted P picture. The vectors have half-pixel resolution and where a half-pixel shift is required, an interpolator will be used.
The DCT data are sent much as for an I picture. It will require inverse quantizing, but not inverse weighting because P and B coefficients are flat-weighted. When decoded this represents an error-cancelling picture which is added pixel-by-pixel to the motion-predicted picture. This results in the output picture.
If bidirectional coding is being used, the P picture may be stored until one or more B pictures have been decoded. The B pictures are sent essentially as a P picture might be, except that the vectors can be forward, backward or bidirectional. The decoder must take pixels from the I picture, the P picture, or both, and shift them according to the vectors to make a predicted picture. The DCT data decode to produce an error-cancelling image as before.
In an interlaced system, the prediction mechanism may alternatively obtain pixel data from the previous field or the field before that. Vectors may relate to macroblocks or to 16 × 8 pixel areas. DCT blocks after decoding may represent frame lines or field lines. This adds up to a lot of different possibilities for a decoder handling an interlaced input.
5.19 MPEG-4 and AVC
As was seen in Chapter 1, MPEG-4 advances the coding art in a number of ways. Whereas MPEG-1 and MPEG-2 were directed only to coding the video pictures which resulted after shooting natural scenes or from computer synthesis, MPEG-4 also moves further back in the process of how those scenes were created. For example, the rotation of a detailed three-dimensional object before a video camera produces huge changes in the video from picture to picture which MPEG-2 would find difficult to code. Instead, if the three-dimensional object is re-created at the decoder, rotation can be portrayed by transmitting a trivially small amount of vector data.
If the above object is synthetic, effectively the synthesis or rendering process is completed in the decoder. However, a suitable if complex image processor at the encoder could identify such objects in natural scenes. MPEG-4 objects are defined as a part of a scene that can independently be accessed or manipulated. An object is an entity that exists over a certain time span. The pictures of conventional imaging become object planes in MPEG-4. Where an object intersects an object plane, it can be described by the coding system using intra-coding, forward prediction or bidirectional prediction.
Figure 5.44 shows that MPEG-4 has four object types. A video object is an arbitrarily shaped planar pixel array describing the appearance or texture of part of a scene. A still texture object or sprite is a planar video object in which there is no change with respect to time. A mesh object describes a two- or three-dimensional shape as a set of points. The shape and its position can change with respect to time. Using computer graphics techniques, texture can be mapped onto meshes, a process known as warping, to produce rendered images.
Using two-dimensional warping, a still texture object can be made to move. In three-dimensional graphic rendering, mesh coding allows an arbitrary solid shape to be created which is then covered with texture. Perspective computation then allows this three-dimensional object to be viewed in correct perspective from any viewpoint. MPEG-4 provides tools to allow two- or three-dimensional meshes to be created in the decoder and then oriented by vectors. Changing the vectors then allows realistic moving images to be created with an extremely low bit rate.
Face and body animation is a specialized subset of three-dimensional mesh coding in which the mesh represents a human face and/or body. As the subject moves, carefully defined vectors carry changes of expression which allow rendering of an apparently moving face and/or body which has been almost entirely synthesized from a single still picture.
In addition to object coding, MPEG-4 refines the existing MPEG tools by increasing the efficiency of a number of processes using lossless prediction. AVC extends this concept further still. This improves the performance of both the motion compensation and coefficient coding allowing either a lower bit rate or improved quality. MPEG-4 also extends the idea of scaleability introduced in MPEG-2. Multiple scaleability is supported, where a low-bit-rate base-level picture may optionally be enhanced by adding information from one or more additional bitstreams. This approach is useful in network applications where the content creator cannot know the bandwidth which a particular user will have available. Scaleability allows the best quality in the available bandwidth.
Although most of the spatial compression of MPEG-4 is based on the DCT as in earlier MPEG standards, MPEG-4 also introduces wavelet coding of still objects. Wavelets are advantageous in scaleable systems because they naturally decompose the original image into various resolutions.
In contrast to the rest of MPEG-4, AVC is intended for use with entire pictures and as such is more of an extension of MPEG-2. AVC adds refinement to the existing coding tools of MPEG and also introduces some new ones. The emphasis is on lossless coding to obtain similar performance to MPEG-2 at around half the bit rate.
5.20 Video objects
Figure 5.45 shows an example of a video object intersecting video object planes, or VOPs. At each plane, the shape and the texture of the object must be portrayed. Figure 5.46 shows this can be done using appropriate combinations of intra- and inter-coding as described for the earlier standards. This gives rise to I-VOPs, P-VOPs and B-VOPs. A group of VOPs is known as a GOV.
Figure 5.47 shows how video objects are handled at the decoder which effectively contains a multi-layer compositing stage. The shape of each object is transmitted using alpha data which is decoded to produce a key signal. This has the effect of keying the texture data into the overall picture. Several objects can be keyed into the same picture. The background may be yet another video object or a still texture object which can be shifted by vectors in the case of pans etc.
Figure 5.48 shows that MPEG-4 codes each arbitrarily shaped video object by creating a bounding rectangle within which the object resides. The shape and the texture of each video object are re-created at the decoder by two distinct but interrelated processes. The bounding rectangle exists on macroblock boundaries and can change from VOP to VOP. At each VOP, the bitstream contains horizontal and vertical references specifying the position of the top-left corner of the bounding rectangle and its width and height.
Within the bounding rectangle are three kinds of macroblock. Transparent macroblocks are entirely outside the object and contain no texture data. The shape data are identical over the entire block and result in a key signal which deselects this block in the compositing process. Such shape data are trivial to compress. Opaque macroblocks are entirely within the object and are full of texture data. The shape data are again identical over the entire block and result in the block being keyed at the compositor. Boundary macroblocks are blocks through which the edge of the object passes. They contain essentially all the shape information and somewhat fewer texture data than an opaque macroblock.
5.21 Texture coding
In MPEG-1 and MPEG-2 the only way of representing an image is with pixels and this requires no name. In MPEG-4 there are various types of image description tools and it becomes necessary to give the pixel representation of the earlier standards a name. This is texture coding which is that part of MPEG-4 that operates on pixel-based areas of image.
Coming later than MPEG-1 and MPEG-2, the MPEG-4 and AVC texture-coding systems can afford additional complexity in the search for higher performance. Figure 5.49 contrasts MPEG-2, MPEG-4 and AVC. Figure 5.49(a) shows the texture decoding system of MPEG-2 whereas (b) shows MPEG-4 and (c) shows AVC. The latter two are refinements of the earlier technique. These refinements are lossless in that the reduction in bit rate they allow does not result in a loss of quality.
When inter-coding, there is always a compromise needed over the quantity of vector data. Clearly if the area steered by each vector is smaller, the motion compensation is more accurate, but the reduction in residual data is offset by the increase in vector data. In MPEG-1 and MPEG-2 only a small amount of vector compression is used. In contrast, MPEG-4 and AVC use advanced forms of lossless vector compression which can, without any bit rate penalty, increase the vector density to one vector per DCT block in MPEG-4 and to one vector per 4 × 4 pixel block in AVC. AVC also allows quarter-pixel accurate vectors. In inter-coded pictures the prediction of the picture is improved so that the residual to be coded is smaller.
When intra-coding, MPEG-4 looks for further redundancy between coefficients using prediction. When a given DCT block is to be intra-coded, certain of its coefficients will be predicted from adjacent blocks.
The choice of the most appropriate block is made by measuring the picture gradient, defined as the rate of change of the DC coefficient. Figure 5.50(a) shows that the three adjacent blocks, A, B and C, are analysed to decide whether to predict from the DCT block above (vertical prediction) or to the left (horizontal prediction). Figure 5.50(b) shows that in vertical prediction the top row of coefficients is predicted from the block above so that only the differences between them need to be coded. Figure 5.50(c) shows that in horizontal prediction the left column of coefficients is predicted from the block on the left so that again only the differences need be coded.
Choosing the blocks above and to the left is important because these blocks will already be available in both the encoder and decoder. By making the same picture gradient measurement, the decoder can establish whether vertical or horizontal prediction has been used and so no flag is needed in the bitstream.
Some extra steps are needed to handle the top row and the left column of a picture or object where true prediction is impossible. In these cases both encoder and decoder assume standardized constant values for the missing prediction coefficients.
The picture gradient measurement determines the direction in which there is the least change from block to block. There will generally be fewer DCT coefficients present in this direction. There will be more coefficients in the other axis where there is more change. Consequently it is advantageous to alter the scanning sequence so that the coefficients which are likely to exist are transmitted earlier in the sequence.
Figure 5.51 shows the two alternate scans for MPEG-4. The alternate horizontal scan concentrates on horizontal coefficients early in the scan and will be used in conjunction with vertical prediction. Conversely the alternate vertical scan concentrates on vertical coefficients early in the scan and will be used in conjunction with horizontal prediction. The decoder can establish which scan has been used in the encoder from the picture gradient.
Coefficient prediction is not employed when inter-coding because the statistics of residual images are different. Instead of attempting to predict residual coefficients, in inter-coded texture, pixel-based prediction may be used to reduce the magnitude of texture residuals. This technique is known as overlapped block motion compensation (OBMC) which is only used in P-VOPs. With only one vector per DCT block, clearly in many cases the vector cannot apply to every pixel in the block. If the vector is considered to describe the motion of the centre of the block, the vector accuracy falls towards the edge of the block. A pixel in the corner of a block is almost equidistant from a vector in the centre of an adjacent block.
OBMC uses vectors from adjacent blocks, known as remote vectors, in addition to the vector of the current block for prediction. Figure 5.52 shows that the motion-compensation process of MPEG-1 and MPEG-2 which uses a single vector is modified by the addition of the pixel prediction system which considers three vectors. A given pixel in the block to be coded is predicted from the weighted sum of three motion-compensated pixels taken from the previous I- or P-VOP. One of these pixels is obtained in the normal way by accessing the previous VOP with a shift given by the vector of this block. The other two are obtained by accessing the same VOP pixels using the remote vectors of two adjacent blocks.
The remote vectors which are used and the weighting factors are both a function of the pixel position in the block. Figure 5.52 shows that the block to be coded is divided into quadrants. The remote vectors are selected from the blocks closest to the quadrant in which the pixel resides. For example, a pixel in the bottom-right quadrant would be predicted using remote vectors from the DCT block immediately below and the block immediately to the right.
Not all blocks can be coded in this way. In P-VOPs it is permissible to have blocks which are not coded or intra-blocks which contain no vector. Remote vectors will not all be available at the boundaries of a VOP. In the normal sequence of macroblock transmission, vectors from macroblocks below the current block are not yet available. Some additional steps are needed to handle these conditions. Adjacent to boundaries where a remote vector is not available it is replaced by a copy of the actual vector. This is also done where an adjacent block is intra-coded and for blocks at the bottom of a macroblock where the vectors for the macroblocks below will not be available yet. In the case of non-coded blocks the remote vector is set to zero.
Figure 5.53(a) shows that the weighting factors for pixels near the centre of a block favour that block. In the case of pixels at the corner of the block, the weighting is even between the value obtained from the true vector and the sum of the two pixel values obtained from remote vectors.
The weighted sum produces a predicted pixel which is subtracted from the actual pixel in the current VOP to be coded to produce a residual pixel. Blocks of residual pixels are DCT coded as usual. OBMC reduces the magnitude of residual pixels and gives a corresponding reduction in the number or magnitude of DCT coefficients to be coded.
OBMC is lossless because the decoder already has access to all the vectors and knows the weighting tables. Consequently the only overhead is the transmission of a flag which enables or disables the mechanism.
MPEG-4 also has the ability to downsample prediction error or residual macroblocks which contain little detail. A 16 × 16 macroblock block is downsampled to 8 × 8 and flagged. The decoder will identify the flag and interpolate back to 16 × 16.
In vector prediction, each macroblock may have only one or four vectors as the coder decides. Consequently the prediction of a current vector may have to be done from either macroblock or DCT block vectors. In the case of predicting one vector for an entire macroblock, or the top-left DCT block vector, the process shown in Figure 5.52(b) is used. Three earlier vectors, which may be macroblock or DCT block vectors, as available, are used as the input to the prediction process. In the diagram the large squares show the macroblock vectors to be selected and the small squares show the DCT block vectors to be selected. The three vectors are passed to a median filter which outputs the vector in the centre of the range unchanged.
A median filter is used because the same process can be performed in the decoder with no additional data transmission. The median vector is used as a prediction, and comparison with the actual vector enables a residual to be computed and coded for transmission. At the decoder the same prediction can be made and the received residual is added to recreate the original vector.
The remaining parts of Figure 5.52(b) show how the remaining three DCT block vectors are predicted from adjacent DCT block vectors. If the relevant block is only macroblock coded, that vector will be substituted.
5.22 Shape coding
Shape coding is the process of compressing alpha or keying data. Most objects are opaque and so a binary alpha signal is adequate for the base-level shape system. For each texture pixel, an alpha bit exists forming a binary alpha map which is effectively a two-dimensional mask through which the object can be seen. At the decoder binary alpha data are converted to 000 or 255 levels in an eight-bit keying system.
Optionally the object can be faded in the compositing process by sending a constant alpha value at each VOP. As a further option, variable transparency can be supported by sending alpha texture data. This is coded using the usual MPEG spatial coding tools such as DCT and scanning etc.
Binary data such as alpha data do not respond to DCT coding, and another compression technique has been developed for binary alpha blocks (bab). This is known as context-based coding and it effectively codes the location of the boundary between alpha one bits and alpha zero bits. Clearly once the boundary is located the values of all remaining alpha bits are obvious.
The babs are raster scanned into a serial bitstream. Context coding works by attempting to predict the state of the current bit from a set of bits which have already been decoded. Figure 5.54(a) shows the set of bits, known as a context, used in an intra-coded VOP. There are ten bits in the context and so there can be 1024 different contexts. Extensive analysis of real shapes shows that for each context there is a certain probability that the current bit will be zero. This probability exists as a standardized look-up table in the encoder and decoder.
Figure 5.54(b) shows that the encoder compares the probability with a parameter called the arithmetic code value to predict the value of the alpha bit. If the prediction is correct, the arithmetic code value needs no change and nothing needs to be transmitted. However, if the predicted bit is incorrect, a change to the arithmetic code value must be transmitted, along with the position of the bit in the scan to which it applies. This change is calculated to be just big enough to make the prediction correct. In this way only prediction errors, which represent the boundary location where alpha changes, are transmitted and the remaining bits self-predict.
Figure 5.55 shows that the spatial extent of the context requires alpha bits from adjacent babs to code the present bab. This results in the concept of the bordered bab which incorporates a two-pixel deep border of alpha bits above and to each side. In the case of babs adjacent to the top or sides of the bounding rectangle, some of the bounding bits are outside that rectangle and their value is obviously always zero.
Shape data can also be inter-coded. This is handled using a context which spreads across two VOPs. As Figure 5.56 shows, four context bits are in the current VOP, whereas five context bits are in a previous VOP. This nine-bit context requires a different probability table, but otherwise coding proceeds as before. If the shape changes from one VOP to another, motion compensation can be used. A shape vector is transmitted. It will be seen from Figure 5.56(b) that this vector shifts the context relative to the previous VOP, thus altering (hopefully improving) the chances of correctly predicting the current bit.
Within a boundary block, shape vectors are highly redundant with the motion vectors of texture and texture vectors may be used to predict shape vectors.
It might be thought that binary shape data would result in objects having ratcheted edges because the key signal can only exist in pixel steps. However, texture coding can overcome this problem. If a binary keyed image is used as a predictor, the texture prediction error will deliver the necessary accuracy in the final image.
5.23 Padding
The peculiar conditions in boundary macroblocks where some of the block is inside the object and some outside requires special handling and this is the function of padding. There are two types of padding which perform distinct functions; texture padding and motion compensation (MC) padding.
Texture padding makes the spatial coding of boundary blocks more efficient. As the frequency coefficients of the DCT relate to the entire block area, detail outside the object can result in coefficients which are not due to the object itself. Texture padding consists of replacing those pixels in a boundary block which are outside the object with pixel values that are selected to minimize the amount of coefficient data. The exact values are unimportant as they will be discarded in the decoder.
When objects move they may also change their shape. Motion compensation will be used whereby texture and shape from another VOP is used to predict texture and shape in the current VOP. If there is a small error in the boundary data, background pixels could leak into the object causing a large prediction error. MC padding is designed to prevent this. In the case of VOPs used as anchors for inter-coding, in boundary blocks, after decoding, texture padding is replaced with MC padding. In MC padding, pixel values at the very edge of the object are copied to pixels outside the object so that small shape errors can be corrected with minor residual data.
5.24 Video object coding
Figure 5.57 shows the video object coding process. At the encoder, image analysis dissects the images into objects, in this case one small object (1) and a background object (2). The picture and shape data for object 1 are used to create a bounding rectangle whose co-ordinates are coded. Pixel values outside the boundary are replaced with texture padding. The texture macroblocks are then DCT coded and the shape babs are context coded. This can be on an intra- or motion-compensated interbasis according to the type of VOP being coded. Background data are coded by another process and multiplexed into the bitstream as another object.
At the decoder the texture and shape data are decoded, and in anchor VOPs the shape data are used to strip off the texture padding and replace it with MC padding. MC padded object data are used to predict intra-coded objects. Finally the shape data key the object into the background.
Figure 5.58 shows an object coder. The object analysis of the incoming image provides texture and texture motion as well as object shape and shape motion. Texture coding takes place in much the same way as MPEG picture coding. Intra-VOPs are spatially coded, whereas P- and B-VOPs use motion-compensated prediction and a spatially coded residual is transmitted. The vectors in MPEG-4 are subject to a prediction process not found in MPEG-2. Note the use of texture padding before the coder. As the decoder will use MC padding in the course of prediction, the encoder must also use it to prevent drift.
The shape data are context coded in I-VOPs and the coding will be motion compensated in P- and B-VOPs. However, the shape vectors can be predicted from other shape vectors or from the texture motion vectors where these exist. These shape vectors are coded as prediction residuals. Figure 5.59 shows how the shape vector prediction works. Around the bab to be coded are three designated shape vectors and three designated texture vectors. The six designated vectors are scanned in the order shown and the first one which is defined will be used as the predictor. Consequently if no shape vectors are defined, the texture vectors will be used automatically since these are later in the scan.
Shape babs can exist in a variety of forms as Figure 5.48 showed. Opaque and transparent blocks are easily and efficiently coded. Intra babs have no shape vector but need a lot of data. Inter babs may or may not contain a shape vector according to object motion and only a residual is sent. On the other hand, under certain object conditions, sufficient shape accuracy may result by sending only the shape vector.
5.25 Two-dimensional mesh coding
Mesh coding was developed in computer graphics to aid the rendering of synthetic images with perspective. As Figure 5.60 shows, as a threedimensional body turns, its two-dimensional appearance changes. If the body has a flat side square-on to the optical axis whose texture is described by uniformly spaced pixels, after a rotation the pixels will no longer be uniformly spaced. Nearly every pixel will have moved, and clearly an enormous amount of data would be needed to describe the motion of each one. Fortunately this is not necessary; if the geometry of the object is correctly sampled before and after the motion, then the shift of every pixel can be computed.
The geometric sampling results in a structure known as a mesh. Effectively a mesh is a sampled surface. The smoother the surface, the fewer samples are necessary because the remainder of the surface can be interpolated. In MPEG-4, meshes can be two or three dimensional. Figure 5.61 shows a two-dimensional mesh which is a set of points known as nodes which must remain in the video object plane. Figure 5.62 shows a three-dimensional mesh which is a set of nodes describing a non-flat surface.
Like other MPEG-4 parameters, meshes can be scaleable. A base mesh has relatively few nodes; adding enhancement data will increase the number of nodes so that the shape is better described.
Figure 5.63 shows an example of how two-dimensional meshes are used to predict an object from a previous VOP. The first object may be sent as an I-VOP, including texture and alpha outline data, both of which will be intra-coded. The texture pixels in the I-VOP are uniformly spaced and so a uniform mesh can be sent as a further part of the same I-VOP. Uniform meshes require very few data. Figure 5.64(a) shows that little more than the co-ordinates of the top-left corner of the mesh and the mesh pitch need be sent. There are also some choices of how the mesh boxes are divided into triangles.
As an alternative to a regular mesh, an irregular or Delaunay mesh may be created by specifying the location of each vertex. This will require more data initially, but may prove to be more efficient in the case of an irregular object which exists for a long time.
Figure 5.64(b) shows how the shape of the mesh can be changed by transmitting mesh vectors in P-VOPs. These specify the new position of each vertex with respect to the original position. The mesh motion vectors are predictively coded as shown in Figure 5.65. In two adjacent triangles there are two common nodes. If the first triangle has been fully coded, only one node of the second triangle remains to be coded. The vector for the remaining node is predicted by taking the average of the two common nodes and rounding to half-pixel accuracy. This prediction is subtracted from the actual vector to produce a residual which is then variable-length coded for transmission. As each node is coded, another triangle is completed and a new pair of common nodes become available for further prediction.
Figure 5.66 shows an object coding system using meshes. At (a), following the transmission of the I-VOP containing shape, outline and a regular mesh, the I-VOP and the next VOP, which will be a P-VOP, are supplied to a motion estimator. The motion estimator will determine the motion of the object from one VOP to the next. As the object moves and turns, its shape and perspective will change. The new shape may be encoded using differential alpha data which the decoder uses to change the previous shape into the current shape. The motion may also result in a perspective change. The motion estimator may recognize features in the object which are at a new location and this information is used to create mesh vectors.
At the encoder, mesh vectors are used to warp the pixels in the I-VOP to predict the new perspective. The differential shape codes are used to predict the new shape of the object. The warped, reshaped object is then compared with the actual object, and the result is a texture residual which will be DCT coded.
At the decoder, (b), the I-VOP is in memory and the mesh motion, the shape difference and the texture residual arrive as a P-VOP. The mesh motion is decoded and used to warp the object pixels taken from the I-VOP memory. The shape difference is decoded to produce a new shape around the warped pixels, resulting in a predicted object. The texture residual is decoded and added to the predicted object to produce the final decoded object which is then composited into the output picture along with other objects.
It should be clear that by transmitting warping vectors the pixels from the I-VOP can be used to create a very accurate prediction of the texture of the P-VOP. The result is that the texture residual can be encoded with a very small number of bits.
5.26 Sprites
Sprites are images or parts of images which do not change with time. Such information only needs to be sent once and is obviously attractive for compression. However, it is important to appreciate that it is only the transmitted sprite which doesn’t change. The decoder may manipulate the sprite in a different way for each picture, giving the illusion of change.
Figure 5.67 shows the example of a large sprite used as a background. The pixel array size exceeds the size of the target picture. This can be sent as an I-VOP (intra-coded video object plane). Once the texture of the sprite is available at the decoder, vectors can be sent to move or warp the picture. These instructions are sent as S-VOPs (static video object planes). Using S-VOP vectors, a different part of the sprite can be delivered to the display at each picture, giving the illusion of a moving background. Other video objects can then be keyed over the background.
Sending a large background sprite in one I-VOP may cause difficulty at high compression factors because the one-off image distorts the bit rate. The solution here is low-latency sprite coding. Although the sprite is a single still texture object, it need not all be transmitted at once. Instead the sprite is transmitted in pieces. The first of these is an object piece, whereas subsequently update pieces may be sent.
Update pieces may coincide spatially with the object piece and result in a refinement of image quality, or may be appended spatially to increase the size of a sprite. Figure 5.68 shows the example of an object piece which is transmitted once but used in a series of pictures. In Figure 5.68(a) an update piece is transmitted which improves the quality of the sprite so that a zoom-in can be performed.
Figure 5.68(b) shows a large sprite sent to create a moving background. Just before the edge of the sprite becomes visible, the sprite is extended by sending an update piece. Update pieces can be appended to pixel accuracy because they are coded with offset parameters which show where the update fits with respect to the object piece.
Figure 5.69 shows a sprite decoder. The basic sprite is decoded from an I-VOP and placed in the sprite buffer. It does not become visible until an S-VOP arrives carrying shifting and warping instructions. Low-latency decoding sends sprite pieces as S-VOPs which are assembled in the buffer as each sprite piece arrives.
5.27 Wavelet-based compression
MPEG-4 introduces the use of wavelets in still picture object coding. Wavelets were introduced in Chapter 3 where it was shown that the transform itself does not achieve any compression. Instead the picture information is converted to a form in which redundancy is easy to find. Figure 5.70 shows three of the stages in a wavelet transform. Each of these represents more detail in the original image, assuming, of course, that there was any detail there to represent. In real images, the highest spatial frequencies may not be present except in localized areas such as edges or in detailed objects. Consequently many of the values in the highest resolution difference data will be zero or small. An appropriate coding scheme can achieve compression on data of this kind. As in DCT coding, wavelet coefficients can be quantized to shorten their word-length, advantageous scanning sequences can be employed and arithmetic coding based on coefficient statistics can be used.
MPEG-4 provides standards for scaleable coding. Using scaleable coding means that images can be reproduced both at various resolutions and with various noise levels. In one application a low-cost simple decoder will decode only the low-resolution noisy content of the bitstream whereas a more complex decoder may use more or all of the data to enhance resolution and to lower the noise floor. In another application where the bit rate is severely limited, the early part of a transmission may be decoded to produce a low-resolution image whose resolution increases as further detail arrives.
The wavelet transform is naturally a multi-resolution process and it is easy to see how a resolution scaleable system could be built from Figure 5.70 by encoding the successive difference images in separate bitstreams. As with DCT-based image compression, the actual bit rate reduction in wavelet compression is achieved by a combination of lossy and lossless coding. The lossless coding consists of scanning the parameters in a manner which reflects their statistics, followed by variable-length coding.
The lossy coding consists of quantizing the parameters to reduce their wordlength. Coarse quantization may be employed on a base-level bitstream, whereas an enhancement bitstream effectively transmits the quantizing error of that coarse quantization.
The wavelet transform produces a decomposition of the image which is ideal for compression. Each sub-band represents a different band of spatial frequencies, but in the form of a spatial array. In real images, the outline of every object consists of an edge which will contain a wide range of frequencies. There will be energy in all sub-bands in the case of a sharp edge. As each sub-band is a spatial array, edge information will be found in the same place in each sub-band. For example, an edge placed one quarter of the way across the picture will result in energy one quarter of the way across each sub-band.
This characteristic can be exploited for compression if the sequence in which the coefficients are coded starts at a given coefficient in the lowest sub-band and passes through each sub-band in turn. Figure 5.71 shows that for each coefficient in a given sub-band there will be four children in the next, sixteen descendants in the next and so on, in a tree structure. In the case of a sharp edge, all the relevant coefficients will generally lie on a single route through the tree. In plain areas containing only low frequencies, the corollary of this characteristic is that generally if a given coefficient has value zero, then all of the coefficients in higher sub-bands will frequently also have value zero, producing a zerotree.3 Zerotrees are easy to code because they simply need a bit pattern which is interpreted as meaning there are no further non-zero coefficients up this tree.
Figure 5.72 shows how zerotree coding works. Zero frequency coefficients are coded separately using predictive coding, so zerotree coding starts with coefficients in the lowest AC sub-band. Each coefficient represents the root of a tree and leads to four more nodes in the next sub-band and so on. At nodes, one of four codes may be applied. ZTR means that this node is at the bottom of a zerotree and that all further coefficients up the tree are zero and need not be coded. VZTR means that this coefficient is non-zero but all its descendants are. In this case only the valid coefficient at the foot of the tree is sent. IZ means that this coefficient is zero but it has non-zero descendants and the scan must continue. Finally VAL means a non-zero coefficient with non-zero descendants. This coefficient and subsequent ones must be coded.
Zerotree coding is efficient because in real images the energy is concentrated at low frequency and most trees do not need to be coded as far as the leaf node.
It is important to appreciate that the scanning order in which zerotree coding takes place must go from the lowest AC sub-band upwards. However, when the coefficients have been designated by the zerotree coder into the four categories above, they can be transmitted in any order which is convenient. MPEG-4 supports either tree depth scanning or band-by-band scanning. Figure 5.73(a) shows that in tree-depth scanning the coefficients are transmitted one tree at a time from the lowest spatial frequency to the highest. The coding is efficient but the picture will not be available until all the data have been received. Figure 5.73(b) shows that in band-by-band scanning the coefficients are transmitted in order of ascending spatial frequency. At the decoder a soft picture becomes available first, and the resolution gradually increases as more sub-bands are decoded. This coding scheme is less efficient but the decoding latency is reduced which may be important in image database browsing applications.
Figure 5.74(a) shows a simple wavelet coder operating in single-quant mode. After the DWT, the DC or root coefficients are sent to a predictive coder where the values of a given coefficient are predicted from adjacent coefficients. The AC coefficients are zerotree coded. Coefficients are then quantized and arithmetic coded. The decoder is shown in (b). After arithmetic decoding, the DC coefficients are obtained by inverse prediction and the AC coefficients are re-created by zerotree decoding. After inverse quantizing the coefficients control an inverse DWT to re-create the texture.
MPEG-4 supports scaleable wavelet coding in multi-quant and bilevel-quant modes. Multi-quant mode allows resolution and noise scaleability. In the first layer, a low-resolution and/or noisy texture is encoded. The decoder will buffer this texture and optionally may display it. Higher layers may add resolution by delivering higher sub-band coefficients or reduce noise by delivering the low-order bits removed by the quantizing of the first layer. These refinements are added to the contents of the buffer.
In bilevel-quant mode the coefficients are not sent as binary numbers, but are sent bitplane-by-bitplane. In other words the MSBs of all coefficients are sent first followed by all the second bits and so on. The decoder would reveal a very noisy picture at first in which the noise floor falls as each enhancement layer arrives. With enough enhancement layers the compression could be lossless.
5.28 Three-dimensional mesh coding
In computer-generated images used, for example, in simulators and virtual reality, the goal is to synthesize the image which would have been seen by a camera or a single eye at a given location with respect to the virtual objects. Figure 5.75 shows that this uses a process known as ray tracing. From a fixed point in the virtual camera, a ray is projected outwards through every pixel in the image to be created. These rays will strike either an object or the background. Rays which strike an object must result in pixels which represent that object. Objects which are three dimensional will reflect the ray according to their geometry and reflectivity, and the reflected ray has to be followed in case it falls on a source of light which must appear as a reflection in the surface of the object. The reflection may be sharp or diffuse according to the surface texture of the object. The colour of the reflection will be a function of the spectral reflectivity of the object and the spectrum of the incident light.
If the geometry, surface texture and colour of all objects and light sources are known, the image can be computed in a process called rendering. If any of the objects are moving, or if the viewpoint of the virtual camera changes, then a new image will need to be rendered for each video frame. If such synthetic video were to be compressed by, say, MPEG-2, the resultant bit rate would reflect the large number of pixels changing from frame to frame. However, the motion of virtual objects can be described by very few bits. A solid object moves in three dimensions and can rotate in three axes and this is trivial to transmit. A flexible object such as a jellyfish requires somewhat more information to describe how its geometry changes with time, but still much less than describing the changes in the rendered video.
Figure 5.76 shows that in MPEG-4, the rendering process is transferred to the decoder. The bitstream describes objects to the decoder which then renders the virtual image. The bitstream can then describe the motion of the objects with a very low bit rate, and the decoder will render a new image at each frame. In interactive applications such as simulators, CAD and video games, the user may change the viewpoint of the virtual camera and/or the position of virtual objects and result in a unique rendering. This does not require any communication between the user and the encoder because the user is only interacting with the rendering process.
Transmitting raw geometry requires a large amount of data. Space is quantized in three axes and for every location in, say, the x-axis, the y and z co-ordinates of the surface must be transmitted. Three-dimensional mesh coding is an efficient way of describing the geometry and properties of bodies in a compressed bitstream. Audio and video compression relies upon the characteristics of sounds, images and the human senses, whereas three-dimensional mesh coding relies on the characteristics of topology.4
Figure 5.77(a) shows a general body in which the curvature varies. Sampling this body at a constant spacing is obviously inefficient because if the samples are close enough together to describe the sharply curved areas, the gently curved areas will effectively be heavily oversampled. Three-dimensional mesh coding uses irregular sampling locations called vertices which can be seen in Figure 5.77(b). A vertex is a point in the surface of an object which has a unique position in x, y and z. Shape coding consists of efficiently delivering the co-ordinates of a number of vertices to the decoder. Figure 5.77(c) shows that the continuous shape can then be reconstructed by filtering as all sampled processes require.
Filtering of irregularly spaced samples is more complex than that of conventional reconstruction filters and uses algorithms called splines.5 Failure correctly to filter the shape results in synthetic objects which appear multi-faceted instead of smoothly curved.
Figure 5.78(a) shows that a set of three vertices may be connected by straight lines called edges to form a triangle. In fact this is how the term ‘vertex’ originates. Note that these edges are truly straight and are thus not necessarily on the surface of the object except at the vertices. There are a number of advantages to creating these triangles. Wherever the vertices are, a triangle is always flat, which is not necessarily the case for shapes with four or more sides. A flat surface needs no further description because it can be interpolated from the vertices. Figure 5.78(b) shows that by cutting along certain edges and hinging along others, a triangulated body can be transformed into a plane. (A knowledge of origami is useful here.) Figure 5.78(c) shows that once a triangle has been transmitted to the decoder, it can be used as the basis for further information about the surface geometry. In a process known as a forest split operation, new vertices are located with respect to the original triangle.6
Forest split operations are efficient because the displacements of the new vertices from the original triangle plane are generally small. Additionally a forest split is only needed in triangles in the vicinity of sharp changes. Forest splits can be cascaded allowing the geometry to be reconstructed to arbitrary accuracy. By encoding the forest split data in different layers, a scaleable system can be created where a rough shape results from the base layer which can be refined by subsequent layers.
Figure 5.79 shows an MPEG-4 three-dimensional mesh coding system at high level. The geometry of the object is coded by the vertices which are points, whereas the properties are coded with respect to the triangles between the vertices. The dual relationship between the triangles and the vertices which they share is known as the connectivity. Geometry is compressed using topological principles and properties are compressed by quantizing. These parameters are then subject to entropy (variable-length) coding prior to multiplexing into a bitstream. At the decoder the three information streams are demultiplexed and decoded to drive the rendering and composition process. The connectivity coding is logical and must be lossless whereas the vertex positions and triangle properties may be lossy or lossless according to the available bit rate.
Generic geometry coding is based on meshes using polygons. Each polygon is defined by the position of each vertex and its surface properties. MPEG-4 breaks such polygons down into triangles for coding the vertices and associates the properties with the correct triangles at the decoder using the connectivity data so that the polygons can be re-created. The geometry coding is based on triangles because they have useful properties.
Figure 5.80 shows that triangles are fundamentally binary. Digital transmission is serial, and compression works by finding advantageous sequences in which to send information, for example the scanning of DCT coefficients. When scanning triangles, if a scan enters via one edge, there are then only two sides by which to exit. This decision can be represented by a single bit for each triangle, known as a marching bit.
Figure 5.81(a) shows a representative object, in this case a spheroid. Its surface has been roughly sampled to create twelve vertices, and edges are drawn between the vertices to create triangles. The flat triangles create a dodecahedron, and this may later be refined more closely to the spheroid with a forest split operation. For the moment the goal is to transmit the vertices so that the triangles can be located. As adjacent triangles share vertices, there are fewer vertices than triangles. In this example there are twenty triangles but only twelve vertices.
When a vertex is transmitted, the information may be required by more than one triangle. Transferring this information to the next triangle in the scan sequence is obvious, but where the scan branches it is non-obvious and a structure known as vertex loop look-up table is needed to connect vertex data. Only one vertex needs to have its absolute co-ordinates transmitted. This is known as the root vertex. The remaining vertices can be coded by prediction from earlier vertices so that only the residual needs to be sent.
In general it is impossible for a continuous scan sequence to visit all of the triangles. Figure 5.81(b) shows that if cuts are made along certain edges, the triangles can be hinged at their edges. This is continued until they all lie in the same plane to form a polygon (c). This process has created a binary tree along which the triangles can be scanned, but it has also separated some of the vertices. Figure 5.81(c) apparently has 22 vertices, but ten of these are redundant because they simply repeat the locations of the real vertices. Figure 5.81(d) shows how the vertex loop look-up process can connect a single coded vertex to all its associated triangles.
In graphics terminology, the edges along which the polyhedron was cut form a rooted vertex tree. Each run of the vertex tree terminates at a point where a cut ended. The extremities of the vertex tree correspond to the concavities in the polygon. When the vertex loop is collapsed to zero area in the process of folding the polygon into a solid body it becomes a vertex tree.
A solid body such as a spheroid is described as manifold because its surface consists of a continuous layer of faces with a regular vertex at each corner. However, some bodies which need to be coded are not manifold. Figure 5.82 shows a non-manifold body consisting of a pair of intersecting planes. The two planes share an edge joining shared vertices. The two planes can be coded separately and joined using stitches which indicate common vertices.
Stitches can also be used to transmit meshes in discrete stages known as partitions. Partitions can be connected together by stitches such that the result is indistinguishable from a mesh coded all at once. There are many uses of mesh partitions, but an example is in the coding of a rotating solid body. Only the part of the body which is initially visible needs to be transmitted before decoding can begin. Extra partitions can be transmitted which describe that part of the surface of the body that will come into view as the rotation continues. This approach is the three-dimensional equivalent of transmitting static sprites as pieces; both allow the bit rate to remain more constant.
Figure 5.83 shows that when scanning, it is possible to enter four types of triangle. Having entered by one edge, the other two edges will either be internal edges, leading to another triangle, or a boundary edge. This can be described with a two-bit code; one bit for the left edge and one bit for the right. The scanning process is most efficient if it can create long runs which are linear sequences of triangles between nodes. Starting at a boundary edge, the run will continue as long as type 01 or 10 triangles are found. If a type 11 triangle is found, there are two runs from this node.
The encoder must remember the existence of nodes in a stack, and pick one run until it ends at a type 00 triangle. It must then read the stack and encode the other run. When the current run has reached a 00 triangle and the stack is empty, all the triangles must have been coded.
There is redundancy in the two-bit code describing the triangles. Except for the end triangles, the triangles in a run will have the value 01 or 10. By transmitting the run length, the decoder knows where the run ends. As only two values are possible in the run, 01 can be coded as 0 and 10 can be coded as 1 and only this needs to be coded for each triangle. This bit is, of course, the marching bit which determines whether the run leaves the triangle by the left edge or the right. Clearly in type 00 or 11 triangles there is no marching bit.
Figure 5.84 shows how the triangles are coded. There are seven runs in the example shown, and so seven run lengths are transmitted. Marching bits are only transmitted for type 01 or 10 triangles and the decoder can determine how the sequence of marching bits is to be decoded using the run lengths.
Figure 5.85 shows a three-dimensional mesh codec. This decoder demultiplexes the mesh object bitstream into data types and reverses the arithmetic coding. The first item to be decoded is the vertex graph which creates the vertex loop look-up table. If stitches are being used these will also be decoded so that in conjunction with the vertex graph look-up table a complete list of how triangles are to be joined to create the body will result.
The triangle topology is output by the triangle tree decoder. This decodes the run lengths and the marching bits of the runs which make up the polygon. Finally the triangle data are decoded to produce the vertex co-ordinates. The co-ordinates are linked by the connectivity data to produce the finished mesh (or partition).
The faces of the mesh then need to be filled in with properties so that rendering can take place.
5.29 Animation
In applications such as videophones and teleconferencing the images to be coded frequently comprise the human face in close-up or the entire body. Video games may also contain virtual human actors. Face and body animation (FBA) is a specialized subset of mesh coding which may be used in MPEG-4 for coding images of, or containing, humans at very low bit rate.
In face coding, the human face is constructed in the decoder as a threedimensional mesh onto which the properties of the face to be coded are mapped. Figure 5.86 shows a facial animation decoder. As all faces have the same basic layout, the decoder contains a generic face which can be rendered immediately. Alternatively the generic face can be modified by facial definition parameters (FDP) into a particular face. The data needed to do this can be moderate.
The FDP decoder creates what is known as a neutral face; one which carries no expression. This face is then animated to give it dynamically changing expressions by moving certain vertices and re-rendering. It is not necessary to transmit data for each vertex. Instead face-specific vectors known as face animation parameters (FAPs) are transmitted. As certain combinations of vectors are common in expressions such as a smile, these can be coded as visemes which are used either alone or as predictions for more accurate FAPs. The resulting vectors control algorithms which are used to move the vertices to the required places. The data rate needed to do this is minute; 2–3 kilobits per second is adequate.
In body animation, the three-dimensional mesh created in the decoder is a neutral human body, i.e. one with a standardized posture. All the degrees of freedom of the body can be replicated by transmitting codes corresponding to the motion of each joint.
In both cases if the source material is video from a camera, the encoder will need sophisticated algorithms which recognize the human features and output changing animation parameters as expression and posture change.
5.30 Scaleability
MPEG-4 offers extensive support for scaleability. Given the wide range of objects which MPEG-4 can encode, offering scaleability for virtually all of them must result in considerable complexity. In general, scaleability requires the information transmitted to be decomposed into a base layer along with one or more enhancement layers. The base layer alone can be decoded to obtain a picture of a certain quality. The quality can be improved in some way by adding enhancement information.
Figure 5.87 shows a generic scaleable decoder. There is a decoder for the base layer information and a decoder for the enhancement layer. For efficient compression, the enhancement layer decoder will generally require a significant number of parameters from the base layer to assist with predictively coded data. These are provided by the mid-processor. Once the base and enhancement decoding is complete, the two layers are combined in a post-processor.
Figure 5.88 shows a spatial scaleability system in which the enhancement improves resolution. Spatial scaleability may be applied to rectangular VOPs or to arbitrarily shaped objects using alpha data. In the case of irregular objects the alpha data will also need to be refined. The mid-processor will upsample the base layer images to provide base pixel data at the new, higher resolution, sampling grid. With respect to a sampling array upsampled by a factor of two, motion vectors appear half as accurate and there are one quarter as many as are needed. Consequently to improve prediction in the enhancement layer, the motion compensation must be made more accurate.
The output of the base level decoder is considered to be a spatial prediction by the enhancement process. This may be shifted by vectors from the enhancement bitstream. Figure 5.89 shows that a base level macroblock of 16 × 16 pixels predicted with one vector is upsampled to four macroblocks, each of which can be shifted with its own enhancement vector. Upsampled base layer pixels which have been shifted by enhancement vectors form a new high-resolution prediction. The encoder transmits the spatial residual from this prediction.
Temporally scaleable decoding allows new VOPs to be created in between the VOPs of the base layer to increase the frame rate. Figure 5.90 shows that enhancement I-VOPs are intra-coded and so are decoded without reference to any other VOP just like a base I-VOP. P- and B-VOPs in the enhancement layer are motion predicted from other VOPs. These may be in the base layer or the enhancement layer according to prediction reference bits in the enhancement data. Base layer B-VOPs are never used as enhancement references, but enhancement B-VOPs can be references for other enhancement VOPs.
Figure 5.91 shows the example of an enhancement B-VOP placed between two VOPs in the base layer. Enhancement vectors are used to shift base layer pixels to new locations in the enhancement VOP. The enhancement residual is then added to these predictions to create the output VOP.
5.31 Advanced Video Coding (AVC)
AVC (Advanced Video Coding), or H.264, is intended to compress moving images that take the form of eight-bit 4:2:0 coded pixel arrays. Although complex, AVC offers between two and two and a half times the compression factor of MPEG-2 for the same quality.7
As in MPEG-2 these may be pixel arrays or fields from an interlaced signal. It does not support object-based coding. Incoming pixel arrays are subdivided into 16 × 16 pixel macroblocks as in previous MPEG standards. In those previous standards, macroblocks were transmitted only in a raster scan fashion. Whilst this is fine where the coded data are delivered via a reliable channel, AVC is designed to operate with imperfect channels that are subject to error or packet loss. One mechanism that supports this is known as FMO (Flexible Macroblock Ordering).
When FMO is in use, the picture can be divided into different areas along horizontal or vertical macroblock boundaries. Figure 5.92(a) shows an approach in which macroblocks are chequerboarded. If the shaded macroblocks are sent in a different packet to the unshaded macroblocks, the loss of a packet will result in a degraded picture rather than no picture. Figure 5.92(b) shows another approach in which the important elements of the picture are placed in one area and less important elements in another. The important data may be afforded higher priority in a network. Note that when interlaced input is being coded, it may be necessary to constrain the FMO such that the smallest element becomes a macroblock pair in which one macroblock is vertically above the other.
In FMO these areas are known as slice groups that contain integer numbers of slices. Within slice groups, macroblocks are always sent in raster scan fashion with respect to that slice group. The decoder must be able to establish the position of every received macroblock in the picture. This is the function of the macroblock to slice group map which can be deduced by the decoder from picture header and slice header data.
Another advantage of AVC is that the bitstream is designed to be transmitted or recorded in a greater variety of ways having distinct advantages in certain applications. AVC may convert the output of the Video Coding Layer (VCL) in a Network Application Layer (NAL) that formats the date in an appropriate manner. This chapter covers the Video Coding Layer, whereas the AVC NAL is considered in Chapter 6.
AVC is designed to operate in multi-rate systems where the same program material is available at different bit rates depending on the bandwidth available at individual locations. Frequently available bandwidth varies and the situation will arise where a decoder needs to switch to a lower or higher bit rate represention of the same video sequence without a visible disturbance. In the case of MPEG-2 such switching could only be done at an I frame in the incoming bitstream. However, the frequent insertion of such I frames would reduce the quality of that bitstream. AVC overcomes the problem by the introduction of two new slice coding types, SI and SP.8 These are switching slices that use intra- or inter-coding. Switching slices are inserted by the encoder that is creating both the high and low bit rate streams to facilitate switching between them. With switching slices, switching can take place at an arbitrary location in the GOP without waiting for an I picture or artificially inserting one. Ordinarily such switching would be impossible because of temporal coding. Lacking the history of the bitstream, a decoder switching into the middle of a GOP would not be able to function. The function of the switching slices is now clear. When a switch is to be performed, the decoder adds decoded switching slice(s) data to the last decoded picture of the old bitstream and this converts the picture into what it would have been if the decoder had been decoding the new bitstream since the beginning of the GOP. Armed with a correct previous picture, the decoder simply decodes the MC and residual data of the new bitstream and produces correct output pictures.
In previous MPEG standards, prediction was used primarily between pictures. In MPEG-2 I pictures the only prediction was in DC coefficients whereas in MPEG-4 some low frequency coefficients were predicted. In AVC, I pictures are subject to spatial prediction and it is the prediction residual that is transform coded, not pixel data.
In I PCM mode, the prediction and transform stages are both bypassed and actual pixel values enter the remaining stages of the coder. In non-typical images such as noise, PCM may be more efficient. In addition, if the channel bit rate is high enough, a truly lossless coder may be obtained by the use of PCM.
Figure 5.93 shows that the encoder contains a spatial predictor that is switched in for I pictures whereas for P and B pictures the temporal predictor operates. The predictions are subtracted from the input picture and the residual is coded.
Spatial prediction works in two ways. In featureless parts of the picture, the DC component, or average brightness is highly redundant. Edges between areas of differing brightness are also redundant. Figure 5.94(a) shows that in a picture having a strong vertical edge, rows of pixels traversing the edge are highly redundant, whereas (b) shows that in the case of a strong horizontal edge, columns of pixels are redundant. Sloping edges will result in redundancy on diagonals.
According to picture content, spatial prediction can operate on 4 × 4 pixel blocks or 16 × 16 blocks. Figure 5.95(a) shows eight of the nine spatial prediction modes for 4 × 4 blocks. Mode 2, not shown, is the DC prediction that is directionless. Figure 5.95(b) shows that in 4 × 4 prediction, up to 13 pixel values above and to the left of the block will be used. This means that these pixel values are already known by the decoder because of the order in which decoding takes place. Spatial prediction cannot take place between different slices because the error recovery capability of a slice would be compromised if it depended on an earlier one for decoding.
Figure 5.95(c) shows that in vertical prediction (Mode 0), four pixel values above the block are copied downwards so that all four rows of the predicted block are identical. Figure 5.95(d) shows that in horizontal prediction (Mode 1) four pixel values to the left are copied across so that all four columns of the predicted block are identical. Figure 5.95(e) shows how in diagonal prediction (Mode 4) seven pixel values are copied diagonally. Figure 5.95(f) shows that in DC prediction (Mode 2) pixel values above and to the left are averaged and the average value is copied into all 16 predicted pixel locations.
With 16 × 16 blocks, only four modes are available: vertical, horizontal, DC and plane. The first three of these are identical in principle to Modes 0, 1 and 2 with 4 × 4 blocks. Plane mode is a refinement of DC mode. Instead of setting every predicted pixel in the block to the same value by averaging the reference pixels, the predictor looks for trends in changing horizontal brightness in the top reference row and similar trends in vertical brightness in the left reference column and computes a predicted block whose values lie on a plane which may be tilted in the direction of the trend.
Clearly it is necessary for the encoder to have circuitry or software that identifies edges and their direction (or the lack of them) in order to select the appropriate mode. The standard does not suggest how this should work; only how its outputs should be encoded. In each case the predicted pixel block is subtracted from the actual pixel block to produce a residual.
Spatial prediction is also used on chroma data.
When spatial prediction is used, the statistics of the residual will be different to the statistics of the original pixels. When the prediction succeeds, the lower frequencies in the image are largely taken care of and so only higher frequencies remain in the residual. This suggests the use of a smaller transform than the 8 × 8 transform of previous systems. AVC uses a 4 × 4 transform. It is not, however a DCT, but a DCT-like transform using coefficients that are integers. This gives the advantages that coding and decoding requires only shifting and addition and that the transform is perfectly reversible even when limited wordlength is used. Figure 5.96 shows the transform matrix of AVC.
One of the greatest deficiencies of earlier coders was blocking artifacts at transform block boundaries. AVC incorporates a de-blocking filter. In operation, the filter examines sets of pixels across a block boundary. If it finds a step in value, this may or may not indicate a blocking artifact. It could be a genuine transition in the picture. However, the size of pixel value steps can be deduced from the degree of quantizing in use. If the step is bigger than the degree of quantizing would suggest, it is left alone. If the size of the step corresponds to the degree of quantizing, it is filtered or smoothed.
The adaptive de-blocking algorithm is deterministic and must be the same in all decoders. This is because the encoder must also contain the same de-blocking filter to prevent drift when temporal coding is used. This is known as in-loop de-blocking. In other words when, for example, a P picture is being predicted from an I picture, the I picture in both encoder and decoder will have been identically de-blocked. Thus any errors due to imperfect de-blocking are cancelled out by the P picture residual data. De-blocking filters are modified when interlace is used because the vertical separation of pixels in a field is twice as great as in a frame.
Figure 5.97 shows an in-loop de-blocking system. The I picture is encoded and transmitted and is decoded and de-blocked identically at both encoder and decoder. At the decoder the de-blocked I picture forms the output as well as the reference with which a future P or I picture can be decoded. Thus when the encoder sends a residual, it will send the difference between the actual input picture and the de-blocked I picture. The decoder adds this residual to its own de-blocked I picture and recovers the actual picture.
5.32 Motion compensation in AVC
AVC has a more complex motion-compensation system than previous standards. Smaller picture areas are coded using vectors that may have quarter pixel accuracy. The interpolation filter for sub-pixel MC is specified so that the same filter is present in all encoders and decoders. The interpolator is then effectively in the loop like the de-blocking filter. Figure 5.98 shows that more than one previous reference picture may be used to decode a motion-compensated picture. A larger number of future pictures are not used as this would increase latency. The ability to select a number of previous pictures is advantageous when a single non-typical picture is found inserted in normal material. An example is the white frame that results from a flashgun firing. MPEG-2 deals with this poorly, whereas AVC could deal with it well simply by decoding the picture after the flash from the picture before the flash. Bidirectional coding is enhanced because the weighting of the contribution from earlier and later pictures can now be coded. Thus a dissolve between two pictures could be coded efficiently by changing the weighting. In previous standards a B picture could not be used as a basis for any further decoding but in AVC this is allowed.
AVC macroblocks may be coded with between one and sixteen vectors. Prediction using 16 × 16 macroblocks fails when the edge of a moving object intersects the macroblock as was shown in Figure 5.20. In such cases it may be better to divide the macroblock up according to the angle and position of the edge. Figure 5.99(a) shows the number of ways a 16 × 16 macroblock may be partitioned in AVC for motion-compensation purposes. There are four high-level partition schemes, one of which is to use four 8 × 8 blocks. When this mode is selected, these 8 × 8 blocks may be further partitioned as shown. This finer subdivision requires additional syntactical data to specify to the decoder what has been done. It will be self-evident that if more vectors have to be transmitted, there must be a greater reduction in the amount of residual data to be transmitted to make it worthwhile. Thus the encoder needs intelligently to decide the partitioning to be used. Figure 5.99(b) shows an example. There also needs to be some efficient vector coding scheme.
In P coding, vectors are predicted from those in macroblocks already sent, provided that slice independence is not compromised. The predicted vector is the median of those on the left, above and above right of the macroblock to be coded. A different prediction is used if 16 × 8 or 8 × 16 partitions are used. Only the prediction error needs to be sent. In fact if nothing is sent, as in the case of a skipped block, the decoder can predict the vector for itself.
In B coding, the vectors are predicted by inference from the previous P vectors. Figure 5.100 shows the principle. In order to create the P picture from the I picture, a vector must be sent for each moving area. As the B picture is at a known temporal location with respect to these anchor pictures, the vector for the corresponding area of the B picture can be predicted by assuming the optic flow axis is straight and performing a simple linear interpolation.
5.33 An AVC codec
Figure 5.101 shows an AVC coder–decoder pair. There is a good deal of general similarity with the previous standards. In the case of an I picture, there is no motion compensation and no previous picture is relevant. However, spatial prediction will be used. The prediction error will be transformed and quantized for transmission, but is also locally inverse quantized and inverse transformed prior to being added to the prediction to produce an unfiltered reconstructed macroblock. Thus the encoder has available exactly what the decoder will have and both use the same data to make predictions to avoid drift. The type of intraprediction used is determined from the characteristics of the input picture.
The locally reconstructed macroblocks are also input to the de-blocking filter. This is identical to the decoder’s de-blocking filter and so the output will be identical to the output of the decoder. The de-blocked, decoded I picture can then be used as a basis for encoding a P picture. Using this architecture the de-blocking is in-loop for intercoding purposes, but does not interfere with the intra-prediction.
Operation of the decoder should be obvious from what has gone before as the encoder effectively contains a decoder.
Like earlier formats, AVC uses lossless arithmetic coding, or entropy coding, to more efficiently pack the data. However, AVC takes the principle further. Arithmetic coding is used to compress syntax data as well as coefficients. Syntax data take a variety of forms: vectors, slice headers etc. A common exp-Golomb variable length arithmetic code is used for all syntax data. The different types of data are mapped appropriately for their statistics before that code. Coefficients are coded using a system called CAVLC (context adaptive variable length coding).
Optionally, a further technique known as CABAC (context adaptive binary arithmetic coding) may be used in some profiles. This is a system which adjusts the coding dynamically according to the local statistics of the data instead of relying on statistics assumed at the design stage. It is more efficient and allows a coding gain of about 15 per cent with more complexity.
CAVLC performs the same function as RLC/VLC in MPEG-2 but it is more efficient. As in MPEG-2 it relies on the probability that coefficient values fall with increasing spatial frequency and that at the higher frequencies coefficients will be spaced apart by zero values. The efficient prediction of AVC means that coefficients will typically be smaller than in earlier standards. It becomes useful to have specific means to code coefficients of value ±1 as well as zero. These are known as trailing ones (T1s). Figure 5.102 shows the parameters used in CAVLC.
The coefficients are encoded in the reverse order of the zig-zag scan. The number of non-zero coefficients N and the number of trailing ones are encoded into a single VLC symbol. The TotalZeros parameter defines the number of zero coefficients between the last non-zero coefficient and its start. The difference between N and TotalZeros must be the number of zeros within the transmitted coefficient sequence but does not reveal where they are. This is the function of the RunBefore parameter which is sent prior to any coefficient that is preceded by zeros in the transmission sequence. If N is sixteen the TotalZeros must be zero and will not be sent. RunBefore parameters will not occur.
Coefficient values for trailing ones need only a single bit to denote the sign. Values above 1 embed the polarity into the value input to the VLC.
CAVLC obtains extra coding efficiency because it can select different codes according to circumstances. For example, if in a sixteen coefficient block N is 7, then TotalZeros must have a value between zero and nine. The encoder selects a VLC table optimized for nine values. The decoder can establish what table has been used by subtracting N from 16 so no extra data need be sent to switch tables. The N and T1s parameter can be coded using one of four tables selected using the values of N and T1 in nearby blocks. Six code tables are available for adaptive coefficient encoding.
5.34 AVC profiles and levels
Like earlier standards, AVC is specified at various profiles and levels and these terms have the same meaning as they do elsewhere. There are three profiles: Baseline, Main and Extended. The Baseline profile supports all of the features of AVC, except for bidirectional coding, interlace, weighted prediction, SP and SI slices and CABAC. FMO is supported, but the number of slices per picture cannot exceed eight. The Main profile supports all features except FMO and SP and SI slices. The extended profile supports all features except CABAC and macroblock adaptive switching between field and frame coding.
AVC levels are defined by various requirements for decoder memory and processing power and are defined as sets of maxima such as picture size, bit rate and vector range. Figure 5.103 shows the levels defined for AVC.
5.35 Coding artifacts
This section describes the visible results of imperfect coding. This may be where the coding algorithm is sub-optimal, where the coder latency is too short or where the compression factor in use is simply too great for the material.
Figure 5.104 shows that all MPEG coders can be simplified to the same basic model in which the predictive power of the decoder is modelled in the encoder. The current picture/object is predicted as well as possible, and the prediction error or residual is transmitted. As the decoder can make the same prediction, if the residual is correctly transmitted, the decoded output will be lossless.
The MPEG coding family contains a large number of compression tools and many of these are lossless. Lossless coding tools include the motion compensation, zig-zag scanning and run-length coding of MPEG-1 and MPEG-2 and the context coding and various types of predictive coding used in MPEG-4 and AVC. It is a simple fact that the more sophisticated the prediction, the fewer data will be needed to transmit the residual.
However, in many, perhaps most, applications of MPEG, the bit rate is not decided by the encoder, but by some other decision process which may be economically biased. All MPEG coders contain an output buffer, and if the allocated bit rate is not adequate, this will start to fill up. To prevent data loss and decoder crashes, the encoder must reduce its bit rate.
It cannot make any economy in the lossless coding. Reducing the losslessly coded information would damage the prediction mechanism and increase the residual. Reducing the accuracy of vectors, for example, is counterproductive. Consequently all that can be done is to reduce the bit rate of the residual. This must be done where the transform coefficients are quantized because this is the only variable mechanism at the encoder’s disposal. When in an unsympathetic environment, the coder has to reduce the bit rate the best it can.
The result is that the output pictures are visually different from the input pictures. The difference between the two is classified as coding noise. This author has great difficulty with this term, because true noise should be decorrelated from the message whereas the errors of MPEG coders are signal dependent and the term coding distortion is more appropriate. This is supported by the fact that the subjective visual impact of coding errors is far greater than would be caused by the same level of true noise. Put another way, quantizing distortion is irritating and pretending that it is noise gives no assistance in determining how bad it will look.
In motion-compensated systems such as MPEG, the use of periodic intra-fields means that the coding error may vary from picture to picture and this may be visible as noise pumping. The designer of the coder is in a difficult position as the user reduces the bit rate. If the I data are excessive, the P and B data will have to be heavily quantized resulting in errors. However, if the data representing the I pictures/objects is reduced too much, P and B pictures may look better than the I pictures. Consequently it is necessary to balance the requantizing in the I, P and B pictures/objects so that the level of visible artifacts in each remains roughly the same in each.
Noise pumping may also be visible where the amount of motion changes. If a pan is observed, as the pan speed increases the motion vectors may become less accurate and reduce the quality of the prediction processes. The prediction errors will get larger and will have to be more coarsely quantized. Thus the picture gets noisier as the pan accelerates and the noise reduces as the pan slows down. The same result may be apparent at the edges of a picture during zooming. The problem is worse if the picture contains fine detail. Panning on grass or trees waving in the wind taxes most coders severely. Camera shake from a hand-held camera also increases the motion vector data and results in more noise, as does film weave.
Input video noise or film grain degrades inter-coding as there is less redundancy between pictures and the difference data become larger, requiring coarse quantizing and adding to the existing noise.
Where a codec is really fighting the quantizing may become very coarse and as a result the video level at the edge of one DCT block may not match that of its neighbour. As a result the DCT block structure becomes visible as a mosaicking or tiling effect. MPEG-4 introduces some decoding techniques to filter out blocking effects and in principle these could be applied to MPEG-1 and MPEG-2 decoders. AVC has the best solution, namely to insert the de-blocking filter into the encoding loop to prevent drift as was described above. Consequently the blocking performance of AVC is exceptional.
Coarse quantizing also causes some coefficients to be rounded up and appear larger than they should be. High-frequency coefficients may be eliminated by heavy quantizing and this forces the DCT to act as a steep-cut low-pass filter. This causes fringing or ringing around sharp edges and extra shadowy edges which were not in the original and is most noticeable on text.
Excess compression may also result in colour bleed where fringing has taken place in the chroma or where high-frequency chroma coefficients have been discarded. Graduated colour areas may reveal banding or posterizing as the colour range is restricted by requantizing. These artifacts are almost impossible to measure with conventional test gear.
Neither noise pumping nor blocking are visible on analog video recorders and so it is nonsense to liken the performance of a codec to the quality of a VCR. In fact noise pumping is extremely objectionable because, unlike steady noise, it attracts attention in peripheral vision and may result in viewing fatigue.
In addition to highly detailed pictures with complex motion, certain types of video signal are difficult for MPEG to handle and will usually result in a higher level of artifacts than usual. Noise has already been mentioned as a source of problems. Timebase error from, for example, VCRs is undesirable because this puts successive lines in different horizontal positions. A straight vertical line becomes jagged and this results in high spatial frequencies in the transform process. Spurious coefficients are created which need to be coded.
Much archive video is in composite form and MPEG can only handle this after it has been decoded to components. Unfortunately many general-purpose composite decoders have a high level of residual subcarrier in the outputs. This is normally not a problem because the subcarrier is designed to be invisible to the naked eye. Figure 5.105 shows that in PAL and NTSC the subcarrier frequency is selected so that a phase reversal is achieved between successive lines and frames.
Whilst this makes the subcarrier invisible to the eye, it is not invisible to an MPEG decoder. The subcarrier waveform is interpreted as a horizontal frequency, the vertical phase reversals are interpreted as a vertical spatial frequency and the picture-to-picture reversals increase the magnitude of the prediction errors. The subcarrier level may be low but it can be present over the whole screen and require an excess of coefficients to describe it.
Composite video should not in general be used as a source for MPEG encoding, but where this is inevitable the standard of the decoder must be much higher than average, especially in the residual subcarrier specification. Some MPEG preprocessors support high-grade composite decoding options.
Judder from conventional linear standards convertors degrades the performance of MPEG. The optic flow axis is corrupted and linear filtering causes multiple images which confuse motion estimators and result in larger prediction errors. If standards conversion is necessary, the MPEG system must be used to encode the signal in its original format and the standards convertor should be installed after the decoder. If a standards convertor has to be used before the encoder, then it must be a type which has effective motion compensation.
Film weave causes movement of one picture with respect to the next and this results in more vector activity and larger prediction errors. Movement of the centre of the film frame along the optical axis causes magnification changes which also result in excess prediction error data. Film grain has the same effect as noise: it is random and so cannot be compressed.
Perhaps because it is relatively uncommon, MPEG-2 cannot handle image rotation well because the motion-compensation system is only designed for translational motion. Where a rotating object is highly detailed, such as in certain fairground rides, the motion compensation failure requires a significant amount of prediction error data and if a suitable bit rate is not available the level of artifacts will rise. The additional coding tools of MPEG-4 allow rotation be handled more effectively. Meshes can be rotated by repositioning their vertices with a few vectors and this improves the prediction of rotations dramatically.
Flashguns used by still photographers are a serious hazard to MPEG especially when long GOPs are used. At a press conference where a series of flashes may occur, the resultant video contains intermittent white frames. These are easy enough to code, but a huge prediction error is required to return the white frame to the next picture. The output buffer fills and heavy requantizing is employed. After a few flashes the picture has generally gone to tiles. AVC can handle a single flash by using an earlier picture as the basis for decoding thereby bypassing the white picture.
5.36 MPEG and concatenation
Concatenation loss occurs when the losses introduced by one codec are compounded by a second codec. All practical compressers, MPEG included, are lossy because what comes out of the decoder is not bit-identical to what went into the encoder. The bit differences are controlled so that they have minimum visibility to a human viewer.
MPEG is a toolbox which allows a variety of manipulations to be performed in both the spatial and temporal domains. MPEG-2 has more tools than MPEG-1 and MPEG-4 has more still. There is a limit to the compression which can be used on a single picture/object, and if higher compression factors are needed, temporal coding will have to be used. The longer the run of time considered, the lower the bit rate needed, but the harder it becomes to edit.
The most editable form of MPEG is to use I data only. As there is no temporal coding, pure cut edits can be made between pictures. The next best thing is to use a repeating IB structure which is locked to the odd/even field structure. Cut edits cannot be made as the B pictures are bidirectionally coded and need data from both adjacent I pictures for decoding. The B picture has to be decoded prior to the edit and re-encoded after the edit. This will cause a small concatenation loss.
Beyond the IB structure processing gets harder. If a long GOP is used for the best compression factor, an IBBPBBP… structure results. Editing this is very difficult because the pictures are sent out of order so that bidirectional decoding can be used. MPEG allows closed GOPs where the last B picture is coded wholly from the previous pictures and does not need the I picture in the next GOP. The bitstream can be switched at this point but only if the GOP structures in the two source video signals are synchronized (makes colour framing seem easy). Consequently in practice a long GOP bitstream will need to be decoded prior to any production step. Afterwards it will need to be re-encoded.
This is known as naive concatenation and an enormous pitfall awaits. Unless the GOP structure of the output is identical to and synchronized with the input the results will be disappointing. The worst case is where an I picture is encoded from a picture which was formerly a B picture. It is easy enough to lock the GOP structure of a coder to a single input, but if an edit is made between two inputs, the GOP timings could well be different.
As there are so many structures allowed in MPEG, there will be a need to convert between them. If this has to be done, it should only be in the direction which increases the GOP length and reduces the bit rate. Going the other way is inadvisable. The ideal way of converting from, say, the IB structure of a news system to the IBBP structure of an emission system is to use a recompressor. This is a kind of standards convertor which will give better results than a decode followed by an encode.
The DCT part of MPEG itself is lossless. If all the coefficients are preserved intact an inverse transform yields the same pixel data. Unfortunately this does not yield enough compression for many applications. In practice the coefficients are made less accurate by removing bits starting at the least significant end and working upwards. This process is weighted, or made progressively more aggressive as spatial frequency increases.
Small-value coefficients may be truncated to zero and large-value coefficients are most coarsely truncated at high spatial frequencies where the effect is least visible.
Figure 5.106(a) shows what happens in the ideal case where two identical coders are put in tandem and synchronized. The first coder quantizes the coefficients to finite accuracy and causes a loss on decoding. However, when the second coder performs the DCT calculation, the coefficients obtained will be identical to the quantized coefficients in the first coder and so if the second weighting and requantizing step is identical the same truncated coefficient data will result and there will be no further loss of quality.9
In practice this ideal situation is elusive. If the two DCTs become non-identical for any reason, the second requantizing step will introduce further error in the coefficients and the artifact level goes up. Figure 5.106(b) shows that non-identical concatenation can result from a large number of real-world effects.
An intermediate processing step such as a fade will change the pixel values and thereby the coefficients. A DVE resize or shift will move pixels from one DCT block to another. Even if there is no processing step, this effect will also occur if the two codecs disagree on where the MPEG picture boundaries are within the picture. If the boundaries are correct there will still be concatenation loss if the two codecs use different weighting.
One problem with MPEG is that the compressor design is unspecified. Whilst this has advantages, it does mean that the chances of finding identical coders is minute because each manufacturer will have his own views on the best compression algorithm. In a large system it may be worth obtaining the coders from a single supplier.
It is now increasingly accepted that concatenation of compression techniques is potentially damaging, and results are worse if the codecs are different. Clearly feeding a digital coder such as MPEG-2 with a signal which has been subject to analog compression comes into the category of worse. Using interlaced video as a source for MPEG coding is sub-optimal and using decoded composite video is even worse.
One way of avoiding concatenation is to stay in the compressed data domain. If the goal is just to move pictures from one place to another, decoding to traditional video so that an existing router can be used is not ideal, although it is substantially better than going through the analog domain.
Figure 5.107 shows some possibilities for picture transport. Clearly if the pictures exist as a compressed file on a server, a file transfer is the right way to do it as there is no possibility of loss because there has been no concatenation. File transfer is also quite indifferent to the picture format. It doesn’t care about the frame rate, whether the pictures are interlaced or not or whether the colour is 4:2:0 or 4:2:2.
Decoding to SDI (serial digital interface) standard is sometimes done so that existing serial digital routing can be used. This is concatenation and has to be done carefully. The compressed video can only use interlace with non-square pixels and the colour coding has to be 4:2:2 because SDI only allows that. If a compressed file has 4:2:0 the chroma has to be interpolated up to 4:2:2 for SDI transfer and then subsampled back to 4:2:0 at the second coder and this will cause generation loss. An SDI transfer also can only be performed in real time, thus negating one of the advantages of compression. In short traditional SDI is not really at home with compression.
As 4:2:0 progressive scan gains popularity and video production moves steadily towards non-format-specific hardware using computers and data networks, use of the serial digital interface will eventually decline. In the short term, if an existing SDI router has to be used, one solution is to produce a bitstream which is sufficiently similar to SDI that a router will pass it. One example of this is known as SDTI. The signal level, frequency and impedance of SDTI is pure SDI, but the data protocol is different so that a bit accurate file transfer can be performed. This has two advantages over SDI. First, the compressed data format can be anything appropriate and non-interlaced and/or 4:2:0 can be handled in any picture size, aspect ratio or frame rate. Second, a faster than realtime transfer can be used depending on the compression factor of the file.
Equipment which allows this is becoming available and its use can mean that the full economic life of an SDI routing installation can be obtained.
An improved way of reducing concatenation loss has emerged from the ATLANTIC research project.10 Figure 5.108 shows that the second encoder in a concatenated scheme does not make its own decisions from the incoming video, but is instead steered by information from the first bitstream. As the second encoder has less intelligence, it is known as a dim encoder.
The information bus carries all the structure of the original MPEG-2 bitstream which would be lost in a conventional decoder. The ATLANTIC decoder does more than decode the pictures. It also places on the information bus all parameters needed to make the dim encoder re-enact what the initial MPEG-2 encode did as closely as possible.
The GOP structure is passed on so that pictures are re-encoded as the same type. Positions of macroblock boundaries become identical so that DCT blocks contain the same pixels and motion vectors relate to the same screen data. The weighting and quantizing tables are passed so that coefficient truncation is identical. Motion vectors from the original bitstream are passed on so that the dim encoder does not need to perform motion estimation. In this way predicted pictures will be identical to the original prediction and the prediction error data will be the same.
One application of this approach is in recompression, where an MPEG-2 bitstream has to have its bit rate reduced. This has to be done by heavier requantizing of coefficients, but if as many other parameters as possible can be kept the same, such as motion vectors, the degradation will be minimized. In a simple recompressor just requantizing the coefficients means that the predictive coding will be impaired. In a proper encode, the quantizing error due to coding say an I picture is removed from the P picture by the prediction process. The prediction error of P is obtained by subtracting the decoded I picture rather than the original I picture.
In simple recompression this does not happen and there may be a tolerance build-up known as drift.11 A more sophisticated recompressor will need to repeat the prediction process using the decoded output pictures as the prediction reference.
MPEG-2 bitstreams will often be decoded for the purpose of switching. Local insertion of commercial breaks into a centrally originated bitstream is one obvious requirement. If the decoded video signal is switched, the information bus must also be switched. At the switch point identical re-encoding becomes impossible because prior pictures required for predictive coding will have disappeared. At this point the dim encoder has to become bright again because it has to create an MPEG-2 bitstream without assistance.
It is possible to encode the information bus into a form which allows it to be invisibly carried in the serial digital interface. Where a production process such as a vision mixer or DVE performs no manipulation, i.e. becomes bit transparent, the subsequent encoder can extract the information bus and operate in ‘dim’ mode. Where a manipulation is performed, the information bus signal will be corrupted and the encoder has to work in ‘bright’ mode. The encoded information signal is known as a ‘mole’12 because it burrows through the processing equipment!
There will be a generation loss at the switch point because the re-encode will be making different decisions in bright mode. This may be difficult to detect because the human visual system is slow to react to a vision cut and defects in the first few pictures after a cut are masked.
In addition to the video computation required to perform a cut, the process has to consider the buffer occupancy of the decoder. A downstream decoder has finite buffer memory, and individual encoders model the decoder buffer occupancy to ensure that it neither overflows nor underflows. At any instant the decoder buffer can be nearly full or nearly empty without a problem provided there is a subsequent correction. An encoder which is approaching a complex I picture may run down the buffer so it can send a lot of data to describe that picture. Figure 5.109(a) shows that if a decoder with a nearly full buffer is suddenly switched to an encoder which has been running down its buffer occupancy, the decoder buffer will overflow when the second encoder sends a lot of data.
An MPEG-2 switcher will need to monitor the buffer occupancy of its own output to avoid overflow of downstream decoders. Where this is a possibility the second encoder will have to recompress to reduce the output bit rate temporarily. In practice there will be a recovery period where the buffer occupancy of the newly selected signal is matched to that of the previous signal. This is shown in Figure 5.109(b).
References
1. | Kelly, D.H., Visual processing of moving stimuli. J. Opt. Soc. America, 2, 216–225 (1985) |
2. | Uyttendaele, A., Observations on scanning formats. Presented at HDTV ’97 Montreux (June 1997) |
3. | Shapiro, J.M., Embedded image coding using zerotrees of wavelet coefficients. IEEE Trans. SP, 41, 3445–3462 (1993) |
4. | Taubin, G. and Rossignac, J., Geometric compression through topological surgery. ACM Trans. on Graphics, 88–115 (April 1998) |
5. | de Boor, C., A Practical Guide to Splines. Berlin: Springer (1978) |
6. | Taubin, G., Guezic, A., Horn, W.P. and Lazarus, F., Progressive forest split compression, Proc. Siggraph ’98, 123–132 (1998) |
7. | ITU-T Recommendation H.264, Advanced video coding for generic audiovisual services. |
8. | Karczewicz, M. and Kurceren, R., The SP and SI frames design for H. 264/AVC. IEEE Trans. on Circuits and Systems for Video Technology (July 2003) |
9. | Stone, J. and Wilkinson, J., Concatenation of video compression systems. Presented at 37th SMPTE Tech. Conf., New Orleans (1995) |
10. | Wells, N.D., The ATLANTIC project: models for programme production and distribution. Proc. Euro. Conf. Multimedia Applications Services and Techniques (ECMAST), 243–253 (1996) |
11. | Werner, O., Drift analysis and drift reduction for multiresolution hybrid video coding. Image Communication, 8, 387–409 (1996) |
12. | Knee, M.J. and Wells, N.D., Seamless concatenation – a 21st century dream. Presented at Int. Television. Symp., Montreux (1997) |
44.200.94.150