The wide range of formats used to archive and to display images constitutes an obstacle to their use by the general public. While the JPEG format has become widely used by both specialist and general users of photography, and is widely considered to represent a standard, a variety of other formats are available, and often produce better results.
The choice of format is determined by differing and often conflicting aims: reduction of image volume and/or maximum respect of properties (dynamics, chromatic palette, line geometry, quality of solid colors, etc.). Certain formats are best suited for archival purposes, while others are better for transmission, or for interoperability across a range of different treatment or visualization programs.
Images are used in a broad and varied range of applications. Some of these applications require documents to be transmitted at high speeds, whilse others are more concerned with quality, which may be judged according to different domain-specific criteria: representation of nuances or fine details, line sharpness, color reproduction, text readability, capacity for integration into a complex composition, etc.
Nowadays, even the most “ordinary” photograph is made up of at least ten million pixels, each often represented by 3 bytes, one per chromatic channel (red, green or blue). If the signal is not compressed in any way, each image requires several tens of millions of bytes of storage space in both the camera and computer memory. Moreover, hundreds, or even thousands, of photographs may now be taken in the course of an event. Professional and enthusiastic amateur photographers use hundreds of gigabytes of memory for image storage alone. In addition to this archiving cost, image consultation and transfer are also slower and more costly. The variety of available formats described in this chapter have been developed in response to these requirements. A more detailed technical analysis of each format may be found in [BAR 03b, WOO 11], for example.
Before considering questions of user access to digital photographs, we must examine the way in which these images are represented for computing purposes.
For a black and white image containing nl lines of np points, where each pixel is coded by a single byte, the simplest method is to record the sequence of N = npnl bytes in a single file. This native format is not completely free from ambiguity, as there are generally multiple ways of constructing a rectangular image from N pixels. To avoid sacrificing levels of gray in order to mark the ends of lines, we must therefore also transmit the values np and nl. These values np and nl constitute the simplest elements of the metadata which needs to accompany an image to ensure readability; many other elements become necessary as we consider different types of images, including the way in which bytes are converted into gray levels (natural code, Gray code, or codes using look-up tables (LUTs)1), the way in which color images are represented (by association of 3 image planes or by interweaving the 3 bytes of each pixel along a line), the type of coding used, the number of channels per pixel (for multi-spectrum images) and the number of bytes per pixel (for high dynamic images), the number of dimensions (for medical applications, for example), etc. Nowadays, metadata also includes identification and property marking elements, along with information on capture conditions, and on the treatments applied to images over the course of their lifetime. For photographers, metadata constitutes an obvious location for specifying capture parameters, allowing them to get the best out of their image.
Metadata is sometimes placed in a second (format) file, used alongside the image file. This solution is still used in certain conditions, where a large quantity of metadata which does not concern the image itself is required (for example patient information in the context of medical imaging, or mission information in astronomic satellite imaging). Separation of the image and metadata files enables greater confidentiality in the treatment of these files. It also reduces the risk that an error made in handling one of the files will cause significant damage to the other file.
This solution is not generally used in mass-market digital photography, where metadata and image data are combined into a single file, containing information of very different types: some binary, other symbolic. Metadata is usually placed in a header above the image data, using a syntactically-defined format identified by the file extension (.jpg, .tif or .bmp). Reading this metadata provides the keys needed to correctly read and display the image field. It also enables users to make use of information included by the camera manufacturer or by previous users of an image, in order to exploit image properties to the full.
The variety and complexity of available image formats are essentially due to the varying importance of different property types for different groups of users, requiring the inclusion of varying amounts of detail in the metadata.
Saving a signal in the way in which it is recorded by the sensor seems logical, if sufficient care is taken to record the parameters required for future treatment alongside the image. For this reason, more bits than are strictly necessary will be assigned to the signal, the original sensor dynamics will be retained, and image reconstruction elements will be recorded, including the dark current and amplifier gains up to saturation values. The geometry imposed by the objective and by mosaicing must also be specified, along with capture parameters, exposure, aperture, other relevant settings, and the specific colorimetric properties of the chromatic filters, in order to permit subsequent corrections and produce an image of the highest possible quality. This information is placed in the metadata field.
This principle is used in native, or RAW, format. Strictly speaking, it does not constitute a true format, as it is not unique and cannot be shared. However, we will continue to use the term, according to common usage, in cases where it does not cause ambiguity.
Native format is used by camera manufacturers as a way of recording signals immediately from sensor output; this first step needs to be carried out at speed, without considering the operations needed for visualization or transmission to other users. The file obtained is a faithful representation of the measurements received from the sensor (see Figure 8.1). Later treatment, either within the camera or using a separate computer, allow full exploitation of this data. However, the RAW format is highly dependent on manufacturers, and images saved in this way were not initially intended to be conserved in this format outside of the camera production line (whether for embedded or remote processing). The popularity of this “format” is a sign of its success, but it remains subject to a number of faults associated with its intended usage.
An ISO standard does exist (discussed in section 8.2.4), and aims to enable the transmission of these files between users; however, manufacturers continue to use their own versions, which are modified in line with their own specific requirements. Native files are marked by a specific file extension (over one hundred types currently exist2), and not by the extension .raw, as we might wish in the case of a shared format. These formats cannot generally be read using ordinary display tools (some examples are even encrypted), but a large number of image processing libraries recognize the most widespread formats, and updates to reading algorithms are made available online as these formats evolve.
There are many reasons why these formats are not universal. First, they reflect the geometry of the photodetector (variable, as seen in section 5.4.2); second, they are tailored to specific architectures in order to take account of the precise noises affecting the signal (dark current, reset current, transfer transistor noise, presence of shared amplifiers, etc.) and provide the correct parameters for noise-reducing filtering equations (such as equation [7.10]). Colorimetric operations are expressed differently as a function of specific chromatic filters, and, finally, each format has a different way of treating known faults, blind pixels, geometric aberrations, vignetting, etc.
The transformation of images from native format to a more widespread format (such as JPEG) occurs during the transfer of the image from the camera memory to a computer, in a process designed by device manufacturers. This operation is also often required for images to be displayed on the camera screen. The solution implemented within the camera or by the host computer using a specific program is often the best available option, as the camera designer is aware of the exact parameters required to best exploit the signal, and knows where to find the necessary parameters in the ancillary data associated with the image. Within the camera, however, the time available for corrections is limited, and more complex correction processes may need to be applied later using a computer. It is also possible for users to override manufacturer settings, which may involve treatments which the user does not require (sharpness reinforcement or color accentuation, for example). In this case, a specific treatment process needs to be applied to the native image, whether using a widely-available free program, DCraw3, which provides a decoding tool for all commercially-available cameras, or using more general toolkits (openware products such as RawTherapee4, or commercial products such as PhotoShop, LightRoom, DxO Lab, etc.), giving users access to a wide range of treatments.
Image processing libraries offer different algorithms to those supplied by manufacturers, and have the potential to produce different reconstructions. These may be better, as more complex algorithms can be used, for example for colorimetrics, noise filtering or demosaicing. Furthermore, these libraries may include treatments for effects which cameras do not always take into account: optical vignetting, pixel vignetting, geometric aberrations etc. In these cases, the reconstruction is better than that obtained using the manufacturer’s methods.
The native RAW format needs to include the elements which differentiate different sensors. It generally includes the following elements:
As many metadata elements are shared by all images, typical formats for these elements are used (for example Exif); these are discussed below. However, the native format file also contains sensor-specific information (particularly in relation to the precise parameters governing electronic elements: thermal noise, positions of dead pixels, non-homogeneity in gains, corrections for pixel vignetting) which manufacturers do not always wish to make public. For this reason, the metadata of native files is sometimes encrypted in order to prevent them from being read by software outside of the manufacturer’s control.
Finally, note that certain manufacturers use native formats in which image files undergo lossless coding (Lempel-Ziv type).
As we have stated, the native format is bulky, and this represents one of its major drawbacks. The header file takes up a few thousand bytes, generally negligible in relation to the size of the image file. For an image with n pixels, each pixel is represented by a single signal in one of the R, G or B channels, and this signal is generally coded using 2 bytes. Without compression, the whole file will therefore take up around 2n bytes. Using a lossless compression algorithm, we can obtain a reduction of the order of 2; a RAW file takes up around n bytes7. This compression, if selected by the manufacturer, will be applied automatically. Native image files generally take up between 10 and 50 Mb, around ten times more than the corresponding JPEG file.
The quality of the signal captured by a camera is best preserved using the native RAW format. Moreover, the accompanying metadata provides all of the elements required for efficient image processing.
This metadata allows us to carry out all of the operations needed to reconstruct an image signal over 8 or 16 bits for any applications, even those which are demanding in terms of quality, such as printing, publication, photocomposition, video integration, etc. JPEG type output is still possible for everyday uses, but if volume is not an issue, formats with a higher dynamic, which better respect the fine details of images, are often preferable: TIFF, DNG, PNG, etc.
Use of the native format is essential in many advanced applications, such as those described in Chapter 10: high dynamic rendering, panoramas, and increasing resolution and focus.
It is often better to use separate computer software for reconstruction processes. These programs, whether free or commercial, give us the opportunity to carry out post-processing as required by specific applications. Post-processing is more effective if the source signal is used than if the signal has already been processed for general usage.
Whilst native formats provide the most faithful rendering of images, they are also subject to a number of faults:
The format standardized by the ISO (ISO 12234-2 [ISO 10]) is known as Tagged Image File Format – Electronic Photography (TIFF/EP) and derives from the TIFF 6.0 format (see section 8.2.4.1), which is more widespread, but under the control of Adobe. It aims to constitute a subset of image formats8.
Unlike TIFF, TIFF/EP is fully accessible to users. Like TIFF products, it is particularly aimed at applications in graphic imaging, like TIFF products9. For this reason, particular attention has been paid to document compression aspects, as these processes produce much higher gains in the context of graphic images than for photographic images. Three types of lossless coding may be used: Huffman coding by areas, Huffman-type coding by blocks, or Lempel– Ziv coding, now that this method is in the public domain (see section 8.4). The only lossy compression method currently available in practice is JPEG/DCT.
In the example, given above, of a native file produced by a Bayer sensor, reconstructed then transcoded into TIFF, coding is carried out in TIFF/EP using 2 bytes per channel and per pixel, i.e. 6 bytes per pixel and 6n bytes for the full image (120 Mb for a Mpixel image). This is the largest volume this image may take up. Lossless compression is therefore recommended in order to obtain a more reasonable file size; even using lossless methods, significant levels of compression may be attained, as the dynamic does not generally use the full 16 bits.
The TIFF/EP format includes a metadata file relatively similar to the Exif file, which will be discussed below, but based on the TIFF format metadata file, with the exclusion of a certain number of fields. Most of the Exif fields are included in TIFF/EP, but arranged using a different structure and with slight changes in definition in certain cases.
The TIFF/EP format can also be used for tiled representation of images (see section 8.7). It is therefore suitable for use with very large images.
The ISO standard is not respected by most manufacturers, who continue to use their own formats. However, the formats used are often derived from, and constitute extensions to, the TIFF/EP standard.
The lossless coding aspect of TIFF/EP is notably compatible with the DNG format, proposed by Adobe but available in open access, and accompanied by public processing libraries.
Adobe offers another universal format for transportation of RAW files, independent of the proprietary format used in their creation [ADO 12]. This format, DNG (Digital NeGative) is fairly widely recognized by image processing software, and uses the extension .dng. Openware programs are available to convert any RAW image into a DNG file, but few manufacturers use this format, preferring their own native formats. However, DNG is widely used in image processing and exploitation software.
DNG is an extension of TIFF 6.0 and uses the same formatting rules.
DNG also uses a metadata file, using EXIF, TIFF/EP or other formats. It retains the proprietary data of these files, offering mechanisms to enable them to be read by other programs. It also transmits specific information from the capture system, concerning the illuminant, calibration, color conversion matrices, representation formats, noise levels, dead pixels, etc. DNG includes fields for post-acquisition processing (floating representation, transparent pixels, etc.). It supports JPEG compression (with or without loss) and predictive lossless compression (using DEFLATE, see section 8.4). DNG is able to use images represented using RGB or YCrCb (see section 5.2.4).
We have seen that metadata is crucial in allowing optimal usage of digital images, and we have considered two file types used for this purpose: TIFF/EP and Exif files. The Exif format is the most widely used in photography and will be discussed in detail further on. First, let us consider the efforts which have been made to standardize this file type.
An ISO standard (16684 - 1: 2012) has been created regarding the description of these data, based on the XMP (extensible metadata platform) profile, proposed by Adobe. In theory, this format allows conservation of existing metadata alongside an XMP profile; in practice, however, conversion of all metadata into XMP format is highly recommended. Metadata is grouped into mutually independent “packets”, which may potentially be processed by separate applications. Users may create their own packets and develop new applications to use these packets, but care is required to avoid affecting other packets.
The resource description framework (RDF) language, popularized by the semantic web, forms the basis for packet description, and XML (extensible markup language) syntax is generally used.
The XMP format is widely used in artistic production, photocomposition, the creation of synthetic images and virtual reality. It has been used in a number of commercial developments, in professional and semi-professional processing tools, and has been progressively introduced into photographic treatment toolkits. However, users are generally unfamiliar with this format and rarely possess the tools required to use it themselves, unlike Exif files, which are much more accessible.
Exif (exchangeable image file format) was designed as a complete sound and image exchange format for audiovisual use, and made up of two parts: one part compressed data, and the other part metadata. Nowadays, only the metadata file accompanying an image is generally used. For the signal compression aspect, JPEG, TIFF or PNG are generally preferred (but the Exif exchange standard includes these compression modes in a list of recommended algorithms). For metadata, however, elements of the Exif file form the basis for a number of other standards: TIFF, JPEG, PNG, etc.
Created by Japanese manufacturers in 1995, the Exif format is used in camera hardware, and, while it has not been standardized or maintained by a specific organization (the latest version was published in 2010), it still constitutes a de facto standard. Unfortunately, however, it has tended to drift away from initial principles, losing its universal nature.
The metadata part of Exif (now referred to simply as Exif) is broadly inspired by the structure of TIFF files. The file is nested within a JPEG or native image file, and contains both public data for general usage and manufacturer-specific data, which is not intended for use outside of image programs. Unfortunately, while image handling programs generally recognize Exif files within images and use certain data from these files, the file is not always retransmitted in derivative images, or is only partially transmitted, leading to a dilution of Exif information and de facto incompatibility. The Exif file can be read easily using simple character sequence recognition tools (for example using PERL), and a wide range of applications for decoding these files are available online.
The data contained within an Exif file relates to several areas:
Exif files constitute a valuable tool for rational and automated use of images. However, they are subject to a number of drawbacks:
Lossless compression is a reversible operation which reduces the volume occupied by images, without affecting their quality. Lossless compression algorithms look for duplication within an image file, and use Shannon’s information theory to come close to the entropic volume limit [GUI 03a, PRO 03].
Black and white images generally do not present high levels of duplication when examined pixel by pixel. Their entropy10 is often greater than 6 bits/pixel, meaning that the possible gain is of less than 2 bits/pixel; this is not sufficient to justify the use of a complex coder. Color images generally include higher levels of duplication, and it is often possible to reduce the size of the image file by half with no loss of information.
It is also possible to look for duplication between pixels, coding pixel groups (run lengths or blocks), exploiting spatial or chromatic correlation properties, which are often strong. The coding required is more complex and the process is longer both in terms of coding and decoding, and a significant amount of memory is required for maintaining code words. The tables used also need to be transmitted or recorded, and the cost involved drastically reduces the benefits obtained from compression, except in the case of very large images.
The algorithms used for lossless compression include general examples, widely used in computer science as a whole, and more specific examples, developed for graphic or imaging purposes.
These coding techniques make use of duplications between strings of bits, without considering the origin of these strings. General forms are included in computer operating systems [GUI 03a].
All of these coding methods are asymptotically very close to the Shannon entropy limit for random sequences. The size of the headers involved means that they are not worth using for short messages, but image files rarely fall into this category. These methods are not particularly efficient for images, as duplication statistics are generally relatively low in a pixel flux representation. This duplication is concealed within the image structure, something which hierarchical and JPEG-type coding methods are designed to exploit.
Specific algorithms have been developed for image coding, based on the specific two-dimensional (2D) properties of images. They have enjoyed a certain level of success in areas such as medical imaging, where image degradation must be avoided at all costs. They are also widely used in graphics, where scanning images are used in conjunction with text and drawings [GUI 03b].
The most widespread approaches are based on predicting the value of a pixel X at position (i, j) using neighbors for which values have already been transmitted and are known to the receiver, specifically those denoted A(i, j − 1), B(i − 1, j − 1), C(i − 1, j − 1) and D(i − 1, j + 1) (Figure 8.3, left). The difference between the predicted and actual values is often very low, and often follows a Gaussian distribution, which may be conveniently coded using the Huffman approach.
The lossless JPEG standard, published in 1993, proposes an algorithm for the prediction of X conditional on the presence of a horizontal or vertical contour. There are eight possible situations in this case, leading to eight different decisions. The remainder is coded using the Huffman method. Compression rates of up to 3 have been obtained in medical imaging using this standard, but results are often lower in natural imaging, and lossless JPEG has not been particularly successful in this area.
A second type of lossless JPEG coder has since been proposed. JPEG-LS is based on a typical prediction algorithm, LOCO-1, which counts configurations using a predictor known as the median edge detector. There are 365 possible configurations, and predictions are made using the hypothesis that the error follows a Laplace distribution. Golomb-Rice coding is then used to compress the transmitted sequence (this type of coding is suitable for messages with a high probability of low values). It is optimal for geometric sequences and performs better than lossless JPEG, with a compression rate of up to 3 for natural images (a comparison of coder performances is given in [GUI 03b]). JPEG-LS has been standardized under the reference ISO-14495-1.
The JPEG 2000 standard concerns another lossless coding method (sometimes noted JPEG 2000R, with R reflecting the fact that the method is reversible). This method uses biorthogonal 5/3 wavelet coding (Le Gall wavelets with a low-pass filter using 5 rational coefficients and a high-pass filter with 3 rational coefficients). This third form of lossless JPEG coding presents the advantage of being scalable11, progressive12 and compatible with the image handling facilities of JPEG 2000 (see section 8.6.2). Unlike JPEG-LS, it uses the same principle for lossy and lossless coding, but it is slower, and the implementation is more complex. In terms of compression, it does not perform better than JPEG-LS.
The PNG (portable network graphics) format was designed for graphic coding in scanning (rather than vector) mode and for lossless compression. It was developed as an alternative to GIF (graphics interchange format), created by CompuServe, which was not free to access for a considerable period of time (the relevant patents are now in the public domain).
The PNG format offers full support for color images coded over 24 bits in RGB (unlike GIF, which only codes images over 8 bits), alongside graphics coded using palettes of 24 or 32 bits. It is also able to handle transparency information associated with images (via channel α, which varies from 0 to 1, where 0 = opaque and 1 = completely transparent). However, unlike GIF, PNG is not able to support animated graphics. For images, PNG conversion must occur after the application of corrections (demosaicing, white balancing, etc.) and improvement processes (filtering, enhancement, etc.). Treatment information and details contained in the Exif file are not transmitted, and are therefore lost.
PNG has been subject to ISO standardization (ISO — 15948: 2004).
PNG files contain a header, for identification purposes, followed by a certain number of chunks, some compulsory, other optional, used in order to read the image. Chunks are identified by a 4-symbol code, and protected by an error protection code. The compulsory elements are:
Optional chunks may then be used to provide additional information concerning the image background, color space, white corrections, gamma corrections, the homothety factor for display purposes, and even information relating to stereoscopic viewing. PNG has been progressively extended to include color management operations in the post-production phase, i.e. control of the reproduction sequence from the image to the final product.
PNG is widely used in the graphic arts, enabling transparent and economical handling of graphics or binary texts, with graphics defined using up to 16 bits per channel over 4 channels (images in true color + channel α). The main attractions of this format do not, therefore, reside in its compression capacity. However, PNG images are compressed using a combination of predictive coding (particularly useful for color blocks, which have high levels of duplication) and a Deflate coder. The predictive coder is chosen by the user from a list of five options (with possible values of 0 (no prediction), A, B, (A + B)/2, A or B or C depending on the closest value of A + B — C, (see diagram in Figure 8.3)).
The TIFF format has already been mentioned in connection with the EP extension, which allows it to handle native formats. In this section, we will consider TIFF itself. Owned by Adobe, TIFF was specifically designed for scanned documents used in graphic arts, which evolved progressively from handling binary documents to handling high-definition color documents. Intended as a “container” format for use in conserving documents of varying forms (vector graphics, text, scanned imaged coded with or without loss, etc.), TIFF is an ideal format for archiving and file exchange, guaranteeing the quality of source documents used with other files to create a final document.
TIFF files are made up of blocks of code, identified by tags, where each block specifies data content. Tags are coded on 16 bits. Above 32,000, tags are assigned to companies or organizations for collective requirements. Tags from 65 000 to 65 535 may be specified by users for private applications. There are therefore a number of variations of TIFF, not all of which are fully compatible. The basic version available in 2015, version 6.0, was created over 20 years ago; this version indicates essential points which must be respected in order to ensure inter-system compatibility. TIFF allows very different structures to be contained within a single image: subfiles, bands or tiles, each with their own specific characteristics to enable coexistence.
Algorithms used to read TIFF files must be able to decode lossless Huffman coding by range and JPEG coding by DCT, and, in some cases, other compression methods (such as LZW and wavelet JPEG), as specific flags and tags are used for these compression types.
TIFF recognizes color images created using RGB, YCbCr or CIE la*b*, coded on 24 bits, or of the CMYK type, coded on 32 bits (see chapter 5). Extensions have been developed for images using more than 8 bits per channel and for very large images.
As a general rule, the TIFF file associated with an image is relatively large. In the case of an image of n pixels in sRVB (3 bytes per pixel), a TIFF file using exactly 1 byte per channel will take up 3n bytes before lossless compression, and around half this value after compression; however, compression is only carried out if requested by the user, making the method slower.
The GIF format is also lossless, and also a commercial product, maintained by CompuServe. While designed for use with graphics, it may also be used to transport images under certain conditions.
GIF is an 8 bit coding method for graphics coded using a color palette, using range-based Lempel-Ziv coding. Black and white images are therefore transmitted with little alteration using a transparent palette (although a certain number of gray levels are often reserved for minimum graphic representation), but with a low compression rate. Color images must be highly quantized (in which case lossless compression is impossible) or split into sections, each with its own palette, limited to 256 colors. Performance in terms of compression is lower in the latter case. Another solution uses the dithering technique, which replaces small uniform blocks of pixels with a mixture of pixels taken from the palette.
As we have seen, the space gained using lossless image compression is relatively limited. In the best cases, the gain is of factor 2 for a black and white image, factor 3 for a color image or a factor between 5 and 10 for an image coded on 3 × 16 bits.
However, images can be subjected to minor modifications, via an irreversible coding process, without significant changes to their appearance. This approach exploits the ability of the visual system to tolerate reproduction faults in cases where the signal is sufficiently complex (via the masking effect), i.e. in textured areas and around contours.
This idea formed the basis for the earliest lossy coding approaches, which use the same methods described above: predictive coding or coding by run-lengthes or zones. In this case, however, a certain level of approximation is permitted, allowing us to represent a signal using longer strings. Although these approaches are sophisticated, they do not provide sufficiently high compression rates. As a result, new methods have been developed, which make use of the high levels of spatial duplication across relatively large image blocks, projecting images onto bases of specially-created functions in order to group the energy into a small number of coefficients. These bases include the Fourier transformation (especially the DCT variation) and wavelet transformation [BAR 03b].
For any type of lossy coding, users have access to a parameter α which may be adjusted to specify a desired quality/compression compromise. This compromise is highly approximate, and highly dependent on the image in question. The objective criteria used to judge the effects of compression processes were discussed in detail in Chapter 6. Taking I (i, j) to represent the image before coding and I′(i, j) the image after coding, the most widespread measures are:
In the 1980s, a group of researchers began work on the creation of an effective standard for lossy compression. This goal was attained in the 1990s, and JPEG (joint photographic expert group) format was standardized in 1992, under the name ISO/IEC 10918-1. The standard leaves software developers a considerable amount of freedom, and variations of the coding program exist. The only constraint is that decoding programs must be able to read a prescribed set of test images [GUI 03b].
JPEG encoding is a six-step process, involving:
Each of these steps is reversed during the decoding process, with the exception of steps 2 and 5, which involve irreversible information loss.
The direct cosine transform, defined by the matrix in Table 8.1, is a variation of the Fourier transform, which is better suited to real, positive signals, such as images, than the Fourier transform itself [AHM 74]. The DCT is a 2D, exact and reversible transformation of a scalar signal, which creates a real signal with as many samples as the input signal.
Let J(i,j) be a window of N × N pixels (where N = 8, for example) of the channel being transmitted (the luminance or chrominance channel of the original image). The DCT KJ of J is given by:
where and C(i, i ≠ 0) = 1.
The reverse transformation:
allows us to retrieve the exact value J(i, j), as long as KJ(k, l) has not been subjected to quantization.
Variables k and l are spatial frequencies, as in the case of a Fourier transform. The pair (k = 0, l = 0) has a frequency of zero (i.e. the mean value of J(i, j) for the window to within factor N). As k increases, the horizontal frequencies increase. As l increases, so do the vertical frequencies, and all pair values (k, l) where k ≠ 0, l = 0 corresponds to a frequency which is inclined in relation to the axes 0x and 0y.
The term KJ(0, 0), known as the continuous component, is always much larger than the others. It is treated separately in the DCT, and often transmitted without alterations in order to provide exact, if not precise, information regarding the value of the block.
Table 8.1. DCT filter coefficients for N = 8. Lines: i, columns: k (equation [8.1])
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
0 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
1 | 0.981 | 0.831 | 0.556 | 0.195 | -0.195 | -0.556 | -0.831 | -0.981 |
2 | 0.924 | 0.383 | -0.383 | -0.924 | -0.924 | -0.383 | 0.383 | 0.924 |
3 | 0.831 | -0.195 | -0.981 | -0.556 | 0.556 | 0.981 | 0.195 | -0.831 |
4 | 0.707 | -0.707 | -0.707 | 0.707 | 0.707 | -0.707 | -0.707 | 0.707 |
5 | 0.556 | -0.981 | 0.195 | 0.831 | -0.831 | -0.195 | 0.981 | -0.556 |
6 | 0.383 | -0.924 | 0.924 | -0.383 | -0.383 | 0.924 | -0.924 | 0.383 |
7 | 0.195 | -0.556 | 0.831 | -0.981 | 0.981 | -0.831 | 0.556 | -0.195 |
The other terms are subject to quantization, to a greater or lesser extent depending on the desired compression rate. We begin by defining a quantizer q(k,l) and an approximation of coefficient KJ(k, l) using such that:
where denotes the default rounded integer value of x.
The set of coefficients q(k, l) thus forms a quantization matrix which is applied to the whole image. This matrix is transmitted with the associated coefficients in order to enable reconstruction of the image. Coders differ in the way in which they use chrominance subsampling and in the choice of quantization matrices.
As a general rule, high frequency terms can be shown to contain very little energy, and these terms are therefore of limited importance in terms of the perceived quality of the image. These terms are set to zero during quantization (see Figure 8.4 and Table 8.1).
Table 8.2. DCT coefficients of thumbnails 1 (left) and 2 (right), after setting all coefficients with an absolute value less than 30 to zero. Note the importance of the term J(0, 0), corresponding to the mean value of the block, and the amplitude of the vertical (in 1) and horizontal (in 2) spatial frequencies corresponding to the periodicities observed in these images
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | ||
0 | 1450 | 65 | 0 | 0 | 0 | 0 | 0 | 0 | 980 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 60 | 200 | 0 | 0 | 0 | 0 | |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
5 | 131 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
The coefficients are then transmitted by zigzag scanning, which places frequencies in increasing 2D order (as far as possible) after the continuous component (see Figure 8.3, right). After a short period, the coefficient string will only contain zeros, which will be compressed effectively using either Liv-Zempel or range-based Huffman techniques.
JPEG images are decompressed by reversing all of the steps described above. Coefficient quantization can highlight faults in individual blocks which are characteristic of the JPEG format. Certain reconstruction algorithms therefore include a low-pass filtering stage which reduces these faults. The subsampling of chromatic components may also require an additional interpolation stage, which may also have damaging effects on the image.
JPEG compression has become the most widely accepted means of transmitting images of reasonable quality and reasonable volume. For color images, a compression rate of 10 is generally considered to cause very little degradation to the image, practically imperceptible to the naked eye, especially for very large image files and when viewed without excessive magnification.
Compression rates of 100 are also widely used, particularly online or when sending images by email. Image quality is highly compromised in this case, and the images in question cease to be suitable for any use except rapid visualization due to the presence of annoying artifacts.
Note that all encoders include a parameter to control compression rates, and this parameter is often unique. However, this indicator is not formally connected to compression gains or to image quality. Applied to two different images, the same index may produce two very different results. For a single image, as the index decreases, the level of degradation and the compression rate increase. This index generally takes a value of 100 in the case of lossless coding, and may be reduced to a minimum value of 1.
Note, however, that lossless JPEG does not use the general principle of the DCT for encoding purposes, as it is based on predictive methods.
JPEG may be used in a progressive manner for transmission purposes: in this case, low frequency terms are transmitted first, and the image is then refined by the transmission of higher frequencies.
A version of the image made up of continuous components alone, 64 times smaller than the original, is often used for display functions or to enable rapid database searches.
The JPEG 2000 standard is recognized by the ISO under the name ISO/IEC 15444-1, and is a relatively general image exchange standard, offering a variety of functions (archiving, internet and video). In this context, we will focus on the aspects relating to the image format.
JPEG 2000 makes use of the fact that wavelet composition offers better performances than the DCT in terms of image compression: this method provides a better quality image than DCT with the same compression rate, and higher compression rates for the same image quality [BAR 03a].
JPEG 2000 is also scalable.
The format also uses a highly developed arithmetic encoder, EBCOT (embedded block coding with optimized truncation) which makes use of the 2D tree structure of the wavelet decomposition [TAU 00].
Finally, JPEG 2000 gives users the possibility to describe image content at a higher level of the pyramid; this is useful in facilitating intelligent use of image collections, and for optimized transmission over new-generation networks. To do this, the format makes use of the progressive transmission capabilities offered by wavelet scalability, via a codestream transmission structure, and the ability to define regions of interest (ROI) which are transmitted first and with higher quality levels. It explicitly accounts for the possible presence of text or graphics in compound structures. The format is also designed for application to very large images (up to 232 pixels per line), images with a large number of channels (up to 214) and images with high-dynamic pixels (up to 38 bits per sample).
As with many other standards, JPEG 2000 is open and does not explicitly specify a compression mode, leaving developers the freedom to make improvements over time. The encoder is obliged to produce a binary string which conforms to the standard, but any strategy may be used to do this. Decoders must simply prove their ability to decode the example files included with the standard.
As a transmission standard, JPEG 2000 is highly suited to professional requirements. It has had less success with the general public; its requirements in terms of calculation power and decoding software are higher than those of JPEG, but this is of minor importance compared to the effects of inertia in the general market, where users are not overly concerned with obtaining high quality/flux ratios.
Like JPEG, JPEG 2000 is able to transform color images created using the RGB space into a luminance/chrominance space (YUV or Y CRCB). Two types of transformations are available, one reversible, the other irreversible (but more effective in terms of compact representation). The chromatic components are then processed separately.
JPEG 2000 then operates either on the image as a whole, or on tiles, identically-sized sections of the image (with the exception of the edge tiles). These tiles are decomposed hierarchically by a bank of 2D, bi-orthogonal wavelet filters [BAR 03a].
The standard allows the use of two types of filters13:
These filters are bi-orthogonal15, and defined by a low-pass filter used to calculate images with increasingly low resolutions, starting from the original level of the tile, and a high-pass filter, used to define the contents of details characterizing the specific level in question. These details are contained within the coefficients of the wavelet decomposition: an(i, j).
Filtering can be carried out in two ways. The first method, using convolution, is that generally used in signal processing as a whole. The second method, which constitutes a major advantage of the hierarchical wavelet approach, involves the iterative application of two filters, one low-pass, the other high-pass; these filters are identical for all levels of the pyramid. This is known as the lifting process. Using this process (see Figure 8.6), the odd coefficients of a decomposition are expressed using the value of the even coefficients with the application of a corrective term. This corrective term constitutes the specific detail image for the level in question [SWE 96].
The basic wavelet transformation is one-dimensional (1D); it is thus applied once to the lines and once to the columns of the area of interest. This results in the production of 4 images (see Figure 8.7):
In a carefully-chosen wavelet base, these coefficients present a number of interesting statistical properties; they are highly concentrated, and can be efficiently modeled using a generalized Gaussian law, allowing efficient quantization:
Quantization tables are defined for each resolution level, and calculated in such a way as to provide a wide range of compression levels.
These quantized coefficients, transformed into bit planes and grouped into subbands, are then transmitted to the encoder. As we have seen, the EBCOT coder uses a high-performance contextual encoding process involving a wavelet coefficient tree, predicting each bit using the same bit from the previous resolution level and its immediate neighbors, starting with those bits with the heaviest weighting. The encoder processes subblocks, which are relatively small (with a maximum 64 × 64); the precise size of these subblocks is defined by the user.
The power of the coding method is based on the domain of hierarchical analysis of the wavelet decomposition. We will consider neighborhoods of 6 or 27 connections within a three-dimensional (3D) pyramid of resolution levels. The majority pixels (identified by the bit from the previous level) are not encoded; the others are coded using their sign and deviation from the previous level, if this is significant. Finally, we consider situations where there is no dominant configuration. The resulting tree structure is encoded using an algebraic code.
Table 8.3. Coefficients of Daubechies 9/7 wavelet filters for image analysis and reconstruction (based on [BAR 03a]). Note that other formulas exist, which differ by a factor √2. The analysis and reconstruction filters need to be symmetrical in relation to order 0 to avoid moving image contours
order of filter | 0 | ± 1 | ± 2 | ± 3 | ± 4 |
analysis: low-pass filter | 0.602949 | 0.266864 | -0.078223 | -0.016864 | 0.026749 |
analysis: high-pass filter | 0.557543 | 0.295636 | -0.028772 | -0.045636 | 0 |
synthesis: low-pass filter | 0.602949 | -0.266864 | -0.078223 | 0.016864 | 0.026749 |
synthesis: high-pass filter | 0.557543 | -0.295636 | -0.028772 | 0.045636 | 0 |
The compression rates obtained using JPEG 2000 are around 20% higher than those for JPEG for the transmission of fixed images of identical quality. Subjectively speaking, the drawbacks of JPEG 2000 are less problematic than those encountered using JPEG, as the systematic block effect no longer occurs. JPEG is particularly effective for very large images, along with images containing few very high frequencies. JPEG 2000 produces satisfactory results for high compression levels (of the order of 100). The standard is also highly adapted for remote sensing images, due to its ability to process large images using a mode compatible with the tiling approach needed for high-resolution navigation, and to the possibility of specifying zones of interest, which are then subject to greater attention.
To complete our discussion of fixed image coding, let us consider the case of very large images. This situation is not yet widely encountered in photography, but with the development of sensors of 50 megapixels and more, these formats may be used by certain devices in the relatively near future.
Tiled formats are used for very large image files (from 10 megapixels up to several gigapixels) and are designed to allow rapid consultation of these images. The images in question are decomposed into tiles (with a default size of 64 × 64 pixels in FlashPix, the most common form, and 256 × 256 pixels in the IVUE format). They also use a pyramid shaped, hierarchical binary description of the image, moving from full resolution (several hundred tiles) to a single, coarse image contained in a single tile. A pixel at any level of the pyramid is obtained by averaging the four pixels from the previous level which it is intended to represent. The use of a tree structure allows rapid determination of the level of representation required in response to a user request (for example to display a zone of a given size on a screen of a given size). The use of this tree, along with pre-calculated representation structures, allows very significant gains to be made in terms of image manipulation, both for input/output purposes and for processing, for example when scrolling across a screen.
The payoff for this gain is an increase in the required storage space, as the original image (split into small tiles) needs to be conserved, alongside versions at 1/2, 1/4, 1/8 resolution etc.; this gives a total storage space of times the size of the image, ignoring possible truncation errors, i.e. a 33% increase in memory requirements.
The FlashPix format is able to handle tiles of any type: 8 or 16 bits, 1, 3 or 4 channels, coded using lossy or lossless methods. A description of the Flashpix representation (tile size and number) is given in the associated EXIF file.
Plane representations are replaced by spherical forms in the context of very large datasets, such as those used in astronomy and remote sensing. In this case, a specific tiled representation of the sphere is required. The best-known example, HEALPix (hierarchical equal area isolatitude projection of a sphere) [HEA 14], decomposes the sphere by projection onto 12 squares of a plane, set out in a staggered formation, using different projections according to the position of the zones in question. These squares are oriented at π/2 in relation to the equator, and centered on either (0, kπ/2) or (kπ/4, ±π/4). The resulting squares are then split into 4, following a classic hierarchical approach.
A more intuitive form of tiled projection, hierarchical triangular mesh (HTM) [KUN 01] decomposes the sphere into triangles along large circular planes of the sphere; another format, QuadTree Cube (Q3C) simply uses the quadratic hierarchical structure [KOP 06].
These tiled representations of spherical forms are particularly useful in photographic assemblies and in panoramas with very wide fields (see section 10.1.8).
The domain of video encoding lies outside the scope of this work; however, an increasing number of cameras allow users to record both fixed images and moving sequences. These cameras include video encoding elements which compress the image flux. As in the case of fixed images, these encoders exploit the duplications within each image (intra-image coding), but also duplications between successive images (inter-image coding).
Standardization work has been carried out by the ITU16, producing recommendations H.120, H.261, H.263 and H.264. The H263 standard was the first to include an inter-image mode; H264 serves as the basis for MPEG-4.
These recommendations have also been used outside of the field of telecommunications, for applications such as archiving, video games and private usage. The moving picture expert group (MPEG) committee, a joint venture by the ISO and the IEC, has worked on creating consensus solutions for the purposes of standardization [NIC 03].
The first recommendation, MPEG-1, concerned low flow rate transmissions for wired transmissions in the period preceding ADSL (using packets of 64 kbit/s, giving a maximum rate of around 1.5 Mbit/s). This very low rate was only suitable for video transmission of very small images (352 × 288 pixels per image at a rate of 25 image/s). MPEG-1 was widely used in the context of video-CD archiving.
The MPEG-2 standard was designed for television, and specifically for high quality transmission (contribution coding). It is particularly suitable for Hertz transmission, whether via cable or ADSL. The standard allows transmissions of between 2 and 6 Mbit/s for ordinary video and of 15 to 20 Mbit/s for HDTV17.
MPEG-4 is a multimedia transmission standard, part of which deals specifically with video images. It is particularly concerned with low-rate transmission (around 2 Mbit/s) and uses more elaborate compression mechanisms than those found in MPEG-2; these mechanisms are suitable for use with highly compressed files (although a significant reduction in quality is to be expected). MPEG-4 makes use of the scalability properties included in recommendation H.263. It forms the basis for easy communications with the field of computer science, particularly with regard to copyright protection.
Developments MPEG-7, MPEG-21 and MPEG-x (with x taking a value from A to V) are designed for multimedia applications of video sequences, and define specific functions for archiving, database searches, production (particularly the integration of a variety of sources), interaction, diffusion and content protection.
The video coding standards contained in MPEG-2 and MPEG-4 are currently used in many cameras. Architectures designed for vector-based treatment and integrated into the camera are used to carry out weighty calculation processes, using many of the elements of these standards [NIC 03, PER 02].
We will begin by considering the implementation of MPEG-2. A color image, generally acquired in a RGB space, is converted into a {luminance, chrominance} space, and the chrominance channels are generally subsampled using a factor of 2 for both lines and columns, as in the case of JPEG.
Intra-image coding is carried out by subdividing the image into groups of blocks (GOB), constructed from macroblocks, MB. A macroblock is itself made up of 4 blocks of 8 × 8 pixels of luminance and, generally, 1 chrominance block (chroma subsampling encoding, 4 : 2 : 0) or 4 blocks (in the case of 4 : 4 : 4 encoding, without subsampling). The binary string is therefore made up of an image layer (I), made up of layers of groups of blocks (GOB), themselves made up of macroblock (MOB) layers. This macroblock layer typically includes the 4 luminance blocks followed by 2 chrominance blocks, one for dominant R, one for dominant B.
These blocks are coded using a discrete cosine transformation (DCT) quantizied, then subjected to entropic encoding (by range length, or, better, arithmetic coding) (Figure 8.8).
There are two possible modes of inter-image encoding:
to reconstruct intermediate images noted B (see Figure 8.9) using a very small quantity of additional transmission space (only elements essential for positioning the macroblock in B are added, in cases where this cannot be deduced by simple interpolation of that of P). However, this operating mode does result in a delay in image reconstruction; for this reason, it is mostly used for archiving purposes, where real time operation is not essential, or in applications where a very low level of latency will be tolerated.
This coding mode generates significant block effects in the case of high compression rates. A number of improvements have been proposed to eliminate this issue: multiplication of movement vectors for each macroblock, block layering when identifying movement, a posteriori filtering, etc.
In MPEG-4, most of the functions of MPEG-2 described above have been improved: better prediction of DCT coefficients, better movement prediction, resynchronization of the data flux, variable length entropic coding, error correction, etc. Part 10 of the standard introduces a new object, the VOP (video object plane), a box surrounding an object of interest. A VOP may be subjected to a specific compression mode different to that used for the rest of the image. A VOP is made up of a number of whole macroblocks, and will be identified in the data flux. For example, a participant in a television program, speaking against a fixed background, may be placed into a VOP with high definition and a high refresh frequency, while the studio background will be coded at a very low rate. VOPs, generally described on an individual basis by users, are not currently used in most cameras (which tend to focus on part 2 of the MPEG-4 recommendations). However, certain cameras now include very similar functions (for example higher-quality coding in a rectangular zone defined by facial recognition).
The use of MPEG-4 coders in cameras and their ability to encode high definition videos (HDTV format18, in 2015, with 4k format available on certain prototypes for durations of a few seconds) represents a major development in camera technology, as this application currently determines the peak processing power, defines the processor and signal processing architecture, the data bus and the buffer memory.
No chapter on image encoding and compression would be complete without at least a rapid overview of this technique, which, in 2015, shows considerable promise in reducing the volume of image signals beyond the possibilities offered by JPEG 2000. However, compressed sensing is still a long way from constituting a standard or an image forward, and does not yet constitute a reasonable alternative to the standards discussed above [ELD 12].
The basic idea in compressed sensing is the reconstruction of images using a very small number of samples, notably inferior to the number recommended by the Shannon theorem19 [DON 06, BAR 07, CAN 08]. The proposed approaches consist of applying a number of relevant and complementary measures in order to reconstruct the image using sophisticated techniques which make use of the sparsity of image signals, essentially by using reconstructions of norm 0 or 1. Sparsity is the capacity of a signal for description in an appropriate base exclusively using a small number of non-null coefficients. Image sparsity cannot be demonstrated, and is not easy to establish. The hypothesis is supported by strong arguments and convincing experimental results, but still needs to be reinforced and defined on a case-by-case basis.
Compressed sensing places image acquisition at the heart of the compression problem; this approach is very different to that used in sensors which focus on regular and dense sampling of whole images, using steps of identical size. However, this “Shannon style” sampling is not necessarily incompatible with all potential developments in compressed sensing. The onboard processor is left to calculate the compressed form in these cases, something which might be carried out directly by sophisticated sensors (although this remains hypothetical in the context of everyday photography).
Presuming that an image can be decomposed using a very small number of coefficients in an appropriate base, we must now consider the best strategy for identifying active coefficients. Candès and Tao have shown [CAN 06] that images should be analyzed using the most “orthogonal” measurement samples available for the selected base functions, so that even a small number of analysis points is guaranteed to demonstrate the effects of the base functions. As we do not know which base functions are active, Candès suggests using a family of independent, random samples [CAN 08]. Elsewhere, Donoho specified conditions for reconstruction in terms of the number of samples required as a function of signal sparsity [DON 09].
In terms of base selection, no single optimal choice currently exists, but there are a certain number of possibilities: Fourier bases, DCT, wavelets (as in the case of JPEG) and wavelet packets used in image processing (curvelets, bandlets, etc.). Reconstruction with a given basis is then relatively straightforward. This process involves cumbersome optimization calculations using a sparsity constraint, for example by searching for the most relevant base function from those available, removing the contributions it explains from the sample, and repeating the process (the classic matching pursuit algorithm [MAL 93]), or by convex optimization.
If an image is principally intended for transmission, reconstruction is not desirable; we therefore need to consider the transmission of random samples. The random character of these samples is not helpful in ensuring efficient transmission, and imposes serious limits on the possible performance of the compressed sensing process, unless we use certain well-mastered quantization and coding forms [GOY 08]. Other types of representation would therefore be useful, and development of these forms is important in order to derive full profit from compressed sensing techniques; however, a lot of work still needs to be done in this area.
Current studies of compressed sensing have yet to produce results of any significance for mass market photography. However, an innovative prototype of a single-pixel camera has been developed, tested [DUA 08] and marketed by InView. This prototype captures images using a matrix of mirrors, individually oriented along both the x and y axes, in order to observe a different aspect of the scene at each instant. All of the signals sent by the mirrors are received by a single sensor, which produces a single pixel for any given instant, made up of a complex mixture of all of these signals. The matrix is then modified following a random law and a new pixel is measured. The flux of successive pixels, measured at a rate of around 1,000 pixels per second, is subjected to a suitable wavelet-based reconstruction process in order to recreate the scene, using around one tenth of the samples previously required. This technology is particularly promising for specific wavelengths for which sensors are extremely expensive (for example infra-red, a main focus of the InView marketing strategy).
Compressed sensing applications are particularly widely used in the field of instrumentation, where they have provided interesting solutions to complex 3D measurement problems in medical imaging (axial tomography, magnetic resonance etc. [LUS 08]), in microscopy for biological purposes [LEM 12], in holography and for physical instrumentation [DEN 09], etc.
For photographers, compressed sensing is a promising area, but considerable developments are required before these methods are suitable for general use.
18.224.59.145