Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 12

High Dynamic Range Imaging with JPEG XT

T. Richter University of Stuttgart, Stuttgart, Germany

Abstract

JPEG XT (ISO/IEC 18477), the latest standardization initiative of the JPEG (ISO SC29WG01) committee, defines an image compression standard backward compatible with the well-known JPEG standard (ISO/IEC 10918-1). JPEG XT extends JPEG by features such as coding of images of higher bit depth, coding of floating point image formats representing high dynamic range images, lossless compression, and coding of alpha channels. All extensions are compatible with the legacy JPEG standard, and always allow the reconstruction of JPEG XT codestreams to images of eight bits per pixel. This chapter discusses the history and motivation for JPEG XT, provides insight into its design principles, and provides results on its performance.

Keywords

JPEG; High dynamic range; JPEG XT

12.1 The JPEG XT Standard

The ITU Recommendation T.81 and ISO/IEC Standard 10918-1 for still image coding, commonly known as JPEG (Wallace, 1992; Pennebaker and Mitchell, 1992; ISO/ITU, 1992), is still the dominant codec used for lossy image coding. However, as the quality of sensors has improved and market demands have changed in recent years, JPEG no longer fully addresses all needs of the digital photography market. While JPEG offers a lossy 12 bits per pixel (bpp) mode and also includes a lossless coding mode, both modes are incompatible with the popular eight-bit mode and are rarely implemented for this reason. Most decoders found on the market will not be able to decode such images. Furthermore, it has been seen that standards such as JPEG 2000 (Boliek, 2000; Taubman, 2000) and JPEG XR (Srinivasan et al., 2007), while addressing such needs, were only successful in niche markets and never found acceptance as widely as JPEG. While the list of available features of these newer codecs is certainly long, they require a completely new image processing chain that is hard to establish in the consumer photography market. In short, both standards had limited success in the consumer market because of the lack of backward compatibility.

To modernize the old standard and address the changed needs of the market, the JPEG committee started at its Paris 2012 meeting (ISO SC 29 (WG 1), 2012) a new work item on a fully backward compatible image compression scheme based on 10918-1. The design goals for this compression scheme were to offer a JPEG-compatible, low-complexity codec for high dynamic range (HDR), large color gamut data. In later meetings, it became apparent that extensions to additional directions would be desirable, such as lossless compression or coding of alpha channels, and the definition of the term “high dynamic range” (HDR) itself required clarification; see Section 12.2. Hence, JPEG XT was designed as a multipart standard, where each subsequent part adds additional features, all integrated into a common framework, allowing individual requirements to be addressed by selection of the necessary coding features from the parts as necessary. The current architecture of the JPEG XT standard is outlined in Section 12.4.

Besides all extensions, one leading principle keeps all parts together, and that is backward compatibility with the popular widespread eight-bit Huffman coding mode of JPEG. Backward compatibility here means that legacy decoders are able to reconstruct a lossy low dynamic range (LDR), standard color gamut version of the encoded image, while the complete (full range, lossless, etc.) image data are available only to decoders compliant with the new standard. The common minimum subset of all JPEG XT parts — namely, the eight-bit mode of the legacy JPEG standard plus some widely known extensions defining color-spaces and subsampling, became its own part, ISO/IEC 18477-1. Compatibility with other JPEG modes, such as arithmetic coding or predictive lossless coding, was of no importance as these modes are rarely used in the photography market. Even more, the extensions made within JPEG XT are all derived from known JPEG algorithms such that implementation of JPEG XT on hardware requires little more than two readily available JPEG chips, plus pixelwise postprocessing. Hence, JPEG XT was designed to ease implementation on existing hardware.

12.2 Problem Definition

While the definition of lossless coding or coding of opacity information is certainly obvious, the term “high dynamic range” (HDR) is only vaguely defined, and it is the purpose of this section to introduce the definitions the JPEG committee used to structure its work.

For this, a couple of definitions have to be made. An f-stop is commonly known as the logarithm to the base 2 of the quotient of the lightest and the darkest (physical) intensity in a given scene. Hence, the number of f-stops describes the dynamic range.¹

An LDR scene has a dynamic range of at most 10 f-stops, an intermediate dynamic Range (IDR) scene has between 10 and 16 f-stops, and everything above is called high dynamic range (HDR).

Related to that is the representation of a scene as a digital image. An LDR representation is defined here, for pure practical purposes, as an image represented by one or three channels of eight-bit integer samples. This is also sometimes called standard dynamic range (SDR) and covers the dynamic range that the popular modes of the legacy JPEG can represent. SDR images are, as in the case of JPEG, also restricted in their color gamut. In digital photography, SDR images are generated by a tone-mapping operation that is applied to the raw sensor signals, compressing the output dynamic range down to 256 possible intensity values.

An IDR representation still uses integer samples, but requires more than eight-bit resolution. This is sometimes also called extended dynamic range.² There is currently no widely accepted standard for IDR image representations for digital photography, although camera vendors advertise proprietary “raw” formats that typically encode integer sensor signals of more than eight-bit precision and thus can be classified as IDR representations. Despite the name, “raw” formats are typically not as “raw” as one might believe: some elementary processing is usually applied to the sensor signals before storage. However, what is common to these proprietary representations is that the final tone-mapping step from the intermediate signal representation to the LDR output is skipped, and has to be performed by the user after the images have been downloaded from the camera, similar to the “printing” of analog negatives. The actual encoding of the preprocessed sensor data is typically lossless. IDR representations are, hence, considered as “digital negatives.”

HDR representations are those that require floating point samples because of a dynamic range that is too large to allow representation by integer samples. Since the dynamic range of today’s sensors falls well within the range that can be represented by at most 16-bit integers, larger dynamic ranges can be acquired only if several shots with different exposure times are combined into a single digital image.

The classification of scenes and representations is not identical, but related. An LDR scene can surely be represented by a HDR image encoding, but the reverse requires an additional tone-mapping step that alters the content and generates loss. Furthermore, the relation between the physical intensities (ie, radiance) and the sample values differs depending on the representation. SDR images are typically gamma-corrected (ie, the sample values are not proportional to the physical intensities, but are instead related to them by a power law). This gamma correction stems originally from analog TV, where it became a necessity because of the nonlinearity of the components used for recording and reproduction of signals (Netravali and Haskell, 1995), but it can also be seen as a very basic tone-mapping procedure (Reinhard et al., 2010). IDR and HDR representations typically use a linear gamma encoding (ie, image sample values are proportional to physical intensities). Depending on whether the proportionality constant is known, one also calls these absolute or relative radiance representations.

12.3 The History of JPEG XT

The original design goal of JPEG XT, as discussed in the initiating 2012 Paris meeting of the JPEG committee, was the backward-compatible representation of HDR images. At that time, the requirements had not been fully settled, and the importance to distinguish between floating point and integer representations and HDR and IDR became evident only several meetings later.

As an answer to this call, the JPEG committee received five proposals at its Shanghai meeting: from the École Polytechnique Fédérale de Lausanne in Switzerland, the Vrije Universiteit Brussel in Belgium, the UK-based company Trellis Management, Dolby in the United States, and the University of Stuttgart in Germany. All proposals expect, as input to the encoder, the IDR or HDR image to be coded, and, sometimes optionally, an additional LDR image which then becomes the SDR image visible to legacy decoders. An additional side stream carries all the information necessary to reconstruct the image to full precision. In principle, this side channel could be coded by any suitable method; however, the committee decided to restrict JPEG XT to entropy coding mechanisms that are as close to JPEG encoding as possible to minimize hardware costs and to reuse as many parts of existing implementations as possible.

The proposals differed insofar as which representations they addressed: the Dolby and Trellis Management proposals covered the HDR case, whereas the University of Stuttgart proposal addressed only IDR use cases and the Vrije Universiteit Brussel proposal addressed only lossless encoding. The École Polytechnique Fédérale de Lausanne proposed a rather generic framework to cover HDR use cases. It took almost one year to fit all proposals into one common architecture and to structure JPEG XT into parts. Part 3 defines a common file format based on a box-type structure very much like JPEG 2000. The box-based file format provides the syntax to express all other parts; it does not define a decoder, but rather the syntactical elements used by the standard as a whole. Part 6 defines IDR coding, part 7 defines coding of HDR sample representations, and part 8 defines lossless coding. Depending on the nature of the proposals, they were integrated into parts 6, 7, or 8, where each part is an extension of the previous part. Part 2 plays a special role by expressing part 7 profile A, while technically equivalent, in a legacy syntax.

The extension for encoding of opacity channels, while already proposed at the 2014 Valencia meeting as part 9, came finally to life at the 2015 Sydney meeting. Future meetings will possibly define additional parts, with such topics as privacy protection of images or regions of images, or encoding of plenoptic images.

More on the encoding algorithms and the joint decoder design of part 6 is discussed in Section 12.4.1, and the HDR extensions are introduced in Section 12.4.3. Lossless coding and coding of opacity data, covered in parts 8 and 9, are beyond the scope of this chapter.

12.4 Coding Technology

While JPEG defines multiple coding tools, including lossless coding, arithmetic coding, and pyramidal coding, only the discrete cosine transform (DCT)-based Huffman coding tools have found widespread use. Even more so, some elements of JPEG as it is understood and used today are surprisingly not part of the original standard at all. The YCbCr color space was first described in JFIF, the “JPEG File Interchange Standard,” and was standardized as ISO/IEC 10918-5 as late as 2011 (Brower et al., 2011). Even though the legacy JPEG standard specifies component subsampling, the upsampling procedure and the alignment of components were originally not part of JPEG, but again were described in JFIF. In fact, the original JPEG standard specifies only a codestream that reconstructs samples from data but leaves the interpretation of such samples or their relation to images or color to other standards.

With all the differences between JPEG as it is applied in practical applications and the original ISO standard in mind, the JPEG committee decided to base its new JPEG XT coding technology on the firm grounds of a new standard spelling out all the accepted conventions, and to release this as JPEG XT part 1, ISO/IEC 18477-1. Codestreams following this specification will be reconstructed correctly by all existing JPEG decoders, despite dependence on conventions not covered in the original JPEG specifications (Fig. 12.1).

f12-01-9780081004128 — Figure 12.1 Overview of the parts of JPEG XT and their relations. Parts 4 and 5 define conformance testing and the reference software and are not included here.

Because legacy JPEG decoders usually support only the eight-bit integer mode described in JPEG XT part 1, JPEG XT part 6 introduces now two orthogonal coding mechanisms to extend the dynamic range: refinement coding and residual coding. The former method increases the precision of the DCT coefficients, and hence operates in the DCT domain. It is quite similar to the legacy 12-bit mode, although it allows arbitrary bit depths and is, unlike the legacy 12-bit mode, fully backward compatible with the eight-bit mode. The latter method, residual coding, operates entirely in the spatial domain; it extends the bit precision of the base layer defined by the legacy codestream elements and the refinement scan by including a second, independently coded extension layer. The extension layer coding adopts coding mechanisms that are either directly identical to or closely related to legacy JPEG coding modes.

A backward-compatible signaling mechanism based on the ISO media file format specified in part 3 instructs the decoder how to merge refinement, extension, and base layers into one final image. Metadata and residual and refinement coded data are here embedded into so-called boxes, a syntax element JPEG XT has in common with JPEG 2000 and MPEG standards. The box-based syntax allows future extensions of JPEG XT toward applications such as JPIP — an interactive image browsing protocol similar to the popular Google Maps service — that depend on such box structures. The boxes themselves are hidden from legacy decoders by their encapsulation in application markers, a generic extension mechanism already defined in the legacy standard. Parts 7 and 8 extend and use these extension mechanisms of part 6 to enable coding of HDR data and lossless coding.

The merging mechanism for the base and extension layers is a common superset of all proposals received by the JPEG committee. It is built from two elementary operations: a linear transformation from $R^{3}$ $R^{3}$ to $R^{3}$ $R^{3}$ representing decorrelation transformations or color-space conversions, and a one-dimensional nonlinear transformation, implementing approximate inverse tone-mapping, inverse gamma correction, or other nonlinear operations. Both types of functions operate on a pixel-per-pixel basis; their combination, in proper order, defines the entire universe of LDR/residual merging operations, which includes all original proposals.

The joint decoder architecture of JPEG XT is outlined in Fig. 12.2. This figure shows all components present in a decoder supporting all profiles and parts 6–8, including integer and lossless coding. Rounded boxes represent linear matrix transformations and regular boxes apply separately to each component. Thick lines transport three channels and thin lines transport a single channel. The dotted boxes are not required for part 6 (ie, IDR coding). Inverse QNT is the dequantization procedure (ie, multiplication of the decoded bucket indices by the quantizer bucket sizes in the JPEG quantization matrix). Rounded boxes with the term NLT run a pointwise nonlinear transformation, typically inverse gamma correction or scale adjustments. Square transformation boxes multiply triples of samples with a linear matrix and serve the purpose of an inverse component decorrelation or color space transformation. Coding of alpha channels (ie, JPEG XT part 9) adds a similar figure for a one-component opacity channel, which is not discussed here.

f12-02-9780081004128 — Figure 12.2 Decoder design of JPEG XT, merging the functionality of all parts except part 9. For the definition of and motivation for the various boxes, see the text.

Before we discuss the features of this decoder and the components of the JPEG XT standard, we will make a couple of observations. First, the legacy JPEG standard covers only the first three top-left boxes in Fig. 12.2, denoted as “T.81 10918-1 Decoder,” “FDCT or IDCT,” and “Inverse QNT.” JPEG, as it is in use today, and as standardized in ISO/IEC 18477-1, adds the two additional boxes to the right: chroma upsampling and transformation from YCbCr to RGB. The lower-left box labeled “Residual Image” represents the decoder for the extension layer, and everything to the right of this box and below the five boxes in the top row implement the merging operations that compute an HDR or IDR image from base layer (top row) and extension layer (bottom row). These operations are all done in the spatial domain, on a pixel-per-pixel basis. The only extension that has been made in the DCT domain is refinement coding, indicated by the small boxes that point into the base image and residual image decoder.

The following sections discuss all these extensions, from IDR coding to lossless coding, first introducing the extension mechanisms for IDR coding that are then, in part 7, put to use for HDR coding as well.

12.4.1 IDR Coding

As introduced above, part 6 defines two orthogonal extension mechanisms, one in the DCT domain denoted as refinement coding and one in the spatial domain, called residual coding. Both possibilities have already been discussed in the video coding community. Residual coding is similar in nature to the HDR extension for MPEG video proposed by Mantiuk et al. (2007, 2006), and refinement coding is not unlike the MPEG extension proposed by Zhang et al. (2011). Both mechanisms can be combined if needed, including the possibility to extend residual coding by refinement coding.

Refinement coding will be discussed first. For that, it is helpful to recapitulate how the progressive coding mode of legacy JPEG works (Pennebaker and Mitchell, 1992). Image quality is here progressively improved in two possible directions: the spectral selection mechanism allows the encoder to select parts of the frequencies in the JPEG zigzag scan pattern to be encoded first, with additional frequency components to be included in later scans if desirable. The successive approximation mechanism improves the bit precision of the DCT coefficients by first coding a subset of their most significant bits, then allowing the encoder to include additional lower-order bits in later scans.

It is important to note that the first scan of a successive approximation scan pattern uses a coding mechanism very similar to regular (sequential) coding. Progressive coding uses first a regular sequential scan — with only very minor extensions to skip over empty blocks quickly — to encode the most significant bits, and all subsequent least significant bits are encoded by an alternative subsequent approximation entropy coder. If the spectral selection includes all frequencies³ and block skipping is not used, the entropy coding mechanism of the first scan of a successive approximation scan pattern is identical to that of the sequential Huffman scan.

Refinement coding makes use of this identity by splitting the coding of a high-bit-depth DCT block into two parts: a legacy sequential or progressive coding part that includes all the necessary bits required to reconstruct an eight-bit approximate image, and refinement bits using the successive approximation scan of the progressive coding mode to extend the bit precision as required. The difference between refinement coding and the legacy progressive coding mode (Fig. 12.3) is that the latter signals the number of least significant bits in the syntax elements — the start of scan marker — by which the decoder has to upshift the reconstructed DCT samples as output from the first scan, whereas the latter hides this information in the syntax elements (ie, boxes) of JPEG XT that remain invisible to legacy implementations. In other words, legacy applications would see a syntactically correct eight-bit stream, whereas extended applications increase the bit precision by the known successive coding mechanism already specified in the legacy standard.

f12-03-9780081004128 — Figure 12.3 Relation between the successive approximation mechanism of legacy JPEG (left) and refinement coding (right), including four refinement bits for a total 12-bit spatial sample precision.

If one wants to limit the number of bits required for the DCT coefficients to 16 and the number of bits required to perform an integer-to-integer DCT approximation to 32, it is not hard to compute that at most four refinement scans can be included; this constraint is included for purely practical purposes in the current committee draft of the standard, although extensions to more bits would be straightforward.

To introduce residual coding, one should first model the merging step of base and extension layer as a similar extension of bit depth through least significant bits, though this time in the spatial domain. In a very simple application of residual coding, the eight most significant bits stem from the legacy JPEG codestream and provide an approximation of the IDR image, whereas as the least significant bits are represented in a side channel making up the residual image IDR images would then be reconstructed by upshifting the samples of the base layer and adding the least significant bits from the residual image. These two operations are done by the addition “ $\oplus$ $\oplus$ ” on the right-hand side in Fig. 12.2 and in the box denoted as “Base NLT Point Trafo.” The dotted boxes are not used, the box denoted as “Output Conversion” clamps the reconstructed sample values to their valid range, and the box denoted as “Color Transformation” is the identity.

This elementary approach does not, however, withstand a closer analysis of the requirements. IDR images are typically encoded without gamma correction (ie, sample values are proportional to physical intensity), whereas LDR images are gamma-corrected, tone-mapped versions of them. A simple downshift to generate an LDR image from the most significant bits of the IDR image would not work because of the lack of a gamma correction. However, if for image reconstruction purposes the upshift by the LDR image bits is replaced by a simple lookup table that provides a global, simple approximation of the tone mapping applied at the encoder for the generation of the LDR image, both a useful LDR image and an IDR image can be carried in the same file. In such a case, the lookup table (the box denoted as “Base NLT Point Trafo”) has 256 entries if the legacy image does not make use of refinement coding and hence has eight-bit precision; any type of error due to the approximation of the tone mapping is then captured by the extension layer. In the case of refinement coding, the lookup table is 2^h+8 entries long, where h is the number of refinement scans “hidden” from the legacy decoder. Here a finer table approximation allows a more precise approximation of the IDR image, and precise reconstruction with refinement coding alone is possible only if the original tone mapping for generation of the LDR image was a simple global operation. Otherwise, additional errors remain that require correction by a residual scan.

Leaving color issues aside, the reconstruction algorithm for IDR grayscale images defined in part 6 of the JPEG XT standard is hence

$\begin{array}{l} IDR (x, y) = Φ (LDR (x, y)) + RES (x, y) - 2^{R_{b} - 1}, \end{array}$ $\begin{array}{l} IDR (x, y) = Φ (LDR (x, y)) + RES (x, y) - 2^{R_{b} - 1}, \end{array}$

(12.1)

where IDR is the reconstructed IDR image, LDR is the base image visible to legacy decoders, RES is the residual image hidden in applications markers, and $2^{R_{b} - 1}$ $2^{R_{b} - 1}$ is an offset creating a residual signal from RES(x,y) that is symmetric around zero. The constant R_b is the bit precision of the IDR image generated. In the presence of refinement coding, the LDR signal will have a bit depth of 8 + h instead of 8, where h is the number of refinement scans.

Let us briefly discuss the encoding of the extension layer. Since the extension layer is never seen by a legacy decoder, it is not necessary to constrain entropy coding to the legacy eight-bit Huffman mode of JPEG. If its resolution is not sufficient to reach the desired quality, one can either use the 12-bit mode of JPEG or encode the data with the eight-bit Huffman mode and extend the resolution of the extension layer by refinement scans, as one can for the base layer. In principle, even other entropy coding methods could have been considered, but the committee decided to constrain JPEG XT to algorithms that are as close as possible to existing JPEG technology to ease their implementation.

12.4.2 Enlarging the Color Gamut

Even though the original JPEG standard did not define a color space and only specifies a mechanism for how to encode sample values into a codestream, JFIF — later standardized as ISO/IEC 10918-5 — did (Brower et al., 2011). JFIF selected ITU Recommendation BT.601, originally a TV standard, both as a color space and as a decorrelation transformation from an RGB-type description into an opponent chrominance/luminance description. Unfortunately, even this selection is not fully consistent with typical uses of JPEG where images are represented in sRGB rather than Recommendation BT.601. Both color spaces are related but not exactly identical.

For reasons of backward compatibility, part 1 of JPEG XT also specifies the legacy ITU color space. Its color gamut is, however, quite limited, and it is often desirable to encode images expressed in larger color spaces. Simply mapping the primaries of larger color space to the Recommendation BT.601 primaries may be an option in some simple cases, but this will typically not map the white points onto each other; in other words, the colors of the legacy image encoded with wrong primaries when interpreted with Recommendation BT.601 colors might be off, creating an overall undesirable image; such color alterations can be especially irritating in the reproduction of skin colors and should be avoided. They can be corrected, however, by use of the extension layer. Any undesirable color shift can be compensated by the addition of a suitable error signal which will shift the reconstructed sample values to their desired color position. From the perspective of coding efficiency, this approach should, however, be avoided, as the error signal can become large enough to compromise the performance of the code.

To address this predicament, JPEG XT offers three options. First, the LDR primaries can be used consistently for the base image and the full IDR image, then requiring the use of negative color coordinates for out-of-gamut colors. Due to the offset shift in Eqn. 12.1, such sample values are representable by a minimum LDR value and a small, but positive residual error.

Second, the legacy transformation between YCbCr and RGB can be replaced by another transformation that maps the legacy samples into a larger gamut color space. This generalized transformation is denoted as “Base Transformation” in Fig. 12.2; it can be considered as the combination of two subsequent linear transformations. The first transforms the Recommendation BT.601 YCbCr opponent sample values into an RGB color space, with color primaries defined as in the mentioned ITU recommendation; the second maps the coordinates of the ITU Recommendation BT.601 color space into the coordinates relative to the target color space. From a color science perspective this transformation is not strictly correct. The transformation is applied in the nonlinear (gamma-corrected) Recommendation BT.601 color space, hence creating errors in the chroma reproduction; the larger the nonlinearity in the inverse tone-mapping process, the larger the errors. Again, such errors can be corrected by an appropriate extension layer, and in the worst case compromise the coding performance, but not the quality of the image.

The third option overcomes this problem by introducing an additional linear transformation after inverse tone mapping, denoted as “Color Transformation” in Fig. 12.2. Here, the decoder first performs the legacy YCbCr to RGB transformation in the nonlinear color space as required by JFIF, and then maps the samples into the (typically) linear target color space of the IDR image by the base nonlinear point transformation, before it finally applies a linear transformation to map the colors to the target space.

While the third approach seems to be ideal from a color science point of view, it comes with problems of its own. None of the transformations can be implemented with infinite precision, and the second color transformation creates additional numerical errors that also require correction. Astonishingly, it is sometimes beneficial to use the second method instead of the third to maximize the coding performance. JPEG XT thus allows the use of all three approaches, even combined, and leaves it to the encoder to signal the desired mechanism.

The color decorrelation in the extension layer is much less critical. At first sight, it seems plausible to transform the residual data also first from YCbCr to RGB with Recommendation BT.601 primaries, and then further transform the data into the target color space. While such transformation are allowed and even present in some profiles, see Section 12.4.3, they not only require one additional matrix transformation, but also add further loss due to numerical inaccuracies. IDR coding as defined in part 6 does not make use of this transformation. Instead, one can take advantage of the fact that the residual image is never displayed directly and does not need to be backward compatible with legacy JPEG applications. Residual coding can be considered as a compressor of three sample values per pixel, without any color space information. While error residuals could be compressed by the residual stream directly without any transformation, it turned out that a decorrelation transformation identical to the Recommendation BT.601 RGB to YCbCr transformation improves compression efficiency as well. This transformation is applied in the box denoted by “Residual Transformation” in Fig. 12.2. The residual nonlinear point transformation is typically only a scalar multiplication or an upshift, necessary to use the full range of eight bits to represent the error signal. It serves no additional purpose, and is even absent for lossless coding.

12.4.3 From IDR to HDR: HDR Coding

The coding mechanisms introduced above are, within the limits the current draft of the standard sets, able to encode images consisting of integer samples of up to 16-bit resolution. Part 7 of the JPEG XT standard is an extension of the coding tools of part 6 to allow reconstruction of images to floating point samples, as required for the representation of HDR images. Similarly to part 6, these extensions are based on two elementary types of operations: a pixel-based nonlinearity, implemented either by a lookup table or by a parameterized curve, and a linear transformation defined by a 3 × 3 matrix.

These tools, while being part of a common decoder framework, enable several encoder architectures to tackle the problem of HDR coding. As for many other standards, the set of decoding tools is structured into profiles, each of which defines a set of allowable decoder tools for codestreams conforming to the corresponding profile. Even though the corresponding encoder architecture is not specified by the standard, it is still helpful to start the discussion with typical encoder designs.

Profiles A and C make use of the approximately logarithmic dependency between the physical luminance (stimulus) and the corresponding response of the human visual system, which has been known for a long time as “Weber’s law” (Netravali and Haskell, 1995). Profile C represents images in the logarithmic domain directly, using a piecewise linear approximation of the logarithm that is exactly invertible and hence enables lossless coding, which is, however, standardized in a separate part. The coding algorithm in the logarithmic domain is then identical to part 6 (ie, profile C part 7 is a minimal extension to part 6).

Profile A applies the logarithmic map only to the luminance channel; that is, the HDR image is represented as the product of an LDR base image and a luminance scale factor μ that is encoded logarithmically (see Fig. 12.2). The relation to profile C becomes more apparent if one recalls that this multiplication in the image domain performed by profile A is equivalent to an addition in the logarithmic domain as performed by profile C decoders (ie, the equivalent to the multiplication step by μ in profile A is the addition step immediately before it in profile C; see Fig. 12.2). Profile A also makes use of this addition step, although here to correct small residual errors in the chrominance channels.

Unlike profiles A and C, profile B encoders split the HDR signal along the luminance axis. Samples below a luminance threshold are represented in the base layer, and samples above the threshold are represented in the extension layer. The composition of the base and extension layers in the decoder is performed by a division operation; this division is represented by an addition and logarithmic and exponential preprocessing and postprocessing. Decoders do not need to go through these processing steps, of course, and may apply the division immediately when they detect a profile B codestream. From the perspective of the architecture of the standard, the addition step is here the same addition as applied in part 6, and the step involving multiplication by μ is absent as in profile C.

A somewhat more formal and more precise discussion of the profiles is given now. Profile A, allowing both the prescaling and postscaling nonlinearity but no refinement or secondary nonlinear transformations (Fig. 12.2), reimplements a decoder similar to JPEG-HDR, a backward-compatible JPEG extension originally proposed by Ward and Simmons (2005) and Dolby. The reconstruction algorithm in this specific case is given by

$\begin{array}{l} HDR (x, y) = μ ({RES}_{0} (x, y)) (C Φ (LDR (x, y)) + ν (S C Φ LDR (x, y)) \cdot R RES (x, y)) . \end{array}$ $\begin{array}{l} HDR (x, y) = μ ({RES}_{0} (x, y)) (C Φ (LDR (x, y)) + ν (S C Φ LDR (x, y)) \cdot R RES (x, y)) . \end{array}$

(12.2)

In Eq. (12.2), μ is the postscaling factor depending on the luminance signal l of the extension layer. The function μ(l) relating the luminance of the extension layer to the scaling factor is typically an exponential ramp function of the form $μ (l) = exp (a l + b)$ $μ (l) = exp (a l + b)$ with suitable constants a and b (ie, the extension layer luminance l depends logarithmically on the scale factor μ). This nonlinearity is, from the perspective of the standard, implemented by the postscaling nonlinear transformation.

The transformation S is a 3 × 3 matrix that extracts the luma signal from the RGB base image, and ν adds an offset shift to the base layer luma signal extracted through S. The matrix C finally maps the primaries of the Recommendation BT.601 color space of the base layer onto the primaries of the larger HDR gamut.

The offset shift through the function ν is represented in the standard decoder architecture by the prescaling nonlinear transformation. As said, one typically has ν(y) = y + N₀, where N₀ is the noise floor. It allows for chroma variations due to camera noise even in the absence of a luma signal.

R is an affine (and not linear) transformation that converts the chrominance signal of the extension layer to an additive RGB residual. It does not take the luminance channel of the extension layer into account, which is instead split off before to compute the scale signal μ.

Because of postscaling by μ, the nonlinearity Φ in the base layer, represented by “Base NLT Point Trafo” in Fig. 12.2, does not need to compensate for the effects of tone mapping and is here typically an inverse gamma transformation. It maps the gamma-compensated base image back into a linear radiance space.

Encoding in profile A with prescaling and postscaling transformation enabled is somewhat more complex, and we will next derive the encoder algorithm. The input of the encoder is an LDR-HDR image pair from which a suitable extension layer must be computed. For that, first note that RRES(x,y) is, by design choice, a pure chrominance residual and does not include any luminance information. Thus, if we denote by P_L the projection onto luminance space, we find that

$P_{L} (C Φ (LDR (x, y)) + ν (S LDR (x, y)) \cdot R RES (x, y)) = P_{L} C Φ (LDR (x, y)) .$ $P_{L} (C Φ (LDR (x, y)) + ν (S LDR (x, y)) \cdot R RES (x, y)) = P_{L} C Φ (LDR (x, y)) .$

This allows the encoder to determine first μ(RES₀(x,y)) and then, as μ is a known function, RES₀(x,y) by the quotient of the reconstructed base image luminance and the original image luminance:

$μ ({RES}_{0} (x, y)) = \frac{P_{L} C Φ (LDR (x, y))}{P_{L} HDR (x, y)} .$ $μ ({RES}_{0} (x, y)) = \frac{P_{L} C Φ (LDR (x, y))}{P_{L} HDR (x, y)} .$

si9_e

In the last step, the chroma residuals can be computed:

$R RES (x, y) = \frac{1}{ν (S LDR (x, y))} (\frac{HDR (x, y)}{μ ({RES}_{0} (x, y))} - C Φ (LDR (x, y))) .$ $R RES (x, y) = \frac{1}{ν (S LDR (x, y))} (\frac{HDR (x, y)}{μ ({RES}_{0} (x, y))} - C Φ (LDR (x, y))) .$

si10_e

The reader may now verify that the right-hand side lies indeed in the chroma-subspace of the color space — that is, the luma signal is zero (or close to zero, because of numerical inaccuracies). The luminance channel of the residual image contains instead the luminance scale factor.

While profile A uses an explicit multiplication operation to scale the base image luminance to the desired target range, encoding and reconstruction are relatively elaborate, and the standard allows simpler alternatives to achieve similar goals in profiles B and C. Both make use of the functional equation of the logarithm to map the addition in Fig. 12.2 that merges the base layer with the extension layer into a multiplication or division — that is,

$\begin{array}{l} HDR {(x, y)}_{i} & = exp (log (C Φ {(LDR (x, y))}_{i}) + log {(RES (x, y))}_{i} - 2^{R_{b} - 1}) \\ = exp (- 2^{R_{b} - 1}) C Φ {(LDR (x, y))}_{i} \cdot R RES {(x, y)}_{i} (i = 0, 1, 2), \end{array}$ $\begin{array}{l} HDR {(x, y)}_{i} & = exp (log (C Φ {(LDR (x, y))}_{i}) + log {(RES (x, y))}_{i} - 2^{R_{b} - 1}) \\ = exp (- 2^{R_{b} - 1}) C Φ {(LDR (x, y))}_{i} \cdot R RES {(x, y)}_{i} (i = 0, 1, 2), \end{array}$

si11_e (12.3)

where in Eq. (12.3) $exp$ $exp$ and $log$ $log$ are applied component-wise on the components of the vectorial image data. Similarly, a subtraction in the logarithmic domain becomes a division for the HDR image signal. The reader may want to compare Eq. (12.3) with Eq. (12.1) used for IDR reconstruction in part 6.

This transformation of a multiplication into an addition in the logarithmic domain is the motivating idea for the secondary base and secondary residual nonlinear point transformations in Fig. 12.2; the logarithm can here also be understood as a transformation into an approximately perceptually uniform domain based on Weber’s law.

The Trellis Management XDepth proposal reconstructs the HDR image as the quotient of an inversely gamma-corrected LDR signal reconstructed from the legacy JPEG stream and a residual included in a side channel. This quotient can be rewritten as the difference of two terms in the logarithmic domain, where the $log$ $log$ and $exp$ $exp$ functions necessary for conversion between the linear and the logarithmic domain are expressed in the standard decoder design shown in Fig. 12.1 by the secondary base and secondary residual nonlinear transformations and the output conversion:

$\begin{array}{l} HDR {(x, y)}_{i} & = σ exp (log (C Φ {(LDR (x, y))}_{i}) - log {(Ψ (ρ (R RES (x, y))) + ϵ)}_{i}) \\ = σ \frac{C Φ {(LDR (x, y))}_{i}}{Ψ (ρ (R RES {(x, y)}_{i})) + ϵ} (i = 0, 1, 2) . \end{array}$ $\begin{array}{l} HDR {(x, y)}_{i} & = σ exp (log (C Φ {(LDR (x, y))}_{i}) - log {(Ψ (ρ (R RES (x, y))) + ϵ)}_{i}) \\ = σ \frac{C Φ {(LDR (x, y))}_{i}}{Ψ (ρ (R RES {(x, y)}_{i})) + ϵ} (i = 0, 1, 2) . \end{array}$

si16_e (12.4)

The “Output Conversion” in Fig. 12.2 includes here, compared with Eq. (12.3), an additional scale factor σ whose purpose will be described below. As in profile A, the base layer nonlinearity Φ maps the gamma-corrected LDR image back into the linear domain, and Ψ is an additional power map applied in the residual to compress its dynamic range–that is, Ψ(x) = x^β, where β is a data-dependent exponent selected by the encoder. As in profile A, R is a decorrelation transformation from YCbCr to RGB, although the luma signal is this time not split off but is included in the transformation. Finally, ρ represented in the standard decoder architecture by the “Intermediate NLT Point Trafo” is an affine scaling transformation that maps the RGB signals of the extension layer into a suitable range. It is described next.

Encoding in profile B is much simpler than in profile A as it does neither require prescaling by ν nor postscaling by μ:

$ρ (R RES {(x, y)}_{i}) = Ψ^{- 1} (σ \frac{C Φ {(LDR (x, y))}_{i}}{HDR (x, y)}) .$ $ρ (R RES {(x, y)}_{i}) = Ψ^{- 1} (σ \frac{C Φ {(LDR (x, y))}_{i}}{HDR (x, y)}) .$

si17_e

In general, the extension layer computed by the right hand will not be in range of the native JPEG standard. The transformation R⁻¹, however, is there already present as the RGB to YCbCr decorrelation transformation. The purpose of ρ⁻¹ is now to perform the necessary scaling and to map the output of Ψ⁻¹ into this interval. It is a data-dependent transformation that must be computed by the encoder.

While the profile B algorithm allows for arbitrary LDR images and hence for arbitrary tone mapping of the HDR content, the original XDepth proposal considered only one special choice for the map Φ, the color transformation C, and the LDR image — namely,

$LDR {(x, y)}_{i} = min (Φ^{- 1} {(σ^{- 1} HDR (x, y))}_{i}, 1),$ $LDR {(x, y)}_{i} = min (Φ^{- 1} {(σ^{- 1} HDR (x, y))}_{i}, 1),$

where Φ⁻¹ is the gamma correction and C is the identity. Hence, the HDR image is first inversely scaled with σ, then gamma-corrected and clamped to the range [0,1]. One observes now that this has the following consequences for the residual. As long as the sample value of the HDR image is below σ, no clamping occurs and the quotient defining the extension layer

$Ψ (ρ (R RES {(x, y)}_{i})) = σ \frac{Φ {(Φ^{- 1} (σ^{- 1} HDR (x, y)))}_{i}}{HDR {(x, y)}_{i}} = 1$ $Ψ (ρ (R RES {(x, y)}_{i})) = σ \frac{Φ {(Φ^{- 1} (σ^{- 1} HDR (x, y)))}_{i}}{HDR {(x, y)}_{i}} = 1$

si19_e

becomes the identity. For larger HDR sample values, the LDR image is saturated, and the residual carries all the image information of the overexposed region. The purpose of σ becomes now obvious: It determines which image luminances go into the LDR image and which become part of the extension layer. Using the analog printing process as metaphor, σ controls the “exposure” of the (now digital) negative “HDR(x,y)” into the print “LDR(x,y)”; in this process, the overexposed regions are preserved in the extension layer. Hence the name “exposure value” for the inverse of σ.

While profile A and profile B decoding algorithms — with or without prescaling and postscaling transformation — are clearly invertible as mathematical operations and hence suitable encoding algorithms are readily derived, they are not invertible without loss as numerical algorithms. This is because both of them depend on floating point representations and use multiplications or transcendental functions, which all introduce (numerical) loss when the (mathematical) inverse is taken. Profile C avoids these problems and extends the decoding algorithm from part 6 (Eq. 12.1) to HDR without depending on floating point arithmetic.

To understand its design, recall the IEEE encoding of floating point numbers (IEEE Computer Society, 2008). In its binary encoding, an IEEE floating point number consists of a sign bit s, a number of k exponent bits e, and a number l of mantissa bits m. The floating point number f is then given by

$f = {(- 1)}^{s} \cdot 2^{e - b} \cdot (1 + 2^{- l} m),$ $f = {(- 1)}^{s} \cdot 2^{e - b} \cdot (1 + 2^{- l} m),$

where b is an exponent bias that depends on the precision of the floating point number. This number f is now stored in memory as the concatenation of the s, e, and m bits; the very same bit pattern can also be interpreted as integer number i:

$i = 2^{k + l} s + 2^{l} e + m .$ $i = 2^{k + l} s + 2^{l} e + m .$

The trick is now that the reinterpretation of f as i, which is a no operation for a computer since it changes only the interpretation of the bits rather than their values, is for s = 0 an approximation of a logarithm to the base 2, here written as $ψ log$ $ψ log$ :

$\begin{array}{l} i = ψ log (f) = 2^{l} ({log}_{2} f + b) - 2^{l} {log}_{2} (1 + 2^{- l} m) + m \approx 2^{l} ({log}_{2} f + b), \end{array}$ $\begin{array}{l} i = ψ log (f) = 2^{l} ({log}_{2} f + b) - 2^{l} {log}_{2} (1 + 2^{- l} m) + m \approx 2^{l} ({log}_{2} f + b), \end{array}$

(12.5)

where one uses that ${log}_{2} (1 + x) \approx x$ ${log}_{2} (1 + x) \approx x$ for x between 0 and 1. In fact, the error

$ϵ (x) = x - {log}_{2} (1 + x)$ $ϵ (x) = x - {log}_{2} (1 + x)$

vanishes at the edges x = 0 and x = 1.

This observation can now be applied to Eq. (12.3), the multiplicative decoding algorithm, by replacement of the logarithm by $ψ log$ $ψ log$ , defined by re-interpretation of bit patterns, and its inverse, denoted by $ψ exp$ $ψ exp$ . This change implements the reconstruction algorithm as

$\begin{array}{l} HDR {(x, y)}_{i} & = ψ exp (ψ log (C Φ {(LDR (x, y))}_{i}) + ψ log {(RES (x, y))}_{i} - 2^{R_{b} - 1}) \\ \approx ψ exp (- 2^{R_{b} - 1}) Φ {(C LDR (x, y))}_{i} \cdot R RES {(x, y)}_{i} (i = 0, 1, 2), \end{array}$ $\begin{array}{l} HDR {(x, y)}_{i} & = ψ exp (ψ log (C Φ {(LDR (x, y))}_{i}) + ψ log {(RES (x, y))}_{i} - 2^{R_{b} - 1}) \\ \approx ψ exp (- 2^{R_{b} - 1}) Φ {(C LDR (x, y))}_{i} \cdot R RES {(x, y)}_{i} (i = 0, 1, 2), \end{array}$

si28_e (12.6)

which carries over to lossless compression and is a natural extension of the coding algorithm of part 6. Since $ψ exp$ $ψ exp$ and $ψ log$ $ψ log$ require only a reinterpretation of bits, this coding algorithm is also of very low complexity.

Eq. (12.6) can be simplified even further. First of all, if we ignore the color transformation matrix C for a moment, the base layer contributes to the overall HDR image through

$ψ log (Φ (LDR (x, y))) .$ $ψ log (Φ (LDR (x, y))) .$

Now, both operations — the nonlinearity Φ and the pseudologarithm mapping floating points to integers — can be merged into a single operation that consists only of a table lookup process as the combined operation maps integers to integers. Second, since the residual image is never displayed on a monitor for observation, it does not matter whether RES or $ψ log RES$ $ψ log RES$ is encoded. The latter has the advantage that it already consists of integers suitable for JPEG encoding. With these changes, one gets a complete integer-based and numerically invertible reconstruction algorithm:

$\begin{array}{l} HDR {(x, y)}_{i} = ψ exp (\hat{Φ} (LDR (x, y)) + R RES (x, y) - 2^{R_{b} - 1}) (i = 0, 1, 2) . \end{array}$ $\begin{array}{l} HDR {(x, y)}_{i} = ψ exp (\hat{Φ} (LDR (x, y)) + R RES (x, y) - 2^{R_{b} - 1}) (i = 0, 1, 2) . \end{array}$

(12.7)

An additional color transformation can now either be applied directly on the LDR image in the gamma-corrected space or be applied outside in the logarithmic space after inverse tone mapping and pseudologarithmic mapping by $\hat{Φ}$ $\hat{Φ}$ . Since Φ and $\hat{Φ}$ $\hat{Φ}$ are nonlinear in general, neither of these operations is identical to the color transformation C in the linear space, and hence additional errors are introduced that can be compensated by the residual image. Hence, coding gain is lost by transformation to an alternative color space. A second strategy, outlined in Section 12.4.2, leaves the color space untouched, but uses negative sample values to represent out-of-gamut colors. Clearly, the original logarithmic and exponential map of Eq. (12.3) cannot express negative values, although an extension of $ψ log$ $ψ log$ and $ψ exp$ $ψ exp$ to the negative axis is straightforward by extension of both functions in an (almost) point-symmetric way around the origin x = 0:

$\begin{array}{l} ψ log (- x) : = - ψ log (x) - 1 for x \geq 0 . \end{array}$ $\begin{array}{l} ψ log (- x) : = - ψ log (x) - 1 for x \geq 0 . \end{array}$

(12.8)

The additional subtraction of 1 seems curious, but it allows one to distinguish between the two representations of zero the IEEE floating point format has to offer — namely, + 0 and − 0. Note that one cannot implement this extension of the pseudologarithm by reinterpreting the bit pattern of IEEE floating point numbers, and its implementation requires a conversion from a sign-magnitude representation to a two’s complement representation. This conversion, however, is also lossless.

12.5 Hardware Implementation

Currently, most JPEG XT implementations are in software, with the exception of the profile stemming from JPEG-HDR, now profile A in part 7; see Section 12.4.3. Even here, the hardware consists of two standard JPEG chips that encode the base layer and the extension layer, plus one embedded processor implementing the operations to compute both layers from the HDR input signal. This setup is a rather typical choice for JPEG XT though. The entire design of the standard enables simple implementations on the basis of existing technology and does not require hardware implementations of algorithms.

All remaining lossy coding parts (ie, part 6 and all other profiles in part 7) can be realized in the very same way as long as refinement coding is not used and eight-bit encoding of the residual image provides sufficient quality.

Refinement coding is an extension based on the progressive scan type (see Section 12.4.1 and Fig. 12.3) and cannot be implemented by postprocessing alone. It will hence not be possible on hardware encoders that support only the sequential Huffman scan type of the legacy JPEG standard. Despite this limitation, the DCT has to be done with higher precision; more details on this are given later.

Lossless coding, as specified in JPEG XT part 8, also specifies the implementation of the DCT and the YCbCr to RGB transformation, and also requires a new entropy coding mode for the extension layer. Hardware encoders may or may not follow these specifications, and are typically not able to support it out of the box.

Finally, we will provide a brief analysis of the required implementation precision and bit depths of the signal paths. The bit precision of the base layer in the spatial domain is at most 12 bits, eight bits from the legacy process and four bits from refinement coding. By analyzing the input/output gain of the DCT, one can see that a 12-bit resolution in the spatial domain requires a 16-bit resolution in the DCT domain. The DCT in part 8 is now specified in fixed point precision such that 32 bits, including integer and fractional bits, are always sufficient for the coefficients and all intermediate results. The part 8 design uses four fractional bits to represent the output of the YCbCr to RGB transformation, and nine fractional bits for the internal representation of the coefficients for the DCT. Clearly, the DCT can at most expand its input by a factor of eight (ie, by three bits). The DCT reaches this maximum expansion gain for the DCT coefficient if the input is constant. In total, this makes 16 + 3 + 4 + 9 = 32 bits.

The lookup table in the base layer in part 7 profile C and parts 6 and 8 takes the reconstructed sample values of the legacy codestream as input. Tables are therefore 256 entries long for eight-bit images, or 4096 entries long if the range of the legacy image is extended to 12 bits by means of the refinement coding path.

The same considerations also hold for the residual image. If the DCT is enabled for lossy coding, image coefficients in DCT space are at most 16 bits wide, or 12 bits in the spatial domain, again requiring at most 32 bits for intermediate results within the DCT. In the DCT bypass mode used for lossless coding, coefficients are always 16 bits wide and are encoded directly in the spatial domain. No further lookup table is needed in part 8 as the residual coefficients represent the error signal directly.

The remaining decoder logic of the extension layer can be implemented in 16-bit integer logic as the standard requires wraparound (ie, modulo) arithmetic. One minor glitch remains — namely, that the DCT bypass signal of part 8 may create the exceptional amplitude − 2¹⁵, which is the only DCT amplitude that cannot be represented by the 12-bit Huffman coding mode. Part 8 includes for this exceptional case an “escape code.” Similarly, if the residual layer uses the lossless integer-to-integer DCT for part 8 coding, the amplitude of the DCT coefficients may grow as large as 2¹⁶⁺⁴ = 2²⁰, and an extended Huffman alphabet becomes necessary to allow lossless encoding. For parts 6 and 7, this is never an issue since even the eight-bit sequential Huffman scan type usually provides sufficient quality. Smart encoders may avoid the large amplitude or escape codes of part 8 completely either by using a progressive mode or refinement coding in the extension layer or by ensuring that the quality of the base layer is high enough to avoid error signals overrunning the dynamic range of the legacy coding modes. In the latter case, the extended scan types fall back to the known sequential Huffman scan.

12.6 Coding Performance

Evaluation of the performance of JPEG XT and its profiles is unfortunately not straightforward. First, the practical experience in objective and subjective evaluation of IDR and HDR image compression is still limited, and unlike LDR coding, objective quality indices and subjective evaluation procedures are much less tested and understood. More details on IDR image and especially HDR image quality evaluation are given in Chapter 17. Second, the design of JPEG XT itself as a two-layered codec complicates matters because the configuration space of an encoder — base and extension layer quality — is two-dimensional, and the choice of the base image influences the performance of the overall encoder.

12.6.1 IDR Coding Performance

IDR image representations are based on integer samples and thus evaluation by established indices such as the peak signal to noise ratio (PSNR) is, at least in principle, possible. Nevertheless, some caution in interpreting such simple indices is certainly required, even if IDR imaging does not cover the same dynamic range as HDR imaging. This is because IDR images are typically not gamma-corrected, unlike LDR images; while gamma correction in the LDR regime creates at least an approximate perceptually uniform space, this is no longer given for IDR and HDR coding. For simplicity, and because of the lack of any better known alternative, only PSNR and multiscale structural similarity (SSIM) (Wang et al., 2003) figures will be presented, despite their known limited correlation to subjective quality. Multiscale SSIM is an objective “top-down” quality index that first separates reference and distorted images into several scales, then segments each scale into small blocks and measures luminance, contrast, and structure deviations within each block before pooling all scores together. It was found that the correlation of SSIM to subjective scores is better than that of PSNR, despite its relatively low complexity. For details, see Wang et al. (2003).

One typically obtains IDR images by demosaicing raw sensor data and either keeping them in the camera color space or transforming them into a scene-referred reference color space. Unlike HDR images, IDR images are typically obtained in a single exposure, although because of improved sensor technology, in bit depths beyond the eight bits per sample. The relation of IDR sample values to physical intensities is here given by the response curve of the camera, which is, however, typically unknown. The test image test in Fig. 12.4 was obtained in such a way (ie, by the demosaicing of “raw” camera streams and their transformation into the 16-bit linear scRGB color space). While this color space defines the same primaries as sRGB, it still allows an extended gamut by including an additive offset in the sample representation. Out-of-gamut colors are hence represented by sample values below the offset, and in-gamut colors are represented by coordinates larger than the offset. Thus, the offset shifts color coordinates which would otherwise be negative into the positive semiaxis.

f12-04-9780081004128 — Figure 12.4 The images DSC_218 (left) and DSC_532 (right) (both courtesy of Jack Holm) here mapped from scRGB to sRGB to facilitate printing.

The images in Fig. 12.4 were then compressed by JPEG XT part 6, then expanded again, and the PSNR and multiscale SSIM (Wang et al., 2003) were measured in the IDR regime and plotted against the total bit rate (Fig. 12.5). Nothing new can be learned about the quality of the base layer alone as this is simply defined by the legacy JPEG standard.

f12-05-9780081004128 — Figure 12.5 PSNR (top) and SSIM (bottom) in the IDR domain for DSC_218 (left) and DSC_532 (right) in various configurations.

Two additional codecs have been included in the benchmark — namely, JPEG 2000 and JPEG XR, the former in PSNR optimal and visually optimal mode. JPEG XT part 6 also allows various configurations, three of which have been tested here. For that, recall that JPEG XT requires two images as input, an IDR image and an LDR image. In the first experiment, the corresponding LDR images forming the base layer were obtained by conversion of the scRGB images with a color management system to sRGB. When used in its “perceptual” setting, the color management system used here — namely TinyCMS — includes a simple automatic tone mapping to generate perceptually “convincing” results. The base and extension layer quality settings of the encoder were then such that the target quality was maximized for a given bit rate constraint. The resulting curve is denoted as “JPEG XT” in Fig. 12.5.

Leaving the quality parameter of the legacy JPEG codestream unconstrained as done above may, however, result in unpleasing LDR images, an effect that was avoided in the second experiment by the disallowance of quality parameters for the LDR codestream below 75. This shrinks the parameter space for optimization, creates additional overhead, and hence lowers the corresponding PSNR and SSIM scores of the HDR image. This curve is denoted by “JPEG XT > 75” in Fig. 12.5.

For the third experiment, the simple scRGB to sRGB map was replaced by the global Zhang et al. (2011) operator to study the dependency of the quality on the tone-mapping algorithm. Base quality is here again unconstrained. As seen from the curve “JPEG XT (Reinhard),” this has only little impact on the overall performance.

In comparison with more advanced technologies such as JPEG 2000 or JPEG XR, JPEG XT shows, of course, a disadvantage coming from its simpler coding technology and its inclusion of a base channel. Additionally, one should keep in mind that the two former codecs are usually run in PSNR-optimized mode, whereas the JPEG quantization matrix already implements a form of visual weighting by quantizing higher-frequency bands more aggressively than the more visible lower-frequency bands. SSIM scores of JPEG XT are therefore usually closer to those of more advanced technology, although the difference is also image dependent.

The dependency of the overall quality on the base and extension layer quality parameter can be seen much better in a three-dimensional plot (Fig. 12.6, where PSNR is plotted against base layer and extension layer rate). The tone mapper is here again the one defined by TinyCMS. The plots show a curious difference between the two images. For the DSC_218 picture, the base and the extension layer contribute approximately equally to the overall PSNR, whereas the bit rate for DSC_532 has to be invested mostly into the extension layer and the overall quality almost does not depend on the base layer rate. The difference lies in the nature of the images. DSC_532 is an almost bitonal image with very light and dark image regions, requiring a very extreme tone mapping to convert it to LDR, whereas DSC_218 is comparably moderate in nature.

f12-06-9780081004128 — Figure 12.6 PSNR in the IDR domain for DSC_218 (left) and DSC_532 (right), parameterized by base and extension layer rate.

12.6.2 Coding Performance on HDR Images

Even when we leave the relevance of such quality indices aside, PSNR and SSIM cannot be directly carried over to the HDR regime as was possible for IDR; this is simply because the definitions of both indices depend on a natural maximum value or maximum brightness which does not exist for HDR. Furthermore, both are not scale independent as required for relative radiance formats (ie, multiplication of all sample values by the same scale alters the value of the index).

A seemingly obvious replacement for the PSNR is the signal-to-noise ratio (SNR), defined as

$\begin{array}{l} SNR : = 10 {log}_{10} \frac{\sum_{i} x_{i}^{2}}{\sum_{i} {(x_{i} - y_{i})}^{2}}, \end{array}$ $\begin{array}{l} SNR : = 10 {log}_{10} \frac{\sum_{i} x_{i}^{2}}{\sum_{i} {(x_{i} - y_{i})}^{2}}, \end{array}$

si39_e (12.9)

because it does not depend on a scale. In Eq. (12.9), x_i is the sample value (grayscale value) of the original image and y_i is the sample value of the reconstructed (distorted) image. Unfortunately, this definition comes with a lot of problems of its own. While the PSNR has some known cases of failure to evaluate image defects correctly according to their visibility, it is even easier to construct defects in HDR images that are misclassified by the SNR. A small relative error in a very light image region will cause an almost invisible defect, but such an error will lower the SNR significantly even though the defect is not obvious to human observers. Similarly, a small defect in a dark image patch may give rise to a significant image defect that is obvious to human observers; the SNR will, however, not differ significantly because the defect is low in amplitude and does not change the denominator much.

A better option to make traditional metrics available for HDR image coding is to first map the absolute luminances, as given by the sample values of the original and the reconstructed image, into a perceptually uniform space, and then apply known metrics in this space instead. If the input scene is described in relative radiance, a suitable scale factor to derive absolute radiance values must be estimated though. A candidate for this is the PU-Map of Aydin et al. (2008); an even simpler map is given by the logarithm, followed by clamping. The motivation for the latter is Weber’s law (Netravali and Haskell, 1989), which states that the sensitivity of the human visual system is approximately inversely proportional to the amplitude of the input stimulus.

While this approach seems somewhat ad hoc, one should recall that a similar map is, implicitly, also applied in LDR imaging. Image data here are usually gamma-corrected and hence recorded in a nonlinear space. The gamma correction, even though historically motivated differently, can here be understood as a similar approximation of the map from the physical to an approximately uniform perceptual space.

Despite these attempts to apply known expressions in the HDR regime, a couple of quality indices have been designed explicitly for HDR imaging, and the area of an HDR image and video quality metrics are discussed in much more detail in Chapter 17. For the purpose of this chapter, three quality indices will be used: one of these is Daly’s visual difference predictor (Daly, 1992; Mantiuk et al., 2005), extended and enhanced in a later version by Mantiuk et al. (2011), and another is the mean relative square error (Richter, 2009) (MRSE). Unfortunately, currently only limited data exist to assess the performance of these indices themselves — that is, there is only limited evidence that they correlate well to image quality as observed in subjective experiments. The reason is that subjective evaluation of HDR image quality is a relatively young field and not much is known about suitable test conditions and test protocols, partially because equipment to reproduce such images has not been available until recently. Unlike the evaluation of LDR data, the response curve and the properties of the particular monitor used for image reproduction must be taken into consideration as they are another source of nonlinearities. Calibration of the overall signal path from sample values to observable radiance is nontrivial and is beyond the scope of this section.

As in Section 12.6.1, the HDR image quality gained after JPEG XT compression and as estimated by an appropriate index is a function of two parameters, the base layer bit rate and the extension layer bit rate. The bit rate itself is controlled by a JPEG “quality” parameter which acts as a scaling factor of a suggested default quantization matrix found as an example in the JPEG standard. This choice of the quantization matrix is not necessarily the best for all profiles, although research results for better alternatives are not yet available.

Three objective quality indices have been tried here — namely, PU-PSNR, MRSE, and HDR-VDP2. PU-PSNR first maps the data with the PU2-Map defined in Aydin et al. (2008) into a perceptually uniform space and then measures the PSNR there. The second measures the MRSE in the image domain, and is defined as

$\begin{array}{l} MRSE : = - 10 {log}_{10} \sum_{i} \frac{{(x_{i} - y_{i})}^{2}}{x_{i}^{2} + y_{i}^{2}} . \end{array}$ $\begin{array}{l} MRSE : = - 10 {log}_{10} \sum_{i} \frac{{(x_{i} - y_{i})}^{2}}{x_{i}^{2} + y_{i}^{2}} . \end{array}$

si40_e (12.10)

As one can see by a Taylor expansion, this is approximately identical to measurement of the PSNR after application of a logarithmic mapping. The third quality index is HDR-VDP2, defined in Mantiuk et al. (2011), which includes an extensive model of the human visual system, including a map to a perceptual luminance space, separation into perceptually defined frequency bands, masking, and error pooling. More details on HDR-VDP2 are given in Daly (1992) and Mantiuk et al. (2005, 2011).

The HDR images for this study were taken from the HDR Photographic Survey (Fairchild, 2015), which contains 106 images obtained through a multiple-exposure process. Photographs of the same scene were taken multiple times with different exposure times to cover a large range of luminances, and then were computationally merged into a single image per scene. Such computational methods are beyond the scope of this chapter, although the outputs are, in the absence of additional calibration, relative radiance images; that is, the floating point sample values obtained are (approximately) proportional to the luminance of the original scene. For details, see Reinhard et al. (2010). Some (tone-mapped) example images from this test set are shown in Fig. 12.7.

f12-07-9780081004128 — Figure 12.7 Two images used for the HDR compression performance. Courtesy of Mark Fairchild’s HDR Photographic Survey: the BloomingGorse2 image (left) and the 507 image (right). Both images have been tone-mapped for printing.

Comparison with subjective quality scores, even though not done in this chapter, requires an additional step — namely, the conversion of relative or absolute radiance data into an absolute radiance display-referred color space that describes the characteristics of the monitor used in the test sessions. Such monitors are already available on the market as professional equipment. In this particular evaluation, a maximum luminance of 4000 cd/m² was assumed to define the mapping.

The performance of a JPEG XT encoder depends not only on the HDR input image but also on the selection of the LDR image encoded in the base layer. Here, the base images were taken from the JPEG XT internal “validation test” which was run to check the suitability and correctness of the various proposals. In particular, the tone-mapped LDR images were prepared with Photoshop by a semiautomatic process where parameters were adjusted by a human observer to generate “pleasing” LDR images. While somewhat arbitrary, this procedure mimics the typical HDR workflow for which JPEG XT has been designed. The difference between the test here and the validation test of JPEG XT is that this test uses absolute radiance output-referred images, whereas the latter used the original relative radiance scene-referred images.

Each HDR-LDR image pair was compressed with each of the three profiles of part 7 (profiles A, B, and C), with variation of both base layer and extension layer quality, and measurement of both the extension layer and the base layer bit rate and the image quality, as approximated by PU-PSNR, MRSE, and HDR-VDP2. Fig. 12.8 shows plots of HDR-VDP2 against the base layer and the extension layer bit rate for four images. BloomingGorse2 is a structurally very complex image with a medium dynamic range, 507 contains many flat areas, but is almost bitonal and has a higher dynamic range than BloomingGorse2. McKeesPub and MtRushmore2 are somewhat between these extremes. Figs. 12.9 and 12.10 show the performance of the same profiles on the very same images, but measured by the MRSE. As can be seen from the figures, profile A performs quite well on complex images and shows an advantage for medium to low bit rates, whereas profile C performs better in the high-quality regime for relatively flat images. What cannot be observed from the plots is that the profile C quality does not saturate toward high bit rates (ie, profile C implements a scalable lossy to lossless compression). Profiles A and B level out toward higher bit rates, and increasing the rate beyond a threshold does not improve the performance anymore.

f12-08-9780081004128 — Figure 12.8 HDR-VDP2 versus extension and base layer quality for all profiles of part 7. From left to right, top to bottom: BloomingGorse2, 507, McKeesPub, MtRushmore2.

f12-09-9780081004128 — Figure 12.9 MRSE versus extension and base layer quality for all profiles of part 7. From left to right, top to bottom: BloomingGorse2, 507, McKeesPub, MtRushmore2.

f12-10-9780081004128 — Figure 12.10 PU2-PSNR versus extension and base layer quality for all profiles of part 7. From left to right, top to bottom: BloomingGorse2, 507, McKeesPub, MtRushmore2.

As can be seen, the relative performance of the three profiles is quite consistent between the metrics except for the complex BloomingGorse2 image, where the advantage is either for profile A or profile C, depending on the quality index used. Subjective tests have shown that the correlation to subjective scores of HDR-VDP2 is usually better than that of the far simpler indices.

A somewhat more conventional one-dimensional plot of quality as measured by HDR-VDP2 versus the bit rate is given in Fig. 12.11. In this case, an extensive search on the two-dimensional base/extension layer rate parameter space has been made to obtain the best possible quality for a given rate constraint. The same trend as in Fig. 12.8 can be observed, showing advantages for profiles A and B for low bit rates and profile C for high bit rates. In this test, the base rate is unconstrained and hence possibly very low; hence, the LDR quality is uncontrolled and may be too low to provide an image of reasonable quality.

f12-11-9780081004128 — Figure 12.11 VDP scores as two-dimensional plots, selecting the best possible base layer-extension layer rate combination to maximize quality. From left to right, top to bottom: BloomingGorse2, 507, McKeesPub, MtRushmore2.

Fig. 12.12 shows the results if the base layer quality is constrained to be at least 75, resulting in a good-quality LDR preview. Here, profile C falls even more behind for low bit rates. However, it is the only profile whose quality does not saturate for increasing bit rate. It remains the best choice if scalable lossy to lossless coding is required or really high qualities need to be obtained.

f12-12-9780081004128 — Figure 12.12 VDP scores as two-dimensional plots, selecting the best possible base layer-extension rate combination to maximize quality under the constraint of a base image quality of at least 75. From left to right, top to bottom: BloomingGorse2, 507, McKeesPub, MtRushmore2.

12.7 Conclusions

JPEG XT is a very flexible and rich image compression framework, allowing lossy and lossless compression of IDR and HDR data, including coding of opacity information. JPEG XT addresses the needs of a “modernized JPEG,” filling the gaps the legacy JPEG standard left: scalable lossy to lossless compression, support for bit depths between eight and 16 bits, support for floating point samples, and coding of opacity data. Unlike other standardization initiatives, JPEG XT is fully backward compatible with legacy applications and always includes a lossy eight-bit version of the image in its codestream, which is visible to legacy applications. Clearly, carrying the legacy of the more than 20- year-old JPEG coding scheme costs performance, and JPEG XT cannot match more modern compression algorithms such as JPEG 2000 or JPEG XR. However, JPEG XT can be easily implemented on the basis of two readily available JPEG intellectual property cores and a single digital signal processor for postprocessing. Unlike more modern architectures, JPEG XT keeps the legacy JPEG tool chain working, allowing existing software and hardware architectures to access JPEG XT images without the compatibility problems of other image compression formats.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 12: High Dynamic Range Imaging with JPEG XT

Create new playlist

Sign In

Sign Up