Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7

Evaluation of Tone Mapping Operators for HDR Video

G. Eilertsen*; J. Unger*; R.K. Mantiuk^† ^* Linköping University, Norrköping, Sweden
^† University of Cambridge, Cambridge, United Kingdom

Abstract

Tone mapping of high dynamic range video is a challenging problem. It is highly important to develop a framework for evaluation and comparison of tone mapping operators to assess their relative performance and expected quality. This chapter gives an overview of different approaches on how evaluation of tone mapping operators can be conducted, including experimental setups, the choice of input data, the choice of tone mapping operators, and the importance of tone mapping parameter adjustment for fair comparisons. The chapter also gives examples of previous evaluations with a focus on the results from the most recent evaluation conducted by Eilertsen et al. (2013). This results in a classification of current video tone mapping operators and an overview of their performance and possible artifacts.

Keywords

Subjective evaluation of tone mapping operators; Tone mapping operator evaluation; Video tone mapping operator evaluation; Video tone mapping; Subjective quality assessment methodology; Tone mapping operator evaluation meta-analysis; Tone mapping operator evaluation classification

7.1 Introduction

Tone mapping of high dynamic range (HDR) video is a challenging problem. Since a large number of tone mapping operators (TMOs) have been proposed in the literature, it is highly important to develop a framework for evaluation, comparison, and categorization of individual operators so that users can find and assess which is most suitable for a specific problem.

TMO evaluation has received significant attention for more than a decade, but there is still no standard method for conducting TMO evaluation studies. Since the performance of a TMO is highly subjective, perceptual studies are an important tool in this process. This chapter gives an overview of different approaches on how evaluation of TMOs can be conducted, including experimental setups, the choice of input data, the choice of TMOs, and the importance of parameter adjustment for fair comparisons. The chapter also discusses and categorizes previous evaluations with a focus on the results from the most recent evaluation conducted by Eilertsen et al. (2013). This results in a classification of current video TMOs and an overview of their performance and possible artifacts, as well as guidelines to the problems a video TMO needs to address.

Ideally, a direct, computational comparison between the result of a TMO and a ground truth image would be convenient, leaving out all the possible error sources associated with a subjective comparison experiment. There are also metrics for automatic visibility and quality prediction of HDR images (Mantiuk et al., 2011). However, since the comparisons of TMOs occur after transformation through the human visual system (HVS), the ground truth and tone-mapped image cannot be used directly for numerical comparison. Another method, not involving simulation of the HVS, was suggested by Yeganeh and Wang (2013). The method estimates the structural fidelity and the naturalness of a tone-mapped image, and combines these measures to form a quality index. Although this quality index was shown to correlate with data from subjective evaluations, the heuristics involved in the quality prediction cannot account for all the complex processes involved in a subjective comparison. As of today, there are no models comprehensive enough to entirely replace the human eye in a comparison, taking into account all the different aspects of the HVS. Furthermore, if we seek the most preferred method in terms of some subjective measure, this involves not only the HVS, but also high-level functions of the human brain, functions that are individual and depend on the specific criteria used for comparison.

7.2 Subjective Quality Assessment Method

Performing a perceptual evaluation study is a complicated task. There are a range of parameters affecting the outcome of the experiment. If we closely define and control the different parts of an evaluation, the outcome is more reliable and consistent. However, it is important to appreciate that the results are fully valid only under the specific conditions used during the evaluation, and that the results may not always generalize well to other situations. This is also reflected in previous studies reported in the literature, where the outcome/result of a perceptual evaluation may vary substantially depending on the experimental setup and evaluation criteria used.

Important aspects in the comparison are, for example, what ground truth, if any, is used (evaluation method), and under which criteria the comparison is made (intent of TMO). It does not make sense to ask what is the best looking tone reproduction if we want to determine which one is closest to how the depicted scene would be perceived in real life, as this criterion is conceptually different (like comparing apples and oranges). However, at the same time it is also important to recognize that a particular TMO intended for a certain purpose can potentially produce results that give a better reproduction than other methods comparing the methods with a different criterion.

Furthermore, even if the actual comparisons and their purpose are closely defined, there are numerous accompanying aspects that may affect the outcome of an experiment. For example, a small difference in how the experiment is explained to the observers can potentially impact their image assessments. Environmental impacts also need to be considered, where the experiment should be designed to be as similar as possible to the typical environment where the methods are supposed to be used. Here it is also important to calibrate the algorithms according to the certain conditions, with, correct adjustment for the particular screen used, for example.

In what follows, we try to categorize some of the many parameters associated with subjective evaluations. These are considered to be among the most influential for a robust and valid evaluation, and should be carefully considered when one is designing a TMO comparison study. Using the categorization, we provide a brief survey in the next section, which includes the most relevant subjective studies of tone mapping methods to date.

It should be noted though that not only is the final outcome of an evaluation sensitive to the implementation design—the actual data produced also need to be examined properly. One crucial aspect is to investigate not just the mean performance of the different methods considered. There should also be a thorough analysis of the variance of the result in order to determine if there are actual statistical proofs for the differences (Mantiuk et al., 2012). Even though a clear distinction can be made, when we consider only the scores or ratings averaged over the participating observers, this does not actually imply statistically significant differences. To reduce the uncertainty, a large number of measurements (observers) may be needed, which in many situations is unfeasible.

7.2.1 TMO Intent

It is important to recognize that different TMOs try to achieve different goals (McCann and Rizzi, 2012); for example, perceptually accurate reproduction, faithful reproduction of colors, or the most preferred reproduction. In an evaluation this particular intent is critical for how the different methods are compared. First, the design of the experiment completely relies on how the algorithms are supposed to be compared, where, for instance, the evaluation method used needs to be in line with the intent of the TMOs. As an example, it does not make sense to compare a TMO with a real-world scene if the evaluation is supposed to be in terms of the subjectively most preferred reproduction, since this could deviate substantially from the real scene. Also, an important aspect of a study is how the experiment is explained to the observer, making sure that the intent of the tone reproduction is clear.

After analyzing the intents of existing operators, we can distinguish three main classes:

Visual system simulators try to simulate the limitations and properties of the HVS. For example, a TMO can add glare, simulate the limitations of human night vision, or reduce colorfulness and contrast in dark scene regions. Another example is the adjustment of images for the difference between the adaptation conditions of real-world scenes and the viewing conditions (including chromatic adaptation).

Scene reproduction operators attempt to preserve the original scene appearance, including contrast, sharpness, and colors, when an image is shown on a device with reduced color gamut, contrast, and peak luminance.

Best subjective quality operators are designed to produce the most preferred images or video in terms of subjective preference. This is a relatively wide term, since the best subjective quality may depend on the situation, and it could be further specified to be in line with a certain artistic goal.

7.2.2 Evaluation Method

One of the central decisions when one is designing a subjective comparison experiment is what is the ground truth we are trying to reproduce. How should this, if possible, be displayed? Should it be displayed at all? The choice depends not only on the application and what is relevant to the study, but also on physical limitations. In Fig. 7.1, the different steps of an HDR capture and reproduction are outlined, highlighting possible locations suited for comparison. Referring to Fig. 7.1, we distinguish the following evaluation methods:

f07-01-9780081004128 — Figure 7.1 Evaluation methods in the context of the tone mapping pipeline. The comparisons depicted are made on the basis of subjective metrics and correspond to the evaluation method used. LDR, low dynamic range.

• Fidelity with reality methods, where a tone-mapped image is compared with a physical scene. This type of study is challenging to execute, in particular for video because it involves displaying both a tone-mapped image and the corresponding physical scene in the same experimental setup. Furthermore, the task is very difficult for the observers as the displayed scenes differ from real scenes not only in the dynamic range, but they also lack depth cues and have a restricted field of view and limited color gamut. These factors usually cannot be controlled or eliminated. Moreover, this setup does not capture the actual intent when the content needs enhancement.
Despite these issues, the method has been used in a number of studies (see Table 7.1). It directly tests one of the main intents of tone mapping — namely, its ability to reproduce a convincing picture of a real-world scene as it would be perceived by a human observer.

Table 7.1

Categorization of TMO Evaluations

Evaluation Study	Fidelity With Reality	Fidelity With HDR	Nonreference	Rating	Ranking	Pairwise Comparison	Visual System Simulator	Best Subjective Quality	Default	Tuned by Experts	Pilot Study	Image	Video	TMOs	Participants	Scenes
Drago et al. (2002)			×			×	×	×	×^a			×		7	11	4
Kuang et al. (2004)			×			×		×	×			×		8	33	10
Yoshida et al. (2005)	×			×			×				×	×		7	14	2
Delahunt et al. (2005)			×			×		×	×			×		4	20	25
Ledda et al. (2005)		×				×	×		×			×		6	48	23
Ashikhmin and Goyal (2006)	×		×		×		×	×	×	×^b		×		5	15	4
Yoshida et al. (2006)	×		×	Tuning^c			×	×	Tuning^c			×		1^c	15	25
Kuang et al. (2007)	×		×	×		×	×	×	×			×		7	33	12
Akyüz et al. (2007)			×		×			×	NA			×		3^d	22	10
Park and Montag (2007)			×			×		×^e	×			×		9	25	6
Akyüz and Reinhard (2008)	LDR reference^f					×	×^f		×			×		7	13	1^f
Čadík et al. (2008)	×		×	×	×		×				×	×		14	20	3
Villa and Labayrade (2010)	×			×	×		×		NA			×		6	50	5
Kuang et al. (2010)	×	×				×	×		×	×^g		×		7	23	4
Petit and Mantiuk (2013)			×	×			×	×	×	×^h			×	4	10	7
Eilertsen et al. (2013)			×	×		×	×				×		×	11	18	5

t0010

Many studies involve multiple experiments, using different methods and criteria. For the numbers listed, TMOs refer to the total set of operators considered in the complete study, while participants and scenes are the ones reported for the largest comparative experiment performed in the study.

LDR, low dynamic range; NA, not applicable.

^a TMOs were also matched to have similar overall brightness.

^b Parameters tuned by the TMO authors where default values could not be obtained.

^c Used a generic operator, and the experiment was performed as a tuning of its parameters.

^d Additional to the TMOs, original HDR and single exposures were included.

^e Used a “scientific usefulness” subjective criterion for evaluation.

^f Evaluated the perceived contrast through the Cornsweet-Craik-O’Brien illusion.

^g Two Photoshop TMOs were calibrated by an expert.

^h Five different settings were used, treated as separate operators for evaluation.

• Fidelity with HDR reproduction methods, where content is matched against a reference shown on an HDR display. Although HDR displays offer a potentially large dynamic range, some form of tone mapping, such as absolute luminance adjustment and clipping, is still required to reproduce the original content. This introduces imperfections in the displayed reference content. For example, an HDR display will not evoke the same sensation of glare in the eye as the actual scene. However, the approach has the advantage that the experiments can be run in a well-controlled environment and, given the reference, the task is easier.

• Nonreference methods, where observers are asked to evaluate operators without being shown any reference. In many applications there is no need for fidelity with “perfect” or “reference” reproduction. For example, consumer photography is focused on making images look as good as possible on a device or print alone, as most consumers will rarely judge the images by comparing them with real scenes. Although the method is simple and targets many applications, it carries the risk of running a “beauty contest” (McCann and Rizzi, 2012), where the criteria of evaluation are very subjective. In the nonreference scenario, it is commonly assumed that tone mapping is also responsible for performing color editing and enhancement. But since people differ a lot in their preference for enhancement (Yoshida et al., 2006), such studies lead to very inconsistent results. In some situations though, the purely subjective preference, with no connection to the real scene, is the criterion we want to assess. Many practical situations do not really aim for a result as close to the pictured scene as possible, but aim rather for an aesthetically pleasing outcome given some artistic preferences. In these situations a nonreference method is often needed.
If the factors affecting assessment are well controlled and with a clear description of the evaluation criteria, the method provides a convenient way to test TMO performance against user expectations. Therefore, it has been used in most of the previous studies (see Table 7.1).

• Appearance match methods, which compare the appearance of certain attributes in both the original scene and its reproduction (McCann and Rizzi, 2012). For example, the brightness of square patches can be measured in a physical scene and on a display by magnitude estimation methods. Then, the best tone mapping is the one that provides the closest match between the measured perceptual attributes. Even though such direct comparison seems to be a very precise method for evaluation, it poses a number of problems. Firstly, measuring appearance for complex scenes is challenging. While measuring brightness for uniform patches is a tractable task, there is no easy method to measure the appearance of all different image attributes, such as gloss, gradients, textures, and complex materials. Secondly, the match of sparsely measured perceptual attributes does not need to guarantee the overall match of image appearance.

7.2.3 Parameter Settings

Most tone mapping algorithms include free parameters that can be tuned to achieve different results. To ensure a fair comparison between different TMOs in an evaluation, it is therefore necessary to perform a thorough adjustment of the parameters. This is, however, often a challenge in itself. The best results are achieved if each algorithm is tweaked independently for each individual input footage to be used in the experiment. However, in this way the operators are tested not as automatic algorithms but rather as editing tools which depend on the skills of the individual performing the adjustment. Therefore, the problem of TMO parameter adjustment is to find a general set of parameters which is optimal for all intended material. Below, we present a summary and categorization of the different adjustment methods used in the evaluation literature:

• Use of default parameters is the easiest and most widely used approach. However, even though default parameters are simple to use, they are unlikely to produce the best output for a given TMO; see, for example, (Yoshida et al., 2006; Petit and Mantiuk, 2013; Eilertsen et al., 2013). In many cases, different intents of the tone mapping will benefit from different parameter settings. If an operator is used in a context which differs from the one used during development and tuning, the default parameters may be invalid and may need to be changed. For example, if the parameters have been set to give the best rendering as perceived by the HVS, another calibration may yield a subjectively more preferred result. Finally, another common problem in the literature is that some parameters of certain operators have not been reported with default values.

• Tuned by experts could, for example, include parameter adjustment by the experts conducting the experiment or by individual experts in the field. However, this may lead to a bias toward their personal preferences and is not guaranteed to generalize well. Furthermore, since many algorithms are computationally expensive and include a large number of parameters, tweaking by trial and error often means that only a relatively small number of adjustments can be included. That is, only a very sparse sampling of the parameter space is tested, which is unlikely to include the optimal point.

• Pilot study, where parameters are adjusted in a separate experiment, designed to be as similar as possible to the following main experiment. While this method has the highest potential, it is also challenging because many TMOs have high-dimensional parameter spaces and may be time consuming to adjust.
One example of a parameter adjustment experiment was reported by Yoshida et al. (2005). They designed a pilot study where a set of experienced observers were assigned to choose the best image from a selection of tone mappings with different parameter settings. However, similarly to the previous method, this means that a very limited number of possible parameter values were tested, and the optimal set was most likely not included. Another experiment was performed by Eilertsen et al. (2013), where a sparse sampling of the parameter space was also used, but where the values in-between were approximated by interpolation. To find the optimal set of parameters, a set of expert users explored the parameter space using a conjugate gradient method to find a local minimum (Eilertsen et al., 2014). More details on the method are given in Section 7.6.

7.2.4 Input Material

Intuitively, to generalize the results of a subjective study, the selection of the input material used for evaluation should allow a sampling spanning as large a portion of the general population of images as possible. This is to ensure that trends extracted from the evaluation results can be expected to generalize also to other situations. However, if the goal is to find more than general trends — for example, the average performance of a certain operator — the differences between input images make it difficult to draw definite conclusions from the evaluation results. To estimate the expected value of the results, we need to know the individual weights of the images used, drawn from the probability distribution of possible images. For example, if one scene is twice as common as a second scene in the general population of scenes, the results obtained with the first scene should be weighted twice as important in the expected quality value. It is thus difficult to make statistically sound generalizations of the results from an evaluation unless the input footage is selected with care.

The selection of input material is also important to make it possible to distinguish between different operators. For example, a simple TMO in comparison with more advanced algorithms may be able to produce comparable results in basic scenes but fail in more difficult lighting situations (Petit and Mantiuk, 2013).

To categorize, we make a distinction between static and dynamic material:

• HDR images are most often produced from a set of differently exposed images (Debevec and Malik, 1997). This imposes limitations on the scene, where, for example, people and other moving objects are difficult to capture, and where common image artifacts such as camera noise are suppressed in the reconstruction.

• HDR video has just recently become readily available through the development of HDR video capturing systems (Tocci et al., 2011; Kronander et al., 2013; Froehlich et al., 2014). Previously, HDR video was limited to panning in static HDR panoramas, static scenes with changes in lighting, and computer graphics renderings. The capturing systems provide a set of new challenges, such as complicated transitions in intensity over time, skin colors, and camera noise.

7.2.5 Experimental Setup

Once the evaluation method and the intent of the TMOs to be evaluated have been specified, another decision is how to present the information to the observers and how they should register their preferences. The decision should be based not only on what information the experiment is supposed to reveal, but also on the human aspects of the design. For example, the difficulty in making the assessments has a direct influence on the accuracy and precision of the result. Also, an experiment that lasts too long for each observer can have implications such as fatigue, which may affect the outcome.

If there is a well-advised and consistent experimental design, different methods for subjective quality assessment are expected to correlate. This has also been demonstrated in different evaluations. Kuang et al. (2007) showed a correlation between pairwise comparisons and rating, but with improved accuracy and precision for pairwise comparisons at the cost of increased experiment length. Čadík et al. (2008) detected no statistical difference between rating and ranking, concluding that for comparison of TMOs, ranking without reference is enough. Mantiuk et al. (2012) showed that four different methods for image quality assessment resulted in the same overall outcome, but differed in precision and experiment length.

Inspecting previous evaluations of TMOs, we find there are three main categories for quality assessment (mentioned in the preceding discussion):

Rating can include assessment of many different attributes at the same time, to more closely measure differences. The measures are either absolute or given relatively as the difference between a pair of displayed images. Since different observers tend to apply different absolute and relative scales, these need to be compensated for. Although the method can provide much information from few displayed conditions, the data tend to have high variance compared with, for example, pairwise comparison. It is a difficult perceptive task to assess the quality level of a certain attribute, compared with selecting one of two images that best fit a certain criterion.

Ranking for TMO evaluation is generally implemented by displaying all TMOs for one scene, and letting the observer sort them according to the evaluation criteria. Although the task is relatively simple, it is difficult to assess all the sequences simultaneously. Practically, the task is most easily performed through a comparison of two images at a time, which in the end is conceptually similar to pairwise comparison.

Pairwise comparison is the perceptually simplest task, where only one of two images should be selected on the basis of closeness to some reference (physical or memory). For N conditions 0.5N(N − 1) comparisons are needed to provide all intercondition relations, which would require a lengthy experiment for large N. However, if dynamic algorithms, such as quicksort (Silverstein and Farrell, 2001), are used, the number of comparisons can be reduced to about $N log N$ $N log N$ . Also, since the visual task is simple, the comparisons are generally quick, and overall pairwise comparison can be both faster and more precise than rating methods (Mantiuk et al., 2012).

7.3 Survey of TMO Evaluation Studies

There has been a lot of research in quality assessment of TMOs, at least in terms of comparisons on static HDR images. Referring to the categorization in the previous section, Table 7.1 shows a selection of the most relevant studies in the literature, from the first by Drago et al. (2002) to the most recent video TMO evaluation by Eilertsen et al. (2013). From this large body of research, one may be led to believe that it clearly should indicate which particular operators are expected to perform superiorly compared with others. However, as outlined in the previous section, there are many factors affecting the evaluation of TMOs, and one study can lead to a completely different result from a similar study with different experimental conditions. Instead, the information from previous studies has to be viewed in a wider perspective. The evaluations do not provide an unanimous answer to which operator performs best, but rather give insight into their different properties, showing tendencies that could be incorporated in future research and development.

Analyzing the TMO evaluation literature, we find there is a slight tendency to favor global TMOs over local ones. This is the case when we make comparisons in terms of fidelity with both reality (Čadík et al., 2008; Eilertsen et al., 2013) and subjective preference (Akyüz et al., 2007). In fact, in some situations a single exposure directly from the camera, tone-mapped only by the camera response curve, has been shown to outperform sophisticated TMOs (Akyüz et al., 2007). However, in more complicated situations — for example, where important information lies in different intensity regions of the image — a more complex tone curve could be expected to adapt better and render a preferable result (Petit and Mantiuk, 2013). The different results for global and local operators could possibly also be explained by insufficient parameter tuning because global operators are, in general, more robust than local operators. The quality may also be masked by spatial inconsistencies, or artifacts, which are more common when a local tone compression is applied.

In many cases, certain image features have been shown to correlate with the subjective quality, both in perceptual comparisons (Kuang et al., 2007; Čadík et al., 2008) and in comparisons with numerical measures (Delahunt et al., 2005; Yoshida et al., 2006; Petit and Mantiuk, 2013). However, even if single image features can be shown to agree with the overall quality as perceived by an observer, it is difficult to draw any general conclusions about the performance of the evaluated TMO. Hence, the human observer’s assessments will likely always be the most reliable method of measuring the quality of TMOs.

Many of the factors defining a subjective study have been shown to correlate strongly. When it comes to the evaluation method, some studies demonstrate approximately similar results with reference and nonreference methods (Kuang et al., 2007; Čadík et al., 2008), and using an HDR display and a real-world scene (Kuang et al., 2010). However, Ashikhmin and Goyal (2006) did not find such a relationship, but showed that when no reference was given, there was no evidence of a difference when comparison was made regarding the preferred and most real result (best subjective quality and visual system simulator intent of TMO, respectively), which was further supported in Petit and Mantiuk (2013). For different experimental setups a correlation in the results is highly expected, which is also demonstrated in TMO evaluations (Kuang et al., 2007; Čadík et al., 2008) as well as in metastudies (Mantiuk et al., 2012).

Despite the many demonstrated correlations, a comparative study is still very sensitive to the choice of the particular experimental setup. It may lead to important differences in the final outcome, where correlations demonstrated in certain situations are not enough to draw definite conclusions. Also, the choice is related not only to the averaged results of a study, but also to important differences and implications for the physical setup and variations in the results. Furthermore, there is not yet any support for correlations in the choice of input material (images or video), and the parameter settings still remain a problematic and highly influential aspect (Petit and Mantiuk, 2013; Eilertsen et al., 2013). The challenges involved in an evaluation are also reflected in the diverse set of results shown in previous work — it is all about how the experiment is constructed. This further illustrates the importance of clear definitions and restrictions as described in the categorization in Section 7.2.

7.4 Evaluation Studies for Video TMOs

Although the area of tone mapping evaluation is quite extensively researched, there are very few studies using HDR video material for the comparisons (see the input material in Table 7.1). This can be explained by the limited availability of HDR video material: HDR video capturing systems first appeared a few years ago (Tocci et al., 2011; Kronander et al., 2013; Froehlich et al., 2014). Now, the question is, how do we evaluate tone mapping in this context using video material? Are there any differences as opposed to making comparisons with single HDR frames in the evaluation of different methods? It would be easiest to infer the results from the many already existing evaluations on static images. However, even though the outcome of such evaluations can be estimated to be approximately the same for many images, it is very unlikely to generalize to video material. This is due to both, differences in the actual tone reproduction with changes in the input signal and the altered subjective impression of material that is moving.

Nontrivial TMOs always rely, at least to some extent, on the input signal to form a tone curve. That is, the mapping f of a pixel value I_x,y in the image I is formulated as $f (I_{x, y}, I)$ $f (I_{x, y}, I)$ . The signal dependence could, for example, include the image histogram, the image mean or maximum value, or other statistical measures. Since these statistics can change rapidly from frame to frame in an image sequence, the outcome of the mapping could potentially cause flickering and other temporal incoherences. To temporally stabilize the tone reproduction, a model for adaptation over time is inevitably needed in most situations. This also makes it pointless to compare TMOs for static images without temporal adaptation. The artifacts caused by temporal inconsistencies will likely be the by far most prominent feature, making an evaluation result difficult to interpret.

Another aspect of temporal tone mapping is the visibility/impact of certain image features. Spatial artifacts can, for example, become visually more prominent in the case of extension to the time domain (a temporally inconsistent behavior is often perceptually very easy to perceive). For example, local tone mapping, where the tone curve varies over the image, often gives rise to a spatially varying tone reproduction which is inconsistent with the original HDR input (eg, around edges in the image). Although these artifacts may not be noticeable in a static image, their dependence on the local image statistics can cause fluctuations over time that are visually disturbing.

In comparison with evaluations of TMOs for images, the inclusion of the time domain leads to a significant increase in the information to be assessed in the subjective judgments. The way a human observer perceives a signal that changes over time compared with a static signal is different, in terms of both how the HVS is stimulated and how the higher-level assessments are performed. It is evident that evaluations of TMOs for video have requirements different from those of TMOs for images and that a well-defined experimental setup is even more important (Section 7.2).

7.5 Video TMO Evaluation Study I

One of the first evaluations to actually use HDR video sequences for the comparison of tone mapping algorithms was performed by Petit and Mantiuk (2013). This study was not targeted at just providing a ranked list of operators, sorted according to their estimated quality in reproduction. The aim was rather to reveal some of the differences in using a simple S-shaped camera tone curve compared with more sophisticated TMOs. Furthermore, as two different criteria for evaluation were used, and different parameter settings were included, the study also aimed to show what differences this makes.

One motivation for the evaluation was the increasing complexity of the algorithms used for tone mapping. Is this added complexity really necessary in the commonest situations? Most consumer cameras already provide simple tone manipulation, built in to the camera hardware as a postprocessing step. This is usually in the form of a S-shaped mapping function that globally compresses the dynamic range of the input. If such methods were used in the same way in HDR capturing systems, what differences could be expected in terms of subjective quality compared with the more sophisticated solutions?

With reference to the preceding evaluation categorization (see Section 7.2), the setup of the evaluation study was as follows:

• The study was carried out to compare tone mappings with a nonreference method, motivated by the difficulties in setup and the fact that the impact of a reference is disputable (Čadík et al., 2008).

• The evaluation used a best subjective quality criterion as well as a visual system simulator criterion.

• Five different settings of each TMO were used for the experiments, and they were treated separately as different TMOs.

• The operators were compared with use of HDR video material, created from panning in static HDR panoramas, and computer-generated scenes.

• For the assessment a rating method was used, mainly since pairwise comparison would demand a lengthy procedure owing to the different parameter settings.

As mentioned previously, three of the central questions of the evaluation were (1) to determine if there is a difference between tone mapping algorithms that simulate the HVS and those that try to achieve the subjectively most preferred result, (2) to investigate the differences between default parameters and other possible settings, and (3) to see if sophisticated tone mapping algorithms show measurable improvement over an S-shaped camera tone curve. The study in terms of these questions is described in further detail in the following sections.

7.5.1 TMO Intent

The evaluation results for different TMOs could potentially depend on what particular criterion is used for the image assessments. To investigate this possibility, and to give a hint regarding the general difference in making comparisons with different criteria, the evaluation used both a preference criterion (best subjective quality) and a fidelity with real-world experience criterion (visual system simulators). Such comparisons of evaluation criteria have previously been shown to yield approximately the same results when no reference is used (Ashikhmin and Goyal, 2006).

In the end, the experiment showed no statistical difference between the two criteria. The results suggest either that there are no major differences in preferred reproduction and reproduction closest to real life or that the differences are small with no reference provided. That is, without a physical reference for direct comparison, the experienced similarity with the real world is biased toward subjective preferences. Indeed, the memorized image looks nothing like the retinal image (Stone, 2012), and a reference comparison could potentially yield a different result (Ashikhmin and Goyal, 2006).

7.5.2 Parameter Settings

If a comparison should reveal differences in subjective quality — for example, between the camera curve and more sophisticated TMOs — another possibility is that these could result from the particular parameter settings of the operators. In an attempt to cover this situation, five different settings of each TMO were provided for the experiments. That is, these settings were treated separately, as different TMOs. Although this subset of different parameters is very small compared with the complete set of possible parameters, it provided a way to distinguish if the default parameters always perform best. The parameters for the different settings were tuned to produce acceptable and visually different results. In the context of sampling the parameter space, this can be explained as an attempt to provide a uniform sampling of the subspace spanned by acceptable values of the altered parameters in a perceptually linear frame of reference.

The outcome showed that fine-tuning the parameters could potentially result in improved quality. Although most cases did not show statistically significant differences in the result with different settings, in some cases it may actually produce an unacceptable or exceptionally good result.

7.5.3 S-Shaped Curve Performance

In the evaluation, four different TMOs were compared, from Irawan et al. (2005), Kuang et al. (2007), and Mantiuk et al. (2008), as well as a S-shaped camera response curve, similar to those found in consumer-grade cameras.

In the end, there were mostly small differences in performance between the different operators, and no statistical difference between the three best performing ones. This can, at least to some extent, be explained by the altered parameters, where default parameters could have caused some operators to perform significantly worse. Thus, an important conclusion was that one needs to thoroughly analyze the operator parameters when initiating a comparative study. Furthermore, the results obtained from investigation of the different TMOs and parameter settings with use of certain isolated image attributes, also showed a correlation between the quality of the tone reproduction and overall brightness, as well as chroma.

For the camera curve, there were only statistical differences in quality when the scene was more complicated — for example, with higher dynamic range and different properties for different parts of the scene. The camera curve is not as flexible for extension to these situations, where important parts of the image differ significantly in luminance. This highlights the need to carefully select the evaluation input material so the performance of sophisticated algorithms can be distinguished and advantages can be detected.

Although the study revealed a substantial amount of interesting and relevant information, there was no particular emphasis on the differences in tone mapping of static images and video material. Another limitation was that in the material used, some important aspects of video were not captured. Also, the study used a low-sensitivity direct rating method, so the actual differences captured were difficult to prove.

7.6 Video TMO Evaluation Study II

The most comprehensive evaluation to date, targeted specifically at HDR video tone mapping, was performed by Eilertsen et al. (2013). The study involved 11 TMOs, all with explicit treatment of the tone mapping over time. This was the first time that material from actual HDR video capturing systems was used in an evaluation, revealing some of the most important problems a video TMO needs to address.

With reference to the categorization in Section 7.2, the setup of the evaluation study was as follows:

• The study applied a nonreference method, motivated by the fact that most applications will require the best match to a memorized scene rather than a particular reference.

• The evaluation was targeted at visual system simulators, which is one of the commonest types of tone mapping algorithms.

• Parameters were carefully studied, and if they were not defined with default values or were subject to substantial improvement, they were set in a pilot study.

• HDR video material was provided from a multisensor HDR camera setup (Kronander et al., 2013), as well as from an RED EPIC camera set to HDR-X mode, and a computer-generated sequence.

• First, a qualitative rating experiment was performed, followed by a pairwise comparison study.

The survey and evaluation were done as follows: Appropriate TMOs were selected and a parameter calibration experiment was performed. The actual evaluation included a qualitative rating experiment and a pairwise comparison. Finally, from the result a set of important conclusions were drawn. These steps are described in further detail in the following sections.

7.6.1 TMO Selection

Even though the evaluation was targeted at visual system simulators, operators with other intents were also included, since they potentially could produce results matching the performance of operators from the targeted class. Eleven operators were considered:

Visual adaptation TMO (Ferwerda et al., 1996), a global visual system simulator operator that uses data from psychophysical experiments to simulate adaptation over time, and effects such as color appearance and visual acuity. The visual response model is based on measurements of threshold visibility as in Ward (1994).

Time-adaptation TMO (Pattanaik et al., 2000), a global visual system simulator based on published psychophysical measurements (Hunt, 1995). Static responses are modeled separately for cones and rods, and are complemented with exponential smoothing filters to simulate adaptation in the time domain. A simple appearance model is also included.

Local adaptation TMO (Ledda et al., 2004), a local visual system simulator, with a temporal adaptation model based on experimental data. The operator operates on a local level using a bilateral filter.

Maladaptation TMO (Irawan et al., 2005), a global visual system simulator based on the work of Ward Larson et al. (1997) for tone mapping and Pattanaik et al. (2000) for adaptation over time. It also extends the threshold visibility concepts to include maladaptation.

Virtual exposures TMO (Bennett and McMillan, 2005), a local best subjective quality operator, applying a bilateral filter both spatially for local tone manipulation and separately in the time domain for temporal coherence and noise reduction.

Cone model TMO (Van Hateren, 2006), a global visual system simulator, using a dynamic model of cones in the retina. A model of primate cones is used, based on actual retina measurements.

Display adaptive TMO (Mantiuk et al., 2008), a global scene reproduction operator, performing a display adaptive tone mapping, where the goal is to preserve the contrasts within the input (HDR) as well as possible given the characteristics of an output display. Temporal variations are handled through a low-pass filtering of the tone curves over time.

Retina model TMO (Benoit et al., 2009), a local visual system simulator with a biological retina model, where the time domain is used in a spatiotemporal filtering for local adaptation levels. The spatiotemporal filtering, simulating cellular interactions, yields an output with whitened spectra and temporally smoothed for improved temporal stability and for noise reduction.

Color appearance TMO (Reinhard et al., 2012), a local visual system simulator operator, performing a display and environment adapted image appearance calibration, with localized calculations through the median cut algorithm.

Temporal coherence TMO (Boitard et al., 2012), a local scene reproduction operator, with a postprocessing algorithm to ensure temporal stability for static TMOs applied to video sequences. Boitard et al. (2012) used mainly the photographic tone reproduction of Reinhard et al. (2002), for which the algorithm is most developed. Therefore, the version used in the survey also used the photographic operator.

Camera TMO, a global best subjective quality operator, representing the S-shaped tone curve which is used by most consumer-grade cameras to map the sensor-captured values to the color gamut of a storage format. The curves applied were measured for a Canon 500D DSLR camera, with measurements conducted for each channel separately. To achieve temporal coherence, the exposure settings were anchored to the mean luminance filtered over time with an exponential filter.

7.6.2 Parameter Selection Experiment

Finding the optimal calibration of an operator, measured by some subjective preference, is an inherently difficult problem because of the many possible parameter combinations to be tested in an experimental setup. To overcome the problem, without resorting to testing only a few selected points of the parameter space, a new method for perceptual optimization (Eilertsen et al., 2014) was used to tune the parameters of each TMO included in the evaluation. The key idea is that the characteristics of an operator can be accurately described by interpolation between a small set of precomputed parameter settings. This allows the user to interactively explore the interpolated parameter space to find the optimal point in a robust way. The method is particularly efficient when dealing with video sequences; since the result of a parameter change needs to be assessed over time, a large number of frames have to be regenerated when a parameter value is changed.

Direct interpolation between a sparse set of points will in many cases generate unacceptable errors due to nonlinear behavior. To prevent this, a linearization of the parameter space was performed, for each operator, before sampling and interpolation. The linearization was formulated by analysis of the image changes E(p) when a certain parameter p was altered. By quantification of the changes in this way, over the range of parameter values, and the requirement that E(p) be constant over all p, the normalized inverse ${\hat{f}}^{- 1} (p)$ ${\hat{f}}^{- 1} (p)$ of the cumulative sum f(p) of E(p) could subsequently be used as a transformation to a space of linear image changes. Experiments showed that the absolute difference averaged over the image pixels worked well in most situations as a simple measure of image changes. That is, E(p) and f(p) are defined as follows:

$\begin{array}{l} E (p) & = \frac{1}{N} \sum |\frac{d I_{p}}{d p}|, \end{array}$ $\begin{array}{l} E (p) & = \frac{1}{N} \sum |\frac{d I_{p}}{d p}|, \end{array}$

(7.1a)

$\begin{array}{l} f (p) & = \int_{p_{min}}^{p} E (r) d r, \end{array}$ $\begin{array}{l} f (p) & = \int_{p_{min}}^{p} E (r) d r, \end{array}$

si5_e (7.1b)

where I_p is the input image I processed with an operator at the parameter value p, and the summation goes over all N pixels in the image. An example linearization, achieved with a simple gamma correction operation for demonstration, is shown in Fig. 7.2. The plots in Fig. 7.2A and B correspond to Eqs. (7.1a) and (7.1b), respectively. Finally, the plot in Fig. 7.2C is of the inverted and normalized cumulative sum ${\hat{f}}^{- 1} (p)$ ${\hat{f}}^{- 1} (p)$ that is used for transformation to a linear parameter space.

f07-02-9780081004128 — Figure 7.2 The mean absolute difference is measured over parameter changes (A), and the cumulative sum (B) is used to create a linearization transform (C). The example uses a simple gamma mapping value as a parameter, I_p =I^1/p.

In Fig. 7.3, the linearization of the gamma correction operation is demonstrated on an image. With the transformation it is possible to use only three sample points to compute an interpolated approximation of the parameter changes, with small interpolation errors. Not only does the linearization reduce interpolation errors, but it also makes adjustments easier and more intuitive as it provides a parameter that varies much more closely to what is perceived as linear.

f07-03-9780081004128 — Figure 7.3 Example linearization and interpolation, using a simple gamma mapping value as a parameter, I_p =I^1/p. Without linearization, the interpolation show large errors (B) as compared with interpolation after the linearization has been done (D). (A) Linear parameter changes. (B) Interpolation from three sample points. (C) Linearized changes. (D) Interpolation from three linearized sample points.

Although the interpolation strategy enables interactive exploration of the parameters of an operator, it is still a difficult problem to actually find the perceptual optimum in a high-dimensional space. The actual search was therefore formulated as a general optimization problem where the objective function was determined by the user’s image assessments. The user was given the task to search the parameter space in one direction at a time, choosing directions according to Powell’s conjugate gradient method (Powell, 1964). The method allows one to find a minimum of a nondifferentiable objective function (in this case the user’s assessments) in a high-dimensional parameter space. An example of a search in a two-dimensional parameter space using conjugate gradients is illustrated in Fig. 7.4. The iterative nature of the approach makes it robust against errors. If inaccuracies are introduced by the user, the solution will converge more slowly. However, if additional iterations are performed, this will ensure convergence towards a perceptual minimum.

f07-04-9780081004128 — Figure 7.4 Left: Example of a search in a two-dimensional space. Right: If errors are introduced in the search, the minimum can still be found with additional iterations.

Even though this method provides a robust way for finding a locally optimal set of parameters, the global perceptual minimum is not guaranteed. Also, performing the calibration procedure for all operators and parameters would mean an expanded pilot study, which would take too long to perform in one go for each observer. To provide the best compromise between default parameters and calibrated ones for the evaluation, they were carefully studied. The calibration experiment was then performed on the parameters where default values either did not provide a reasonable result or were not defined. For each parameter, the linearization function (Eq. 7.1) was evaluated over a range of different sequences, and was then used as a general linearization across all sequences selected for the calibration experiment. The evaluation involved four expert observers, and the final parameters used were calculated as their averaged result.

7.6.3 Qualitative Evaluation Experiment

Initial observations on the behavior of the considered TMOs, when exposed to sharp intensity transitions over time, revealed that some have serious temporal coherency problems. In many cases, severe flickering or ghosting artifacts were visible. This is illustrated in Fig. 7.5, where the tone mapping responses of four operators are plotted over time. The outputs are fetched from two different pixel locations; the first (left) is a location where the input is relatively constant over time, and the second (right) is a location with a rapid temporal transition. From the plots, it is clear that the tone-mapped values contain flickering or ghosting artifacts.

f07-05-9780081004128 — Figure 7.5 (A) An HDR video with a set of frames, where the two points indicated are analyzed. (B) and (C) The tone-mapped output of four operators at the positions, plotted over time. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

From an evaluation standpoint it is problematic to assess the overall quality of a TMO exhibiting severe temporal artifacts. To quantify the different artifacts and tabulate the strengths and weaknesses of each operator included in the evaluation, a qualitative analysis was performed before the actual comparison experiment. This was done as a rating experiment, where five expert observers were assigned the task of rating different image attributes and artifacts. To measure appearance reproduction, overall brightness, overall contrast, and overall color saturation were included in these ratings. Furthermore, to assess the presence of artifacts, additional measurements were defined: temporal color consistency (objects should retain the same hue, chroma, and brightness), temporal flickering, ghosting, and excessive noise.

Fig. 7.6 shows the result from the qualitative evaluation for four TMOs. The figure is divided into ratings of (A) artifacts and (B) color-rendition attributes. From the experiment it was clear that the most salient artifacts were flickering and ghosting. While some inconsistencies in colors and visible noise were accepted as tolerable, even minor flickering or ghosting was deemed unacceptable. It was therefore decided that all operators where either of these artifacts were visible in at least three scenes should be eliminated from further analysis. In the end, four operators were removed with this criterion. They are all shown in Fig. 7.6, and from the plots it is clear that all four TMOs show problems with respect to either flickering or ghosting. This type of analysis was not only helpful in removing operators not suitable for comparison, but also provides insight into many of the challenges involved in tone mapping.

f07-06-9780081004128 — Figure 7.6 Example ratings of artifacts (top) and color-rendition problems (bottom). The bars for each attribute represent different sequences.

7.6.4 Pairwise Comparison Experiment

One of the goals of the evaluation was to reveal which operator could be expected to generate, on average, the best result. In this context, best means the tone reproduction closest to what is perceived by the human eye, since the evaluation was performed with visual system simulators (or other TMOs that potentially could deliver a convincing result in line with this criterion).

The evaluation was performed as a direct comparison experiment with the seven operators remaining after the qualitative rating experiment. For the experiment, tone-mapped sequences were shown sequentially in random order, and in each trial the participants were shown two videos of the same HDR scene tone-mapped with two different operators. The observers were then asked to choose the sequence which to them looked closest to how they imagined the real scene would look. In total 18 observers completed the experiment, each evaluating the seven TMOs using five different HDR video sequences. Assuming all pairs are compared, this means $5 \times \frac{1}{2} \times 7 \times 6 = 105$ $5 \times \frac{1}{2} \times 7 \times 6 = 105$ comparisons, which is time-consuming and potentially exhausting for the observer. To avoid fatigue, the number of comparisons was reduced with the quicksort algorithm (Silverstein and Farrell, 2001), adapting the comparisons to be made on the basis of previous comparisons. The algorithm puts more effort into distinguishing closer samples, and with a sorting procedure a complete relative quality estimate for N conditions is provided with about $N log N$ $N log N$ comparisons.

The outcome of the pairwise comparison experiment is shown in Fig. 7.7. The results are plotted per sequence, and are scaled in units of just noticeable differences (JND). When 75% of the observers select one condition over another, the quality difference between them is defined to be 1 JND. To scale the results in JND units, the Bayesian method of Silverstein and Farrell (2001) was used. In brief, the method maximizes the probability that the collected data explain the experiment under the Thurstone case V assumptions. The optimization procedure finds a quality value for each image that maximizes the probability, which is modeled by the binomial distribution. Unlike standard scaling procedures, the Bayesian approach is robust to unanimous answers, which are common when a large number of conditions are compared.

f07-07-9780081004128 — Figure 7.7 Results of the pairwise comparison experiment scaled in JND units (the higher, the better) under Thurstone case V assumptions, where 1 JND corresponds to 75% discrimination threshold. Note that absolute JND values are arbitrary and only relative differences are meaningful. The error bars denote 95% confidence intervals computed by bootstrapping.

Although the differences for some sequences are small, and with confidence intervals that do not always indicate that there are any significant differences, two operators seem to be consistently preferred: display adaptive TMO and maladaptation TMO. Also, the simple camera curve — camera TMO — shows good relative performance, and actually came out top for one of the sequences. Another observable pattern is the tendency to reject the cone model TMO and the time-adaptation TMO.

The scores should not be considered to infer an absolute ranking of the individual TMOs. There are uncertainties in the scores, and the results are not guaranteed to generalize to the entire population of possible sequences. However, they provide an indication of how the operators can be expected to perform in a variety of different situations.

7.6.5 Conclusion

One important outcome of the evaluation was due to the use of HDR video camera generated material, with/which contain complicated transitions in intensity both spatially and, most importantly, over time. There are substantial differences in camera-captured HDR video and artificially created material, such as computer-generated scenes, panning in HDR panoramas, and static scenes with changing illumination. This provided genuinely challenging conditions under which the operators under consideration had not yet been, and many TMOs showed artifacts in at least some of the situations.

As a consequence, there was a tendency in the pairwise comparisons to favor the more straightforward, global, tone reproductions. In general, the complicated, local, transformations appeared to be more sensitive and prone to flickering and ghosting artifacts. Since these artifacts were established as perceptually most salient, they often give rise to mappings that are unsatisfactory or even unacceptable.

However, the results should not be interpreted as a guideline in favor of global processing as the method of choice for tone transformation. For example, a global strategy is unable to capture important local transitions that need to be preserved in order to maintain an overall level of local contrast corresponding to the original HDR input. Rather, the results show that there still are issues to be addressed in order to achieve high-quality tone reproductions for general HDR video material. To aid in future development of such work, the survey and evaluation concludes with a set of guidelines, pointing out properties a video TMO should possess:

• Temporal model free from artifacts such as flickering, ghosting, and disturbing (too noticeable) temporal color changes.

• Local processing to achieve sufficient dynamic range compression in all circumstances while maintaining a good level of detail and contrast.

• Efficient algorithms, since a large amount of data needs processing, and turnaround times should be kept as short as possible.

• No or very limited requirement of parameter tuning.

• Capability of generating high-quality results for a wide range of video inputs with highly different characteristics.

• Explicit treatment of noise and color.

7.7 Summary

Throughout this chapter, we have emphasized the importance of well-defined and clearly motivated subjective studies. In this context, it is important to appreciate that the outcome of an evaluation, given a certain setup, is only an assessment of the performance of a TMO under the particular conditions chosen. Statistical testing ensures that the results generalize to other observers, but not necessarily to other sequences, intents, etc. If the conditions are chosen with care, however, the outcome can be estimated to generalize reasonably well to an extended selection of conditions. But, all in all, it is not possible to cover all possible scenarios and intents. This is also shown in the diverse set of outcomes from the evaluation studies found in the literature, where the different experimental designs give significantly different results.

During the two last decades we have seen extensive research in the area of tone mapping, and many new TMOs have been developed. However, with the help of thorough evaluations it has been demonstrated that tone mapping is still an open problem and that there are still many problems to be addressed, especially for HDR video.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 7: Evaluation of Tone Mapping Operators for HDR Video

Create new playlist

Sign In

Sign Up