Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 4

Multiview HDR Video Sequence Generation

R.R. Orozco*; C. Loscos^†; I. Martin*; A. Artusi* ^* University of Girona, Girona, Spain
^† University of Reims Champagne-Ardenne, Reims, France

Abstract

The convergence between high dynamic range (HDR) images/video and stereoscopic/3D images/video is an active research field, with limitations coming from various aspects: acquisition, processing, and display. This work analyzes the latest advances in HDR video and stereo HDR video acquisition from multiple exposures in order to highlight the current progress towards a common target: 3D HDR video. The most relevant existing techniques are presented and discussed.

Keywords

Multiexposure high dynamic range generation; Stereoscopic high-dynamic range; Multiexposure stereo matching; High dynamic range video

4.1 Introduction

There is a huge gap between the range of lighting perceived by the human visual system and what conventional digital cameras and displays are able to capture and represent respectively. High dynamic range (HDR) imaging aims at reducing this gap by capturing a more realistic range of light values, such that dark and bright areas can be recorded in the same image or video. Visually, this avoids underexposure and overexposure in such areas.

The combination of the latest advances in different digital imaging areas, such as 4K color video resolution, stereo- and multiscopic video, and HDR imaging, promises an unprecedented video experience for users. However, big challenges of different nature remain to be overcome before such technologies converge. In particular, the whole pipeline of acquisition, compression, transmission, and display of HDR imagery presents unsolved problems to which solutions need to be found before we can enjoy HDR films on a TV at home. Just one example is enough to illustrate such challenges: 1 min of HDR video at 30 frames per second full high-definition resolution requires 42GB of storage compared with four times less for traditional digital low dynamic range (LDR) video (Chalmers and Debattista, 2011). Among such challenges, the extension from HDR images to HDR video and stereo HDR video has a very important role.

Techniques for HDR video acquisition are evolving. There are two main approaches to capture HDR images: HDR sensors or multiple exposure with LDR sensors. Some HDR camera prototypes have been lately presented to the research community (Nayar and Branzoi, 2003; Chalmers et al., 2009; Tocci et al., 2011), but are not available yet for commercial use. Their commercial counterparts, such as the Viper camera (ThomsomGrassValley, 2005), the Phantom HD camera (Research, 2005), and the Red Epic camera (RedCompany, 2006), are only few in number, the range addressed remains limited (below 16 f-stops), and they are still unaffordable for most users. Diverse alternatives have been presented to acquire HDR values with conventional cameras. Chapter 2 discussed some techniques based on RAW data acquired directly from the camera sensor and hardware prototypes designed to increase the range of light captured in each frame.

Chapter 3 introduced some computational procedures to recover HDR images from different LDR exposures. There is a common problem to merge the different exposures in HDR video and stereoscopic HDR video when LDR images do not superpose exactly. Pixels are misaligned: pixels corresponding to the same object in the scene do not correspond to the same position in all the images. A matching process is thus required to find correspondences between pixels in the different images that represent the same object in the scene.

In this chapter we present the various solutions that are proposed to go from images to sequences with changes in content and/or viewpoint. One key issue is pixel registration in differently exposed sequences, with the possibility of creating HDR videos or multiview (including stereo) image sequences.

The chapter is divided into three main sections. In Section 4.2, we provide a classification of the different natures of possible misalignment according to the position of the camera and the scene content. Section 4.3 is dedicated to multiexposure HDR video generation from a free-path camera. Section 4.4 is dedicated to multiscopic HDR video.

4.2 HDR and Stereo HDR Video Acquisition

Chapter 3 presented the idea that merging multiple exposures allows one to recover HDR images. Multiple exposures can be acquired in different ways. One is exposure time variation for multiple exposures of the same scene sequence contributing to each video frame. Use of deghosting algorithms to produce an HDR per frame and obtain HDR video is not straightforward. Several constraints need to be considered to decide if an algorithm for HDR image generation can be extended to multiple exposure video sequences. Even the kind of camera we use is important, because some camera application programming interfaces do not allow exposure times to be adjusted. This section is dedicated to analyzing the two main aspects that need to be considered: temporal and spatial issues in HDR acquisition.

4.2.1 Temporal Considerations

The typical frame rate to play video is around 24 frames per second. This means that we need to create at least 24 HDR frames per second. Unlike still HDR images, we cannot just capture long exposures to try to increase the dynamic range. The sum of the exposure times for each frame must be small enough to guarantee video frame rates. The difference in exposure times between consecutive frames is limited for temporal coherence not to be compromised. Such limitations must be considered before we can extend methods designed for image deghosting. Some of them are suitable for generating only one HDR image for each multiexposed sequence. In the video context, the use of long exposure times to increase the dynamic range is not always possible without compromising video frame rates.

Besides, stereo HDR video requires at least two views per frame, while some multiscopic displays may accept more than nine different views. Computational approaches in this case are preferred to the use of multiple expensive HDR cameras. To achieve HDR in stereoscopy, we need to generate one HDR image per view for each stereo HDR frame. Most proposed approaches recover HDR values from differently exposed stereo LDR images. Synchronization in capturing the different views is one supplementary key consideration.

Thus, four solutions can be considered:

1. All views have the same exposure time at each take. The exposure varies from one take to another. While synchronization is simplified here, the difficulty is to reconstruct temporarily coherent HDR information in overexposed and underexposed areas.

2. Each view has a different exposure time. While finding HDR information in all areas of the images is ensured, synchronization is problematic: either each objective waits for the others before taking the next frame or frames need to be synchronized afterward.

3. Bonnard et al. (2012) proposed an alternative to the first two solutions. It involves placing neutral density filters on the camera objectives in order to simulate different exposure times. An advantage of this is that all objectives use the same exposure time even if the resulting images simulate different exposures. Synchronization is thus reduced to synchronizing the different objectives. A major drawback arises from the fact that each view takes the same exposure at each frame. Underexposed or overexposed areas will remain as such through the entire video.

4. Other alternatives have been proposed to acquire for each objective, several exposures at once. This could be done, for example, through a beam splitter (Tocci et al., 2011) or a spatially varying mask (Nayar and Branzoi, 2003). Optical elements can split light beams onto different sensors with different exposure settings (Tocci et al., 2011).

4.2.2 Spatial Considerations

While LDR images are being taken for reconstruction of HDR information, spatial variation can occur. In the following we classify these variations according to their cause and the misalignment they produce (Section 4.2.2.1). The types of misalignment are discussed in Section 4.2.2.2. Finally, we review the different approaches specifically designed to manage misalignment of multiexposed LDR images for HDR generation (Section 4.2.2.3).

4.2.2.1 Camera versus scene movement

Misalignment can be classified into different categories according to the kind of movement of the camera and the objects in the scene. Table 4.1 shows the types of possible misalignment present in multiple exposures for HDR generation.

Table 4.1

Different Configurations of the Camera and Scene

	Camera	Scene	Misalignment	HDR Video
1	Static	Static	–	Time lapse
2	Static	Dynamic	Local	Camera constrained
3	Free path	Static	Global	Scene constrained
4	Free path	Dynamic	Local and global	General case
5	Stereo/multiscopic	Static/dynamic	Constrained global	Stereo/multiscopic

t0010

A static camera refers to a camera fixed to a tripod or any other support that keeps the objective still during the acquisition. The free path classification considers camera movement because the camera is either handheld or supported by an articulated device, therefore following an arbitrary trajectory. Stereo or multiscopic acquisition includes stereo pairs of cameras or camera rigs composed of three or more positions of one or more cameras horizontally aligned, to capture respectively two views or more of the same scene.

The scene is classified as dynamic if any object moves during the acquisition, no matter what the amount of movement or the size of the object is. Otherwise the scene is considered as static.

In every case it is possible to produce an HDR video. For static images (first row in Table 4.1), it is possible to repeat the acquisition for a given time step and combine the HDR video into a time-lapse video. Time lapse (also known as slow motion) is a technique for capturing frames significantly slower than video frame rates. When played at video frame rates, time appears to be faster. This is often used to capture a natural phenomenon such sunrise or sunset, or in the animation industry. Camera- and scene-constrained sequences are, in general, easier to align than the general case since only one kind of misalignment takes place. The three cases (second, third, and fourth rows in Table 4.1) are used in film production. The last case (fifth row in Table 4.1), concerning stereo and multiscopic cameras, has become very popular since stereo and autostereoscopic displays have appeared on the market.

4.2.2.2 Local and/or global misalignment

Misalignments as defined in the previous section can be categorized in three different types:

1. Global misalignment is the consequence of camera motion (change in position) and affects all pixels in the image (third and fifth rows in Table 4.1). It is common in exposure sequences acquired with handheld cameras, although it is possible to find misalignment even for still sequences acquired with tripods (camera shaking with the mechanism activation). Between consecutive pairs of images, it is generally a small movement corresponding to translations or rotations. It may cause ghosting in the resulting HDR image, but some efficient techniques have been proposed to correct this misalignment. However, even for small movements, problems of object occlusion and parallax could be difficult to solve.

2. Local misalignment is produced by dynamic objects in the scene and affects only certain areas inside the image (second row in Table 4.1). Capturing a set of LDR images takes at least the sum of the shutter speeds of each picture. This time is enough to introduce differences in the positions of a dynamic object in the scene. In this case some areas occluded in some images may be visible in others. Depending on the speed of the dynamic object and the kind of the movement, it may produce important differences between the inputs.

3. Local and global misalignment combines the two previous types and concerns the fourth row in Table 4.1. When a camera follows a free path to record a dynamic scene, each frame contains both local and global misalignment. Pixels in the image are affected by transformations of different nature.

Fig. 4.1 shows an example of movement in a common multiexposure sequence. Notice that in stereo pairs (Fig. 4.1A and B), both images were acquired at the same time by two different sensors. Even if the scene is dynamic, both images correspond to the same time and no local misalignment is possible. The only misalignment possible is global, due to changes in the perspective from the two points of view.

f04-01-9780081004128 — Figure 4.1 Different exposure frames of a stereo pair (A, B) from Middlebury (2006) and a dynamic video sequence captured with a handheld camera (C, D).

4.2.2.3 Generated HDR content

If both the scene and the camera remain static during the acquisition, the multiple exposures are aligned. The only possible result is an HDR image. However, even in this case, time-lapsed HDR video can be produced combining HDR frames acquired from static multiexposed LDR images. In such a case, any of the existing techniques for static images (Mann and Picard, 1995; Debevec and Malik, 1997; Mitsunaga and Nayar, 1999) can be used to recover radiance values and merge them into HDR images.

In cases where the camera remains static recording a dynamic scene, it is possible to detect the areas affected by dynamic objects in the scene and treat them locally. Several techniques have been presented for motion detection in the context of HDR image generation (Khan et al., 2006; Jacobs et al., 2008; Gallo et al., 2009; Pece and Kautz, 2010; Heo et al., 2011; Orozco et al., 2012, 2013, 2014), many of them already discussed in Chapter 3.

In the opposite case (static scene and dynamic camera), the movement between consecutive frames is very small. Some computationally efficient methods were proposed to solve such misalignment (Ward, 2003; Grosch, 2006; Skala, 2007). The most difficult case is when both the camera and the scene move. In such a case, dense correspondences between frames are required.

In a handheld video sequence (Fig. 4.1C and D), the images correspond to different time instants and every pixel in the image is affected by global misalignment due to changes in the position of the camera, and some pixels are also affected by local misalignment due to dynamic objects like the boat in this figure.

The rest of this chapter is dedicated to analyzing the most important approaches to solve the misalignment described in Table 4.1. Section 4.3 focuses on cases corresponding to the second to fourth, whereas Section 4.4 analyzes existing solutions for the case described in the fifth row.

4.3 Free-Path Single Camera

Most approaches for deghosting in HDR imaging are based on selecting the best exposure as a reference image and solving the misalignment only for the reference image. The extension of such algorithms for HDR video by repeating the process, taking as a reference each frame in a multiexposed video sequence (see Fig. 4.2) is not always possible. Most of these methods fail if the reference contains large overexposed or underexposed regions and they do not pay attention to temporal coherence. This is an important drawback for extending such methods to HDR video generation. It is fundamental to produce HDR frames free of ghosting in the resulting sequence and they must look alike. Otherwise, flickering will be noticeable in the resulting HDR video.

f04-02-9780081004128 — Figure 4.2 Multiexposure video sequence alternating three different exposures.

4.3.1 Multiexposure Acquisition Setup

In HDR images, the number of exposures and the shutter speed used for each LDR image vary depending on the scene conditions. It is common to take several exposures and to use long shutter speeds to get details in shadow areas of dark scenes. In contrast, for brighter scenes, less light exposure and thus faster shutter speeds are preferable. However, as discussed in Section 4.2, it is difficult to implement this approach in the HDR video context because of timing constraints.

Similarly to auto exposure control of digital cameras, digital video cameras have a function called auto gain control (AGC) in charge of measuring the brightness distribution of the scene and calculating the best exposure time for the scene conditions. This function provides an optimal exposure value for the scene.

The ratios of light exposure are measured in f-stops. Each f-stop means a factor of 2 (double or half, added or subtracted exposure value). The most extended idea is to capture additional exposures at fixed multiples of this medium exposure value (eg, ±2 f-stops) to obtain high and low exposures. Many authors use only two exposures (low and high) to generate three frames of HDR video (Kang et al., 2003; Sand and Teller, 2004; Mangiat and Gibson, 2010), while the most recent approach (Kalantari et al., 2013) uses three (low, medium, and high) exposures to generate the same number of HDR frames (see Fig. 4.2).

4.3.2 Per-Frame HDR Video Generation

To our knowledge, Kang et al. (2003) proposed the first method to extend multiple exposure image methods to video sequences. Their system uses a programmed video camera that temporally alternates long and short exposures. On the basis of the AGC function, this method calculates adaptive exposure values at each frame depending on the scene lighting conditions. The ratio between exposures of consecutive frames can range from 1 f-stop to a maximum of 16 f-stops.

Every HDR frame for a given time t_i is generated with use of information from adjacent frames t_i−1 and t_i+1. They reexpose the short-exposure frame with the long-exposure value. Once images have been transformed to the same exposure, motion estimation is performed for the two adjacent images. It consists of two steps:

1. Global registration by estimation of an affine transform between them.

2. Gradient-based optical flow to determine dense motion field for local correction.

In the regions where the current frame is well exposed, images are merged by a weighted function to prevent ghosting. For the overexposed or underexposed regions, the previous/next frames are bidirectionally interpolated with use of optical flow and a hierarchical homography algorithm.

Despite the novelty of this work and the promising results for some scenes, gradient-based optical flow is not accurate enough to find forward/backward flow fields. Boosting the short exposure to the long one will increase the noise, details such as edges may be lost, and slight variations of brightness may persist because of inaccuracies in the camera response function (CRF). This may produce ghosting and errors in registration for fast nonrigid moving objects.

Sand and Teller (2004) proposed an algorithm to register two different video sequences of the same scene. HDR video is one of the most direct applications of their method. Differently exposed videos can be matched with their method. The matching search starts by selecting feature points and comparing the selected features of the two images. A matching cost is evaluated in some feature points by use of two terms:

1. Pixel consistency, instead of comparing equal pixels or patches in two images, they compare a single pixel in the reference image with a 3 × 3 patch in the source image. Correspondence is evaluated within a window around each pixel and pixel matching probabilities are assigned.

2. Motion regression and consistency, to determine how well a particular correspondence is consistent with its neighbors.

The matchings obtained are used to find regression predictions that are improved in a regression process. After high likelihood correspondences have been found, a locally weighted linear regression method is used to interpolate and extrapolate correspondences for the rest of pixels, with a dense correspondence field being obtained. This scheme is extended to all frame pairs of the video sequence. This method offers very good results for highly textured scenes but poor results otherwise. Processing each pair of frames might take up to 1.31 s, and full video matching might take several minutes for each second of video input.

Mangiat and Gibson (2010) proposes improving the method of Kang et al. (2003) by using a block-based motion estimation algorithm. They also worked with a video sequence that alternates two exposure values. They used a CRF recovered with a sequence of 12 static exposures using the method presented by Debevec and Malik (1997). The short exposure is boosted by the CRF to match the long exposure.

They used software (Sühring, 2008) to calculate block-based forward and backward motion estimation vectors for each frame with respect to the adjacent ones. However, such estimation is likely to fail in saturated areas. A second step of bidirectional motion estimation is performed to fill in the saturated areas with information from the previous and next frames. The cost function is the sum of absolute differences (SAD), adding a cost term that relates the motion vector estimated for adjacent frames. Block-based motion estimation is prone to artifacts such as discontinuities at block boundaries. Differences between the images in the radiance domain are detected and assumed to be artifacts. Such pixels are considered to be like holes and are replaced by pixels in the contour of such areas. Nevertheless, poorly registered pixels may pass to the HDR merging step. They propose the use of a cross-bilateral filter to treat the tone-mapped HDR image using edge information at each frame. Despite the different attempts to avoid artifacts, the problem of fast motion (eg, eyes blinking) remains unsolved. The filtering step executed in the tone-mapped images cannot be used for HDR displays.

Sen et al. (2012) recently presented a method based on a patch-based energy minimization that integrates alignment and reconstruction in a joint optimization for HDR image synthesis. Their method relies on a patch-based nearest-neighbor search proposed by Barnes et al. (2009) and a multisource bidirectional similarity measure inspired by Simakov et al. (2008). This method allows the production of an HDR result that is aligned to one of the exposures and contains information from all the remaining exposures. The results are very accurate but dependent on the quality of the reference image. Artifacts may appear if the reference image has large underexposed or saturated areas.

Kalantari et al. (2013) proposed a new approach for HDR reconstruction from multiexposure video sequences built on patch-based synthesis for HDR images from (Sen et al., 2012) combined with optical flow. Instead of alternating two exposures, they propose using three different exposure values. The HDR video reconstruction is guided by an energy function that includes terms for mapping the LDR images in the radiance domain, ensures the similitude of the resulting HDR values with the LDR reference, and reinforces temporal coherence. They used the sum of squared distances (SSD) to compare two patches.

Their algorithm uses optical flow for a first motion estimation step. This estimation helps to compute a window size to constrain the patch-match search and vote step. The results are very accurate for detailed regions in the scene. However, slight flickering appears in large areas of the same colors.

4.4 Multiscopic HDR Video

In this section we address the generation of HDR images for more than two views. We first review the basics of stereoscopic imaging (Section 4.4.1) and epipolar geometry (Section 4.4.2), before we discuss recent contributions to the generation of multiscopic HDR images (Section 4.4.3).

4.4.1 Stereoscopic Imaging

Apart from a huge number of colors and huge amount of fine detail, our visual system is able to perceive depth and 3D shape of objects. Traditional images offer a representation of reality projected in two dimensions. We can guess the distribution of objects in depth because of monoscopic cues such as perspective, but we cannot actually perceive depth. Our brain needs to receive two slightly different images to actually perceive depth.

Stereoscopy enables depth perception. Stereo images refer to a pair of images acquired with two cameras horizontally aligned and separated at a scalable distance similar to the distance between our eyes. Stereoscopic display systems project them in a way such that each eye perceives only one of the images. In recent years, stereoscopic video technologies such as stereoscopic video cameras and stereoscopic displays have become available to consumers (Mendiburu et al., 2012; Dufaux et al., 2013; Urey et al., 2011). Stereo video requires the recording of at least two views of a scene, one for each eye. Some autostereoscopic displays render more than nine different views for an optimal experience (Lucas et al., 2013).

Some prototypes have been proposed to acquire stereo HDR content from two or more differently exposed views. Most approaches (Troccoli et al., 2006; Lin and Chang, 2009; Sun et al., 2010; Rufenacht, 2011; Bätz et al., 2014; Akhavan et al., 2014) are based on a rig of two cameras placed like a conventional stereo configuration that captures different exposed images. Section 4.4.3 focuses on analyzing the different existing approaches for multiscopic HDR acquisition.

4.4.2 Epipolar Geometry

Stereo images permit a viewer to see depth. The geometry that relates 3D objects to their 2D projection in stereo vision is known as epipolar geometry. Depth can be retrieved mathematically from a pair of images with use of epipolar geometry. Fig. 4.3 describes the main components of epipolar geometry. A point x in the 3D world coordinates is projected onto the left and right images I_L and I_R respectively. c_L and c_R are the two centers of projection of the cameras; the plane formed by them and the point x is known as the epipolar plane. x_L and x_R are the projections of x in I_L and I_R, respectively.

f04-03-9780081004128 — Figure 4.3 Main elements of the epipolar geometry: (A) epipolar geometry; (B) epipolar geometry rectified.

For any point x_L in the left image, the distance to x is unknown. According to the epipolar geometry, the corresponding point x_R is located somewhere on the right epipolar line. Epipolar geometry does not mean direct correspondence between pixels. However, it reduces the search for a matching pixel to a single epipolar line.

If the image planes are aligned and their optical axes are made parallel, the two epipolar lines (left and right) converge. This process is known as “rectification.” After rectification, the search space for a pixel match is reduced to the same image row.

To the best of our knowledge, all methods concerned with stereoscopic HDR content take advantage of the epipolar constraint during the matching process. Rectified image sets are available on the Internet for testing purposes (Middlebury, 2006).

4.4.3 Multiple-Exposure Stereo Matching

4.4.3.1 Problem formulation

The use of stereo matching or disparity estimation for pixel matching on differently exposed stereo/multiview images is not straightforward. Stereo matching or disparity estimation is the process of finding the pixels in the multiscopic views that correspond to the same 3D point in the scene. The rectified epipolar geometry simplifies this process of finding correspondences on the same epipolar line. It is not necessary to calculate the 3D point coordinates to find the corresponding pixel on the same row of the other image. The disparity is the distance d between a pixel and its horizontal match in the other image. Akhavan et al. (2013, 2014) compared the different ways of obtaining disparity maps from HDR, LDR, and tone-mapped stereo images. A useful comparison among them is offered, illustrating that HDR input can have a significant impact on the quality of the result.

Fig. 4.4 shows an example of a differently exposed multiview set corresponding to one frame in a multiscopic system of three views. The main goal of stereo matching is to find the correspondences between pixels to generate one HDR image per view for each frame.

f04-04-9780081004128 — Figure 4.4 “Aloe” set of LDR multiview images from Middlebury (2006): (A) multiscopic different exposures; (B) multiscopic tone-mapped HDR images.

Correspondence methods rely on matching cost functions for computing the similarity of images. It is important to consider that even when one uses radiance space images, there might be brightness differences. Such differences may be introduced by the camera because of image noise or slightly different settings or vignetting. For good analysis and comparison between existing matching costs and their properties, see (Scharstein and Szeliski, 2002; Hirschmuller and Scharstein, 2009; Bonnard et al., 2014).

Many approaches have been presented to recover HDR from multiview and multiexposed sets of images. Some of them (Troccoli et al., 2006; Lin and Chang, 2009; Sun et al., 2010) share the same pipeline as in Fig. 4.5. All the work mentioned takes as input a set of images with different exposures acquired with a camera with unknown response function. In such cases, the disparity maps need to be calculated in the first instance with use of LDR pixel values. Matching images under important differences of brightness is still a big challenge in computer vision.

f04-05-9780081004128 — Figure 4.5 General multiexposed stereo pipeline for stereo HDR. Proposed by Troccoli et al. (2006), used by Sun et al. (2010) and Lin and Chang (2009), and modified later by Bätz et al. (2014).

4.4.3.2 Per frame CRF recovery methods

To our knowledge, Troccoli et al. (2006) proposed the first technique for HDR recovery from multiscopic images of different exposures. They observed that the normalized cross-correlation (NCC) is approximately invariant to exposure changes when the camera has a gamma response function. Under such an assumption, they used the algorithm described by Kang and Szeliski (2004) to compute the depth maps that maximize the correspondence between one pixel and its projection in the other image. The original approach proposed by Kang and Szeliski (2004) used the SSD but it was replaced by the NCC. Eqs. (4.1) and (4.2) show how to calculate the SSD and NCC, respectively, for two image patches of N pixels centered in p and q of images I_L and I_R.

Images are warped to the same viewpoint with the depth map. Once pixels have been aligned, the CRF is calculated by the method proposed by Grossberg and Nayar (2003) over a selected set of matches. The problem that arises from the use of the NCC is that this introduces ambiguity within the equivalence gamma response. With the CRF and the exposures, all images are transformed to radiance space and the matching process is repeated, this time with the SSD. The new depth map improves the previous one and helps to correct artifacts. The warping is updated and HDR values are calculated by a weighted average function.

$\begin{array}{l} SSD & = \sum_{q, p \in N} {(I_{L} (q) - I_{R} (p))}^{2}, \end{array}$ $\begin{array}{l} SSD & = \sum_{q, p \in N} {(I_{L} (q) - I_{R} (p))}^{2}, \end{array}$

si1_e (4.1)

$\begin{array}{l} NCC & = \frac{\sum_{q, p \in N} I_{L} (q) \cdot I_{R} (p)}{\sqrt{\sum_{q, p \in N} I_{L} {(q)}^{2} \cdot \sum_{q, p \in N} I_{R} {(p)}^{2}}} . \end{array}$ $\begin{array}{l} NCC & = \frac{\sum_{q, p \in N} I_{L} (q) \cdot I_{R} (p)}{\sqrt{\sum_{q, p \in N} I_{L} {(q)}^{2} \cdot \sum_{q, p \in N} I_{R} {(p)}^{2}}} . \end{array}$

si6_e (4.2)

The same problem was addressed by Lin and Chang (2009). Instead of the NCC, they use scale-invariant feature transform (SIFT) descriptors to find matches between LDR stereo images. SIFT is not robust for different-exposure images. Only the matches that are coherent with the epipolar and exposure constraints are selected for the next step. The selected pixels are used to calculate the CRF.

The stereo matching algorithm they propose is based on previous work (Sun et al., 2003). Belief propagation is used to calculate the disparity maps. The stereo HDR images are calculated by means of a weighted average function. Even with the best results, SIFT is not robust enough if there are significant exposure variations.

A ghost removal technique is used afterward to treat the artifacts due to noise or stereo mismatches. The HDR image is exposed to the best exposure. The difference between them is calculated and pixels over a threshold are rejected, considering them like mismatches. This is risky because HDR values in areas underexposed and overexposed in the best exposure may be rejected. In this case the problem of ghosting would be solved but LDR values may be introduced in the resulting HDR image.

Sun et al. (2010) (inspired by Troccoli et al., 2006) also follow the pipeline described in Fig. 4.5. They assume that the disparity map between two rectified stereo images can be modeled as a Markov random field. The matching problem is presented like a Bayesian labeling problem. The optimal label (disparity) values are obtained by minimization of an energy function. The energy function they use is composed of a pixel dissimilarity term (NCC in their solution) and a disparity smoothness term. It is minimized by the graph cut algorithm to produce initial disparities. The best disparities are selected to calculate the CRF with the algorithm proposed by Mitsunaga and Nayar (1999). Images are converted to radiance space and another energy minimization is performed to remove artifacts. This time the pixel dissimilarity cost is computed with the Hamming distance between candidates.

The methods presented so far have a high computational cost. Calculating the CRF from nonaligned images may introduce errors because the matching between them may not be robust. Two exposures are not enough to obtain a robust CRF with existing techniques. Some of them execute two passes of the stereo matching algorithm, the first one to detect matches for the CRF recovery and the second one to refine the matching results. One might avoid this by calculating the CRF in a previous step using multiple exposures of static scenes. Any of the available techniques (Mann and Picard, 1995; Debevec and Malik, 1997; Mitsunaga and Nayar, 1999; Grossberg and Nayar, 2003) can be used to get the CRF corresponding to each camera. The curves help to transform pixel values into radiance for each image, and the matching process is executed in radiance space images. This avoids one stereo matching step and prevents errors introduced by disparity estimation and image warping.

4.4.3.3 Offline CRF recovery methods

Bonnard et al. (2012) proposed a method to create content that combines depth and HDR video for autostereoscopic displays. Instead of varying the exposure times, they use neutral density filters to capture different exposures. A camera with eight synchronized objectives and three pairs of 0.3, 0.6, and 0.9 filters plus two nonfiltered views provides eight views with four different exposures of the scene stored in 10-bit RAW files. They used a geometry-based approach to recover depth information from epipolar geometry. Depth maps drive the pixel matching procedure.

Bätz et al. (2014) presented a workflow for disparity estimation divided into the following steps:

• Cost initialization consists in evaluating the cost function for all values within a disparity search range. They use zero-normalized cross correlation (ZNCC), defined in Eq. (4.3):

$\begin{array}{l} ZNCC = \frac{\sum_{q, p \in N} (I_{L} (q) - {\bar{I}}_{L} (q)) \cdot (I_{R} (p) - {\bar{I}}_{R} (p))}{\sqrt{\sum_{q, p \in N} {(I_{L} (q) - {\bar{I}}_{L} (q))}^{2} \cdot \sum_{q, p \in N} {(I_{R} (p) - {\bar{I}}_{R} (p))}^{2}}} . \end{array}$ $\begin{array}{l} ZNCC = \frac{\sum_{q, p \in N} (I_{L} (q) - {\bar{I}}_{L} (q)) \cdot (I_{R} (p) - {\bar{I}}_{R} (p))}{\sqrt{\sum_{q, p \in N} {(I_{L} (q) - {\bar{I}}_{L} (q))}^{2} \cdot \sum_{q, p \in N} {(I_{R} (p) - {\bar{I}}_{R} (p))}^{2}}} . \end{array}$

si2_e (4.3)

The matching is performed on the luminance channel of a radiance space image with patches of 9 × 9 pixels. The result of searching for disparities is the disparity space image (DSI), a matrix of m × n × d + 1 for an image of m × n pixels, with d + 1 being the disparity search range.

• Cost aggregation is done to smooth the DSI and find the actual disparity of each pixel in the image. They use an improved version of the cross-based aggregation method described by Mei et al. (2011). This step is performed over the actual RGB images, not in the luminance channel as in the previous step.

• Image warping is responsible for actually shifting all pixels according to their disparities. How to deal with occluded areas between the images is the main challenge. Bätz et al. (2014) propose doing the warping in the original LDR images, which adds a new challenge: dealing with underexposed and overexposed areas. A backward image warping is chosen to implicitly ignore the saturation problems. The algorithm produces a new warped image with the appearance of the reference one by using the target image and the corresponding disparity map. Bilinear interpolation is used to retrieve values at subpixel precision.

Selmanovic et al. (2014) propose generating stereo HDR video from a pair of HDR and LDR videos, using an HDR camera (Chalmers et al., 2009) and a traditional digital camera (Canon 1Ds Mark II) in a stereo configuration. Their work is an extension to video of previous work (Selmanović et al., 2013) focused only on stereo HDR images. In this case, one HDR view needs to be reconstructed from two completely different sources.

Their method proposes three different approaches to generate the HDR video:

1. Stereo correspondence is computed to recover the disparity map between the HDR and LDR images. The disparity map allows the HDR values to be transferred to the LDR image. The SAD (Eq. 4.4) is used as a matching cost function. Both images are transformed to Lab color space, which is perceptually more accurate than RGB.

$\begin{array}{l} SAD = \sum_{q, p \in N} | I_{L} (q) - I_{R} (p) | . \end{array}$ $\begin{array}{l} SAD = \sum_{q, p \in N} | I_{L} (q) - I_{R} (p) | . \end{array}$

si3_e (4.4)

The selection of the best disparity value for each pixel is based on the winner takes all technique. The lower SAD is selected in each case. An image warping step based on the work of Fehn (2004) is used to generate a new HDR image corresponding to the LDR view. The SAD stereo matcher can be implemented to run in real time but the resulting disparity maps could be noisy and not accurate. The overexposed and underexposed pixels may end up in the wrong position. In large areas of the same color and hence same SAD cost, the disparity will be constant. Occlusions and reflective or specular objects may cause some artifacts.

2. The expansion operator could be used to produce an HDR image from the LDR view. Detailed state-of-the-art reports on LDR expansion have been presented by Banterle et al. (2009) and Hirakawa and Simon (2011). However, in this case, we need the expanded HDR to remain coherent with the original LDR. Inverse tone mappers are not suitable because the resulting HDR image may be very different from the acquired one, producing results that are not possible to fuse through a common binocular vision.
Selmanovic et al. (2014) propose an expansion operator based on a mapping between the HDR and the LDR image using the first one as a reference. A reconstruction function maps LDR to HDR values (Eq. 4.5) based on an HDR histogram with 256 bins putting the same number of HDR values in each bin as there are in the LDR histogram.

$\begin{array}{l} RF = \frac{1}{Card (Ω_{c})} \sum_{i = M (c)}^{M (c) + Card (Ω_{c})} C_{hdr} (i) . \end{array}$ $\begin{array}{l} RF = \frac{1}{Card (Ω_{c})} \sum_{i = M (c)}^{M (c) + Card (Ω_{c})} C_{hdr} (i) . \end{array}$

si4_e (4.5)

In Eq. (4.5), Ω_c = { j = i. .N : c_ldr( j) = c}, c = 0. .255 is the index if a bin Ω_c, Card(⋅) returns the number of elements in the bin, N is the number of pixels in the image, c_ldr(j) are the intensity values for pixel j, $M (c) = \sum_{0}^{c} Card (Ω_{c})$ $M (c) = \sum_{0}^{c} Card (Ω_{c})$ is the number of pixels in the previous bin, and c_hdr are the intensities of all HDR pixels sorted in ascending order. RF is used to calculate the look-up table and afterward expansion can be performed directly, assigning the corresponding HDR value to each LDR pixel.
The expansion runs in real time, is not view dependent, and avoids stereo matching. The main limitation is again on saturated regions.

3. The hybrid method combines the two previous methods. Two HDR images are generated by the previous approaches (stereo matching and expansion operator). Pixels in well-exposed regions are expanded by the first method (expansion operator), whereas matches for pixels in underexposed or overexposed regions are found by SAD stereo matching with the addition of a correction step. A mask of undersaturated and oversaturated regions is created by use of a threshold for pixels over 250 or below 5. The areas out of the mask are filled in with the expansion operator, and the underexpose or overexposed regions are filled in with an adapted version of the SAD stereo matching to recover more accurate values in overexposed or underexposed regions.

Instead of having the same disparity over the whole underexposed or overexposed region, this variant interpolates disparities from well-exposed edges. Edges are detected by a fast morphological edge detection technique described by Lee et al. (1987). Nevertheless, some small artifacts may still be produced by the SAD stereo matching in such areas.

Orozco et al. (2015) presented a method to generate multiscopic HDR images from LDR multiexposure images. They adapted a patch match approach to find matches between stereo images using epipolar geometry constrains. This method reduces the search space in the matching process and improves the incoherence problem of the patch match. Each image in the set of multiexposed images is used as a reference, looking for matches in all the remaining images. These accurate matches allow one to synthesize images corresponding to each view which are merged into one HDR per view that can be used in autostereoscopic displays.

4.5 Conclusions

We have presented a selection of the most significant methods to recover HDR values in misaligned multiple-exposure sequences. These approaches are based on the type of misalignment affecting the sequence of LDR images.

Despite the amount of research focused on this topic, there is not yet a fully robust solution for the two main cases we have analyzed in this chapter: free-path camera and multiscopic different views. The problem of large saturated areas remains unsolved and temporal coherence problems persist in both HDR and stereo HDR video. However, important advances have been made lately in multiscopic and HDR video and such progress makes it likely we will soon have a robust method that includes large saturated areas.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 4: Multiview HDR Video Sequence Generation

Create new playlist

Sign In

Sign Up