Chapter 2. Shader Amortization using Pixel Quad Message Passing (2/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

354 VI 3D Engine Design

// Gather f o u r f l o a t s i n to one f l o a t 4

f l o a t QuadGather2x2 ( f l o a t v a l u e )

{

f l o a t 4 r = v a l u e ;

r . y = r . x + ddx ( r . x ) ∗ QuadVector . z ; // H o r i z o nt a l

r . zw = r . xy + ddy ( r . xy ) ∗ QuadVector .w; // V e r t i c a l /

Dia gon al

return r ;

}

In both of these examples we used the variable QuadVector. Figure 2.2 illus-

trates the value of QuadVector for each pixel in a quad. Most of the optimiza-

tions we perform in this chapter rely on this vector and one other variable called

QuadSelect. QuadVector is used to divide two-dimensional symmetric problems

into four parts, while QuadSelect is used to choose between two values based on

the current pixel’s quadrant.

The following code demonstrates one way to calculate QuadVector and

QuadSelect from a pixel’s screen coordinates. The negated/ﬂipped values are

also useful and are stored in z/w components.

void InitQuad ( f l o a t 2 s cr ee nCo or d )

{

// This assumes sc r ee nC o or d c o n t ai n s an i n t e g e r p i x e l

c o o r d in a te

ScreenCoord = scr e en Co o rd ;

QuadVector = f r a c ( sc r ee nC o or d . xy ∗ 0 . 5 ) . xyxy ;

QuadVector = QuadVector∗ fl o a t 4 (4 ,4 , −4 , −4) + fl o a t 4 ( −1 , −1 ,1 ,1)

;

QuadSel e ct = s a t u r a t e ( QuadVector ) ;

}

While it takes a few instructions to initialize communication within a quad,

this will allow us to amortize the cost of several costly shading operations.

First, however, we will identify a few drawbacks and limitations when using

PQA.

2.6 Limitations of PQA

There are a number of limitations to pixel quad amortization that become im-

mediately apparent. First and foremost, pixel quad message passing works only

on hardware that uses hybrid forward and backward derivatives as illustrated in

Figure 2.2. When half-resolution derivatives are used, the derivative instructions

never touch the bottom-right pixel in the quad. There is no way to communi-

cate that pixel’s value to the other pixels in the quad in that case, thus hybrid

derivative support needs to be detected based on the graphics card. Appendix A

2. Shader Amortization using Pixel Quad Message Passing 355

(0,1) (1,1)

(0,0) (1,0)

(-1, 1) (1, 1)

(-1,-1) (1,-1)

QuadVector QuadSelect

Figure 2.2. To initialize PQA we calculate two simple values for each pixel. QuadVector

contains the x/y sign of the pixel within it’s quad and is used to perform symmetric

operations while QuadSelect is used to choose between values based on the pixel’s

location in the quad.

provides a list of hardware that supports hybrid derivatives at the time this ar-

ticle was written. It is also possible that a hardware vendor could change the

way the derivative instructions work, breaking this functionality. Although this

seems very unlikely, it is easy enough to write a detection routine to test which

type of derivatives are used.

The second problem that becomes immediately apparent is that there is no in-

terpolation between quads as there would be from a pre-rendered half-resolution

buﬀer. Thus, if we output the same value for an entire quad, it will resemble

unﬁltered point sampling from a half-resolution frame buﬀer. This may be ac-

ceptable in certain situations, but if we want higher quality results, we still need

to compute unique values for each pixel. Our ability to produce pleasing results

really depends on the speciﬁc problem.

The third problem is that quad-level calculations work eﬀectively only in the

current triangle’s domain. For example, we can use pixel quad amortization

to accelerate PCF shadow-map sampling in forward rendering, but not nearly

as easily in deferred rendering. This is because in the deferred case the quads

being rendered are not in object space; thus, a pixel quad may straddle a depth

discontinuity, creating a large gap in shadow space. In forward rendering, the

entire quad will project into a contiguous location in shadow space, which is

what we rely on to amortize costs eﬀectively.

Although there are a number of drawbacks to PQA, we found we could solve

these issues for several common graphics problems and still achieve large per-

formance gains. In the following sections we will discuss how to optimize PCF,

bilateral upsampling, and basic convolution and blurring with PQA.

356 VI 3D Engine Design

2.7 Cross Bilateral Sampling

The cross bilateral ﬁlter has been popularized as a means to provide geometry-

aware upsampling. If a screen-space buﬀer is blurred or upsampled using a simple

bilinear ﬁlter, the features in the low-resolution buﬀer will bleed across depth

boundaries, creating artifacts. The basic idea behind the bilateral ﬁlter is to

modify the reconstruction kernel to avoid integrating across depth or normal

boundaries in the scene. This is achieved by storing a depth and/or normal for

each low-resolution sample and assigning ﬁlter weight according to not only the

distance in screen space to each sample, but also distance in depth and/or normal

space. Bilateral ﬁlters usually use Gaussian weighting functions in both depth and

screen space, however [Yang et al. 08] proposed to use a simple tent function in

screen space, mimicking the eﬀect of a bilinear upsample and therefore requiring

only four depth/image samples. No matter what type of weighting function is

used, the ﬁlter weight is accumulated such that the sample can be normalized by

the total accumulated weight:

Σc

f(ˆx

, x

)g(|z

− z

Σf(ˆx

, x

)g(|z

− z

In this example f () is the normal linear ﬁltering weight while g() is a Gaussian

falloﬀ based on the diﬀerence in depth between the high-resolution and low-

resolution depths. One disadvantage of bilateral upsampling is its cost compared

with simple bilinear ﬁltering. While a bilinear upsample requires only one hard-

ware ﬁltered sample, a bilateral upsample will require at minimum four point

samples and four depth samples. This cost is incurred at the high resolution,

thus it often partially defeats the purpose of performing calculations at a lower

resolution in the ﬁrst place. Obviously, if the calculation costs less than eight

samples, it will be less expensive to just compute the value at the high resolu-

tion.

The bilateral ﬁlter is one example where PQA works without any of the draw-

backs mentioned in the previous section. Since bilateral upsampling occurs in

screen space, we can set up our low-resolution buﬀer such that all the pixels in

the same high-resolution quad will share the same low-resolution samples. All

that is needed then is to share the samples across the quad and let each pixel

perform the bilateral ﬁlter independently. Here is an example for a 2X upsample

of a low-resolution AO texture. To optimize this further to only one sample, the

depth can be packed into extra channels of the AO texture.

// Gather quad h o r i z o n t a l / v e r t i c a l / d i a g o n a l sa mple s

f l o a t 2 AO D, AO D H, AO D V , AO D D ;

AO D. x = tex2D ( lowResDepthSampler , c oor d ) . x ;

AO D. y = tex2D ( lowResAOSampler , coo rd ) . x ;

QuadGather2x2 ( AO D, AO D H , AO D V, AO D D ) ;

2. Shader Amortization using Pixel Quad Message Passing 357

2X Bilateral Upsample 4X Bilateral Upsample

Texel

Quad

Figure 2.3. Bilateral upsampling from a half-resolution or quarter-resolution buﬀer. All

quad pixels utilize the same four low-resolution samples. We can therefore perform a

bilateral upsample with only one or two texture fetches and two derivative instructions,

instead of eight texture fetches.

The bilateral upsample can then be performed as usual for each pixel, with

the caveat that tent weights will need to ﬂip to compensate for the samples being

ﬂipped in each pixel. A similar approach can be taken for a 4X upsample, or for

bilateral blurring operations at any resolution. One extra thing to note is that

the low-resolution buﬀer is shifted half a pixel (see Figure 2.3).

2.8 Convolution and Blurring

Convolution and blurring operations can also be accelerated using PQA. Although

we are performing calculations at the pixel quad level, we would not want our

result to be output at half-resolution or we might as well simply output a truly

half-resolution texture! Thankfully, because we can share results at any point in

the shader, we can customize the message delivered to other pixels in the quad

in order to perform unique blurs for each pixel. The following code illustrates a

3 × 3 blur with four samples, while Figure 2.4 illustrates this process for a 5 × 5

blur using nine samples:

// Po pula te mes sage s f o r n e i g h b o r s

f l o a t 4 m = 0 ;

m. rgba+= tex2D ( imageSampler , coo rd ) . x ;

m. rb += tex2D ( imageSampler , coo r d+QuadVector ∗

f l o a t 2 ( TEXEL SIZE . x , 0 ) ) . x ;

m. rg += tex2D ( imageSampler , coo rd+QuadVector∗

f l o a t 2 ( TEXEL SIZE . y , 0 ) ) . x ;

m. r += tex2D ( imageSampler , coord+QuadVector ∗

f l o a t 2 ( TEXEL SIZE . xy ) ) . x ;

358 VI 3D Engine Design

Figure 2.4. Illustration of a 5 × 5 blur using PQA. The blur kernel footprint of four

pixels in a quad (left). Samples taken by each pixel in the quad (middle). Uniquely

weighted messages from the red pixel to other pixels in the quad (right).

// Gather m e ssa g es

f l o a t 4 h , v , d ;

QuadGather2x2 ( m, h , v , d ) ;

//Weight r e s u l t s f o r 3 x3 b l u r

f l o a t 4 r e s u l t = dot ( f l o a t 4 ( 4 ,2 , 2 , 1 ) / 9 . 0 ,

f l o a t 4 (m. x , h . g , v . b , d .w) ) ;

Unfortunately, though we can gather more samples, it becomes cumbersome

to apply unique weights for more complicated ﬁlters, especially when bilinear ﬁl-

tering is also applied to increase the kernel width. In our example it would also

take several QuadGather operations for a multiple channel texture. While this can

be optimized signiﬁcantly by separating vertical and horizontal messages, we rec-

ommend this approach primarily for performing nonseparable and/or nonlinear

blurring operations on one or two channel data. In the case that only approxi-

mate results are required, we discuss a gradient approximation to support bilinear

ﬁltering in Section 2.9.

In the case of Direct3D 11 hardware, it should be noted that PQA should

not be used for simple image blurring. In this case DirectCompute or OpenCL

can achieve much better performance by applying the same idea in a compute

shader. For example, one could output in quad-sized groups of pixels, or even

output an entire row of quads in one shader. For this reason PQA should be used

only during geometry rasterization on hardware that supports compute shaders.

PQA will remain a valid technique in these cases since rasterization is only a

semi-parallelizable task.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 2. Shader Amortization using Pixel Quad Message Passing (2/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 2. Shader Amortization using Pixel Quad Message Passing (2/4)