i
i
i
i
i
i
i
i
354 VI 3D Engine Design
// Gather f o u r f l o a t s i n to one f l o a t 4
f l o a t QuadGather2x2 ( f l o a t v a l u e )
{
f l o a t 4 r = v a l u e ;
r . y = r . x + ddx ( r . x ) QuadVector . z ; // H o r i z o nt a l
r . zw = r . xy + ddy ( r . xy ) QuadVector .w; // V e r t i c a l /
Dia gon al
return r ;
}
In both of these examples we used the variable QuadVector. Figure 2.2 illus-
trates the value of QuadVector for each pixel in a quad. Most of the optimiza-
tions we perform in this chapter rely on this vector and one other variable called
QuadSelect. QuadVector is used to divide two-dimensional symmetric problems
into four parts, while QuadSelect is used to choose between two values based on
the current pixel’s quadrant.
The following code demonstrates one way to calculate QuadVector and
QuadSelect from a pixel’s screen coordinates. The negated/flipped values are
also useful and are stored in z/w components.
void InitQuad ( f l o a t 2 s cr ee nCo or d )
{
// This assumes sc r ee nC o or d c o n t ai n s an i n t e g e r p i x e l
c o o r d in a te
ScreenCoord = scr e en Co o rd ;
QuadVector = f r a c ( sc r ee nC o or d . xy 0 . 5 ) . xyxy ;
QuadVector = QuadVector fl o a t 4 (4 ,4 , 4 , 4) + fl o a t 4 ( 1 , 1 ,1 ,1)
;
QuadSel e ct = s a t u r a t e ( QuadVector ) ;
}
While it takes a few instructions to initialize communication within a quad,
this will allow us to amortize the cost of several costly shading operations.
First, however, we will identify a few drawbacks and limitations when using
PQA.
2.6 Limitations of PQA
There are a number of limitations to pixel quad amortization that become im-
mediately apparent. First and foremost, pixel quad message passing works only
on hardware that uses hybrid forward and backward derivatives as illustrated in
Figure 2.2. When half-resolution derivatives are used, the derivative instructions
never touch the bottom-right pixel in the quad. There is no way to communi-
cate that pixel’s value to the other pixels in the quad in that case, thus hybrid
derivative support needs to be detected based on the graphics card. Appendix A
i
i
i
i
i
i
i
i
2. Shader Amortization using Pixel Quad Message Passing 355
(0,1) (1,1)
(0,0) (1,0)
(-1, 1) (1, 1)
(-1,-1) (1,-1)
QuadVector QuadSelect
Figure 2.2. To initialize PQA we calculate two simple values for each pixel. QuadVector
contains the x/y sign of the pixel within it’s quad and is used to perform symmetric
operations while QuadSelect is used to choose between values based on the pixel’s
location in the quad.
provides a list of hardware that supports hybrid derivatives at the time this ar-
ticle was written. It is also possible that a hardware vendor could change the
way the derivative instructions work, breaking this functionality. Although this
seems very unlikely, it is easy enough to write a detection routine to test which
type of derivatives are used.
The second problem that becomes immediately apparent is that there is no in-
terpolation between quads as there would be from a pre-rendered half-resolution
buffer. Thus, if we output the same value for an entire quad, it will resemble
unfiltered point sampling from a half-resolution frame buffer. This may be ac-
ceptable in certain situations, but if we want higher quality results, we still need
to compute unique values for each pixel. Our ability to produce pleasing results
really depends on the specific problem.
The third problem is that quad-level calculations work effectively only in the
current triangle’s domain. For example, we can use pixel quad amortization
to accelerate PCF shadow-map sampling in forward rendering, but not nearly
as easily in deferred rendering. This is because in the deferred case the quads
being rendered are not in object space; thus, a pixel quad may straddle a depth
discontinuity, creating a large gap in shadow space. In forward rendering, the
entire quad will project into a contiguous location in shadow space, which is
what we rely on to amortize costs effectively.
Although there are a number of drawbacks to PQA, we found we could solve
these issues for several common graphics problems and still achieve large per-
formance gains. In the following sections we will discuss how to optimize PCF,
bilateral upsampling, and basic convolution and blurring with PQA.
i
i
i
i
i
i
i
i
356 VI 3D Engine Design
2.7 Cross Bilateral Sampling
The cross bilateral filter has been popularized as a means to provide geometry-
aware upsampling. If a screen-space buffer is blurred or upsampled using a simple
bilinear filter, the features in the low-resolution buffer will bleed across depth
boundaries, creating artifacts. The basic idea behind the bilateral filter is to
modify the reconstruction kernel to avoid integrating across depth or normal
boundaries in the scene. This is achieved by storing a depth and/or normal for
each low-resolution sample and assigning filter weight according to not only the
distance in screen space to each sample, but also distance in depth and/or normal
space. Bilateral filters usually use Gaussian weighting functions in both depth and
screen space, however [Yang et al. 08] proposed to use a simple tent function in
screen space, mimicking the effect of a bilinear upsample and therefore requiring
only four depth/image samples. No matter what type of weighting function is
used, the filter weight is accumulated such that the sample can be normalized by
the total accumulated weight:
c
H
i
=
Σc
L
j
f(ˆx
i
, x
j
)g(|z
H
i
z
L
j
|)
Σf(ˆx
i
, x
j
)g(|z
H
i
z
L
j
|)
In this example f () is the normal linear filtering weight while g() is a Gaussian
falloff based on the difference in depth between the high-resolution and low-
resolution depths. One disadvantage of bilateral upsampling is its cost compared
with simple bilinear filtering. While a bilinear upsample requires only one hard-
ware filtered sample, a bilateral upsample will require at minimum four point
samples and four depth samples. This cost is incurred at the high resolution,
thus it often partially defeats the purpose of performing calculations at a lower
resolution in the first place. Obviously, if the calculation costs less than eight
samples, it will be less expensive to just compute the value at the high resolu-
tion.
The bilateral filter is one example where PQA works without any of the draw-
backs mentioned in the previous section. Since bilateral upsampling occurs in
screen space, we can set up our low-resolution buffer such that all the pixels in
the same high-resolution quad will share the same low-resolution samples. All
that is needed then is to share the samples across the quad and let each pixel
perform the bilateral filter independently. Here is an example for a 2X upsample
of a low-resolution AO texture. To optimize this further to only one sample, the
depth can be packed into extra channels of the AO texture.
// Gather quad h o r i z o n t a l / v e r t i c a l / d i a g o n a l sa mple s
f l o a t 2 AO D, AO D H, AO D V , AO D D ;
AO D. x = tex2D ( lowResDepthSampler , c oor d ) . x ;
AO D. y = tex2D ( lowResAOSampler , coo rd ) . x ;
QuadGather2x2 ( AO D, AO D H , AO D V, AO D D ) ;
i
i
i
i
i
i
i
i
2. Shader Amortization using Pixel Quad Message Passing 357
2X Bilateral Upsample 4X Bilateral Upsample
Texel
Quad
Figure 2.3. Bilateral upsampling from a half-resolution or quarter-resolution buffer. All
quad pixels utilize the same four low-resolution samples. We can therefore perform a
bilateral upsample with only one or two texture fetches and two derivative instructions,
instead of eight texture fetches.
The bilateral upsample can then be performed as usual for each pixel, with
the caveat that tent weights will need to flip to compensate for the samples being
flipped in each pixel. A similar approach can be taken for a 4X upsample, or for
bilateral blurring operations at any resolution. One extra thing to note is that
the low-resolution buffer is shifted half a pixel (see Figure 2.3).
2.8 Convolution and Blurring
Convolution and blurring operations can also be accelerated using PQA. Although
we are performing calculations at the pixel quad level, we would not want our
result to be output at half-resolution or we might as well simply output a truly
half-resolution texture! Thankfully, because we can share results at any point in
the shader, we can customize the message delivered to other pixels in the quad
in order to perform unique blurs for each pixel. The following code illustrates a
3 × 3 blur with four samples, while Figure 2.4 illustrates this process for a 5 × 5
blur using nine samples:
// Po pula te mes sage s f o r n e i g h b o r s
f l o a t 4 m = 0 ;
m. rgba+= tex2D ( imageSampler , coo rd ) . x ;
m. rb += tex2D ( imageSampler , coo r d+QuadVector
f l o a t 2 ( TEXEL SIZE . x , 0 ) ) . x ;
m. rg += tex2D ( imageSampler , coo rd+QuadVector
f l o a t 2 ( TEXEL SIZE . y , 0 ) ) . x ;
m. r += tex2D ( imageSampler , coord+QuadVector
f l o a t 2 ( TEXEL SIZE . xy ) ) . x ;
i
i
i
i
i
i
i
i
358 VI 3D Engine Design
Figure 2.4. Illustration of a 5 × 5 blur using PQA. The blur kernel footprint of four
pixels in a quad (left). Samples taken by each pixel in the quad (middle). Uniquely
weighted messages from the red pixel to other pixels in the quad (right).
// Gather m e ssa g es
f l o a t 4 h , v , d ;
QuadGather2x2 ( m, h , v , d ) ;
//Weight r e s u l t s f o r 3 x3 b l u r
f l o a t 4 r e s u l t = dot ( f l o a t 4 ( 4 ,2 , 2 , 1 ) / 9 . 0 ,
f l o a t 4 (m. x , h . g , v . b , d .w) ) ;
Unfortunately, though we can gather more samples, it becomes cumbersome
to apply unique weights for more complicated filters, especially when bilinear fil-
tering is also applied to increase the kernel width. In our example it would also
take several QuadGather operations for a multiple channel texture. While this can
be optimized significantly by separating vertical and horizontal messages, we rec-
ommend this approach primarily for performing nonseparable and/or nonlinear
blurring operations on one or two channel data. In the case that only approxi-
mate results are required, we discuss a gradient approximation to support bilinear
filtering in Section 2.9.
In the case of Direct3D 11 hardware, it should be noted that PQA should
not be used for simple image blurring. In this case DirectCompute or OpenCL
can achieve much better performance by applying the same idea in a compute
shader. For example, one could output in quad-sized groups of pixels, or even
output an entire row of quads in one shader. For this reason PQA should be used
only during geometry rasterization on hardware that supports compute shaders.
PQA will remain a valid technique in these cases since rasterization is only a
semi-parallelizable task.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.180.71