Chapter 22. Additional Topics in Modern Rendering

This is the final chapter in the book, and it’s intended to provide a look at what’s next for a budding graphics developer. As such, the material is presented a little differently. Instead of taking a deep dive on a particular technique, this chapter touches on several topics in modern rendering. In particular, you examine options to improve the performance of your rendering applications and discover ways to improve the quality of your scenes through deferred shading and global illumination. You also learn about compute shaders and data-driven engine architecture.

Rendering Optimization

You can optimize rendering speed using many approaches, but they can be roughly categorized into two groups:

1. Optimizations on the CPU (before data is sent to the GPU)

2. Optimizations on the GPU (shader optimization)

The overarching goal is to reduce the time it takes from when your application begins rendering to when your scene appears on the screen. Typically, the more objects are processed by the rendering pipeline, the longer the time to render. Thus, a common task in rendering optimization is to reduce the number of objects processed by the rendering pipeline. This means culling objects that are not visible to the camera and sorting objects so that they can be optimally rendered by the GPU. Furthermore, it’s important to reduce the overhead of the rendering API and minimize the shader instructions that are executed within each stage of the pipeline.

Myriad approaches exist for each of these topics. The next few sections present just a few optimization options that are commonly used in modern rendering.

View Frustum Culling

Ideally, objects outside the camera’s view frustum should not be passed to the GPU for rendering. These objects have no chance of being written to the render target; they clog up the graphics bus and make their way at least through the early pipeline stages before being discarded. The process of view frustum culling tests the objects in the scene for collision with the view frustum. If an object collides with the frustum, it’s sent to the GPU; otherwise, it’s culled.

A simple approach to testing for frustum collisions is to walk the complete list of objects in the scene. But this quickly becomes unwieldy for scenes with many objects. So the real task becomes pruning the list of objects that will be tested for frustum collision. Bounding volume hierarchies (BVHs) describe a family of data structures commonly used for pruning. A BVH is constructed as a preprocessing step and organizes the list of objects into a hierarchy of bounding volumes. For example, a binary tree could be constructed as a hierarchy of bounding spheres. Each node within the hierarchy contains the smallest bounding sphere that is large enough to contain all its objects. A leaf node is represented by a single object (and its bounding sphere). A nonleaf node contains two child nodes, each consisting of half the objects of its parent node. Pruning the list of objects to render is a matter of traversing the tree and testing the bounding sphere of each node for intersection with the view frustum. If a node’s bounding sphere does not intersect with the view frustum, neither will any of its children, and that entire branch can be rejected from further processing. Leaf node objects whose bounding spheres intersect with the frustum are then sent to the GPU.

This approach is not restricted to bounding spheres. You might choose axis-aligned bounding boxes (AABBs) or oriented bounding boxes (OBBs) to encompass the objects. Trees with these bounding volumes are commonly referred to as AABB-trees, OBB-trees, and Sphere-trees. Tradeoffs come into play between the intersection cost and the bounding volume’s tightness of fit. For example, spheres have the quickest intersection computation but often have the worst fit. You get progressively better fit (sphere, AABB, OBB, low-poly hull) with progressively higher intersection cost.

Whereas bounding volume hierarchies subdivide the objects within a scene, space partitioning systems subdivide the space in which the objects reside. Space partitioning systems divide a space into two or more nonoverlapping subsets. Each object within the scene is assigned into exactly one of these regions. As in the previous example, these regions can be organized into a tree, called a space-partitioning tree. A binary-space partitioning tree (or BSP tree), for example, assigns objects to either side of a plane, applying the partitioning scheme recursively to some maximum tree depth. BSP trees are one of the more common forms of space partitioning. Two other common space partitioning systems are quadtrees and octrees.

A quadtree recursively subdivides a space into four quadrants (typically square or rectangular). Objects are assigned to a quadrant until a maximum capacity is reached, at which point the quadrant is subdivided (typically to some maximum depth). Nodes are pruned when the space they encompass does not intersect with the view frustum.

An octree is a three-dimensional version of a quadtree, subdividing its space into eight octants (commonly cubes). However, this is not meant to suggest that quadtrees cannot be used for three-dimensional scenes. Indeed, a quadtree is preferable in a game in which most of the objects reside in the same plane (the xz-plane, for instance). Octrees consume more memory and are more costly to build and traverse than quadtrees. But if your scene is set, for example, in outer space, and you have objects distributed more or less uniformly throughout the space, the improved culling will mitigate the additional overhead of an octree. This hints that the distribution of the objects within the scene should dictate the data structure used for organization. Quadtrees and octrees are uniform space-partitioning systems; they divide the space uniformly. But if your scene is not uniformly distributed, you end up with a lot of empty regions (wasted space). In such scenarios, you might opt for a nonuniform space-partitioning system, such as kD-tree.

All these systems have tradeoffs, including their implementation complexity, memory requirements, and costs to update. These must be offset by culling performance for a positive impact. The choice of culling system is largely based on a priori knowledge of the application. Are the objects in the scene evenly distributed or clustered together? Are they arranged more or less within the same plane or do they have a vertical component? Furthermore, will you include dynamic objects within your tree, or just static meshes? Dynamic objects require updating the data structure; this could be prohibitively expensive, depending on the complexity of the scene. Ultimately, you have to experiment within your application to determine the best data structure to use.

Occlusion Culling

You’ve already been introduced to backface culling, or not rendering triangles that face away from the camera. And you’ve employed the depth buffer (or z-buffer) to reject pixels that are behind other pixels (z-culling). But these systems are employed only after an object has been sent to the GPU. CPU occlusion culling rejects objects that are fully occluded by other objects, before they are ever sent to the GPU. Consider a scenario in which you are inside a building that’s part of a much larger scene. The walls of the building occlude all the objects outside, which might be the majority of the geometry. Discarding those occluded objects would dramatically improve rendering performance.

Two major occlusion culling approaches exist: potentially visible set (PVS) rendering and portal rendering. PVS rendering involves subdividing the scene into regions and precomputing what objects are visible for various camera positions within each region. The runtime cost is just a lookup for the PVS set associated with the camera position. The list of visible objects in the set is sent to the GPU. Disadvantages to PVS rendering include the memory overheard for storing the PVS data, the lack of support for entirely dynamic scenes, and the preprocessing time involved in generating the PVS data. Moreover, manual configuration typically is required to establish the PVS data; this can be time consuming. Furthermore, the PVS data becomes stale and must be regenerated as the scene changes.

Portal rendering also divides the scene into regions (or zones), where each zone occludes anything outside the zone. Adjacent zones are connected via portals, geometry that’s shared between regions and that indicates potentially visible objects beyond the zone. PVS algorithms can precompute visibility within and between zones.

Object Sorting

Occlusion culling is intended to reduce overdraw, or rendering the same pixel more than once. Rendering an opaque pixel, only to have it overwritten by a pixel that’s closer to the camera, is a waste of processing time. The amount of overdraw can be described as a ratio between the number of unnecessarily drawn pixels (overdrawn pixels) and the number of pixels that should be drawn. Ideally, this value should be zero (no overdraw). As the overdraw ratio increases, performance decreases.

Front-to-back sorting (of opaque objects) can reduce overdraw by supporting the z-buffer’s attempts to reject pixels that fail the depth test. Such sorting is expensive, but the cost can be mitigated as a tradeoff with less accurate sorting.

An alternative to object sorting is to render the scene with an early z pass. This technique renders the scene twice: first to the depth buffer (with no pixel shader enabled) and then to the frame buffer. When the scene is rendered the second time, only the pixels nearest the camera will pass the depth test; all others will be rejected before execution of the pixel shader.

Shader Optimization

When you’ve culled as many objects as you can and you’ve organized the draw calls for efficient z-culling, you can turn your attention to optimizing the shaders. The shaders in this book have been developed with a bent toward readability rather than performance. This section aims to provide some guidance on improving the speed of your shaders.

Image Know thy enemy (or, profile your code).

Several tools on the market can help you profile your shaders, but I prefer NVIDIA Nsight Visual Studio Edition. You can obtain this free tool from NVIDIA’s website. Quite a bit of documentation is available, and you have access to instructional videos to get you started.

Indeed, profiling isn’t simply germane to shader optimization; profiling your applications from the CPU side is just as important. Profiling is applied to the application as a whole and is truly the first step in any optimization attempt.

Image Don’t early optimize.

If you aren’t having a performance issue, there’s little call for investing valuable development time to improve rendering speed.

Image Pixels outnumber vertices.

Generally, fewer vertices than pixels are moving through the rendering pipeline. Thus, try to move instructions out of the pixel shader and into the vertex shader, particularly if the visual impact is negligible.

A common technique for determining whether the pixel shader is the bottleneck is to compare the performance of the application rendered to difference window sizes. The larger the window, the more pixels are rendered. If your performance drops precipitously at higher resolution, it’s a good bet that your pixel shader is the culprit.

Image Skip unnecessary instructions.

This recommendation is largely about dynamic branching. For example, if your shader supports fog, test the fog amount at the beginning of the pixel shader. If the pixel is fully fogged, you can skip the rest of the shader. This holds true for a number of “early outs,” including transparency. If you’re alpha blending and the pixel you’re on is fully transparent, you can skip the entire pixel shader.

That said, you’re more likely to be able to skip blocks of code than the whole shader. For example, if you’re using Lambert’s cosine law to compute a diffuse component and the light is “behind” the surface, you can skip additional processing for the diffuse and specular terms. However, you still might have an ambient term. Dynamic branching is your friend in such scenarios. Use it sparingly, but you should almost certainly prefer dynamic branching over executing unnecessary instructions.

Another example is renormalizing data unnecessarily. If the CPU-side application guarantees that a direction vector is already normalized, there’s no need to do it again. Similarly, you might renormalize direction vectors in the pixel shader because the data coming out of the interpolator could be slightly off. However, the visual artifacts might be small enough that you can skip the extra normalization.

Image Keep your shaders simple.

The more complex your shaders, the slower they are to execute.

Image Examine the assembly code.

You can ask the HLSL compiler to output the assembly instructions built for each of your shaders. To do so, open the Property Pages for your project and edit the Additional Options field under Configuration Properties, HLSL Compiler, Command Line to add the following option:

/Fc"$(OutDir)ContentEffects\%(Filename).asm"

Listing 22.1 shows the assembly from the release build of the BlinnPhongInstrincs.fx pixel shader.

Listing 22.1 Assembly Code for the Pixel Shader from BlinnPhongIntrinsics.fx


PixelShader = asm {
    ps_5_0
    dcl_globalFlags refactoringAllowed
    dcl_constantbuffer cb0[2], immediateIndexed
    dcl_constantbuffer cb1[10], immediateIndexed
    dcl_sampler s0, mode_default
    dcl_resource_texture2d (float,float,float,float) t0
    dcl_input_ps linear v1.xyz
    dcl_input_ps linear v2.xy
    dcl_input_ps linear v3.xyz
    dcl_input_ps linear v4.xyz
    dcl_output o0.xyzw
    dcl_temps 3
    dp3 r0.x, v4.xyzx, v4.xyzx
    rsq r0.x, r0.x
    dp3 r0.y, v3.xyzx, v3.xyzx
    rsq r0.y, r0.y
    mul r0.yzw, r0.yyyy, v3.xxyz
    mad r1.xyz, v4.xyzx, r0.xxxx, r0.yzwy
    dp3 r0.x, r1.xyzx, r1.xyzx
    rsq r0.x, r0.x
    mul r1.xyz, r0.xxxx, r1.xyzx
    dp3 r0.x, v1.xyzx, v1.xyzx
    rsq r0.x, r0.x
    mul r2.xyz, r0.xxxx, v1.xyzx
    dp3 r0.x, r2.xyzx, r1.xyzx
    dp3 r0.y, r2.xyzx, r0.yzwy
    ge r0.zw, r0.xxxy, l(0.000000, 0.000000, 0.000000, 0.000000)
    log r0.x, r0.x
    mul r0.x, r0.x, cb1[9].x
    exp r0.x, r0.x
    max r0.y, r0.y, l(0.000000)
    and r0.z, r0.w, r0.z
    and r0.x, r0.x, r0.z
    sample_indexable(texture2d)(float,float,float,float) r1.xyzw,
v2.xyxx, t0.xyzw, s0
    min r0.x, r0.x, r1.w
    mul r0.yzw, r0.yyyy, r1.xxyz
    mul r2.xyz, cb0[1].wwww, cb0[1].xyzx
    mul r0.yzw, r0.yyzw, r2.xxyz
    mul r2.xyz, cb0[0].wwww, cb0[0].xyzx
    mad r0.yzw, r2.xxyz, r1.xxyz, r0.yyzw
    mul r1.xyz, cb1[8].wwww, cb1[8].xyzx
    mad o0.xyz, r1.xyzx, r0.xxxx, r0.yzwy
    mov o0.w, l(1.000000)
    ret
    // Approximately 32 instruction slots used
};


By examining the assembly, you can see exactly what instructions are generated from your code and refactor sections that are suspect. That said, the driver processes these instructions further for the actual graphics card. You need a vendor-specific application to view the instructions a particular driver produces.

Image Use intrinsic functions.

We discussed this concept in Chapter 6, “Lighting Models,” when we introduced the HLSL intrinsic lit() function to compute the diffuse and specular coefficients. Using intrinsic functions yields fewer instructions and has better performance than rolling the same code yourself.

Image Don’t perform calculations on otherwise constant data.

Recall the light intensity multiplications performed in shaders throughout the text (for example, sampledColor.rgb * ambientColor.rgb * ambientIntensity.a). The ambientColor.rgb * ambientColor.a calculation results in a constant value for each pixel rendered by the draw call. Such calculations are better done on the CPU and passed into the shader as a constant. Chapter 6 briefly mentioned this, but it is worth emphasizing here. The book’s shaders were authored this way, to make interaction simple within NVIDIA FX Composer.

This section just scratches the surface of a much larger topic, but hopefully these few guidelines will point you in the right direction.

Hardware Instancing

The final optimization topic we discuss is hardware instancing, or rendering multiple instances of the same mesh data through a single draw call. It’s common to duplicate geometry throughout a scene. However, while each instance shares the same geometry, an instance has (at least) a unique world matrix and possibly other attributes that differentiate the object from other copies. Without hardware instancing, you can render copies of the same mesh by issuing multiple draw calls against the same vertex buffer. And for each call, you update the shader constants specific to each object. Although that works, each draw call and material update has additional API overhead. Hardware instancing mitigates this overhead.

Configuring hardware instancing involves augmenting the data sent to the input-assembler stage with per-instance variables. Listing 22.2 presents a hardware instancing version of the point light shader. Notice that the VS_INPUT structure now contains members for the object’s world matrix, specular power, and specular color.

Listing 22.2 An Instancing Point Light Shader


#include "include\Common.fxh"

/************* Resources *************/

cbuffer CBufferPerFrame
{
    float4 AmbientColor = {1.0f, 1.0f, 1.0f, 0.0f};
    float4 LightColor = {1.0f, 1.0f, 1.0f, 1.0f};
    float3 LightPosition = {0.0f, 0.0f, 0.0f};
    float LightRadius = 10.0f;
    float3 CameraPosition : CAMERAPOSITION;
}

cbuffer CBufferPerObject
{
    float4x4 ViewProjection : VIEWPROJECTION;
}

Texture2D ColorTexture;

SamplerState TrilinearSampler
{
    Filter = MIN_MAG_MIP_LINEAR;
    AddressU = WRAP;
    AddressV = WRAP;
};

/************* Data Structures *************/

struct VS_INPUT
{
    float4 ObjectPosition : POSITION;
    float2 TextureCoordinate : TEXCOORD;
    float3 Normal : NORMAL;
    row_major float4x4 World : WORLD;
    float4 SpecularColor : SPECULARCOLOR;
    float SpecularPower : SPECULARPOWER;
};

struct VS_OUTPUT
{
    float4 Position : SV_Position;
    float3 Normal : NORMAL;
    float2 TextureCoordinate : TEXCOORD;
    float3 WorldPosition : WORLD;
    float Attenuation : ATTENUATION;
    float4 SpecularColor : SPECULAR;
    float SpecularPower : SPECULARPOWER;
};

/************* Vertex Shader *************/

VS_OUTPUT vertex_shader(VS_INPUT IN)
{
    VS_OUTPUT OUT = (VS_OUTPUT)0;

    OUT.WorldPosition = mul(IN.ObjectPosition, IN.World).xyz;
    OUT.Position = mul(float4(OUT.WorldPosition, 1.0f),
ViewProjection);
    OUT.Normal = normalize(mul(float4(IN.Normal, 0), IN.World).xyz);
    OUT.TextureCoordinate = IN.TextureCoordinate;

    float3 lightDirection = LightPosition - OUT.WorldPosition;
    OUT.Attenuation = saturate(1.0f - (length(lightDirection) /
LightRadius));
    OUT.SpecularColor = IN.SpecularColor;
    OUT.SpecularPower = IN.SpecularPower;

    return OUT;
}

/************* Pixel Shader *************/

float4 pixel_shader(VS_OUTPUT IN) : SV_Target
{
    float4 OUT = (float4)0;

    float3 lightDirection = LightPosition - IN.WorldPosition;
    lightDirection = normalize(lightDirection);

    float3 viewDirection = normalize(CameraPosition -
IN.WorldPosition);

    float3 normal = normalize(IN.Normal);
    float n_dot_l = dot(normal, lightDirection);
    float3 halfVector = normalize(lightDirection + viewDirection);
    float n_dot_h = dot(normal, halfVector);

    float4 color = ColorTexture.Sample(TrilinearSampler,
IN.TextureCoordinate);
    float4 lightCoefficients = lit(n_dot_l, n_dot_h, IN.SpecularPower);

    float3 ambient = get_vector_color_contribution(AmbientColor, color.
rgb);
    float3 diffuse = get_vector_color_contribution(LightColor,
  lightCoefficients.y * color.rgb) * IN.Attenuation;
    float3 specular = get_scalar_color_contribution(IN.SpecularColor,
min(lightCoefficients.z, color.w)) * IN.Attenuation;

    OUT.rgb = ambient + diffuse + specular;
    OUT.a = 1.0f;

    return OUT;
}

/************* Techniques *************/

technique11 main11
{
    pass p0
    {
        SetVertexShader(CompileShader(vs_5_0, vertex_shader()));
        SetGeometryShader(NULL);
        SetPixelShader(CompileShader(ps_5_0, pixel_shader()));
    }
}


The new members to the VS_INPUT structure replace global shader constants and are now used in the vertex and pixel shaders to transform the object’s position and compute the specular term. Note also that the world matrix component has been removed from the global ViewProjection constant. The ViewProjection matrix will be constant across instances, but the world matrix now comes from the per-instance data.

On the CPU side, you must update your input layout to match the new input signature. In contrast with the D3D11_INPUT_PER_VERTEX_DATA enumeration you’ve used thus far, your D3D11_INPUT_ELEMENT_DESC structures use the D3D11_INPUT_PER_INSTANCE_DATA enumeration for per-instance data elements. The following code shows the array of input element descriptions used for the shader in Listing 22.2:

D3D11_INPUT_ELEMENT_DESC inputElementDescriptions[] =
{
    { "POSITION", 0, DXGI_FORMAT_R32G32B32A32_FLOAT, 0, 0, D3D11_INPUT_
PER_VERTEX_DATA, 0 },
    { "TEXCOORD", 0, DXGI_FORMAT_R32G32_FLOAT, 0, 16, D3D11_INPUT_PER_
VERTEX_DATA, 0 },
    { "NORMAL", 0, DXGI_FORMAT_R32G32B32_FLOAT, 0, 24, D3D11_INPUT_PER_
VERTEX_DATA, 0 },
    { "WORLD", 0, DXGI_FORMAT_R32G32B32A32_FLOAT, 1, 0, D3D11_INPUT_PER_
INSTANCE_DATA, 1 },
    { "WORLD", 1, DXGI_FORMAT_R32G32B32A32_FLOAT, 1, 16, D3D11_INPUT_PER_
INSTANCE_DATA, 1 },
    { "WORLD", 2, DXGI_FORMAT_R32G32B32A32_FLOAT, 1, 32, D3D11_INPUT_PER_
INSTANCE_DATA, 1 },
    { "WORLD", 3, DXGI_FORMAT_R32G32B32A32_FLOAT, 1, 48, D3D11_INPUT_PER_
INSTANCE_DATA, 1 },
    { "SPECULARCOLOR", 0, DXGI_FORMAT_R32G32B32A32_FLOAT, 1, 64, D3D11_INPUT_PER_
INSTANCE_DATA, 1 },
    { "SPECULARPOWER", 0, DXGI_FORMAT_R32_FLOAT, 1, 80, D3D11_INPUT_PER_
INSTANCE_DATA, 1 }
};

Some additional constructs are noteworthy in these input element descriptions. Notice the four WORLD elements, each with 4×32-bit floating point values; they are identified with the semantic indices 0, 1, 2, and 3 (that’s the second field of the D3D_INPUT_ELEMENT_DESC structure). This is how the 16 floats in the 4×4 world matrix shader input are referenced. Also note that all the per-instance elements are passed to the input-assembler stage using slot 1 (the field immediately after the format specification. The IA stage has 16 input slots that accommodate up to 16 vertex buffers in a single draw call. You can store your instance data in a vertex buffer that’s separate from the buffer storing geometry data. In this way, this instance buffer can be updated independently. Finally, note that the last field (the InstanceDataStepRate member) is specified with the value 1 for each per-instance element. This is the number of instances to draw using the same per-instance data before advancing to the next element in the instance buffer.

Listing 22.3 presents the initialization and draw calls for the hardware instancing demo. You can find the full source code on the companion website.

Listing 22.3 Initialization and Draw Calls for the Hardware Instancing Demo


void InstancingDemo::Initialize()
{
    SetCurrentDirectory(Utility::ExecutableDirectory().c_str());

    std::unique_ptr<Model> model(new Model(*mGame, "Content\Models\
Sphere.obj"
, true));

    // Initialize the material
    mEffect = new Effect(*mGame);
    mEffect->LoadCompiledEffect(L"Content\Effects\Instancing.cso");
    mMaterial = new InstancingMaterial();
    mMaterial->Initialize(*mEffect);

    // Create vertex buffer
    Mesh* mesh = model->Meshes().at(0);
    ID3D11Buffer* vertexBuffer = nullptr;
    mMaterial->CreateVertexBuffer(mGame->Direct3DDevice(), *mesh,
&vertexBuffer);
    mVertexBuffers.push_back(new VertexBufferData(vertexBuffer,
mMaterial->VertexSize(), 0));

    // Create instance buffer
    std::vector<InstancingMaterial::InstanceData> instanceData;
    UINT axisInstanceCount = 5;
    float offset = 20.0f;
    for (UINT x = 0; x < axisInstanceCount; x++)
    {
        float xPosition = x * offset;

        for (UINT z = 0; z < axisInstanceCount; z++)
        {
            float zPosition = z * offset;

            instanceData.push_back(InstancingMaterial::InstanceData( XMMatrixTranslation(-xPosition, 0, -zPosition), ColorHelper::ToFloat4(mSpecularColor), mSpecularPower));
            instanceData.push_back(InstancingMaterial::InstanceData( XMMatrixTranslation(xPosition, 0, -zPosition), ColorHelper::ToFloat4(mSpecularColor), mSpecularPower));
        }
    }

    ID3D11Buffer* instanceBuffer = nullptr;
    mMaterial->CreateInstanceBuffer(mGame->Direct3DDevice(),
instanceData, &instanceBuffer);
    mInstanceCount = instanceData.size();
    mVertexBuffers.push_back(new VertexBufferData(instanceBuffer,
mMaterial->InstanceSize(), 0));

    // Create index buffer
    mesh->CreateIndexBuffer(&mIndexBuffer);
    mIndexCount = mesh->Indices().size();

    std::wstring textureName = L"Content\Textures\EarthComposite.
jpg"
;
    HRESULT hr = DirectX::CreateWICTextureFromFile(mGame-
>Direct3DDevice(), mGame->Direct3DDeviceContext(), textureName.c_str(),
nullptr, &mColorTexture);
    if (FAILED(hr))
    {
        throw GameException("CreateWICTextureFromFile() failed.", hr);
    }

    mPointLight = new PointLight(*mGame);
    mPointLight->SetRadius(500.0f);
    mPointLight->SetPosition(5.0f, 0.0f, 10.0f);

    mKeyboard = (Keyboard*)mGame->Services().GetService
(Keyboard::TypeIdClass());
    assert(mKeyboard != nullptr);

    mProxyModel = new ProxyModel(*mGame, *mCamera, "Content\Models\
PointLightProxy.obj"
, 0.5f);
    mProxyModel->Initialize();
}

void InstancingDemo::Draw(const GameTime& gameTime)
{
    ID3D11DeviceContext* direct3DDeviceContext =
mGame->Direct3DDeviceContext();
    direct3DDeviceContext->IASetPrimitiveTopology(D3D11_PRIMITIVE_
TOPOLOGY_TRIANGLELIST);

    Pass* pass = mMaterial->CurrentTechnique()->Passes().at(0);
    ID3D11InputLayout* inputLayout = mMaterial->InputLayouts().
at(pass);
    direct3DDeviceContext->IASetInputLayout(inputLayout);

    ID3D11Buffer* vertexBuffers[2] = { mVertexBuffers[0]->VertexBuffer,
mVertexBuffers[1]->VertexBuffer };
    UINT strides[2] = { mVertexBuffers[0]->Stride, mVertexBuffers[1]
->Stride };
    UINT offsets[2] = { mVertexBuffers[0]->Offset, mVertexBuffers[1]
->Offset };

    direct3DDeviceContext->IASetVertexBuffers(0, 2, vertexBuffers,
strides, offsets);
    direct3DDeviceContext->IASetIndexBuffer(mIndexBuffer, DXGI_FORMAT_
R32_UINT, 0);

    mMaterial->ViewProjection() << mCamera->ViewMatrix() *
mCamera->ProjectionMatrix();
    mMaterial->AmbientColor() << XMLoadColor(&mAmbientColor);
    mMaterial->LightColor() << mPointLight->ColorVector();
    mMaterial->LightPosition() << mPointLight->PositionVector();
    mMaterial->LightRadius() << mPointLight->Radius();
    mMaterial->ColorTexture() << mColorTexture;
    mMaterial->CameraPosition() << mCamera->PositionVector();

    pass->Apply(0, direct3DDeviceContext);

    direct3DDeviceContext->DrawIndexedInstanced(mIndexCount,
mInstanceCount, 0, 0, 0);

    mProxyModel->Draw(gameTime);
}


This code creates a grid of spheres along the xz-plane and builds a vertex buffer from the instance data to be used in the draw call. Note that two buffers are specified in the call to ID3D11DeviceContext::IASetVertexBuffers(). The draw method now invokes ID3D11DeviceContext::DrawIndexInstanced() with the index count of the shared object and the number of instances to draw. The demo’s Update() method could modify the values within the instance buffer or add/remove instances. Figure 22.1 shows the output of the hardware instancing demo.

Image

Figure 22.1 Output of the hardware instancing demo. (Original texture from Reto Stöckli, NASA Earth Observatory. Additional texturing by Nick Zuccarello, Florida Interactive Entertainment Academy.)

Deferred Shading

The rendering you’ve done in this book has been forward rendering—essentially, rendering each object in the scene separately, using lights, textures, and shadows to determine the final color of the pixels associated with an object. You’ve been exposed to the concepts of single-pass rendering (all the lighting for an object is computed in a single shader) and multipass rendering (the final pixel values of an object are composited through multiple draw calls). In any forward rendering system, you are limited in the number of lights that can impact an object. The more lights, the more work the system has to do, and the slower your application will run. But another approach to rendering, called deferred shading, can handle a large number of lights in the scene without significant performance degradation.


Note

Deferred shading is also known as deferred rendering. Unfortunately, this term is overloaded in graphics terminology. Microsoft, in particular, uses the term deferred rendering to refer to buffering commands so that they can be “played back” at some other time. This is entirely different from the deferred shading topic this section discusses.


With deferred shading, all the objects in the scene are first rendered without lighting information. They’re rendered not to the back buffer, but to several off-screen buffers using multiple render targets (MRT). These render targets store information about the scene’s geometry, including the object’s position, normal, color (typically sampled from a diffuse color texture), specular power, and specular intensity. Remember that render targets are just 2D textures—but instead of storing color, these geometry buffers (or G-buffers) store data that influences a second, lighting pass to compute the final color of all the pixels in the scene.

The lighting pass resembles a post-processing effect, such as those in Chapter 18, “Post Processing.” For each light in the scene, you sample the data from the geometry buffers and compute the direct and indirect lighting for each pixel influenced by the light. The final image is a composite of the contribution of all the lights in the scene, mapped to a full-screen quad.

Because the lighting is performed in screen space, the cost to render a frame is primarily a function of the number of lights in the scene and the number of pixels they influence. Geometry data is rendered just once per frame, and costly lighting calculations are performed only against the geometry that’s actually affected. Beyond the performance advantages, deferred shading simplifies a rendering engine’s architecture because all objects in the scene share exactly one shader. Considerable performance overhead is required to switch shaders, so forward rendering engines commonly batch objects with like materials. This entails a more complicated engine architecture that is absent from deferred shading systems.

That said, deferred shading has many disadvantages that still make forward rendering necessary, if not preferable. First, the geometry buffers require large amounts of memory. Second, deferred shading does not handle transparency. This is typically handled by a separate sorting step (from back to front) and a forward rendering pass of transparent objects. Another disadvantage is that, by separating the geometry and lighting stages, typical hardware anti-aliasing no longer functions properly and an additional post-processing technique must be applied to perform edge smoothing.

In the end, a modern rendering engine likely incorporates elements of both forward and deferred shading.

Global Illumination

Global illumination refers to a set of techniques for adding more realistic light to a scene. In particular, global illumination simulates indirect lighting, light that reaches a surface after being reflected by other objects. Consider the ambient lighting used in this book’s shaders. This term was modeled as a constant value and applied uniformly to an entire object. Clearly, this is not how ambient light actually works. Ambient light is the result of the complex inter-reflection of light between the objects in the scene; a surface that’s more occluded (with respect to other reflective surfaces) receives less indirect lighting and appears darker.

A number of popular techniques can simulate indirect lighting, with tradeoffs between physical accuracy and speed. Ambient occlusion describes a family of rendering techniques that attempt to estimate the amount of indirect light that reaches a surface. Ambient occlusion can be pre-calculated and baked into textures. This is a common practice because it adds realism to the lighting with no runtime performance implications. However, such textures don’t respond to dynamic lighting, and the trick can be revealed if the lighting changes or the surrounding environment is modified. A widely adopted real-time technique is known as screen-space ambient occlusion (SSAO). SSAO operates in screen space (as a post-processing step) and is therefore independent of scene complexity. In particular, the scene is rendered to a depth buffer, and the depths around a pixel are sampled to compute an occlusion factor. Furthermore, SSAO can be implemented entirely on the GPU and works well with dynamic objects within the scene. A detailed discussion of SSAO is outside the scope of this text, and we leave it as a topic for the reader to explore.

Another topic in modern rendering (a full discussion of which is likewise outside the scope of this text), and one that is often used for global illumination, is spherical harmonic lighting (SH lighting). Quite a bit of math is involved, but SH lighting can be summarized as a set of techniques that precompute functions (such as the global lighting environment) into spherical harmonic coefficients. All our lighting techniques can be considered simplifications of the rendering equation, a system for generating images based entirely on physics. The rendering equation is an integral over a hemisphere of directions and is not real-time friendly. SH lighting replaces parts of the rendering equation with spherical functions. Instead of computing the rendering equation at runtime, using SH lighting, the integral can be reduced to a dot product of SH coefficients. SH lighting can produce highly realistic images in real time, but the position and form of objects within the scene must remain static or have separate sets of coefficients.

Compute Shaders

With the release of DirectX, Microsoft introduced the DirectCompute API, a library supporting general-purpose programming on the GPU. Compute shaders enable you to move practically any computation onto the GPU, with the intent of offloading work from the CPU and leveraging the massively parallel architecture of the graphics card.

The compute shader (CS) stage is outside the normal rendering pipeline but it can read and write to GPU resources. Thus, although compute shaders are capable of supporting any number of general-purpose processes, some graphics applications are particularly interesting. For example, you can use the compute shader for the deferred shading lighting pass, deferred shading edge smoothing, depth-of-field blurring, or post-processing in general. Outputs from the compute shader can be bound as inputs to the rendering pipeline without ever transferring those outputs to the CPU.

Threads

Compute shaders can run on multiple threads, in parallel, within a thread group. A thread group contains a set of threads with access to the same shared memory. Thread groups are created as a three-dimensional grid whose size is specified when you execute a compute shader through a call to ID3D11DeviceContext::Dispatch(UINT threadGroupCountX, UINT threadGroupCountY, UINT threadGroupCountZ). The number of threads within a group is specified through the numthreads(UINT x, UINT y, UINT z) attribute attached to the compute shader. For example:

[numthreads(32, 32, 1)]
void compute_shader()
{
     // Shader body
}

The total number of threads is calculated by multiplying the number of thread groups by the number of threads per group. For example, a call to ID3D11DeviceContext::Dispatch (32, 24, 1) against the previous compute shader would yield (32×32, 24×32, 1×1) = 1024×768×1 total threads. Work is distributed among these threads according to the number of processors on the GPU.

A Simple Compute Shader

Writing sophisticated compute shaders is beyond the scope of this section, but Listing 22.4 presents a simple compute shader that writes colors to a 2D texture.

Listing 22.4 A Compute Shader That Writes to a 2D Texture


RWTexture2D<float4> OutputTexture;

cbuffer CBufferPerFrame
{
    float2 TextureSize;
    float BlueColor;
};

[numthreads(32, 32, 1)]
void compute_shader(uint3 threadID : SV_DispatchThreadID)
{
    float4 color = float4((threadID.xy / TextureSize), BlueColor, 1);
    OutputTexture[threadID.xy] = color;
}

technique11 compute
{
    pass p0
    {
        SetVertexShader(NULL);
        SetGeometryShader(NULL);
        SetPixelShader(NULL);
        SetComputeShader(CompileShader(cs_5_0, compute_shader()));
    }
}


This listing begins by declaring a read-write (RW) 2D texture (OutputTexture) that the compute shader writes to. This data type allows multiple threads to write to the texture simultaneously. A specific texel can be accessed through a 2D coordinate using array-style conventions.

This compute shader is declared with a single input parameter, threadID, marked with the SV_DispatchThreadID semantic. This parameter contains the unique identifier of the thread that the compute shader is executing within. The shader in Listing 22.4 uses this ID as an index into the output texture. In this way, each execution of the compute shader writes a specific texel. If you do not specify enough total threads (in this example) to cover the entire texture, some of the texels will not be written.

Note that you can declare and access constant buffer data just as you have for vertex and pixel shaders. In this example, the TextureSize variable is used to normalize the thread IDs (whose values are unsigned integers), and the BlueColor variable is used for the texel’s blue channel.

In the application-side demo, an unordered access view (UAV) is bound to the OutputTexture variable. A UAV provides unordered read/write access to a resource from multiple threads. More specifically, multiple threads can simultaneously read from and write to the resource without concern for memory conflicts. Listing 22.5 presents the initialization code for creating the UAV. Note that a shader resource view (SRV) is also created using the same texture and is subsequently bound to a pixel shader for rendering the output of the compute shader.

Listing 22.5 Initializing a UAV and an SRV for the Compute Shader Demo


D3D11_TEXTURE2D_DESC textureDesc;
ZeroMemory(&textureDesc, sizeof(textureDesc));
textureDesc.Width = mGame->ScreenWidth();
textureDesc.Height = mGame->ScreenHeight();
textureDesc.MipLevels = 1;
textureDesc.ArraySize = 1;
textureDesc.Format = DXGI_FORMAT_R8G8B8A8_UNORM;
textureDesc.SampleDesc.Count = 1;
textureDesc.SampleDesc.Quality = 0;
textureDesc.BindFlags = D3D11_BIND_UNORDERED_ACCESS | D3D11_BIND_SHADER_RESOURCE;

HRESULT hr;
ID3D11Texture2D* texture = nullptr;
if (FAILED(hr = mGame->Direct3DDevice()->CreateTexture2D(&textureDesc,
nullptr, &texture)))
{
    throw GameException("IDXGIDevice::CreateTexture2D() failed.", hr);
}

D3D11_UNORDERED_ACCESS_VIEW_DESC uavDesc;
ZeroMemory(&uavDesc, sizeof(uavDesc));
uavDesc.Format = DXGI_FORMAT_R8G8B8A8_UNORM;
uavDesc.ViewDimension = D3D11_UAV_DIMENSION_TEXTURE2D;
uavDesc.Texture2D.MipSlice = 0;

if (FAILED(hr = mGame->Direct3DDevice()->CreateUnorderedAccessView
(texture, &uavDesc, &mOutputTexture)))
{
    ReleaseObject(texture);
    throw GameException("IDXGIDevice::CreateUnorderedAccessView()
failed."
, hr);
}

D3D11_SHADER_RESOURCE_VIEW_DESC resourceViewDesc;
ZeroMemory(&resourceViewDesc, sizeof(resourceViewDesc));
resourceViewDesc.Format = DXGI_FORMAT_R8G8B8A8_UNORM;
resourceViewDesc.ViewDimension = D3D11_SRV_DIMENSION_TEXTURE2D;
resourceViewDesc.Texture2D.MipLevels = 1;

if (FAILED(hr = mGame->Direct3DDevice()->
CreateShaderResourceView(texture, &resourceViewDesc, &mColorTexture)))
{
    ReleaseObject(texture);
    throw GameException("IDXGIDevice::CreateShaderResourceView()
failed."
, hr);
}

ReleaseObject(texture);


Note how the mOutputTexture (UAV) and mColorTexture (SRV) members refer to the same texture.

Drawing the demo is a matter of updating the compute shader’s material, dispatching the compute shader, and rendering the output texture to a full-screen quad. Listing 22.6 presents the demo’s Draw() method.

Listing 22.6 The Draw() Method of the Compute Shader Demo


void ComputeShaderDemo::Draw(const GameTime& gameTime)
{
    ID3D11DeviceContext* direct3DDeviceContext =
mGame->Direct3DDeviceContext();

    // Update the compute shader's material
    mMaterial->TextureSize() << XMLoadFloat2(&mTextureSize);
    mMaterial->BlueColor() << mBlueColor;
    mMaterial->OutputTexture() << mOutputTexture;
    mComputePass->Apply(0, direct3DDeviceContext);

    // Dispatch the compute shader
    direct3DDeviceContext->Dispatch(mThreadGroupCount.x,
mThreadGroupCount.y, 1);

    // Unbind the UAV from the compute shader, so we can bind the same
underlying resource as an SRV

    static ID3D11UnorderedAccessView* emptyUAV = nullptr;
    direct3DDeviceContext->CSSetUnorderedAccessViews(0, 1, &emptyUAV,
nullptr);

    // Draw the texture written by the compute shader
    mFullScreenQuad->Draw(gameTime);
    mGame->UnbindPixelShaderResources(0, 1);
}

void ComputeShaderDemo::UpdateRenderingMaterial()
{
    mMaterial->ColorTexture() << mColorTexture;
}


Note how the UAV is unbound from the compute shader stage through the ID3D11Device Context::CSSetUnorderedAccessViews() call. This is necessary because the same resource is being used in the full-screen quad’s draw call. More specifically, the full-screen quad is attached to the ComputeShaderDemo::UpdateRenderingMaterial() call back, which sets the material’s ColorTexture() member.

Within the demo, the BlueColor shader variable is updated each frame through the following method:

void ComputeShaderDemo::Update(const GameTime& gameTime)
{
    mBlueColor = (0.5f) * static_cast<float>(sin(gameTime.
TotalGameTime())) + 0.5f;
}

Figure 22.2 shows a snapshot of the output of the compute shader demo.

Image

Figure 22.2 Output of the compute shader demo.

This section merely introduces compute shaders. You have an amazing amount of capability available with DirectCompute.

Data-Driven Engine Architecture

All the demos in this book have specified their assets—the models, textures, and materials—in code. This simplified the demos but is not common practice for modern rendering engines. Instead, engines are data driven; their scenes are populated from configuration files. A data-driven engine implies that the game will happily execute even without any (or much) data in the world. Furthermore, modern engines incorporate not only assets from configuration files, but behavior as well. This is typically accomplished through scripting languages. The Unreal Development Kit (UDK), for example, uses UnrealScript to author practically all game code. Unity (another popular game engine) offers a choice between C#, JavaScript, and Boo for game scripting.

Making a game engine data-driven starts by defining a configuration file format and an entry point for loading assets. XML is a common format choice because the data is (at least somewhat) human readable and can be hand-crafted in any text editor. The specific definition of configuration files matches the attributes associated with runtime classes. Furthermore, configuration files are often hierarchical, with a “master” configuration file broken into myriad subfiles that define specific levels or asset types. Multiple developers can work on different sections of the game without concern over conflicting file changes. The following example presents a hypothetical XML configuration format:

<!-- Game.xml, the root asset store of the configuration hierarchy -->
<AssetStore>
  <Items>
    <Item Class="AssetStore" File="ContentLevelsLevel1.xml" />
    <Item Class="AssetStore" File="ContentLevelsLevel2.xml" />
    <Item Class="AssetStore" File="ContentLevelsLevel3.xml" />
  </Items>
</AssetStore>

<!-- Level1.xml -->
<AssetStore>
  <Items>
    <Item Class="Camera" FOV="0.785398" AspectRatio="1.33333" NearPlaneDistance="1.0" FarPlaneDistance="1000.0">
      <Position X="0" Y="0" Z="0" />
      <Direction X="0" Y="0" Z="-1" />
      <Up X="0" Y="1" Z="0" />
    </Item>

    <Item Class="Skybox" CubemapFile="ContentTexturesMaskonaive2_1024.
dds
" Scale="500" />
  </Items>
</AssetStore>

At the heart of this system is the capability to match runtime types with elements in the configuration file. This can be accomplished with the factory pattern, a software design pattern that allows objects to be instantiated through an interface. More specifically, your configuration files store class names used for lookup into a set of factories. Each factory knows how to instantiate a single class, and the factory manager locates the factory associated with a name.

In the previous example, a C++ AssetStore class contains a collection of Item instances. The AssetStore class represents the root configuration data structure. The Item class represents a single asset within the store, and its Class attribute identifies the C++ class to instantiate. If the item’s File attribute is present, the runtime loading system should retrieve the asset’s configuration from a secondary file; otherwise, the configuration data required for the class is included as attributes within the XML Item property. In this example, Camera and Skybox classes will be instantiated and configured with the associated data.

The AssetStore class interacts with a templated Factory class, whose declaration could resemble that of Listing 22.7.

Listing 22.7 The Factory<T> Class Declaration


template <class T>
class Factory
{
public:
    virtual ~Factory();

    virtual const std::string ClassName() const = 0;
    virtual T* Create() const = 0;

    static Factory<T>* Find(const std::string& className);
    static T* Create(const std::string& className);

    static typename std::map<std::string, Factory<T>*>::iterator
Begin();
    static typename std::map<std::string, Factory<T>*>::iterator End();

protected:
    static void Add(Factory<T>* const factory);
    static void Remove(Factory<T>* const factory);

private:
    static std::map<std::string, Factory<T>*> sFactories;
};


This class acts as both factory and factory manager. Each derived class implements the pure-virtual ClassName() and Create() methods, and adds/removes itself to the static list of factories through Add() and Remove(). The static Create() method attempts to find the factory for the given name and invokes the nonstatic Create() method if it finds a factory. Creating factory subclasses is a common requirement (necessary for any class you want to expose to your data-driven configuration system), so you could implement the macro in Listing 22.8 for easy factory definition.

Listing 22.8 ConcreteFactory Class Macro


#define ConcreteFactory(ConcreteProductT, AbstractProductT)           
class ConcreteProductT ## Factory : public Factory<AbstractProductT>  
{                                                                     
                                                                      
public:                                                               
    ConcreteProductT ## Factory()                                     
    {                                                                 
        Add(this);                                                    
    }                                                                 
                                                                      
    ~ConcreteProductT ## Factory()                                    
    {                                                                 
        Remove(this);                                                 
    }                                                                 
                                                                      
    virtual const std::string ClassName() const                       
    {                                                                 
        return std::string( #ConcreteProductT );                      
    }                                                                 
                                                                      
    virtual AbstractProductT* Create() const                          
    {                                                                 
        AbstractProductT* product = new ConcreteProductT();           
        return product;                                               
    }                                                                 
};


With this system in place, you would create a Skybox factory with a call such as:

ConcreteFactory(Skybox, RTTI)

And you would instantiate a Skybox with a call such as:

RTTI* skybox = Factory<RTTI>::Create("Skybox");

Populating your game world with objects becomes a matter of deserializing the configuration files, instantiating game objects through the factory system, and adding them to whatever data structure represents your scene.

The End of the Beginning

In this chapter, you examined some modern rendering topics. You looked at optimizing the speed of your rendering applications through view frustum culling, occlusion culling, object sorting, shader optimization, and hardware instancing. You also explored deferred shading, global illumination, compute shaders, and data-driven engine architecture.

This is the final chapter of the book. Congratulations on your accomplishments! But instead of seeing this as the end, I hope you think of this as a commencement. You are now at the beginning of an exploration of even more topics in rendering. I sincerely hope you have enjoyed this book.

Exercises

1. Install a graphics profiling tool such as NVIDIA Nsight Visual Studio Edition, and become familiar with the process of profiling your applications.

2. Implement a deferred shading architecture. The discussion in this chapter should be enough to get you pointed in the right direction. You’ll need to investigate multiple render targets (MRT).

3. Implement a screen-space ambient occlusion shader.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.185.169