Chapter 8. From CPU to GPU

Writing good rendering code is challenging because of the many changing variables. For most game projects, the art at the start of the project is nothing like the final art. The scenes get more cluttered and complex. Many more materials are present. Vistas are larger, and gameplay also has to happen sometime during execution.

Never mind that a major new OS might drop, new GPUs will become available, and drivers will be updated over the course of your project. For example, Microsoft does quarterly updates to DirectX. They aren’t big, but you do have to test against them. OpenGL might be more stable, but there are still driver updates many times a year.

All these factors affect your game, but more importantly, it can change your performance picture quite a bit. If a driver update makes a specific case slow, you might be sunk without issuing a patch or doing some serious refactoring. Scary stuff!

Beyond all this, you have to know who your target audience is and what sort of hardware they will own when you release your game. You have to choose which of the veritable buffet of rendering APIs and capabilities you can and want to support. And all this might change over the course of a project, too.

You also have to choose a graphics API. DirectX 9 and DirectX 11 are the main options in the Windows world, while everywhere else you get to choose between a couple of flavors of OpenGL (or perhaps even a proprietary API). You might need to ship with support for a software renderer, like Pixomatic from RAD, WARP from Microsoft, or SwiftShader from TransGaming. If you are really cutting edge, you might implement your own renderer on top of OpenCL or CUDA!

The bottom line is that you must have a foundation in your code that will meet the changing needs of your project. You need to structure your architecture carefully, not only to be prepared to meet the performance needs of the project, but also the compatibility, porting, and scalability requirements.

We will go into a detailed discussion of getting maximum performance on the GPU in the following chapters. This chapter is about how to write your renderers so that you will survive that knowledge. Michael Abrash’s Graphics Programming Black Book is a big, thick book, and mostly about how, while working on Quake, John Carmack and Michael Abrash kept completely rewriting their renderer for maximum performance. The state of the art is slower moving than that nowadays, but not by much.

One closing thought before we move into the meat of this chapter: Any time you, as a graphics programmer or optimizer, need inspiration, check out what the demo scene is doing. They can create works of art using nothing more than a few KB of data and clever hacks. There are hundreds of incredible techniques that have come out of that space due to the extremely tight requirements they face.

Project Lifecycle and You

We discussed how optimization fits with a project’s lifecycle in a previous chapter. Let’s revisit it very briefly as it relates to optimizing rendering.

There are many big changes that happen to rendering code as it goes from early project to late project. In the early phases, it does not have to work 100% right or be 100% stable. The art load is light, both in terms of model complexity and the sheer number of objects in the scene. Gameplay and the other subsystems that will strain the performance picture haven’t been fully fleshed out.

By the time you hit the end of the project, the following has happened:

  • Your game is being run on a GPU that was not on the market when you started the project.

  • Rendering is 100% right and 100% stable. Getting this right can require some costly operations.

  • The artists have gone nuts and put a unique material (with its own shader) on every item in the game. You now have 100x the state changes you started with.

  • Gameplay and other systems have grown to require 100% of the CPU per frame.

How do you stay sane through all this? With careful preparation from day one of the project. And if you come in late—God help you, and try to get as much of our advice in as you can!

Points of Project Failure

Let’s start by talking about common performance problems that projects are suffering from by the time they ship. If you have shipped a few games, these will come as no surprise. If you haven’t, take heed and realize that you can’t fix everything up front. It is better to take these problems as they come and spend your time when you need to. The main points of project failure when it comes to rendering are synchronization points, capability management, resource management, global ordering, instrumentation, and debugging capabilities.

Synchronization

A single synchronization point can cause huge performance loss. The GPU is a deep pipeline, taking rendering commands and resources in at one end and spitting out complete frames at the other. If you do something that depends on output, you have to wait for the pipeline to clear out before you can proceed.

Certain state changes have high overhead. For instance, a new pixel shader won’t be able to take advantage of hardware that is already processing a shader program. The GPU will need to wait for the old shader to finish before loading the new shader into the memories of the multi-processors.

Worse than state changes are readbacks. A readback is when you lock a resource on the GPU and read data back from it. The data flow is optimized for pushing content from CPU to GPU, not the other way around, and this can often be very slow.

Locking the framebuffer or depth buffer, especially, has dire performance consequences. Not only must the pipeline be flushed, but all rendering also has to cease while the data is being read. If you are lucky, the drivers will copy it all back into a buffer and give that to you while other rendering is going on. However, this requires very careful coordination to pull off, since programs usually don’t submit render commands while locking a resource.

Or you might be computing some vertex positions on the CPU and streaming them to a vertex buffer. If you’re not careful, you might write code that stores temporary values (for instance by using *=) in the buffer, causing an extremely slow readback. Your authors worked on a project where we were doing that, and just by using a local temporary instead of the VB’s memory, we saw a manyfold speedup in one of our inner loops.

If you are frequently locking something, you should consider just locking it once. The cost for locking any given resource is the initial stall while the GPU synchronizes with the CPU and the lock is acquired. If you lock again immediately thereafter, the synchronization work is already done, so the second lock is cheap. It’s very smart to minimize the number of locks you acquire on a resource—doing a single lock per resource per frame is the best rule of thumb.

Caps Management

Another crucial piece for any good renderer is capability management. Capabilities, or caps as the DirectX 9 API names them, describe what a given card can do. In addition to binary questions like “Can I do wireframe?” there are some scalars you will probably want to know about, like “What is the highest level of MSAA supported?” or “How much video RAM is available?”

These sorts of things are essential to know because every card and vendor works a little differently, and not just in terms of varying video RAM. OpenGL and DirectX imply, by their nature as standardized APIs, that you can write rendering code once and have it run everywhere. While they certainly make that scenario easier, anyone who tells you your rendering code will always work on every card and computer is a liar.

It is not uncommon for some vendors to ship drivers or hardware that completely breaks in certain cases. Or for hardware/drivers to report that capabilities are available, but not to mention that they are very slow. On OS X, if you use an unsupported GL feature, it will switch you to a software implementation—talk about falling off the fast path!

Bottom line—you need to have a flexible system for indicating what cards need fallbacks or other work-arounds, as well as more mundane things like what level of PS/VS is supported or how much VRAM there is. This is very useful during the time period immediately surrounding launch—it’s easy to add special cases as you run into them and push updates.

Resource Management

A lot of games, especially PC games, are unpleasant to play because of their resource management code. It might take them a long time to load things to the card. They might be wasteful of VRAM and perform badly on all but the highest-end systems. They might break during device resets when switching out of full-screen mode, or when resizing the game window. Level changes might result in resource leaks.

A big cause of performance problems is failing to get resource management right! We’ll talk about what’s needed in the “Managing VRAM” section that follows.

Global Ordering

Ordering your render commands is crucial for good performance. Drivers and GPU do not have good visibility into your scene’s specific needs, so they mostly have to do what you say in the order you say it. And reordering render commands on the fly isn’t likely to be a performance win, even if it could be done reliably due to the overhead of figuring out the optimal order.

Every computational problem becomes easier when you add constraints, and figuring out optimal render order is no exception. It’s much, much cheaper for you to submit things properly.

It’s worthwhile to think about the kind of rendering order that might be important. Maybe you want to draw all your reflections and shadow maps before drawing the main scene. Maybe you want to draw opaque stuff front-to-back and translucent stuff back-to-front. Maybe you want to sort draws by material or texture in order to reduce state changes.

Games have shipped doing awful things, like switching the render target in the middle of the frame to draw each reflection. So you can certainly get away with it, at least on the PC. But in those scenarios, the developers had to pay a high price. On consoles, the overhead of changing render targets can make this prohibitive. It’s better to address the problem up front.

We’ll discuss a flexible system to help deal with these sorts of sorting issues later in this chapter, in the “Render Managers” section.

Instrumentation

Good instrumentation means good visibility into your code. Renderers have lots of moving parts. You need to know which ones are moving! If you don’t have a way of seeing that you’re doing 10x too many locks per frame, you’ll never be able to fix it.

The best thing here is to add a counter system and note important operations (like locks, draw calls, state changes, and resource creations/destructions). Tools like PIX and NvPerfHUD can also do this, but onboard tools are great because they are always available, even on your artist’s or tester’s computer.

Debugging

When you are debugging graphics code, tools like NvPerfHUD, PIX, gDE-Bugger, BuGLe, and GLIntercept are essential. One of the best things you can do is intelligently support their APIs for annotating your app so that it’s easy to include the annotations in your rendering stream when appropriate, without always requiring the tools to be present. Don’t forget your trusty profiler, either, because you can often spot rendering hotspots by looking for unusual numbers of calls to a block of code, API calls, or driver modules that take a large percentage of execution time (for instance, when locking a buffer).

Many projects have delivered subpar products because the developers didn’t have an easy way to see what their code was (or wasn’t) doing. When debugging complex rendering code, visibility into what’s going on behind the scenes is key. Otherwise, you cannot make intelligent decisions—you’re flying blind!

You also need tools in your engine to cross-check that what your code thinks it is doing is what is really happening, as reported by PIX, etc. Some generally useful tools include a command to dump a list of all known resources, a way to see what’s in all your render targets, and the ability to toggle wireframe/solid rendering on objects. There are a ton of other possibilities if you have laid out your code sensibly.

Make sure that you have meaningful error reports that dump enough information so that you can interpret garbled reports from artists/testers. It’s also smart to put warnings in if performance boundaries are exceeded, to help keep people from blowing the budget when building content.

Managing the API

Now that we’ve laid out some of the major problem areas for graphics coders, let’s talk about what you can do to help deal with them. The first step is dealing with the API. Whether you are working with DirectX or OpenGL, there are a lot of things you can do to simplify life and prevent problems before they start to be a barrier.

Assume Nothing

The best rule with any rendering API is to assume nothing. Keep it at arm’s length and use a minimal subset. When developing Quake, Carmack used a minimal subset of OpenGL, and for many years after, no matter what else didn’t work, vendors would be sure to test that Quake ran really well on their hardware. As a result, staying on that minimal subset helped a lot of games run better.

Even on today’s APIs, which are more mature and reliable than they were in those days, it still makes sense to find the common, fast path and use that for everything you can.

GL has GLUT and Direct3D has D3DX. GLUT and D3DX are convenient helper libraries that make it easier to get started with both APIs. However, they are not necessarily written with performance or absolute correctness in mind. (GLUT’s window management code can be buggy and unreliable, while D3DX’s texture management code can be slow and inefficient.) We highly recommend either not using these libraries at all, or if you do, being vigilant and ready to ditch them at the first sign of trouble. With a few exceptions, the things these libraries do can be done readily in your own code.

For more complex operations, like some of the mesh simplification/lighting routines in D3DX, you might consider running them offline and storing the output rather than adding a dependency in your application.

Build Correct Wrappers

Make sure that every basic operation you support—like allocating a vertex buffer or issuing a draw call—happens in one and only one place. This tremendously simplifies problems like “finding that one vertex buffer that is being allocated with the wrong size” or “preventing draw calls during a device reset.” It also makes it easy to be sure you are doing the right thing in terms of calling APIs the correct way, doing proper error handling, consistently using fallbacks, and so on.

Having everything happen in only one place gives you yet another benefit. When you do encounter a problem in the hardware or with the driver, it makes it trivial to work around the problem.

You may want to wrap every single API call in routines that do error checking and validation on their behavior. Having wrappers also makes it easier to do useful things like log every call to a text file—a lot of data, but a very useful trick when you are trying to track down a rendering bug.

Wrappers also help a lot because they make it much simpler to port your rendering code from API to API or from platform to platform. Even large incompatibilities can be localized to a single file or method.

State Changes

State changes are a big part of the performance picture for rendering code. Advice to avoid too many is a perennial part of every GPU vendor’s “best practices” talk. A state change happens whenever you change something via the rendering API, like what the current texture is, what the fill mode is, or which shader is in use. Newer rendering approaches, such as deferred shading, are designed to (among other things) reduce state changes, but it is difficult to remove them entirely.

One very useful trick is to use your API wrapper to suppress duplicate state set calls. Depending on the API and platform, API calls can range from “a relatively cheap, in-process call” to “very costly; requires calling kernel mode driver code.” Even in cheap cases, suppressing duplicate calls is easy and reduces work for the driver. Just track each state’s value and only call the API to set it if the new value passed to the wrapper is different.

Some drivers will actually batch similar states for you if they are called near to each other. For example, state changes A, B, A, B will actually submit to the hardware as A, A, B, B, thus reducing the overall stage change count by half.

Draw Calls

State changes tell the graphics card how to draw, but draw calls (DX’s Draw*-Primitives or GL’s glDraw* call) are what trigger actual work. Typically, the drivers defer all work until a draw call is issued. This helps reduce the cost of duplicate state changes (although they still have a noticeable cost).

The net result is that when you are profiling, you will typically find that your draw calls are where most of your graphics API-related time is going. This is largely due to work that the driver has to do to queue up all the commands so the GPU will render, which is not to say that a draw call itself doesn’t have noticeable costs. Figure 8.1 shows the time in milliseconds that it takes to draw frames with varying number of draw calls. For comparison, we show two scenarios: no state changes (the best case) and varying the texture bound to sampler zero. Both cases draw to a very small render target—64px square—to minimize the impact of fill-rate while requiring the card and drivers to actually process the change. As you can see, there is a nice linear trend. Texture changes between draws are about triple the cost of straight draws.

Time in milliseconds (Y-axis) for varying number of draw calls (X-axis). Notice that changing state for each draw (in this case, the active texture) dramatically increases the cost of each draw call. Each draw call only draws a few pixels in order to minimize actual work the GPU has to perform.

Figure 8.1. Time in milliseconds (Y-axis) for varying number of draw calls (X-axis). Notice that changing state for each draw (in this case, the active texture) dramatically increases the cost of each draw call. Each draw call only draws a few pixels in order to minimize actual work the GPU has to perform.

However, this also suggests an upper limit. Every additional piece of state that changes consumes additional mspf. You cannot budget your entire frame for draw call submission. Suddenly, spending 5mspf on draw calls seems very generous. For a very simple scene with only textures changing between draw calls, you can do perhaps 6,000 draw calls. But if you are changing your shaders, changing render target, uploading shader constants, and so forth, that will quickly drop off to less—only one or two thousand calls.

A symptom of draw call issues is clearly visible in Intel’s VTune. When too many draw calls exist, the driver consumption of the CPU is extraordinarily high. If VTune’s module view doesn’t show high levels of utilization in the driver (usually a .dll file with either an NV or ATI in its name), then draw calls aren’t your problem. If the driver does show a relatively high utilization and draw call numbers are in the 2000+ range, then your issue is quite clear—your application is CPU bound, which means the GPU is starving.

State Blocks

To remove the need to make dozens or hundreds of calls to the driver to modify the state between draw calls, it’s possible to use state blocks. A state block is a group of related states that are all set at once, reducing GPU and driver workload. So, for instance, you can predefine values for all the states on a texture sampler at startup, and then instead of making a dozen calls to completely set it later, you can simply assign the state block.

One thing to be aware of with state blocks is that support for them can vary, especially on older or lower-end hardware. In some cases, they are emulated by the driver because it simply goes through the values in the state block and sets all the states normally. In this case, simply reducing state changes (as discussed elsewhere) will give you better performance.

As always, make sure to test on your target hardware to make sure that state blocks are a good fit. If the hardware doesn’t support it, the drivers may emulate it more slowly than you could do it yourself.

Instancing and Batching

Are we stuck with less than a thousand objects? By no means.

For simpler objects like text, particle effects, motion trails, user interfaces, and similar low-poly objects, a common technique is simply to put all the objects into a single dynamic vertex buffer, transformed into world space or screen space as appropriate, and draw them with a single draw call per material. This works well for some scenarios, but it breaks down in more complex cases.

Instancing is a technique that allows you to resubmit the same geometry many times with different parameters, automatically. In short, instancing moves the overhead of draw call submission to the GPU. There are two different versions of instancing and each has positive and negative side effects.

  • Shader Instancing: Constant instancing builds on dynamic instancing by moving the multiplication from the CPU to the GPU. A stream of instanced data accompanies the vertex data through the use of constants. This technique is sensitive to hardware constant minimums. Since attribute data may exceed that of the space available for constants, developers may need to break rendering into several draw calls. This is a fair trade-off in many cases and may reduce the number of draw calls from many to several. This technique is still memory intensive.

  • Hardware Instancing: Hardware instancing is a feature of shader model 3.0 hardware. It achieves many of the positive elements of the other techniques with one major drawback—it requires shader model 3.0 hardware. Hardware instancing uses one data stream that contains a mesh. The second data stream contains attribute information. Using the SetStreamSourceFreq call, we are able to map vertex data to instance data. This method is both CPU and memory friendly. The instancing isn’t free, however, and some cards may have overhead in pre-vertex parts of the GPU. In NVIDIA’s PerfHUD, you may notice high levels of utilization in the vertex assembly unit.

Render Managers

The most obvious way to write a renderer is to have a bunch of objects that look something like this:

Class MyRenderableObject
{
      Matrix worldXfrm;
      VertexBuffer myMesh;
      void onRender()
      {
        Graphics->setTexture("red");
        Graphics->setBuffer(myMesh);
        Graphics->setTransform(worldXfrm);
        Graphics->draw();
      }
}

Then, for each visible object, the scene-drawing code calls onRender and off you go.

Naturally, over time, each object’s rendering code gets more and more complex. If an object requires a reflection, it might do something like this:

void onRender()
{
     Graphics->setRenderTarget(myReflectionTexture);
     drawSceneForReflection();
     Graphics->setRenderTarget(NULL);
     // Draw the object normally using the reflection texture.
}

Aside obviously awful things like the previous example, each object ends up with slightly different rendering code for each specialized kind of drawing it does. Your rendering code ends up mixed in with your gameplay and other code. And because each object is working independently, with no global knowledge of the scene, it becomes very difficult to do something like optimize mesh rendering in general—you can only do fixes to each object.

What makes a lot more sense is to create render managers. They have a manager-specific way to submit items for rendering—for instance, a mesh + transform + material for a mesh render manager, but a particle render manager might just need a point in space for each particle. They have common interfaces for sorting and drawing, so that the core renderer can call them all equally. In other words:

class RenderItem;

class IRenderManager
{
      virtual void sort(RenderItem *items, int count)=0;
      virtual void render(RenderItem *items, int count)=0;
}

class MeshRenderManager
{
      RenderItem *allocateRenderItem(Mesh &myMesh, Material &myMaterial,
                                     Matrix &worldXform);
      void sort(RenderItem *items, int count);
      void render(RenderItem *items, int count);
} gMeshRenderManager;

MyObject::onRender()
{
     submitRenderItem(gMeshRenderManager.allocateRenderItem(mesh,
                      material, xfrm);
}

As you can see, now the object that is being drawn can concern itself with describing what it wants drawn, and the MeshRenderManager deals with how to draw it quickly and correctly.

Render Queues

An important piece in the example code shown earlier is the RenderItem and the sort/render calls. You notice everything works in terms of lists of RenderItems. Well, this is actually very useful when you want to do things like optimize render order. Look at the definition of RenderItem:

struct RenderItem
{
     float distance;
     IRenderManager *manager;
     unsigned int key;
     void *itemData;
}

You’ll notice a couple of things. First, it is a 16-byte structure, so it is very cache friendly. Second, there is data to help with sorting, but most of the data for rendering is stored in itemData. With just this small amount of structure, you can handle most sorting quickly and easily.

Because you know what the manager is for each item, you can put opaque and translucent RenderItems into separate lists. You can do this either from explicit knowledge about the manager (for instance, all particle effects are probably translucent), or by querying it when creating the RenderItem (since it will have to know how each RenderItem is drawn).

Once that is done, you can easily sort by the distance field (which stores distance to the closest point on the object) if you want front-back/back-front ordering, or by grouping items by manager and letting the manager sort() them for maximum efficiency (whatever that means for that manager). For efficiency, the manager can encode sort-related data into the key field and only access whatever it stores in itemData when actually rendering.

Allocation overhead can be a problem, so we recommend pooling the memory used for itemData (per manager) and for RenderItem (per scene).

There are several variants on this idea, mostly relating to the packing. For instance, Quake 3 Arena encodes the entire render command into a single four-byte value. Forsyth recommends using a 64- or 128-bit value, so there’s room for a pointer if needed. The approach we’ve outlined is nice because it is a little easier to follow and extend. With very tight packing there’s a lot of range management required in order to avoid exceeding what you can address with (say) four bits. Of course, it can be worth it when you need your render sorting to really scream.

Since this makes sorting a great deal more flexible and extensible, and completely separates scene traversal and determining what needs to be rendered from the actual API calls required for rendering, it is a major building block for all sorts of rendering improvements. Two obvious improvements are keeping a queue per render target and the targets in a list with dependency information, so that you can easily arrange to touch every RT only once, and to buffer all the render queues so rendering can happen in a second thread.

Managing VRAM

We’ve covered some of the major concerns when dealing with your rendering API. The other major piece of any renderer is resource management, which is what this section covers.

In rendering terms, a resource is anything you allocate from the rendering API. It could be a state block, a vertex or index buffer, a texture, a shader, one or more shader constants, a query/fence, or a half dozen other things. Specifically, it is something you have to allocate from the API, use for rendering, and release during the course of rendering.

Getting resource management right is a big pain, but it is essential for good performance. There are two contexts in which performance is important: at device reset time, and during normal rendering. The following sections cover dealing with those contexts.

Dealing with Device Resets

Device resets can happen in both GL and Direct3D programming, although the causes and promises related to them are different. Basically, a device reset occurs any time you need to tear down everything you’ve allocated for rendering and recreate it. In Direct3D this can happen whenever the OS feels like it (or more commonly when you switch resolutions or windowed to/from full screen, or the user switches to another program), and you generally lose most of your data and have to re-upload it. OpenGL is better about preserving data across resets, but it operates similarly.

Good performance during a device reset is important because of the user experience it creates. If a player’s IM program pops up and steals focus from the game, the game needs to be snappy about it; otherwise, the user will become frustrated. When the game starts up and goes to full screen, it needs to be quick so the user knows what is happening. If the program is slow/buggy, the user is less likely to play it.

To deal with device reset, you need to know about every resource you have allocated from the device, and you have to take resource-type-specific steps to preserve it across the reset. The simplest way to do this is to have a base class for your resources that knows what device allocated them, has callbacks for the resource type to implement for device lost/device found, and maintains itself on a per-device list. Then, when a reset happens, you just walk the list, call lost on everything, do the device reset, and when you come back, call found.

You may need to keep copies of some resources in system-memory backing store or regenerate procedurally generated content! Thankfully, dealing with device resets is well documented in both DX and GL, so we don’t need to go into the specifics here.

Resource Uploads/Locks

Runtime is the other situation where resources have a big performance contribution. There are three basic guidelines here, and they all revolve around doing exactly what you have to do and no more.

  • Minimize copies: Copying 4MB of texture data onto the card is bad; copying it once in system memory and again to the card is even worse. Look for and avoid extra copies. The only time you should do them is when you need the data on a physically different chunk of RAM (like on the GPU).

  • Minimize locks: This is also crucial. Whenever possible, lock buffers once, write each dword once, in linear order, immediately and then unlock it and move on. We’ll cover this in more detail in the following chapters, but it’s worth mentioning here.

  • Double buffer: This isn’t appropriate for every situation, but if you have data that is changing frequently (once or more every frame), double buffer it so the card has more time to process each frame before it moves on. In some cases, the driver does this for you.

Resource Lifespans

One key to dealing with resource allocations efficiently is to figure out in what timeframe they will be used. Using different allocation strategies for different lifespans can pay big dividends. Consider breaking things down into these three scenarios:

  • Per-frame usage: Keeping a special shared buffer for vertices/indices that are only going to be used once in a frame and then discarded can give you big wins. By reusing the buffer every frame and batching uploads into bigger transfers, you can regain a lot of the performance loss caused by small draws (of <100 triangles).

  • Dynamic resources: These are buffers that are kept around for multiple frames, but changed once every frame (or relatively often). Typically, you allocate these with different flags in order to let the driver be more intelligent, although most drivers detect usage patterns and move data around appropriately.

  • Static resources: These are buffers that are locked and written to once and never changed again. Often, by hinting this to the driver, the driver can be more efficient about keeping them on the card and available for fast access.

Look Out for Fragmentation

Fragmentation can rapidly reduce how much of the GPU’s memory you can work with. It can be worth it to round all allocations to powers of two in order to make it harder for fragmentation to occur. Do some tests. If you centralize all your allocations, it becomes much easier to do these sorts of changes. The worst-case scenario is finding out a change like this is required, and realizing that you have no wrapper around the API’s resource allocation/deallocation calls.

Other Tricks

Here are a few other tricks that will be helpful in managing the performance characteristics of your game.

Frame Run-Ahead

An important factor that can affect perceived performance of a system is frame run-ahead. The drivers will try to buffer commands so that the GPU always has a constant stream of work, which can lead to it running behind the CPU by several frames. At 30Hz, the user might press a button, experience one frame (32ms) of latency until the CPU processes the effect and issues the commands to display the result, and then two or three frames (64 -96ms) before the commands are visible onscreen. Suddenly, you have an extra 100ms of latency, making your game feel loose, unresponsive, and generally laggy.

When you are running interactively, you will want to make sure you prevent this behavior. The easiest way to do so is to force synchronization at certain points. This is a great opportunity to reverse your optimization knowledge. Usual tricks here include issuing a command like an occlusion query or a fence that you can use to determine where the GPU is in executing your commands, or locking a 1px square texture that is touched as part of the rendering process and reading back a pixel from it.

When you profile your code, you’ll see that the GPU or CPU is waiting on these synch points, but that’s probably fine. If you find that one is consistently waiting for the other, it might be an opportunity to scale up the work you’re doing on the resource that is waiting.

Lock Culling

When culling is working properly, it is invisible. There should be no way to tell that you are drawing everything in the scene, as opposed to drawing only what is needed, except by comparing frame rate. This makes it easy for culling to be broken, since if it “mostly” works, then it is possible you would never know.

A useful technique to fight this is to add a flag to your application that locks the camera information used for culling, i.e., prevents it from being updated. In combination with some debug visualization (like showing bounding boxes of culled items and the frustum of the camera), it becomes trivial to turn on the flag and then freecam around your world to confirm visually that only stuff in the frustum is being drawn.

Stupid Texture (Debug) Tricks

Texture sampling is an infinitely useful tool. You can lock your textures to 1px in size to check on fill-rate. But there are a lot of other possibilities, too.

You can visualize which mipmap levels are in use by generating a texture with a different color in each mip level; for instance, red at level 0, green at level 1, blue at level 2, etc. Then just play your game and look for the different colors. If you never see red, you can probably drop a level or two on your mips and save some VRAM. Many engines automate either mip visualization or even mip in-use detection for you, so check before you add something like this.

For checking alignment and sampling issues, checkerboard textures of all varieties are extremely useful. A 1px checker pattern (every other pixel set) makes it easy to tell if you are sampling on texel centers; if you’re right, you get a pretty checker pattern, and if you’re wrong, you get a gray blob. Larger checks are good for checking alignment. (For instance, suppose that a model should be using 16px square regions of the texture; if you put a 16px checkerboard on it, it will be easy to spot if you are crossing boundaries that you ought not cross.)

Some graphics hardware also has precision limits when you are drawing very large UV ranges or very large pieces of geometry, and a checkerboard texture will make it easy to spot any errors in calculation because they’ll show up as artifacts in the checker pattern. The fix there is usually to make sure that you’re drawing data that’s approximately the same size as the screen, rather than many times larger.

Gradients are also very useful. For instance, a texture where you map (U,V) to the (R,G) color channels makes it easy to see if your texture mapping is working correctly. When you’re generating texcoords procedurally, generating a gradient based on position can make it easy to spot if transforms that should be identical across draw calls really are identical.

In the brave new world of pixel shaders, you can do most of these tricks with a few instructions in your pixel shader instead of generating a texture.

Conclusion

In this chapter, we’ve presented a survival guide for anyone who is tasked with writing or working with a renderer. We’ve covered the major areas that are essential for good performance and proper functioning, and we’ve outlined specific problems that cause projects to fail, as well as tips and tricks to help make development easier. Our focus so far has been on how to co-exist peacefully with graphics drivers and treat the CPU well. The next few chapters will discuss the specifics of GPU and shader performance.

 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.151.153